Tuesday, March 15, 2005

I want to use Unicode, but...

I want to use Unicode in my work. I really do. UTF-8, to be specific. It's safer, cleaner and easier to use... but I can't.

Let me pause and elaborate. What I'm talking about here is text-encoding, which is the way a computer represents characters (letters, numbers, punctuation). It's something most people should not have to worry about. Most web pages use some variant on Latin-1 encoding. This is great for english sites, but start adding accents and you get in trouble. And Eastern characters (think Japanese or Chinese)? Shoot, we didn't plan for that.

So some nice folks created Unicode (actually, a few flavors - many sub-par) to represent every character in every language known to man.

Here is a simple example of why Unicode is handy when you're writing pages: Let's say I want to show the Copyright symbol: ©. I can't just type it, because it doesn't work in Latin-1. Instead I have to write ©. But wait, using XHTML, that isn't guaranteed to work, so I have to use the difficult-to-remember ©. All this just to display a ©. If I was using Unicode, I could just type the symbol (option+g on my mac) and save the page.

So back to my problem. Why don't I just use Unicode files (UTF-8)? First, I'm dealing with a silly problem. My (old) version of DreamWeaver saves UTF-8 with a byte-order mark (BOM) which is unnecessary and causes all kinds of problems with PHP. Becuase my budget doesn't allow for upgrades, I'm stuck using other programs (XCode, which I don't like for PHP editing) to save UTF-8. This is more a productivity issue than a technical one.

The bigger problem comes in with MySQL, the database I love to hate. Older versions of MySQL didn't utilize UTF-8. This wouldn't be a problem if two hosting companies I'm working with right now used the newest version. (Fortunately, one client just upgraded their database.) So even if I go to unicode, any data in or out of MySQL is stuck in Latin-1.

One client is running UTF-8 pages, with latin-1 database data thrown in. Another is all latin-1, but I hope to switch to full UTF-8 after the fortunate upgrade I mentioned before.. Unfortunately, my upcoming personal site will be stuck as a mixture.

I want to go to Unicode, but I'm being held back.


