Lessons learned when using Unicode with Python

4:19pm, 12th October 2003

I’ve been wrestling with Unicode for some time now, but it looks like that time is almost over. Hopefully somebody will find this a useful shortcut to the long meandering trudges through Google that I found myself on.

  • Unicode is an abstract mapping from numbers to character symbols. It is not a document encoding.
  • UTF-8 is a document encoding.
  • Internally, python has Unicode strings.
  • Reading a UTF-8 encoded file does not result in a Unicode string. This must be done explicitly:

    >>> f = open('file', 'r')
    >>> unicodeString = unicode(f.read(), ‘utf-8′)

  • Python will process Unicode strings perfectly, but will not print or write them to a file. First you must convert them to some encoding. Observe:

    >>> b = u'a\xe1\xea'
    >>> b
    u’a\xe1\xea’
    >>> print b
    Traceback (most recent call last):
    File “<stdin>”, line 1, in ?
    UnicodeError: ASCII encoding error: ordinal not in range(128)
    >>> print b.encode(’utf-8′)
    aáê

  • Given arbitrary input, there is no way of determining what encoding it is in. You have to find out some other way; for example, the first line of an XML document should say.

The moral of the story is thus:

  • Whenever any string data enters your program, make sure you know what encoding it is in, and convert it from that into Unicode. Keep it in Unicode throughout your entire program. Whenever any string data leaves your program, use the encode method (which all Unicode strings have) to convert it to a specific encoding. UTF-8 seems ideal, if you have no other constraints.

Warning! I’m in no sense a python expert. As far as I can tell, this works, but there may be more elegant ways of doing it.


Leave a comment

Comment sensibly. I know where you live.

I am a human, not a spambot

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>