For many a novice Python programmer, particularly in the English-speaking world, their first brush with Unicode is with this error:

UnicodeDecodeError: 'ascii' codec can't decode
   byte 0xc5 in position 0: ordinal not in range(128)

Now I realise that much has been written about Unicode already, but for me it has always been one of those topics that goes in one ear, sticks around while I’m using it, then gradually makes its way out of the other ear when I’m least expecting it.  Last year I read the excellent “Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit” (http://www.nltk.org/book) by Steven Bird, Ewan Klein, and Edward Loper.  An interlude in Chapter 3 discusses Unicode in a refreshingly clear and brief way and I now feel that the topic is well and truly understood.

I thought I’d write a quick cookbook as an aide-memoire, not about Unicode per se as there are many good articles already (see http://www.joelonsoftware.com/articles/Unicode.html for a thorough intro and history), but specifically about Unicode and Python, and to showcase the differences between Python 2 and Python 3.

1. Unicode is a representation of characters where each character is uniquely represented by a number called a code point.  Unicode literals are represented in Python as follows:

Python 2:

>>> print u'This is Unicode'
This is Unicode

Python 3:

>>> print('This is Unicode')    # Strings are unicode in Python 3
This is Unicode

2. Non-ASCII literals can also be represented in a Python file if the encoding permits.  The default encoding for Python 2 is ASCII whereas the default encoding for Python 3 is UTF-8.  The file encoding can be changed with the appropriate stanza at the top of the file.

Python 2:

# -*- coding: utf-8 -*-
# The following string is Unicode represented in UTF-8 characters.
print u'This is also Unicode. Mañana!'

Python 3:

# The following string is Unicode represented in UTF-8 characters.
print ('This is also Unicode. Mañana!')

3. All Unicode characters can be represented with an escape sequence regardless of the file encoding:

Python 2:

# 1, 2, and 4 byte unicode chars are represented in hex with
# \xnn, \unnnn and \Unnnnnnnn respectively
>>> print u'This is also Unicode. Ma\xf1ana! Ma\u00f1ana! Ma\U000000f1ana!'
This is also Unicode. Mañana! Mañana! Mañana!

Python 3:

# You can do the same in Python 3
print('This is also Unicode. Ma\xf1ana! Ma\u00f1ana! Ma\U000000f1ana!')
This is also Unicode. Mañana! Mañana! Mañana!

4. The Unicode standard does not cover how text is represented in a file.  When it comes to writing text to a file, a particular encoding will be used.  The UTF-8 encoding can efficiently represent all Unicode characters.  Write to a file with a particular encoding as follows:

Python 2:

>>> import codecs
>>> txt = u'Ma\xf1ana!'
>>> with codecs.open('file.txt', 'w', encoding='utf-8') as f:
...     f.write(txt)

… or encode before writing to file …

>>> txt = u'Ma\xf1ana!'
>>> with open('file.txt', 'w') as f:
...     f.write(txt.encode('utf-8'))

Python 3:

>>> txt = 'Ma\xf1ana!'
>>> with open('file.txt', 'w', encoding='utf-8') as f:
...     f.write(txt)

5. A file of text is just a file of binary data unless the encoding is known.  A text file with a given encoding is read as follows:

Python 2:

>>> import codecs
>>> with codecs.open('file.txt', encoding='utf-8') as f:
...     print f.read()
Mañana!

… or decode after reading from file …

>>> with open('file.txt') as f:
...     print f.read().decode('utf-8')
Mañana!

Python 3:

>>> with open('file.txt', encoding='utf-8') as f:
...     print(f.read())
Mañana!

6. A special encoding peculiar to Python is “unicode_escape” which encodes a Unicode string into a Python 2 string or a Python 3 bytes object with the appropriate escape sequences.  For example:

Python 2:

>>> tomorrow = u'Ma\xf1ana'
>>> print tomorrow.encode('unicode_escape')
Ma\xf1ana

Python 3:

>>> tomorrow = 'Ma\xf1ana'
>>> print(tomorrow.encode('unicode_escape'))
b'Ma\\xf1ana'

7. To get the code point for a Unicode character:

Python 2:

>>> print ord(u'\xf1')
241

Python 3:

print(ord('\xf1'))
241

8. To get the unicode character for a given code point:

Python 2:

>>> print unichr(0xf1)
ñ

Python 3:

>>> print(chr(0xf1))
ñ

9. And finally, the unicodedata module has some useful functions worth knowing about.  For example, to get the name of a given Unicode character:

Python 2:

>>> import unicodedata
>>> print unicodedata.name(u'\xf1')
LATIN SMALL LETTER N WITH TILDE

Python 3:

>>> import unicodedata
print(unicodedata.name('\xf1'))
LATIN SMALL LETTER N WITH TILDE

References:

[1] Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit – Text Processing with Unicode
[2] Unicode HOWTO (Python 2)
[3] The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky