Posts RSS Comments RSSTwitter 16 Posts and 13 Comments till now
This wordpress theme is downloaded from wordpress themes website.

Archive for the 'Programming' Category

Python Northwest Returns

Just under a year ago a group of like-minded people in the Northwest of England with excellent taste in programming language got together for Python Northwest.  The plan is get together again this month.  Whether you’re a beginner Pythoneer or a seasoned Pythonista, or if you just want an excuse to go to the pub, then this meeting is for you!

Details are as follows:

  • When: Thursday 19th August 2010, 6pm
  • Where: Rain Bar, Manchester:
  • What: A social meet to chat about stuff we’ve found interesting / useful / fun with Python recently.  Topics likely to include games, robots, web programming, GUIs, parallel processing, audio generation, tips and tricks, and just about anything heard, said or done at the recent Europython conference.
  • Contact:

Building on the well attended and fun meetup organised by Michael Sparks last year, the hope is that Python Northwest will continue to meet every third Thursday of the month starting with social meets then alternating between social meets, technical meets and perhaps coding sessions.

Please forward, tweet and dent this to anyone or anylist you think might be interested, then email python-north-west@googlegroups.com to say you’re coming along!

See you there …

Python And Unicode

For many a novice Python programmer, particularly in the English-speaking world, their first brush with Unicode is with this error:

UnicodeDecodeError: 'ascii' codec can't decode
   byte 0xc5 in position 0: ordinal not in range(128)

Now I realise that much has been written about Unicode already, but for me it has always been one of those topics that goes in one ear, sticks around while I’m using it, then gradually makes its way out of the other ear when I’m least expecting it.  Last year I read the excellent “Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit” (http://www.nltk.org/book) by Steven Bird, Ewan Klein, and Edward Loper.  An interlude in Chapter 3 discusses Unicode in a refreshingly clear and brief way and I now feel that the topic is well and truly understood.

I thought I’d write a quick cookbook as an aide-memoire, not about Unicode per se as there are many good articles already (see http://www.joelonsoftware.com/articles/Unicode.html for a thorough intro and history), but specifically about Unicode and Python, and to showcase the differences between Python 2 and Python 3.

1. Unicode is a representation of characters where each character is uniquely represented by a number called a code point.  Unicode literals are represented in Python as follows:

Python 2:

>>> print u'This is Unicode'
This is Unicode

Python 3:

>>> print('This is Unicode')    # Strings are unicode in Python 3
This is Unicode

2. Non-ASCII literals can also be represented in a Python file if the encoding permits.  The default encoding for Python 2 is ASCII whereas the default encoding for Python 3 is UTF-8.  The file encoding can be changed with the appropriate stanza at the top of the file.

Python 2:

# -*- coding: utf-8 -*-
# The following string is Unicode represented in UTF-8 characters.
print u'This is also Unicode. Mañana!'

Python 3:

# The following string is Unicode represented in UTF-8 characters.
print ('This is also Unicode. Mañana!')

3. All Unicode characters can be represented with an escape sequence regardless of the file encoding:

Python 2:

# 1, 2, and 4 byte unicode chars are represented in hex with
# \xnn, \unnnn and \Unnnnnnnn respectively
>>> print u'This is also Unicode. Ma\xf1ana! Ma\u00f1ana! Ma\U000000f1ana!'
This is also Unicode. Mañana! Mañana! Mañana!

Python 3:

# You can do the same in Python 3
print('This is also Unicode. Ma\xf1ana! Ma\u00f1ana! Ma\U000000f1ana!')
This is also Unicode. Mañana! Mañana! Mañana!

4. The Unicode standard does not cover how text is represented in a file.  When it comes to writing text to a file, a particular encoding will be used.  The UTF-8 encoding can efficiently represent all Unicode characters.  Write to a file with a particular encoding as follows:

Python 2:

>>> import codecs
>>> txt = u'Ma\xf1ana!'
>>> with codecs.open('file.txt', 'w', encoding='utf-8') as f:
...     f.write(txt)

… or encode before writing to file …

>>> txt = u'Ma\xf1ana!'
>>> with open('file.txt', 'w') as f:
...     f.write(txt.encode('utf-8'))

Python 3:

>>> txt = 'Ma\xf1ana!'
>>> with open('file.txt', 'w', encoding='utf-8') as f:
...     f.write(txt)

5. A file of text is just a file of binary data unless the encoding is known.  A text file with a given encoding is read as follows:

Python 2:

>>> import codecs
>>> with codecs.open('file.txt', encoding='utf-8') as f:
...     print f.read()
Mañana!

… or decode after reading from file …

>>> with open('file.txt') as f:
...     print f.read().decode('utf-8')
Mañana!

Python 3:

>>> with open('file.txt', encoding='utf-8') as f:
...     print(f.read())
Mañana!

6. A special encoding peculiar to Python is “unicode_escape” which encodes a Unicode string into a Python 2 string or a Python 3 bytes object with the appropriate escape sequences.  For example:

Python 2:

>>> tomorrow = u'Ma\xf1ana'
>>> print tomorrow.encode('unicode_escape')
Ma\xf1ana

Python 3:

>>> tomorrow = 'Ma\xf1ana'
>>> print(tomorrow.encode('unicode_escape'))
b'Ma\\xf1ana'

7. To get the code point for a Unicode character:

Python 2:

>>> print ord(u'\xf1')
241

Python 3:

print(ord('\xf1'))
241

8. To get the unicode character for a given code point:

Python 2:

>>> print unichr(0xf1)
ñ

Python 3:

>>> print(chr(0xf1))
ñ

9. And finally, the unicodedata module has some useful functions worth knowing about.  For example, to get the name of a given Unicode character:

Python 2:

>>> import unicodedata
>>> print unicodedata.name(u'\xf1')
LATIN SMALL LETTER N WITH TILDE

Python 3:

>>> import unicodedata
print(unicodedata.name('\xf1'))
LATIN SMALL LETTER N WITH TILDE

References:

[1] Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit – Text Processing with Unicode
[2] Unicode HOWTO (Python 2)
[3] The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

Python 2.7 Released

The production version Python 2.7 is now available. Several features from Python 3.1 have been backported including language features such as the syntax for set literals, dictionary and set comprehensions, and multiple context managers in a single with statement.

These are great features of Python 3 and I’m sure I’ll use them in my Python 2.7 code, but I can’t help wondering if they also reduce the motivation for developers to move to Python 3. Are there enough compelling Python 3-only features to help promote the “upsell”?

MongoUK

MongoUK has been and gone and the day went very well. Many thanks to the 10gen guys who sponsor MongoDB for putting on a great meeting. My head is spinning with thoughts and ideas about MongoDB and the host of possibilities it clearly opens up for many classes of application.

It was interesting to learn that MongoDB was born primarily out of a desire for high performance and horizontal scalability i.e. the ability to scale out to multiple nodes as needed. Other properties such as its non-relational and schema-free nature are happy side effects as joins and the maintenance of referential integrity are difficult in a highly parallel environment. It also explains why there is no strong drive to implement transactions, a feature completely absent from MongoDB which is not only hard to implement in a highly parallel environment but would also be detrimental to performance.

Some stand-out features of MongoDB discussed at the meeting include the following:

  • MongoDB is schema free and far more capable than simply storing arbitrary key / value pairs. MongoDB stores BSON documents, a binary JSON format meaning that data of any complexity and nesting can be represented.
  • Through the use of BSON, MongoDB understands several basic data types including integers, strings, floating point and datetime. Many similar databases only support strings and perhaps numeric data types usually leaving the developer to do some parsing in their application.
  • Indexes can be created at any level of nesting in a document, and significantly, an index can be created on an array. Imagine a database full of objects with nested attributes one of which is an array of tags then rapidly searching for all objects with a given tag.
  • A large number of drivers are already developed and well supported.
  • MongoDB has extremely rich query support. Apparently the developers considered using SQL as the query language but settled on a JSON query format. The result is an expressive query language which is satisfyingly symmetrical to the nature of the data being queried upon.

In my previous post I hoped to learn whether MongoDB would be a good solution to storing and querying our arbitrary key value data. MongoDB would clearly solve some of our issues and I look forward to putting MongoDB though its paces over the coming days and to assess it with respect to other document-oriented databases.

Will MongoDB Save The Day?

I’m greatly looking forward to attending tomorrow’s MongoUK, the UK meeting for MongoDB, one of a new breed of so called document-oriented databases.  I thought I’d pen a few thoughts about MongoDB and document-oriented databases in general before the meeting so I could compare them to my thoughts after the meeting.

MongoDB is “a scalable, high-performance, open source, document-oriented database”.  The terms scalable, high-performance, document-oriented, object-oriented, non-relational, schema-free, key/value store, and NoSQL are all used interchangeably for this class of database.  All serve to describe the qualities of this class of database whilst at the same time muddying the water a little by providing multiple and sometimes conflicting reasons for choosing them, or more commonly, for switching over from the commonplace relational database.

I’ve followed the rise and rise of document-oriented databases with interest over the last few years and I often sense that the reason for making the switch is to latch onto one of the qualities mentioned above then try hard to make the other qualities work.  For example, to make the switch to gain massive scalability then work around the loss of SQL and rich query functionality.  The other reason to make the switch is that these databases are new, cool, often a joy to work with, and a refreshing change from the confines of the relational database.  I’ll be the first to put my hand up and favour these reasons for considering the switch for certain applications.  More specifically, I’m investigating the use of a schema-free database whilst contemplating what this might mean with regard to database transactions, referential integrity, and the ability to query your data.

For several years I’ve worked on an application called Asset DB which records the location and attributes of company assets.  There’s no end to the different attributes that might belong to an asset, for example, Serial No, Colour, IP Address, Installation Date, and this list very much depends on the type of asset, for example, a piece of artwork will have an Artist but a PC will have a MAC Address.  Added to this, there’s no end to the different types of asset which might be recorded from artwork to waste-bins to paper-clips.  Asset DB uses a relational database and we were faced with the conundrum of how to structure and store this mish-mash of data.  The problem is not unlike the question of how to store tag data in a relational database (see http://forge.mysql.com/wiki/TagSchema for opinions on how this might be done).  As it happens, we completely side-stepped the issue and simply lumped the attributes into a single string field structured in our own format, in effect, turning the field into a flat file.  The following gives you a rough representation of the schema we used:

 +----+------------+------------+-------------------------------------------------+
 | id | date_added | asset_type | attributes                                      |
 +====+============+============+=================================================+
 | 1  | 2010-06-16 | PC         | assetno=ABC010,colour=blue,mac=00:11:2:33:44:55 |
 | 2  | 2010-06-17 | Artwork    | assetno=ABC020,artist=Picasso,value=$20000      |
 +----+------------+------------+-------------------------------------------------+

The clear advantage is that we can store any key / value data for any asset.  The disadvantages are numerous:

  • There’s no possibility of using standard database constructs for maintaining structural integrity of the data.
  • Over time, our needs have changed and we’ve had to devise way of storing more complex data structures such as lists, maps, time-series data etc.
  • To read even a small part of the data requires reading in the entire string of attributes then parsing it.  Also, any change to a small part of that data, for example, to amend the artist on a piece of artwork, requires reading, parsing, changing then writing back the entire string to the database.
  • The ability to query the data is severely limited, for example, to find all artwork by Picasso, we can’t use the database to do the query for us.  Instead every reference to artwork in the database has to be read and parsed, or some sort of full-text search is needed.  Additionally, opening the database to third parties isn’t as simple as saying “here’s the database, now run some SQL as you please”.

To be fair, we could have formatted our data in something like XML or JSON (although the decision was made prior to the standardisation of JSON), and perhaps we could have used an XML or object database rather than a relational database, but our needs were very simple at the time.  I’d be interested to learn how others have implemented the above in their applications. Meanwhile, I’m hoping to learn whether document-oriented databases might be a good solution.

Why MongoDB?  I was first introduced to MongoDB about a year ago and I’ve played with it a few times since then.  It appears to provide a good balance between providing a schema free database whilst retaining the ability to richly query your data.  As the front page of the MongoDB web site says:

The Best Features of Document Databases, Key-Value Stores, and RDBMSes.

MongoDB bridges the gap between key-value stores (which are fast and highly scalable) and traditional RDBMS systems (which provide rich queries and deep functionality).

Perhaps this is exactly what our application needs and I’m looking forward to learning more tomorrow.

« Prev