Posts RSS Comments RSSTwitter 18 Posts and 26 Comments till now
This wordpress theme is downloaded from wordpress themes website.

Is English Tonal?

An interesting feature of several widely used Asian languages is that they’re tonal. In tonal languages, changing the intonation of what seems to be the same word (at least to the Western ear) can markedly change the meaning of that word. This can be quite hard to fathom for the typical English speaker. A celebrated example of this can be found in Mandarin Chinese:

妈 mā mother
麻 má hemp
马 mǎ horse
骂 mà scold
吗 ma (question tag)

I was in an English supermarket recently and read the word “discount” on several items for sale. It occurred to me that the word discount can be used with at least a couple of different but related meanings:

  1. In the supermarket it’s often used as a noun meaning “a reduction in the sale price”.
  2. It can also mean the verb “to dismiss”, “to remove from consideration” and sometimes “to reduce in price”.

What then struck me is that these two usages are spelled the same but pronounced differently. In the first meaning, the first syllable is stressed whereas in the second meaning, the second syllable is stressed. I tried to think of more words which followed this pattern and it took me some time to come up with “reject”, “survey” and “upset”. My hunch was that there were plenty more words like that so I set about seeing if I could automate finding them.

One can argue that changing the stresses on a word’s syllables changes its intonation. Does that make English tonal after all, albeit on a small scale?

Pronunciation

The Carnegie Mellon Pronouncing Dictionary is a machine-readable pronunciation dictionary for North American English. Its database of 100,000+ words contains a set of pronunciations organised as a list of sounds, for example:

tree = ['T', 'R', 'IY1']
biscuit = ['B', 'IH1', 'S', 'K', 'AH0', 'T']
undo = ['AH0', 'N', 'D', 'UW1']

I’m interested here not in the actual consonant and vowel sounds which can vary quite markedly with differences in regional accent, but in the stresses of the vowel sounds. These are indicated by a numeric suffix:

0 – No stress
1 – Primary stress
2 – Secondary stress

In the examples above, “biscuit” is pronounced with the stress on the first syllable and “undo” with the stress on the second.

In the Python programming language, the CMU Pronouncing Dictionary can be accessed using the Natural Language Toolkit (NLTK). If you’re using the NLTK for the first time, you’ll need to do the following:

>>> import nltk
>>> nltk.download()

A GUI will appear where you can choose to download the CMU Pronouncing Dictionary. This only needs to be done once. The dictionary can then be accessed as follows:

>>> from nltk.corpus import cmudict
>>> pronunciations = cmudict.dict()
>>> pronunciations['tree']
[['T', 'R', 'IY1']]
>>> pronunciations['discount']
[['D', 'IH0', 'S', 'K', 'AW1', 'N', 'T'], ['D', 'IH1', 'S', 'K', 'AW0', 'N', 'T']]

Here we can see that “discount” is indeed listed with more than one pronunciation. Now lets distill the stresses in theses pronunciations:

>>> def stresses(pronunciation):
...     return [i[-1] for i in pronunciation if i[-1].isdigit()]
...
>>> stresses(['D', 'IH0', 'S', 'K', 'AW1', 'N', 'T'])
['0', '1']
>>> stresses(['D', 'IH1', 'S', 'K', 'AW0', 'N', 'T'])
['1', '0']

So in one pronunciation, the stress is on the first syllable and in the other pronunciation, the stress is on the second, just as we suspected.

Part of Speech

WordNet is a lexical database of English nouns, verbs, adjectives and adverbs. The database lists the multiple uses of a given word, and for any given use, its definition and most remarkably, its relationship to other words. For example, “dog” is a type of “canine” and a “poodle” is a type of “dog”. We’re interested in the fact that WordNet also helpfully stores the part of speech (i.e. noun, verb etc.) for any given usage.

WordNet can also be accessed using NLTK. Once again, for first use, the WordNet database needs to be downloaded using nltk.download().

Each usage of a word is called a “synset” (i.e. Synonym Set) in WordNet parlance and can be accessed as follows:

>>> wordnet.synsets('discount')
[Synset('discount.n.01'), Synset('discount_rate.n.02'), Synset('rebate.n.01'), Synset('deduction.n.02'), Synset('dismiss.v.01'), Synset('discount.v.02')]

As might be apparent from this example, the synset’s primary word may or may not be ‘discount’. In fact, each synset contains a list of words (known as lemmas) which can represent that usage:

>>> synsets = wordnet.synsets('discount')
>>> synsets[0]
Synset('discount.n.01')
>>> synsets[0].definition
'the act of reducing the selling price of merchandise'
>>> synsets[0].lemma_names
['discount', 'price_reduction', 'deduction']

We’ll concentrate on those synsets whose primary lemma is the word we are interested in.

Finally, the part of speech for a synset is easily obtained:

>>> synsets[0]
Synset('discount.n.01')
>>> synsets[0].definition
'the act of reducing the selling price of merchandise'
>>> synsets[0].pos
'n'
>>> synsets[5]
Synset('discount.v.02')
>>> synsets[5].definition
'give a reduction in price on'
>>> synsets[5].pos
'v'

Putting It All Together

So to find our “tonal” words, all we need to do is find words which fit the following criteria:

  1. Two or more syllables.
  2. Multiple pronunciations with different stresses.
  3. Can be used as a noun or verb.

A sample Python script can be found here.

And here’s the full list of 112 tonal English words found using this script:

['addict', 'address', 'affiliate', 'affix', 'ally', 'annex', 'associate', 'average', 'bachelor', 'buffet', 'combine', 'commune', 'compact', 'compound', 'compress', 'concert', 'concrete', 'confederate', 'conflict', 'content', 'contest', 'contract', 'contrast', 'converse', 'convert', 'convict', 'coordinate', 'correlate', 'costume', 'debut', 'decrease', 'defect', 'delegate', 'desert', 'detail', 'detour', 'dictate', 'digest', 'discharge', 'discount', 'duplicate', 'effect', 'escort', 'estimate', 'excerpt', 'excise', 'ferment', 'finance', 'forearm', 'geminate', 'general', 'graduate', 'impact', 'implant', 'import', 'impress', 'imprint', 'increase', 'insert', 'interest', 'intrigue', 'invalid', 'laminate', 'leverage', 'mentor', 'mismatch', 'object', 'offset', 'overflow', 'permit', 'pervert', 'postulate', 'predicate', 'present', 'privilege', 'produce', 'progress', 'project', 'protest', 'ratchet', 'recall', 'recess', 'record', 'recount', 'reference', 'refund', 'regress', 'research', 'reset', 'retake', 'rewrite', 'romance', 'segment', 'separate', 'sophisticate', 'subject', 'submarine', 'subordinate', 'supplement', 'surcharge', 'survey', 'suspect', 'syndicate', 'syringe', 'transfer', 'transport', 'trespass', 'underestimate', 'update', 'upgrade', 'upset', 'veto']

Observations

Interesting observations include:

  1. In most cases, stressing the first syllable yields the noun whereas stressing a later syllable yieds the verb.
  2. The noun and verb are usually closely related in meaning, however the nouns of some words have taken on a common usage which has detached it from the meaning of the verb. Obvious examples include “project”, “subject”… and “pervert”!
  3. There also seems to be a high frequency of words beginning with ‘com’, ‘con’ and ‘re’. Is this significant or is this is common of English verbs? I’ll leave that question as an exercise for the reader.

With a minor tweak to the script, we can find words that are combinations of adjectives, nouns and verbs. This gives us much smaller lists of words:

  • adjective/noun: ['antecedent', 'commemorative', 'compact', 'complex', 'compound', 'concrete', 'deliverable', 'eccentric', 'general', 'hostile', 'inside', 'invalid', 'invertebrate', 'juvenile', 'liberal', 'mineral', 'national', 'natural', 'oblate', 'peripheral', 'present', 'salient', 'separate', 'subordinate', 'worsening']
  • adjective/verb: ['abstract', 'alternate', 'animate', 'appropriate', 'articulate', 'compact', 'compound', 'concrete', 'frequent', 'general', 'invalid', 'moderate', 'perfect', 'present', 'separate', 'subordinate']
  • adjective/noun/verb: ['compact', 'compound', 'concrete', 'general', 'invalid', 'present', 'separate', 'subordinate']

Epilogue

It turns out that what we’ve found here are heteronyms which are two or more words which share the same spelling (also known as homographs) but have different meanings. More specifically, we’ve found plenty of initial-stress-derived nouns where a verb can be turned into a noun by stressing the first syllable.

I’m not sure we’ve proven that English is a truly tonal language, but this has been a good exercise in cross-referencing two major natural language databases to find interesting words.

Be good to your colon

Programmers spend more time reading code than writing it (a fact well known by most programmers who tend not to publicise this to their employers).  It therefore stands to reason that (most?) programming languages should be designed as much for human consumption as for machine consumption and should be as readable as possible.

Python is a very readable language (a fact which contributes to its popularity) and has been termed “executable pseudocode” on account of its readability.  An aspect of Python which makes it readable is its avoidance of syntactic fluff, extraneous words and symbols which add nothing to the code’s meaning but serve to detract from it.

In the past I’ve felt somewhat negative about Python’s terminal colon “:”, the symbol used to terminate if, while, def and class statements and to signify the start of a new block of indented code.  For example:

if a == 1:
    b = do_something_cool()

def do_something_cool():
    return 'Doing something cool'

Even without the colon, it’s quite clear that we’re starting a new block of indented code because (a) the statement starts with the keyword if, while, def or class and (b) the next line of code is indented. For comparison, Ruby gets on just fine without the colon after its def statement. So why the need for a colon in Python? Is it syntactic fluff?

The Python FAQ explains that the colon enhances readability and helps editors with syntax highlighting and code indentation. Lets face it, any self respecting editor should be capable of parsing a line beginning with an if, while, def or class, so the “helps editors” argument is bogus. I do however buy the argument that the code is visibly more readable. But how does it enhance readability?

I’ve already mentioned that a programmer spends more time reading than writing code. What I haven’t yet suggested is that a programmer will often reread and scan the same code repeatedly to form a mental picture of a larger codebase. It’s what the eyes do when they’re scanning code that’s key to the importance of the colon. There is some evidence to suggest that the eyes linger at the beginning and at the end of a sentence when reading text and draw especially from visual cues at those locations. Let’s assume for the moment that this holds true for a line of code. So the visual cue heralding an indented block of code is clear at the beginning of a line of code, namely an if, while, def or class followed by an indented line. The only visual cue at the end of a line of Python code is the colon, and without the colon there would be no cue. So even though the colon is not strictly necessary, there is an argument that its existence is there for human consumption and aids readability.

When all’s said and done, the advantage of the colon is probably slight at best, and then probably only for a newcomer to the language. (This sort of advantage possibly completely vanishes for experienced users of any language). Never-the-less, on balance, I’m now happy it’s there!

Pro Python – Book Review

A recent thread on the Python Northwest mailing list asked for opinions on Marty Alchin‘s book Pro Python.  I thought I’d reproduce the answer I gave and expand on it a little.

I’ve owned Marty Alchin’s first book, Pro Django, for some time and was very happy with that purchase.  Based on that, I decided to buy his Pro Python book last year.  Pro Python is targeted at readers who are proficient with basic Python but are looking to push their skills further.  Quite naturally there’s a large number of beginners’ Python books out there but a shortage of more advanced books so it was nice to see this published.

Marty Alchin starts his book with a refreshing approach.  Rather than regurgitating Python facts to the reader, he takes a step by step tour of The Zen of Python discussing how it’s philosophy can be practically applied to make your programming more Pythonic.  He then delves into traditional topics such as classes, objects and strings as well as development topics such as packaging and testing.

I like Marty Alchin’s style of writing and find it to be clear and concise.  Even if you’re reasonably knowledgeable about the advanced topics he covers such as metaclasses, descriptors, introspection and multiple inheritance, I think the book benefits from the fact that these topics are backed up with good examples of how they work, and just as importantly, how they might usefully be used in ways you might not have seen before.  In fact, Chapter 11 walks through the building of a real world Python library which can be found on PyPI (try pip install Sheets) using the principles outlined in the previous chapters.

The other aspect of the book I find very useful is the fact that it is based on Python 3, however all examples are annotated and compared with the “legacy” Python 2 equivalent where relevant.  I’ve gotten a lot more comfortable with Python 3 by reading this book and better understand the improvements in the language from Python 2 to Python 3.

This isn’t a book aimed at newcomers to Python, even if you have a lot of programming experience, as it expects a reasonable amount of basic Python proficiency.  It’s also a “thin” book in the sense that it gives each topic a light treatment rather than aiming to be a complete reference.  This may or may not suit your needs, however there’s plenty of reference material elsewhere both online (e.g. the official Python documentation) and in print.

By comparison, the other advanced Python book I’ve read (and reread!) is Python In a Nutshell by Alex Martelli.  It’s based on Python 2.5 and getting a bit out of date, but much of it is still very relevant for all Python 2.x versions.  (I think a Python 3 version might be in the works).  It’s a much heftier and more detailed book and acts as much a reference text as well as being a book you’d enjoy reading from cover to cover.

In summary, I’d recommend Pro Python to any intermediate level Python programmer who’d like to advance their Python skills with a clear and concise text.

N.B. I am in no way associated with Pro Python, Apress or Marty Alchin … except of course for owning the book!

Hessian RPC Services. What’s not to like?

Over the last few days I’ve been playing with Hessian, “a compact binary protocol for connecting web services”. In my previous company we used Hessian extensively for communicating between a Java thick client and a Java Apache Tomcat HTTP server with good success. These days we talk of JSON and REST and peer our noses down at thick clients so Hessian might seem irrelevant, however around the time we were implementing our client-server communications (2004 / 2005), we were bathing in the waters of SOAP, WSDL and so-called heavyweight web services. The beauty of Hessian was our ability to take our Plain Old Java Objects which we had already implemented on our thick client and send them down the pipe unchanged to our server. Hessian took care of the marshalling and unmarshalling of data. In fact, because we took advantage of Hessian integration with the Spring Framework, a declarative application framework which encourages defining objects and their relationships and dependencies in configuration files, all it took was a bit of code and a bit of configuration to get everything working.

So does it now make more sense to use JSON / REST? One of the advantages of JSON / REST includes the inherent decoupling of client and server. The client fires a JSON string to the server at the correct URL using an HTTP POST and the server parses what it needs from that string and happily replies. This process is platform agnostic as HTTP and JSON libraries are available for many programming languages and platforms, not least including Javascript in the web browser. This model is widely used by service providers such as Google and Amazon whereby they can provide and update REST interfaces to their services without having to deliver and maintain multiple API client libraries. A drawback of this model is the need to hand code the marshalling and unmarshalling of JSON data by both client and server, though this can also be seen as an advantage as it decouples an application’s internal representation of data from the wire format.

Hessian compares well with the JSON / REST model. Hessian is also designed around HTTP POST whereby a client connects to a URL on the server and sends data, however Hessian goes one step further and encodes an RPC call i.e. a function name and arguments. In fact the Hessian library makes this process transparent by proxying the server i.e. it provides an object on which the client makes function calls without knowing that the call will be sent to a server. Note that there is no “contract” or abstract interface which you are forced to code to – client and server ensure they’re sending and receiving the correct function arguments by “unwritten agreement” much like the JSON / REST model. Unlike JSON, Hessian is a binary protocol meaning that the data exchanged between client and server is very compact. It also encodes type information, in fact, entire object structures are maintained when unmarshalled on either client or server. Hessian is also cross platform and libraries exist for many programming languages including Javascript.

So what’s not to like?  Well binary communication and the concept of RPC function calls in general seems to have gained a bad reputation, possibly due to the extra complexity and library support needed over simple JSON / REST and possibly because of the increased coupling an RPC call implies.  Experience at my previous company taught us that the communication can be little brittle if the definitions of objects sent over the wire are not kept in step on both client and server. If an object sent from the client to the server has an extra unknown field, there will be an error when the Hessian library on the server tries to unmarshall that data to create an object.  (The reverse, however, is not true – any fields missing from data over the wire will simply end up unset on the unmarshalled object).

Passing JSON over HTTP is much more forgiving in that the client or server will blissfully ignore any field it doesn’t know how to handle, though of course if a field that is expected is not found, the server must know handle that.  Ordinarily, keeping the client and server in step shouldn’t be a problem, however we had many clients in the field with different versions of our software all connecting to the same server.

It has only recently occurred to me that the brittleness described above is peculiar to statically typed languages such as Java where an Exception is thrown at any attempt to apply a value to a field where that field not been defined in an object’s class. The same is not true of dynamically typed languages such as Python which is forgiving when applying values to arbitrary fields on an object. For many years, hessianlib.py has been the standard Python implementation of Hessian. It has been little unmaintained over that time and includes a Hessian client implementation but no Hessian server implementation. The code is also a little impenetrable. Happily, earlier this year a fork of hessianlib.py called Mustaine has appeared. It doesn’t (yet) contain a server implementation, but the code is more penetrable so I submitted a patch with an implementation of a Hessian WSGI server.

Let’s see some code based on the proposed mustaine.server module. (Please note that Mustaine server support is in flux so this example is subject to change). An object can be served via WSGI by wrapping it with mustaine.server.WsgiApp. An object’s methods are only exposed if decorated with the mustaine.server.exposed decorator. For example:

from mustaine.server import exposed

class Calculator(object):
    @exposed
    def add(self, a, b):
        return a + b

    @exposed
    def subtract(self, a, b):
        return a - b

The following code will serve a Calculator() object on port 8080 using the Python reference WSGI server:

from wsgiref import simple_server
from mustaine.server import WsgiApp
s = simple_server.make_server('', 8080, WsgiApp(Calculator()))
s.serve_forever()

This object can now be accessed over the network using the Hessian client:

>>> from mustaine.client import HessianProxy
>>> h = HessianProxy('http://localhost:8080/')
>>> h.add(2, 3)
5

As a result of providing server support to Mustaine, I’ve started developing django-hessian, a library which serves Hessian objects in Django. Objects can be served using djangohessian.Dispatcher at a given URL with an entry in urls.py. The Calculator() object described above can be served at the URL http://localhost:8000/rpc/calculator/ in the Django development server as follows:

# mysite/urls.py:

from django.conf.urls.defaults import *

urlpatterns = patterns('',
    (r'^rpc/', include('mysite.myapp.urls')),
)
# mysite/myapp/urls.py:

from django.conf.urls.defaults import *
from djangohessian import Dispatcher
from server import Calculator

urlpatterns = patterns('',
    url(r'^calculator/', Dispatcher(Calculator())),
)

Full source can be found at http://bitbucket.com/safehammad/django-hessian/.

I can’t help wondering whether the Hessian protocol is getting attention it deserves, particularly in environments where both client and server are delivered and maintained by a single provider. Have you implemented JSON / REST systems which would have benefited from using Hessian? Do you have good arguments as to why the use of Hessian is to be discouraged?

PyWeek 11 – And the winner is …

PyWeek 11 has come to an end .  The judging is over and the winners have been announced.  The deserving winners are Universe Factory 11 as an individual entry with the game Mortimer the Lepidopterist and Super Effective 11 as a team entry with the game Trident Escape.

Mortimer the Lepidopterist

Trident Escape: The Dungeon of Destiny

Trident Escape: The Dungeon of Destiny

I made several interesting observations during the course of the contest.

Firstly, I’m no gamer, however that was clearly irrelevant to me as I thoroughly enjoyed the competition, the pressure of having to deliver a piece of software to a deadline (but not losing my job if I didn’t) and generally having free reign to hack with Python to produce a creative end product.  Not only that, but I was doing it in the knowledge that at least 39 teams of people would be playing with my creation.

Secondly, in telling my non-geek friends that I was entering this competition, I received all sorts of interest and support in what I was programming to a level I’d not experienced before.  It was both humbling and refreshing to be able to talk to my non-geek friends about what I was programming without a familiar glazed look descending on their faces.

Thirdly, I really enjoyed playing the other teams’ games and learnt a lot from doing so.  It was interesting to see the sheer variety of games and the creative thought that went into them.  It was also interesting to look at the code behind the games.  For me this was a real win and an affirmation that you learn most about coding from reading others’ code.

As mentioned previously, my entry was Voices Under Water and was written using the excellent pyglet and cocos2d libraries.  The game is based around a dolphin who has to catch life rings being thrown by a ship’s captain to save his crew from drowning.  It’s probably not the most exciting story, but I found myself writing the game then shoehorning the story onto it, and that was the best I could come up with!  Coming up with the name of the game was much easier.  My other half’s niece and her boyfriend are part of a band formerly called The Bacchae and more recently called Black Moth.  They kindly gave me permission to use one of their tracks which is fittingly called Voices Under Water as the backing music for the game.

Many thanks to the organisers and to the other teams for an enjoyable competition!

Next »