Posts RSS Comments RSSTwitter 18 Posts and 27 Comments till now
This wordpress theme is downloaded from wordpress themes website.

Is English Tonal?

An interesting feature of several widely used Asian languages is that they’re tonal. In tonal languages, changing the intonation of what seems to be the same word (at least to the Western ear) can markedly change the meaning of that word. This can be quite hard to fathom for the typical English speaker. A celebrated example of this can be found in Mandarin Chinese:

妈 mā mother
麻 má hemp
马 mǎ horse
骂 mà scold
吗 ma (question tag)

I was in an English supermarket recently and read the word “discount” on several items for sale. It occurred to me that the word discount can be used with at least a couple of different but related meanings:

  1. In the supermarket it’s often used as a noun meaning “a reduction in the sale price”.
  2. It can also mean the verb “to dismiss”, “to remove from consideration” and sometimes “to reduce in price”.

What then struck me is that these two usages are spelled the same but pronounced differently. In the first meaning, the first syllable is stressed whereas in the second meaning, the second syllable is stressed. I tried to think of more words which followed this pattern and it took me some time to come up with “reject”, “survey” and “upset”. My hunch was that there were plenty more words like that so I set about seeing if I could automate finding them.

One can argue that changing the stresses on a word’s syllables changes its intonation. Does that make English tonal after all, albeit on a small scale?

Pronunciation

The Carnegie Mellon Pronouncing Dictionary is a machine-readable pronunciation dictionary for North American English. Its database of 100,000+ words contains a set of pronunciations organised as a list of sounds, for example:

tree = ['T', 'R', 'IY1']
biscuit = ['B', 'IH1', 'S', 'K', 'AH0', 'T']
undo = ['AH0', 'N', 'D', 'UW1']

I’m interested here not in the actual consonant and vowel sounds which can vary quite markedly with differences in regional accent, but in the stresses of the vowel sounds. These are indicated by a numeric suffix:

0 – No stress
1 – Primary stress
2 – Secondary stress

In the examples above, “biscuit” is pronounced with the stress on the first syllable and “undo” with the stress on the second.

In the Python programming language, the CMU Pronouncing Dictionary can be accessed using the Natural Language Toolkit (NLTK). If you’re using the NLTK for the first time, you’ll need to do the following:

>>> import nltk
>>> nltk.download()

A GUI will appear where you can choose to download the CMU Pronouncing Dictionary. This only needs to be done once. The dictionary can then be accessed as follows:

>>> from nltk.corpus import cmudict
>>> pronunciations = cmudict.dict()
>>> pronunciations['tree']
[['T', 'R', 'IY1']]
>>> pronunciations['discount']
[['D', 'IH0', 'S', 'K', 'AW1', 'N', 'T'], ['D', 'IH1', 'S', 'K', 'AW0', 'N', 'T']]

Here we can see that “discount” is indeed listed with more than one pronunciation. Now lets distill the stresses in theses pronunciations:

>>> def stresses(pronunciation):
...     return [i[-1] for i in pronunciation if i[-1].isdigit()]
...
>>> stresses(['D', 'IH0', 'S', 'K', 'AW1', 'N', 'T'])
['0', '1']
>>> stresses(['D', 'IH1', 'S', 'K', 'AW0', 'N', 'T'])
['1', '0']

So in one pronunciation, the stress is on the first syllable and in the other pronunciation, the stress is on the second, just as we suspected.

Part of Speech

WordNet is a lexical database of English nouns, verbs, adjectives and adverbs. The database lists the multiple uses of a given word, and for any given use, its definition and most remarkably, its relationship to other words. For example, “dog” is a type of “canine” and a “poodle” is a type of “dog”. We’re interested in the fact that WordNet also helpfully stores the part of speech (i.e. noun, verb etc.) for any given usage.

WordNet can also be accessed using NLTK. Once again, for first use, the WordNet database needs to be downloaded using nltk.download().

Each usage of a word is called a “synset” (i.e. Synonym Set) in WordNet parlance and can be accessed as follows:

>>> wordnet.synsets('discount')
[Synset('discount.n.01'), Synset('discount_rate.n.02'), Synset('rebate.n.01'), Synset('deduction.n.02'), Synset('dismiss.v.01'), Synset('discount.v.02')]

As might be apparent from this example, the synset’s primary word may or may not be ‘discount’. In fact, each synset contains a list of words (known as lemmas) which can represent that usage:

>>> synsets = wordnet.synsets('discount')
>>> synsets[0]
Synset('discount.n.01')
>>> synsets[0].definition
'the act of reducing the selling price of merchandise'
>>> synsets[0].lemma_names
['discount', 'price_reduction', 'deduction']

We’ll concentrate on those synsets whose primary lemma is the word we are interested in.

Finally, the part of speech for a synset is easily obtained:

>>> synsets[0]
Synset('discount.n.01')
>>> synsets[0].definition
'the act of reducing the selling price of merchandise'
>>> synsets[0].pos
'n'
>>> synsets[5]
Synset('discount.v.02')
>>> synsets[5].definition
'give a reduction in price on'
>>> synsets[5].pos
'v'

Putting It All Together

So to find our “tonal” words, all we need to do is find words which fit the following criteria:

  1. Two or more syllables.
  2. Multiple pronunciations with different stresses.
  3. Can be used as a noun or verb.

A sample Python script can be found here.

And here’s the full list of 112 tonal English words found using this script:

['addict', 'address', 'affiliate', 'affix', 'ally', 'annex', 'associate', 'average', 'bachelor', 'buffet', 'combine', 'commune', 'compact', 'compound', 'compress', 'concert', 'concrete', 'confederate', 'conflict', 'content', 'contest', 'contract', 'contrast', 'converse', 'convert', 'convict', 'coordinate', 'correlate', 'costume', 'debut', 'decrease', 'defect', 'delegate', 'desert', 'detail', 'detour', 'dictate', 'digest', 'discharge', 'discount', 'duplicate', 'effect', 'escort', 'estimate', 'excerpt', 'excise', 'ferment', 'finance', 'forearm', 'geminate', 'general', 'graduate', 'impact', 'implant', 'import', 'impress', 'imprint', 'increase', 'insert', 'interest', 'intrigue', 'invalid', 'laminate', 'leverage', 'mentor', 'mismatch', 'object', 'offset', 'overflow', 'permit', 'pervert', 'postulate', 'predicate', 'present', 'privilege', 'produce', 'progress', 'project', 'protest', 'ratchet', 'recall', 'recess', 'record', 'recount', 'reference', 'refund', 'regress', 'research', 'reset', 'retake', 'rewrite', 'romance', 'segment', 'separate', 'sophisticate', 'subject', 'submarine', 'subordinate', 'supplement', 'surcharge', 'survey', 'suspect', 'syndicate', 'syringe', 'transfer', 'transport', 'trespass', 'underestimate', 'update', 'upgrade', 'upset', 'veto']

Observations

Interesting observations include:

  1. In most cases, stressing the first syllable yields the noun whereas stressing a later syllable yieds the verb.
  2. The noun and verb are usually closely related in meaning, however the nouns of some words have taken on a common usage which has detached it from the meaning of the verb. Obvious examples include “project”, “subject”… and “pervert”!
  3. There also seems to be a high frequency of words beginning with ‘com’, ‘con’ and ‘re’. Is this significant or is this is common of English verbs? I’ll leave that question as an exercise for the reader.

With a minor tweak to the script, we can find words that are combinations of adjectives, nouns and verbs. This gives us much smaller lists of words:

  • adjective/noun: ['antecedent', 'commemorative', 'compact', 'complex', 'compound', 'concrete', 'deliverable', 'eccentric', 'general', 'hostile', 'inside', 'invalid', 'invertebrate', 'juvenile', 'liberal', 'mineral', 'national', 'natural', 'oblate', 'peripheral', 'present', 'salient', 'separate', 'subordinate', 'worsening']
  • adjective/verb: ['abstract', 'alternate', 'animate', 'appropriate', 'articulate', 'compact', 'compound', 'concrete', 'frequent', 'general', 'invalid', 'moderate', 'perfect', 'present', 'separate', 'subordinate']
  • adjective/noun/verb: ['compact', 'compound', 'concrete', 'general', 'invalid', 'present', 'separate', 'subordinate']

Epilogue

It turns out that what we’ve found here are heteronyms which are two or more words which share the same spelling (also known as homographs) but have different meanings. More specifically, we’ve found plenty of initial-stress-derived nouns where a verb can be turned into a noun by stressing the first syllable.

I’m not sure we’ve proven that English is a truly tonal language, but this has been a good exercise in cross-referencing two major natural language databases to find interesting words.

The Freedom of the City

A couple of days ago I visited the beautiful John Rylands Library in Manchester with the family. Within the library is a document recording the honour of “Freedom of the City of Manchester” awarded to Enriqueta Augustina Rylands, third wife of John Rylands, when she founded the library in 1899.

Freedom of the City of Manchester

Aside from the beauty and the colourful vibrancy of this document, what struck me was the verbosity and sheer length of the sentences contained within. Here’s a key sub-sentence from the document which is 39 words long and drawn from a parent sentence no less than 73 words long.:

“…the members of this council desire to express their opinion that the powers accorded to them by law for the recognition of eminent services would be fittingly exercised by conferring upon Mrs Enriqueta Rylands the Freedom of the City…”

So how do we break down a relatively complex sentence such as this in order to analyse it?  The answer is to build a syntax tree, a representation of the sentence decomposed into its constituent sub-sentences, decomposed in turn into noun phrases and verb phrases, decomposed in turn into nouns, verbs and other parts of speech. This is a three-step process:

  1. Tokenising –  splitting the sentence into its constituent entities (mainly words).
  2. Part of speech tagging – assigning a part of speech to each word.
  3. Parsing – turning the tagged text into a syntax tree.

I’ll be using the nltk to help me. Here goes…

1. Tokenise

Splitting a sentence into words seems like it should be an easy task but the main gotcha is deciding what to do with punctuation such as full stops and apostrophes.  Thankfully, nltk just “does the right thing” (or at least it does the same thing predictably and consistently).  In our case, there’s no punctuation to worry about so we could just split the sentence on whitespace, but we’ll use the nltk anyway as good practice.

>>> import nltk
>>> sent = 'the members of this council desire to express their opinion that the powers accorded to them by law for the recognition of eminent services would be fittingly exercised by conferring upon Mrs Enriqueta Rylands the Freedom of the City'
>>> tokens = nltk.word_tokenize(sent)
>>> print tokens
['the', 'members', 'of', 'this', 'council', 'desire', 'to', 'express', 'their', 'opinion', 'that', 'the', 'powers', 'accorded', 'to', 'them', 'by', 'law', 'for', 'the', 'recognition', 'of', 'eminent', 'services', 'would', 'be', 'fittingly', 'exercised', 'by', 'conferring', 'upon', 'Mrs', 'Enriqueta', 'Rylands', 'the', 'Freedom', 'of', 'the', 'City']

2. Tag

Part of speech tagging is also catered for by the nltk. The built in tagger uses a maximum entropy classifier and assigns tags from the Penn Treebank Project.  A list of tags and guidelines for assigning tags can be found in this document.


>>> nltk.pos_tag(tokens)
[('the', 'DT'), ('members', 'NNS'), ('of', 'IN'), ('this', 'DT'), ('council', 'NN'), ('desire', 'NN'), ('to', 'TO'), ('express', 'NN'), ('their', 'PRP$'), ('opinion', 'NN'), ('that', 'WDT'), ('the', 'DT'), ('powers', 'NNS'), ('accorded', 'VBD'), ('to', 'TO'), ('them', 'PRP'), ('by', 'IN'), ('law', 'NN'), ('for', 'IN'), ('the', 'DT'), ('recognition', 'NN'), ('of', 'IN'), ('eminent', 'NN'), ('services', 'NNS'), ('would', 'MD'), ('be', 'VB'), ('fittingly', 'RB'), ('exercised', 'VBN'), ('by', 'IN'), ('conferring', 'NN'), ('upon', 'IN'), ('Mrs', 'NNP'), ('Enriqueta', 'NNP'), ('Rylands', 'NNPS'), ('the', 'DT'), ('Freedom', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('City', 'NNP')]

As expected, some tagging decisions are questionable and some are just plain wrong. The most common errors tend to be with words which can be used as both nouns and verbs, for example, desire and express. These are incorrectly tagged as nouns rather than verbs as a “best guess” as there are far more nouns than verbs in the English language. By my reckoning, we’ve achieved about 85% accuracy in this sentence with just six manual corrections required:

('desire', 'NN')      ->  ('desire', 'VB')
('express', 'NN')     ->  ('express', 'VB')
('that', 'WDT')       ->  ('that', 'IN')
('accorded', 'VBG')   ->  ('accorded', 'VBN')
('eminent', 'NN')     ->  ('eminent', 'JJ')
('conferring', 'NN')  ->  ('conferring', 'VBG')

3. Parse

Now the hard part. Analysing sentence structure tends to be a manually intensive process. I’ll start by hand crafting a context free grammar by gradually splitting the sentence into its constituent parts in multiple iterations, for example:

Iteration 1


S    = Sentence
NP   = Noun Phrase
VP   = Verb Phrase
SBAR = Subordinating Clause
IN   = Preposition or subordination conjunction.

(S the members of this council desire to express their opinion that the powers accorded to them by law for the recognition of eminent services would be fittingly exercised by conferring upon Mrs Enriqueta Rylands the Freedom of the City)

Iteration 2


(S (NP the members of this council) (VP desire to express their opinion that the powers accorded to them by law for the recognition of eminent services would be fittingly exercised by conferring upon Mrs Enriqueta Rylands the Freedom of the City))

Iteration 3


(S (NP the members of this council) (VP (VP desire to express their opinion) (SBAR (IN that) (S the powers accorded to them by law for the recognition of eminent services would be fittingly exercised by conferring upon Mrs Enriqueta Rylands the Freedom of the City))))

...etc...
By repeating this process, the following grammar is produced, shown here together with an application to display the generated syntax tree.
import nltk

sent = 'the members of this council desire to express their opinion that the powers accorded to them by law for the recognition of eminent services would be fittingly exercised by conferring upon Mrs Enriqueta Rylands the Freedom of the City'

tokens = nltk.word_tokenize(sent)

grammar = """
    S    -> NP VP
    NP   -> NP PP | DT NNS | DT NN | PRPS NN | NP IN NP | NP VBN PP | JJ NNS | DT NNP
    PP   -> IN NP | TO VP | TO PRP IN NN | IN VP
    SBAR -> IN S
    VP   -> VP SBAR | VB PP | VB NP | VP NP | VP PP | MD VB RB VBN | VBG RP NNP NNP NNP NP

    DT   -> 'the' | 'this'
    NNS  -> 'members' | 'powers' | 'services'
    IN   -> 'of' | 'that' | 'by' | 'for'
    NN   -> 'council' | 'opinion' | 'law' | 'recognition'
    VB   -> 'desire' | 'express' | 'be'
    TO   -> 'to'
    PRPS -> 'their'
    VBN  -> 'accorded' | 'exercised'
    PRP  -> 'them'
    JJ   -> 'eminent'
    MD   -> 'would'
    RB   -> 'fittingly'
    VBG  -> 'conferring'
    RP   -> 'upon'
    NNP  -> 'Mrs' | 'Enriqueta' | 'Rylands' | 'Freedom' | 'City'
"""

parser = nltk.ChartParser(nltk.parse_cfg(grammar))
trees = parser.nbest_parse(tokens)
trees[0].draw()

This grammar results in no less than 1956 different possible syntax trees for this sentence (in theory meaning that this sentence could be interpreted in up to 1956 different ways).

Syntax Tree

The first of these syntax trees has a maximum depth of 11.  Contrast this with a sentence such as “the cat sat on the mat” with a maximum depth of approximately 5.  The depth of the syntax tree gives a feel for the complexity of the sentence and the depth of sub-sentences, sub-clauses and dependent phrases within the sentence.

Now when it comes to considering how the human brain might parse and understand this sentence, it might be interesting to consider whether the depth of the syntax tree can be thought of similarly to the stack depth in a running application.  Does the human brain contain a stack for parking sentence fragments as a complex sentence unfolds?  Is there a maximum stack depth, and if so, does this vary greatly from person to person?

Complex sentences certainly require more concentration to understand and perhaps the phrase: “Could you repeat that, please!” is the direct result of a cerebral stack overflow error!

Our days are numbered

Whenever I learn a new word in any language, I often find myself comparing that word with equivalent words in other languages. I was recently thinking about words for days of the week in various languages. For related languages, not only are there similarities in the words themselves, but the origins of those words also fall into a small number of distinct categories.

I chose three groups of related languages, partly for reasons of familiarity and interest, and partly to allow me to compare words within and across groups:

  • English and German (Germanic)
  • French and Spanish (Romance)
  • Arabic and Hebrew (Semitic)

Interestingly, the origins of the words for days of the week can almost all be placed into the categories Planetary, Pagan GodsReligious and Numeric.  The following table lists the words for days of the week in each language together with their meaning / origin:

The origins of words for days of the week

The origins of words for days of the week

What struck me immediately is that this table is strongly reminiscent of the Periodic Table.  This probably isn’t surprising considering that related languages have been placed adjacent to each other.

Head over to this Wikipedia page for a more complete study of the origins of the words for days of the week (but without the colourful table).

Update

Mercury, Venus, Mars, Jupiter and Saturn have been known about since ancient times and were named after Roman gods. So many of the planetary days of the week were actually named after ancient gods, albeit indirectly. Thanks to M Stallman for pointing this out!

Perhaps it makes sense that the sun and moon being the most obvious celestial bodies lend their names to the first and second days of the week. But how were the remaining days assigned their planets?

Be good to your colon

Programmers spend more time reading code than writing it (a fact well known by most programmers who tend not to publicise this to their employers).  It therefore stands to reason that (most?) programming languages should be designed as much for human consumption as for machine consumption and should be as readable as possible.

Python is a very readable language (a fact which contributes to its popularity) and has been termed “executable pseudocode” on account of its readability.  An aspect of Python which makes it readable is its avoidance of syntactic fluff, extraneous words and symbols which add nothing to the code’s meaning but serve to detract from it.

In the past I’ve felt somewhat negative about Python’s terminal colon “:”, the symbol used to terminate if, while, def and class statements and to signify the start of a new block of indented code.  For example:

if a == 1:
    b = do_something_cool()

def do_something_cool():
    return 'Doing something cool'

Even without the colon, it’s quite clear that we’re starting a new block of indented code because (a) the statement starts with the keyword if, while, def or class and (b) the next line of code is indented. For comparison, Ruby gets on just fine without the colon after its def statement. So why the need for a colon in Python? Is it syntactic fluff?

The Python FAQ explains that the colon enhances readability and helps editors with syntax highlighting and code indentation. Lets face it, any self respecting editor should be capable of parsing a line beginning with an if, while, def or class, so the “helps editors” argument is bogus. I do however buy the argument that the code is visibly more readable. But how does it enhance readability?

I’ve already mentioned that a programmer spends more time reading than writing code. What I haven’t yet suggested is that a programmer will often reread and scan the same code repeatedly to form a mental picture of a larger codebase. It’s what the eyes do when they’re scanning code that’s key to the importance of the colon. There is some evidence to suggest that the eyes linger at the beginning and at the end of a sentence when reading text and draw especially from visual cues at those locations. Let’s assume for the moment that this holds true for a line of code. So the visual cue heralding an indented block of code is clear at the beginning of a line of code, namely an if, while, def or class followed by an indented line. The only visual cue at the end of a line of Python code is the colon, and without the colon there would be no cue. So even though the colon is not strictly necessary, there is an argument that its existence is there for human consumption and aids readability.

When all’s said and done, the advantage of the colon is probably slight at best, and then probably only for a newcomer to the language. (This sort of advantage possibly completely vanishes for experienced users of any language). Never-the-less, on balance, I’m now happy it’s there!

Django JavaScript Integration: AJAX and jQuery

Django JavaScript Integration: AJAX and jQuery is a book about the building of Ajax-enabled web applications using Django and jQuery.  Django has rapidly shot to fame as the most popular web development framework for the Python programming language.  Similarly, jQuery has taken the Javascript world by storm as a client-side Javascript framework making the development of sophisticated browser based clients both easier and even more pleasurable than using Javascript alone.  The strapline to this book is: “Develop AJAX applications using Django and jQuery” and I would suggest that this describes the aim of the book more accurately than its title.

There’s a wealth of both online and dead-tree texts covering Django and jQuery, however by comparison, there’s far less information covering the integration of both technologies so the arrival of this book is timely.  I’m also always happy to see new books aimed at the more experienced Python programmer in a time when the rapid (and very welcome) growth in the adoption of Python has led to the recent publication of a large number of beginners’ books.

To get the most out of this book, a knowledge of Python is expected and a working knowledge of Javascript and Django highly recommended.  The author also makes occasional (and perhaps inevitable) comparisons between Javascript and the Java language in the first couple of chapters, however a working knowledge of Java is definitely not needed.

The first chapter covers Python and Javascript.  As a Django/jQuery developer you’ll be using both languages and the author provides some interesting comparisons between the two.  The author is also quite candid and realistic about the weaknesses of Javascript and its cross-browser incompatibilities whilst carefully highlighting its strengths:  “If you can figure out why Python is a good language, you can figure out why JavaScript is a good language.”

The second chapter gets stuck into the basics of jQuery and the constructs which simplify the implementation of Ajax.  The third chapter then dives into Django with a tour of Django validation and a detailed discussion of validation in general.  The remainder of the book builds a reasonably large web application with each chapter pulling together a good number of disparate features you’d want to provide in any self-respecting Web 2.0 application.  Autocompletion, form validation, server-side validation, client-side and server-side search and login handling are all described and integrated into the application.  Even the creation of a “favicon.ico” is mentioned to put a company logo on your users’ web browser tabs and make them look distinctive.

It quickly became apparent that this book  is not a regurgitation of “the same old stuff”, rather it makes the effort not only to show you what to do, but also to discuss why you do something in a particular way and how you can improve on it, leaving the reader with a deeper understanding.  For example, the book is quite happy to extend the provided Django classes where they fall short, and show validation of more unusual types such as GPS coordinates not natively supported by Django.  Another example is the book’s excellent treatment of validation discussing cultural awareness and the suggestion that a “less is more” approach to validation can sometimes make sense.

Apart from a couple of typos here and there (which are possibly restricted to my electronic copy), a minor annoyance is what I felt to be a rather unorthodox Javascript formatting style.  For example:

set: function(newValue)
   {
   var value = parseInt(newValue.toString());
   if (isNaN(value))
       {
       return false;
       }
   else
       {
       field = value;
       return true;
       }
   }

It’s quite possible again that this is a formatting issue restricted to my electronic copy (and I’ll investigate and update this review accordingly).  I also acknowledge that you can never please everyone with your coding style and layout!

The book stops short of helping you organise the inevitable growing mass of Javascript code, a difficult but increasingly important topic.  A little information around the modularisation of Javascript files or strategies and libraries for implementing MVC in client side code would have gone a long way.  Another aspect of the book which is notably glossed over is the topic of testing.  Testing can be hard, and testing web applications can be very hard, particularly those which rely on a lot of Javascript.  Admittedly this isn’t a book about testing, but implementing tests is a very important part of a developer’s life and a section or chapter setting the reader on the right path would have been welcome.

There are several parts of the book which deserve a special mention, however Chapter 11 particularly stands out.  The topic of usability is one often brushed over in technical books in favour of delivering more how-to’s and code examples.  The author devoted an entire chapter to usability, a chapter which I can only hope the authors of many web applications I’ve used might one day read.

I find it hard to characterise the author’s style of writing but I’d probably describe it as intellectual bordering on philosophical with a colourful vocabulary, a style which I enjoy but might not be to everyone’s taste.  An amusing example of the intellectual nature of the book can be found in Chapter 2: “Prototypal inheritance is more like the evolutionary picture of single-celled organisms that can mutate, reproduce asexually by cell division, and pass on (accumulated) mutations when they divide.”  I actually found this an interesting and useful analogy however it’s probably a little hard to relate to unless you remember your school biology!

In summary, I like this book.  I like the the fact that it’s filled with gems of information you won’t easily find online.  I like the colourful language and the interesting discussion around the concepts the author is conveying.  Most importantly, this book is written by someone who has clearly developed real web applications.  If you’re someone merely looking to get cracking on a project using Django and jQuery in the shortest time possible, then this book might disappoint.  But then again, the online tutorials and references are there to get you started and this book can take over where they leave off.

Finally, the author strikes me as someone both interesting and accomplished and I look forward to reading other books he might have in the works.

Next Page »