Posts RSS Comments RSSTwitter 18 Posts and 32 Comments till now
This wordpress theme is downloaded from wordpress themes website.

Archive for the 'Natural Language Processing' Category

Is English Tonal?

An interesting feature of several widely used Asian languages is that they’re tonal. In tonal languages, changing the intonation of what seems to be the same word (at least to the Western ear) can markedly change the meaning of that word. This can be quite hard to fathom for the typical English speaker. A celebrated example of this can be found in Mandarin Chinese:

妈 mā mother
麻 má hemp
马 mǎ horse
骂 mà scold
吗 ma (question tag)

I was in an English supermarket recently and read the word “discount” on several items for sale. It occurred to me that the word discount can be used with at least a couple of different but related meanings:

  1. In the supermarket it’s often used as a noun meaning “a reduction in the sale price”.
  2. It can also mean the verb “to dismiss”, “to remove from consideration” and sometimes “to reduce in price”.

What then struck me is that these two usages are spelled the same but pronounced differently. In the first meaning, the first syllable is stressed whereas in the second meaning, the second syllable is stressed. I tried to think of more words which followed this pattern and it took me some time to come up with “reject”, “survey” and “upset”. My hunch was that there were plenty more words like that so I set about seeing if I could automate finding them.

One can argue that changing the stresses on a word’s syllables changes its intonation. Does that make English tonal after all, albeit on a small scale?

Pronunciation

The Carnegie Mellon Pronouncing Dictionary is a machine-readable pronunciation dictionary for North American English. Its database of 100,000+ words contains a set of pronunciations organised as a list of sounds, for example:

tree = ['T', 'R', 'IY1']
biscuit = ['B', 'IH1', 'S', 'K', 'AH0', 'T']
undo = ['AH0', 'N', 'D', 'UW1']

I’m interested here not in the actual consonant and vowel sounds which can vary quite markedly with differences in regional accent, but in the stresses of the vowel sounds. These are indicated by a numeric suffix:

0 – No stress
1 – Primary stress
2 – Secondary stress

In the examples above, “biscuit” is pronounced with the stress on the first syllable and “undo” with the stress on the second.

In the Python programming language, the CMU Pronouncing Dictionary can be accessed using the Natural Language Toolkit (NLTK). If you’re using the NLTK for the first time, you’ll need to do the following:

>>> import nltk
>>> nltk.download()

A GUI will appear where you can choose to download the CMU Pronouncing Dictionary. This only needs to be done once. The dictionary can then be accessed as follows:

>>> from nltk.corpus import cmudict
>>> pronunciations = cmudict.dict()
>>> pronunciations['tree']
[['T', 'R', 'IY1']]
>>> pronunciations['discount']
[['D', 'IH0', 'S', 'K', 'AW1', 'N', 'T'], ['D', 'IH1', 'S', 'K', 'AW0', 'N', 'T']]

Here we can see that “discount” is indeed listed with more than one pronunciation. Now lets distill the stresses in theses pronunciations:

>>> def stresses(pronunciation):
...     return [i[-1] for i in pronunciation if i[-1].isdigit()]
...
>>> stresses(['D', 'IH0', 'S', 'K', 'AW1', 'N', 'T'])
['0', '1']
>>> stresses(['D', 'IH1', 'S', 'K', 'AW0', 'N', 'T'])
['1', '0']

So in one pronunciation, the stress is on the first syllable and in the other pronunciation, the stress is on the second, just as we suspected.

Part of Speech

WordNet is a lexical database of English nouns, verbs, adjectives and adverbs. The database lists the multiple uses of a given word, and for any given use, its definition and most remarkably, its relationship to other words. For example, “dog” is a type of “canine” and a “poodle” is a type of “dog”. We’re interested in the fact that WordNet also helpfully stores the part of speech (i.e. noun, verb etc.) for any given usage.

WordNet can also be accessed using NLTK. Once again, for first use, the WordNet database needs to be downloaded using nltk.download().

Each usage of a word is called a “synset” (i.e. Synonym Set) in WordNet parlance and can be accessed as follows:

>>> wordnet.synsets('discount')
[Synset('discount.n.01'), Synset('discount_rate.n.02'), Synset('rebate.n.01'), Synset('deduction.n.02'), Synset('dismiss.v.01'), Synset('discount.v.02')]

As might be apparent from this example, the synset’s primary word may or may not be ‘discount’. In fact, each synset contains a list of words (known as lemmas) which can represent that usage:

>>> synsets = wordnet.synsets('discount')
>>> synsets[0]
Synset('discount.n.01')
>>> synsets[0].definition
'the act of reducing the selling price of merchandise'
>>> synsets[0].lemma_names
['discount', 'price_reduction', 'deduction']

We’ll concentrate on those synsets whose primary lemma is the word we are interested in.

Finally, the part of speech for a synset is easily obtained:

>>> synsets[0]
Synset('discount.n.01')
>>> synsets[0].definition
'the act of reducing the selling price of merchandise'
>>> synsets[0].pos
'n'
>>> synsets[5]
Synset('discount.v.02')
>>> synsets[5].definition
'give a reduction in price on'
>>> synsets[5].pos
'v'

Putting It All Together

So to find our “tonal” words, all we need to do is find words which fit the following criteria:

  1. Two or more syllables.
  2. Multiple pronunciations with different stresses.
  3. Can be used as a noun or verb.

A sample Python script can be found here.

And here’s the full list of 112 tonal English words found using this script:

['addict', 'address', 'affiliate', 'affix', 'ally', 'annex', 'associate', 'average', 'bachelor', 'buffet', 'combine', 'commune', 'compact', 'compound', 'compress', 'concert', 'concrete', 'confederate', 'conflict', 'content', 'contest', 'contract', 'contrast', 'converse', 'convert', 'convict', 'coordinate', 'correlate', 'costume', 'debut', 'decrease', 'defect', 'delegate', 'desert', 'detail', 'detour', 'dictate', 'digest', 'discharge', 'discount', 'duplicate', 'effect', 'escort', 'estimate', 'excerpt', 'excise', 'ferment', 'finance', 'forearm', 'geminate', 'general', 'graduate', 'impact', 'implant', 'import', 'impress', 'imprint', 'increase', 'insert', 'interest', 'intrigue', 'invalid', 'laminate', 'leverage', 'mentor', 'mismatch', 'object', 'offset', 'overflow', 'permit', 'pervert', 'postulate', 'predicate', 'present', 'privilege', 'produce', 'progress', 'project', 'protest', 'ratchet', 'recall', 'recess', 'record', 'recount', 'reference', 'refund', 'regress', 'research', 'reset', 'retake', 'rewrite', 'romance', 'segment', 'separate', 'sophisticate', 'subject', 'submarine', 'subordinate', 'supplement', 'surcharge', 'survey', 'suspect', 'syndicate', 'syringe', 'transfer', 'transport', 'trespass', 'underestimate', 'update', 'upgrade', 'upset', 'veto']

Observations

Interesting observations include:

  1. In most cases, stressing the first syllable yields the noun whereas stressing a later syllable yieds the verb.
  2. The noun and verb are usually closely related in meaning, however the nouns of some words have taken on a common usage which has detached it from the meaning of the verb. Obvious examples include “project”, “subject”… and “pervert”!
  3. There also seems to be a high frequency of words beginning with ‘com’, ‘con’ and ‘re’. Is this significant or is this is common of English verbs? I’ll leave that question as an exercise for the reader.

With a minor tweak to the script, we can find words that are combinations of adjectives, nouns and verbs. This gives us much smaller lists of words:

  • adjective/noun: ['antecedent', 'commemorative', 'compact', 'complex', 'compound', 'concrete', 'deliverable', 'eccentric', 'general', 'hostile', 'inside', 'invalid', 'invertebrate', 'juvenile', 'liberal', 'mineral', 'national', 'natural', 'oblate', 'peripheral', 'present', 'salient', 'separate', 'subordinate', 'worsening']
  • adjective/verb: ['abstract', 'alternate', 'animate', 'appropriate', 'articulate', 'compact', 'compound', 'concrete', 'frequent', 'general', 'invalid', 'moderate', 'perfect', 'present', 'separate', 'subordinate']
  • adjective/noun/verb: ['compact', 'compound', 'concrete', 'general', 'invalid', 'present', 'separate', 'subordinate']

Epilogue

It turns out that what we’ve found here are heteronyms which are two or more words which share the same spelling (also known as homographs) but have different meanings. More specifically, we’ve found plenty of initial-stress-derived nouns where a verb can be turned into a noun by stressing the first syllable.

I’m not sure we’ve proven that English is a truly tonal language, but this has been a good exercise in cross-referencing two major natural language databases to find interesting words.

The Freedom of the City

A couple of days ago I visited the beautiful John Rylands Library in Manchester with the family. Within the library is a document recording the honour of “Freedom of the City of Manchester” awarded to Enriqueta Augustina Rylands, third wife of John Rylands, when she founded the library in 1899.

Freedom of the City of Manchester

Aside from the beauty and the colourful vibrancy of this document, what struck me was the verbosity and sheer length of the sentences contained within. Here’s a key sub-sentence from the document which is 39 words long and drawn from a parent sentence no less than 73 words long.:

“…the members of this council desire to express their opinion that the powers accorded to them by law for the recognition of eminent services would be fittingly exercised by conferring upon Mrs Enriqueta Rylands the Freedom of the City…”

So how do we break down a relatively complex sentence such as this in order to analyse it?  The answer is to build a syntax tree, a representation of the sentence decomposed into its constituent sub-sentences, decomposed in turn into noun phrases and verb phrases, decomposed in turn into nouns, verbs and other parts of speech. This is a three-step process:

  1. Tokenising –  splitting the sentence into its constituent entities (mainly words).
  2. Part of speech tagging – assigning a part of speech to each word.
  3. Parsing – turning the tagged text into a syntax tree.

I’ll be using the nltk to help me. Here goes…

1. Tokenise

Splitting a sentence into words seems like it should be an easy task but the main gotcha is deciding what to do with punctuation such as full stops and apostrophes.  Thankfully, nltk just “does the right thing” (or at least it does the same thing predictably and consistently).  In our case, there’s no punctuation to worry about so we could just split the sentence on whitespace, but we’ll use the nltk anyway as good practice.

>>> import nltk
>>> sent = 'the members of this council desire to express their opinion that the powers accorded to them by law for the recognition of eminent services would be fittingly exercised by conferring upon Mrs Enriqueta Rylands the Freedom of the City'
>>> tokens = nltk.word_tokenize(sent)
>>> print tokens
['the', 'members', 'of', 'this', 'council', 'desire', 'to', 'express', 'their', 'opinion', 'that', 'the', 'powers', 'accorded', 'to', 'them', 'by', 'law', 'for', 'the', 'recognition', 'of', 'eminent', 'services', 'would', 'be', 'fittingly', 'exercised', 'by', 'conferring', 'upon', 'Mrs', 'Enriqueta', 'Rylands', 'the', 'Freedom', 'of', 'the', 'City']

2. Tag

Part of speech tagging is also catered for by the nltk. The built in tagger uses a maximum entropy classifier and assigns tags from the Penn Treebank Project.  A list of tags and guidelines for assigning tags can be found in this document.


>>> nltk.pos_tag(tokens)
[('the', 'DT'), ('members', 'NNS'), ('of', 'IN'), ('this', 'DT'), ('council', 'NN'), ('desire', 'NN'), ('to', 'TO'), ('express', 'NN'), ('their', 'PRP$'), ('opinion', 'NN'), ('that', 'WDT'), ('the', 'DT'), ('powers', 'NNS'), ('accorded', 'VBD'), ('to', 'TO'), ('them', 'PRP'), ('by', 'IN'), ('law', 'NN'), ('for', 'IN'), ('the', 'DT'), ('recognition', 'NN'), ('of', 'IN'), ('eminent', 'NN'), ('services', 'NNS'), ('would', 'MD'), ('be', 'VB'), ('fittingly', 'RB'), ('exercised', 'VBN'), ('by', 'IN'), ('conferring', 'NN'), ('upon', 'IN'), ('Mrs', 'NNP'), ('Enriqueta', 'NNP'), ('Rylands', 'NNPS'), ('the', 'DT'), ('Freedom', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('City', 'NNP')]

As expected, some tagging decisions are questionable and some are just plain wrong. The most common errors tend to be with words which can be used as both nouns and verbs, for example, desire and express. These are incorrectly tagged as nouns rather than verbs as a “best guess” as there are far more nouns than verbs in the English language. By my reckoning, we’ve achieved about 85% accuracy in this sentence with just six manual corrections required:

('desire', 'NN')      ->  ('desire', 'VB')
('express', 'NN')     ->  ('express', 'VB')
('that', 'WDT')       ->  ('that', 'IN')
('accorded', 'VBG')   ->  ('accorded', 'VBN')
('eminent', 'NN')     ->  ('eminent', 'JJ')
('conferring', 'NN')  ->  ('conferring', 'VBG')

3. Parse

Now the hard part. Analysing sentence structure tends to be a manually intensive process. I’ll start by hand crafting a context free grammar by gradually splitting the sentence into its constituent parts in multiple iterations, for example:

Iteration 1


S    = Sentence
NP   = Noun Phrase
VP   = Verb Phrase
SBAR = Subordinating Clause
IN   = Preposition or subordination conjunction.

(S the members of this council desire to express their opinion that the powers accorded to them by law for the recognition of eminent services would be fittingly exercised by conferring upon Mrs Enriqueta Rylands the Freedom of the City)

Iteration 2


(S (NP the members of this council) (VP desire to express their opinion that the powers accorded to them by law for the recognition of eminent services would be fittingly exercised by conferring upon Mrs Enriqueta Rylands the Freedom of the City))

Iteration 3


(S (NP the members of this council) (VP (VP desire to express their opinion) (SBAR (IN that) (S the powers accorded to them by law for the recognition of eminent services would be fittingly exercised by conferring upon Mrs Enriqueta Rylands the Freedom of the City))))

...etc...
By repeating this process, the following grammar is produced, shown here together with an application to display the generated syntax tree.
import nltk

sent = 'the members of this council desire to express their opinion that the powers accorded to them by law for the recognition of eminent services would be fittingly exercised by conferring upon Mrs Enriqueta Rylands the Freedom of the City'

tokens = nltk.word_tokenize(sent)

grammar = """
    S    -> NP VP
    NP   -> NP PP | DT NNS | DT NN | PRPS NN | NP IN NP | NP VBN PP | JJ NNS | DT NNP
    PP   -> IN NP | TO VP | TO PRP IN NN | IN VP
    SBAR -> IN S
    VP   -> VP SBAR | VB PP | VB NP | VP NP | VP PP | MD VB RB VBN | VBG RP NNP NNP NNP NP

    DT   -> 'the' | 'this'
    NNS  -> 'members' | 'powers' | 'services'
    IN   -> 'of' | 'that' | 'by' | 'for'
    NN   -> 'council' | 'opinion' | 'law' | 'recognition'
    VB   -> 'desire' | 'express' | 'be'
    TO   -> 'to'
    PRPS -> 'their'
    VBN  -> 'accorded' | 'exercised'
    PRP  -> 'them'
    JJ   -> 'eminent'
    MD   -> 'would'
    RB   -> 'fittingly'
    VBG  -> 'conferring'
    RP   -> 'upon'
    NNP  -> 'Mrs' | 'Enriqueta' | 'Rylands' | 'Freedom' | 'City'
"""

parser = nltk.ChartParser(nltk.parse_cfg(grammar))
trees = parser.nbest_parse(tokens)
trees[0].draw()

This grammar results in no less than 1956 different possible syntax trees for this sentence (in theory meaning that this sentence could be interpreted in up to 1956 different ways).

Syntax Tree

The first of these syntax trees has a maximum depth of 11.  Contrast this with a sentence such as “the cat sat on the mat” with a maximum depth of approximately 5.  The depth of the syntax tree gives a feel for the complexity of the sentence and the depth of sub-sentences, sub-clauses and dependent phrases within the sentence.

Now when it comes to considering how the human brain might parse and understand this sentence, it might be interesting to consider whether the depth of the syntax tree can be thought of similarly to the stack depth in a running application.  Does the human brain contain a stack for parking sentence fragments as a complex sentence unfolds?  Is there a maximum stack depth, and if so, does this vary greatly from person to person?

Complex sentences certainly require more concentration to understand and perhaps the phrase: “Could you repeat that, please!” is the direct result of a cerebral stack overflow error!