Deconstructing Khmer

Matt and I are currently writing a program to extract all the legal syllable clusters that can occur in the Khmer written language.

Text is usually represented inside a computer using ASCII, which can handle 255 possible letters. That’s fine for English, but to work with the tens of thousands of characters found in the world’s more exotic languages, you use ‘Unicode’. We’ve never worked with unicode before, so there’s been a learning curve for both of us.

We’re working with Diethelm Kanjahn, a translator and font designer (he created the Khmer font ‘Mondulkiri’). He’s part of the global translation organisation ‘SIL’ which works with non-Roman scripts around the world. We had a chance to meet up with him last week to discuss the intricacies of Unicode Khmer rendering before he returned to Mondulkiri.

Didi needs to check how each syllable looks visually, since Khmer vowels can appear above, below, in front of or behind their consonant (and sometimes in front AND behind!). Runs of dependent glyphs need to position themselves relative to each other, like successive toppings on an ice-cream sundae. Fortunately Didi will handle all of that, we’re just coding a Python script to extract all the possible clusters from a large body of text: the entire bible, plus lots of modern sources of Khmer language, like news sites. 

The SIL team are also using the Khmer glyphs (letter shapes) to give a written form to five other indigenous languages which have never developed their own written scripts. The plan is that our program should also work for them.

This is Matt’s first time coding something ‘real’ for a client other than himself, so he’s learning about testing and quality assurance: the code isn’t finished when you’ve typed the last line, or when it produces something that ‘looks’ right, but when ALL your output tests pass and you can’t think of any new ones to add. 


Khmer cluster construction

Leave a Reply