Monday, October 12, 2009

FAQ and Resources on Khmer in Unicode

Table of Contents:

1.What is Unicode?

2. What is meant by ‘Khmer in Unicode’?

3. Isn’t there something odd about this Khmer encoding?

4. How can one ‘see’ Khmer Unicode text?

5. How do I change this Khmer Unicode encoding? I don’t like it.

6. What will a Khmer Unicode keyboard look like?

7. How should Khmer Unicode be sorted/collated/ordered?

8. Can legacy encoded Khmer text be converted to Khmer Unicode?

9. How can Khmer be transliterated?

10. Language Codes (including Khmer)

11. Khmer Unicode Effect on Other Standards

12. Using Khmer Unicode in Microsoft Office 2003

13. Voice to Khmer Unicode text

14. Locale

15. Text editors; escape codes

16. Khmer URLs


Revision History:

5 August 2004 XeTeX, UltraXML - Professional Desktop Publishing in Khmer, updated files

1 May 2004 Om Mony's new Web site (with fonts); KhmerOS.ttf version 1.6 URL.

19 March 2004 Ukelele - Mac OS X Keyboard Layout Editor; Macintosh TextEdit layout bug illustrated

28 February 2004 Macintosh programmes that handle Khmer

9 February 2004 Macintosh savvy Khmer Unicode font being made by Daniel Kai: http://www.xenotypetech.com/osxCambodian.html

4-5 February 2004 Added Khmer Unicode URLs section and notice of Lin Chear's ABC Khmer encoding to Unicode, Khmer sorting sample

19 January 2004 Added reference to KhmerOS web site

7 January 2004 Added a Pango page pointing to Lin Chear's Linux implementation of Khmer

1 January 2004 Extensive reworking of TECkit.html page

27 December 2003 Reference to browser.html page

24, 26 December 2003 Lin Chear's work on Pango

17 December 2003 Locale URL

14 December 2003 Word breaks, procedures for text encoding conversion.

3 December 2003 STSF from Sun; ICU information, locales

25 November 2003 MacOS X keyboard URL

23 November 2003 MacOS X Font tools URL, Javier Sola, OpenOffice/StarOffice URLs

13 November 2003 VOLT/usp10.dll URLs, audio to Khmer text, editing

10 November 2003 URL to revised Kent Karlsson Khmer sorting document

9 November 2003 Unicode 4.0/Microsoft Office 2003

21 October 2002 Khmer Unicode deprecation recommendations

31 August 2002 UN Transliteration Scheme URL added

23 July 2002 Early release of a Graphite enabled Khmer Unicode font for font developers

24 May 2002 21st International Unicode Conference; Additions/modifications to Khmer Unicode. Note especially the reluctant acceptance by the Cambodian government of the Khmer Unicode encoding (N2459R).

7 April 2002 Keyman keyboard, Code 2000 font, Adobe naming convention URL, Kent Karlsson sorting contribution, IUC announcement and paper, general editing.

4 April 2002 XML Blueberry

28 March 2002 Unicode 3.2 Technical Report URL added

17 February 2002 Added PDF files of proposed changes to Khmer Unicode

11 February 2002 Added information on ‘seeing’ Khmer Unicode

1 December 2001 Table of Contents, Transliteration scheme and Language Code URLs.

21 November 2001 New Khmer Sorting document

17 November 2001 New Khmer Unicode mailing list

26 October 2001 Revised submission of above with code table

23 October 2001 The Khmer Philology Project and some others encouraged the Khmer government to challenge the Khmer Unicode encoding. The documents are referenced below.

21 September 2001 Graphite/Keyman Khmer Unicode possibilities; Keyboard layout

27 August 2001 Standardisation of a Non-Commercial Script

26 May 2001 Added information on Pali dictionary, Unicode font URLs, RTFEditorKit and MIFFile

18 May 2001 Frequencies file replaced with proper one

Top.


1. What is Unicode?

“The Unicode Standard is the universal character encoding scheme for written characters and text. It defines a consistent way of encoding multilingual text that enables the exchange of text data internationally and creates the foundation for global software.“ The Unicode Consortium. The Unicode Standard Version 3.0. Reading: Addison-Wesley, January 2000. p. 1 ISBN 0-201-61633-5. The International Standards Organisation coordinates in this encoding: “The Unicode Standard is fully compatible with the international standard ISO/IEC 10646-1:2000, Information Technology-Universal Multiple-Octet Coded Character Set (UCS)-Part 1: Architecture and Basic Multilingual Plane, which is also known as the Universal Character Set (UCS).” Idem, p. 5. In due time virtually all the scripts of the world will be included in Unicode. Since the base plane is filling up quickly, supplementary character codes are being made available in additional planes. Unicode Consortium, The Unicode Standard, Version 4.0. Addison Wesley, 4 September 2003 1520 pages ISBN: 0321185781. Virtually the whole book is available in multiple PDF files at http://www.unicode.org/versions/Unicode4.0.0/

Top.

2. What is meant by ‘Khmer in Unicode’?

The hexadecimal range U+1780 to U+17FF in Unicode/ISO10646 includes characters of the Khmer/Cambodian script. With Unicode Version 4.0 an additional range has been added: U+19E0 to U+19FF. The Khmer script, in turn, is used to record text in the following languages: Khmer, Sanskrit (transliteration), Pali (transliteration), Cham, Krung, and other minority languages used in Cambodia. Khmer Unicode is the only globally standardised encoding of the Khmer script. Top.

3. Isn’t there something odd about this Khmer encoding?

3.1 What are the sources of the Khmer encoding of Unicode?

Significant study, discussion, and debate went into this encoding. Some key steps: UNESCO-sponsored conference in Phnom Penh on Unicode (immediately after UNTAC), Khmer government sponsored meetings of Khmer linguists in Cambodia (Report of 14 August 1996, pages: 1, 2, 3, 4, Rough English translation), Unicode sponsored mailing lists, and discussions within an ISO subcommittee. There was a healthy debate at every stage. A 1992 proposal by Andy Daniels, Unicode Technical Report #1 (http://www.unicode.org/unicode/reports/tr1.html and http://www.unicode.org/Public/TEXT/UTR-1.TXT), did not progress to the level of a standard but heavily influenced what followed. I offered a number of proposals and encodings: (14 March 1997, 17 March 1997, 30 July 1997, 20 September 1997, 13 March 1998). An ad hoc committee of the WG2 working group focused on the Khmer encoding proposals...and approved one to be reviewed by the International Standards Organization members as a PDAM (pre-draft amendment to ISO10646). That decision was made 17 March 1998. The agenda of the ISO/IEC JTC 1/SC 2/WG 2 starting 30 June 1997 http://www.dkuug.dk/JTC1/SC2/WG2/docs/N1517.html Discussions by the ISO subcommittee led to documents dated: 18 March 1998 and 23 October 1998. A great debt is owed to Michael Everson (everson at evertype.ie; http://www.evertype.com) and Paul Nelson (paulnel at microsoft.com) who have worked very hard on the standardisation and implementation respectively of Khmer Unicode. An OpenType specification for Khmer is now available on the Microsoft site.

3.2 Why is this encoding so different from those previously used (glyph versus phonetic encoding)?

This encoding breaks from the tradition of glyph-based encodings where each subscript, vowel-part, ligature, and positional variation were separately encoded. Although glyph-based encodings of Indic scripts are easy to implement (for display purposes, except that there were not enough codes to render all ligatures and characters in one byte encoding schemes), they hinder: (1) extensibility [i.e., vowels in other languages in the Khmer script could use one glyph where the Khmer language uses two glyphs or ancient Thai-influenced writing of a vowel may use two glyphs whereas Khmer now uses only one glyph], (2) sorting, (3) searching, (4) transliteration [i.e., switching the display via fonts for an identically encoded text in the same language; i.e., from Khmer script into International Phonetic Alphabet for foreigners], (5) text-to-speech, and (6) rule-based display [i.e., presenting a visual warning if by mistake two vowels were entered for one cluster/quantum in the Khmer language; a visual warning if characters are not entered in the proper order, which is: base consonant/independent vowel/number, [rarely] robat, optional register shifter (but in a secondary priority for sorting), optional subscript, optional subscript, vowel (often inherent), optional vowel-like sign (U+17C6, U+17C7, U+17C8), optional sign, optional sign].

Therefor Khmer Unicode is by preference a phonetic encoding, one in which each character occurs in the order it is pronounced and spelled [which is also the sequence of sorting/collation/ordering].

3.3 Why are there no subscripts in this encoding (COENG)?

In addition, a decision was made to implement subscript characters as a COENG character (U+17D2) followed by a ‘base’ character. The advantages of using COENG include: (1) it maintains the close relationship there is between the base and subscript forms of characters [except for the COENG prefix, the names are the same; the pronunciation is the same excepting that only the last in a cluster/quantum bears the vowel sound; the analogous ordering is the same], (2) it limits the number of characters that have to be encoded [explicit encoding could have added as many as 52 additional codes: all Khmer/transliteration consonants PLUS all the independent vowels], (3) it makes typing easier by limiting the number of keys that need to be learned, (4) it maintains consistency with most other Indic scripts, (5) it forces a decision in typing between subscript TA (U+178F) and subscript DA (U+178A) which are visually identical in subscript form but need to be distinguished in sorting and searching, (6) it allows a standardised means of encoding every possible subscript form [consonants and independent vowels, for example]; without enshrining in a standard some subscripts which never occur and (7) it could have facilitated the old representation of lunar dates (but those later were encoded [like ligatures] at U+19E0 to U+19FF) http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2491.pdf ).

3.4 There appear to be duplicates in the encoding. Why?

The apparent duplication of U+17A2 and U+17A3 reflected the linguist's perception that there was a need to retain the Khmer consonant (U+17A2) and the Sanskrit/Pali independent vowels (U+17A3, U+17A4) separately. U+17A3 was later deprecated as it reflects language-specific sorting differences rather than a true character difference (even though one is considered a consonant and the other a vowel). Deprecation does not mean the character is taken out of the standards...but its use is strongly discouraged. The U+17A4 character should only be used for Sanskrit/Pali transliteration in specialist works (which would be necessary if ever found in a subscript form).

Anders Keller picked up a Pali dictionary transcribed in Khmer script in Phnom Penh and discovered that in that book the independent vowels (including KHMER INDEPENDENT VOWEL QAQ U+17A3 [substitute U+17A2 due to the deprecation]) sort before KHMER LETTER KA U+1780 on pages 1-84. KHMER LETTER LA U+17A1 is a fairly recent Khmer invention so is absent in the Pali series (but has recently been found in a subscript form in Northern Thai use of the Khmer script).

The order of the independent vowels from pages 1-84:
U+17a3 (this has subsequently been deprecated)
U+17a4
U+17a5
U+17a6
U+17a7
U+17a9
U+17af
U+17b1
U+17a2 U+17c6 (QAM or something like that)
The next character is KA U+1780

Similarly the normally invisible U+17B4 and U+17B5 are only meant to distinguish Khmer versus Pali use of inherent vowels and would only be used in specialist works.Words spelled with these vowels would be identical on the primary level with words otherwise identically spelled, but without these vowels. On a secondary level they would be different. Explicitly tagging these inherent vowels also facilitates round trip transliteration.

3.5 Why are the vowels inconsistent with what is taught in schools?

The three signs U+17C6, U+17C7, and U+17C8 have been encoded separately from vowels so as to avoid inconsistency in data entry under the direction of the Khmer linguists committee. There are many vowel combinations which are possible using these signs in addition to the commonly recognised U+17C6, U+17B6 + U+17C6, U+17C4 + U+17C7. The available choices were: (1) encode these three as separate vowels, in which case U+17C6 would sometimes function as a vowel, sometimes as a sign, so users would enter U+17B6 + U+17C6 sometimes as one code and other times by the constitutent parts (Note this paper dealing with such scenarios), (2) encode U+17C6, U+17C7, and U+17C8 as signs as was ultimately done. For sorting purposes, however, each combination of vowel plus one of these three signs should be treated as a unique vowel.

3.6 What else is recorded about Khmer in Unicode?

Each character in Unicode also has a number of properties encoded in a database. The details of those can be found at:

Version 4.0 Khmer encoding (search for Khmer):

ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt

Explanation of properties database:

ftp://ftp.unicode.org/Public/UNIDATA/UCD.htmlUCD.html

Link to Unicode version 4.0:

http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Latest

Link to Unicode version 3.2 discussion re Khmer:

http://www.unicode.org/unicode/reports/tr28/#9_15_khmer

Link to Unicode version 4.0 chapter dealing with Khmer:

http://www.unicode.org/versions/Unicode4.0.0/ch10.pdf

Top.

4. How can one ‘see’ Khmer Unicode text?

GENERIC FONT ENCODING

Whatever font type you use, it is important to use Adobe’s Unicode PostScript naming scheme: http://partners.adobe.com/asn/developer/type/unicodegn.html A font editor is needed: I find ‘Font Creator Program 4.0’ (http://www.high-logic.com/) to be one of the cheapest ones for TrueType/OpenType fonts...and it does not destroy the added font tables needed for intelligent fonts.

GRAPHITE

An early beta copy of a public domain Khmer Unicode font using Graphite technology was available on this site from 23 July 2002. The outlines were those generated by Mr. Om Mony for UNICEF in 1995. These have been extensively edited (but not hinted;-() by Maurice Bauhahn. The Graphite scripts are largely the work of Maurice Bauhahn with patient help from Sharon Carrol and Martin Hosken. This font automatically generates wrap around vowels, prefix subscript RO, subscripts, ligatures, lunar dates, stacking superscripts, etc. It is available for use on Microsoft Windows 98 and Win2000. This requires an installation of Graphite and WorldPad from http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&cat_id=RenderingGraphite, download and decompression of KhmerGraphite.zip, the normal manual installation of PhnomPOT_Krung_gr.ttf, execution of grfontinst PhnomPOT_Krung_gr.ttf, installation of Keyman and either the KhmerDV keyboard or the Khmer phonetic keyboard, and setup of WorldPad Writing System Properties for KHM (Keyboard: Keyman; Font: PhnomPenh 95).

Didi Kanjahn subsequently created a separate Khmer Unicode font using Graphite: http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=GraphiteFonts

While attending a fantastic one-week conference (Non-Roman Script Initiative) the middle of September 2001, I did a successful test run of typing Unicode Khmer using the Graphite engine (from the Summer Institute of Linguistics), WorldPad, and Keyman. These are at the moment Windows only applications, but 20 September 2001 the source code was made available under an Open Source license (so hopefully it will be ported to Linux). First of all one needs to create a Unicode enabled TrueType font (the ‘cmap’ table needs to reference Unicode code points and the ‘post’ table uses Adobe certified names (when there is no such name, that is the Unicode hex value preceded by ‘uni’: KHMER LETTER KA would be ‘uni1780’). Then an XML-like text file is prepared with the proper ordering, substitution, positioning, etc rules. The Graphite font compiler then moves those rules into the TrueType font. A special font installer incorporates the Graphite-enabled TrueType font into the operating system (Win98, WinNT, Win2000, or higher) and registers it three different places in the Windows Registry. At the moment the only application taking advantage of Graphite is a simple word-processor, World Pad. A presentation of Khmer Unicode in a Graphite engine was done at the 21st International Unicode Conference in Dublin, Ireland: http://www.unicode.org/iuc/iuc21/a329.html However, work needs to done on an interface with Microsoft SQL Server and other applications.

OPEN TYPE

For ‘quick and dirty’ viewing of Khmer Unicode an immediately available tool is the James Kass’ shareware Code 2000 font (not the Code 2001 which rather encodes ‘supplementary’ code plane characters). This is an OpenType font which in the presence of a Uniscribe engine can properly display Khmer (excepting some ligatures). It does not appear to be hinted...so screen display is a bit fuzzy.

Development of an OpenType ‘intelligent’ Khmer font (dependant upon the USP10.dll) was undertaken on the Windows platform but missed an important software development cycle so was delayed some years until 2003. A copy of that usp10.dll file is available to people under a non-disclosure agreement in the VOLT user group area. Note at the bottom of this URL issues about later versions of Uniscribe not working right with Windows version earlier than WinXP: http://uk.geocities.com/BabelStone1357/Software/BabelPad.html#download You may download the excellent VOLT OpenType development tool from Microsoft. There is also a good VOLT community at http://groups.msn.com/MicrosoftVOLTuserscommunity/.
In a recent version you'll find a file named VOLTSupplementalFiles.exe from which can be extracted usp.cab containing the 1.0460.3707.0 version of
Uniscribe (usp10.dll). ,
http://download.microsoft.com/download/Typography/Install/1.1.206/W98NT42KMeXP/EN-US/msvolt.exe An excellent article on OpenType can be found at:
http://www.microsoft.com/typography/developers/opentype/default.htm

FREETYPE

However it is hoped that FreeType (http://www.freetype.org/) Khmer fonts on Linux/Unix, AAT (http://fonts.apple.com/) fonts on Macintosh (using the ATSUI engine: http://www.apple.com/developer/; http://developer.apple.com/fonts/OSXTools.html), and Graphite fonts (using the Summer Institute of Linguistic’s engine documented at http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&cat_id=RenderingGraphite) will be developed as well.

ICU

Possibly the most important area for development of Khmer would be in the OpenSource ICU project which is the basis of Unicode development for Sun and IBM's Java layout engine as well as for many applications. This depends upon OpenType fonts with the reordering being handled by some C programmes (which effectively replace the Uniscribe functionality, but cross-platform). The OpenI18N has recommended the inclusion of ICU in Linux distributions (http://www.openi18n.org/openi18nspec/index.html). OpenOffice uses the ICU layout engine as well as other services. You may join the ICU mailing list at http://oss.software.ibm.com/icu/archives/

PANGO

Robert Brady (robert at suse.co.uk) writes in March 2001 describing the Pango engine for rendering complex scripts: “[Pango] is the text rendering and shaping infrastructure that’s going to be used by the next major version of GTK+ and Gnome (see http://www.gnome.org/), a desktop environment thing for Unixish platforms, including Linux.” Pango uses some heavily modified code from ICU (but does not use UTF-16 internally).

Lin Chear (linchear at rogers.com) reports on Christmas Eve 2003 that he is making good progress in implementing Khmer with Pango 1.3.1 using the KhmerOS.ttf font along with GTK+. This is a significant breakthrough in the field of Open Source using of Khmer. Congratulations to Lin Chear on these exciting developments. See the present state of play on this site at Pango.html and also: (http://www.khmer.cc/members/homepage.html?member=CrashX): klinux.jpg

X11

Sun has made available under BSD license X11 technology that reportedly can take advantage of TrueType, PostScript Type1, OpenType and TrueType GX (AAT) technologies called Standard Type Services Framework (STSF): http://stsf.sourceforge.net/about.html

AAT

Unicode fonts for Mac, AAT (Apple Advanced Typography), Apple Font Tools, ATSUI applications. http://groups.yahoo.com/search?query=aattype

Daniel Kai is rapidly assembling AAT fonts for the Macintosh (note that you can encourage and help development by preordering on the following page): http://www.xenotypetech.com/osxCambodian.html

Programmes which have an excellent display of Khmer Unicode (if supplied with an AAT font) are:

MacOS 10.3 (Panther) - you can type or paste Khmer names of files on the desktop...and they show up in list boxes!

Safari 1.1 (v100) browser on MacOS 10.3

WorldText 1.3.16 (buried in the Developer installs; can be used to create or edit text files with special fonts, typography effects, or embedded media) on MacOS 10.3

Somewhat problematic display of Khmer Unicode (these are unusable due to a bug in the TextEdit libraries: http://homepage.mac.com/bauhahnm/applications.html):

Stone Create (11.3.2 (v February 9 2004 [build 359]) http://www.stone.com (desktop publishing programme)

TextEdit 1.3 (v202) (Drag & drop text, graphics, and attachments; Edit RTF, Word, Unicode and plain text files; import SimpleText and HTML files)

SubEthaEdit Version 1.1.5 (v1038) http://www.codingmonkeys.de (collaborative distributed text editor)

Reportedly Apple's implementation of the Java JRE also supports AAT fonts.

Jonathan Kew of SIL has implemented what is possibly the most significant world-aware desktop publishing solution...it works great for Khmer Unicode and is FREE! XeTeX (pronounced zeetek) is a Macintosh-based implementation of Donald E. Knuth's TeX programme that uses Apple's AAT fonts. Its preferred output format is PDF.(http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=xetex)

A programme available on PCs with very powerful desktop publishing functions is UltraXML 3.3. Although in preliminary trials it did not handle Khmer word-wrap (inadequate handling of Zero Width Space), it is more GUI oriented and handles the display of Khmer Unicode well. Very discouraging, however, is its price £2,400 for one license. It lacks the ability to generate PDF files (http://www.webxsystems.com)

Ready, Set, Go! 7.6 (compatible with Mac OS X 10.3) and Mellel 1.7.0.1 do NOT support AAT fonts, unfortunately.

I am also working on some legacy<->Unicode transcoders for people working with legacy encodings using TECkit (http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&cat_id=TECkit).

GENERIC OPEN SOURCE

Note the following URLs that point to work being done on incorporating other Indic scripts in open source software:

http://trinetra.ncb.ernet.in/bharateeyaoo/; http://marketing.openoffice.org/conference/presentations-pdf/IndianPerspective.pdf

GNU Unicode Font Project (http://czyborra.com/unifont/); http://groups.yahoo.com/search?query=gnu-unifont


FONT DEBUGGING

A Perl script (to be used with Perl 5.8.1) can be used to generate all conceivable [many more than actually occur!] consonant cluster combinations: TestKhmerFont.pl.txt A partial output of the above script in Khmer Unicode (UTF-8) is available here: Khmer6016.txt

Top.

5. How do I change this Khmer Unicode encoding? I don’t like it.

5.1 What cannot be changed?

One of the characteristics of a standard is that it is stable and identifiable. Hence, one of the rules of Unicode is that no code or code name (or other ‘normative’ property) that is in the standard may be deleted or changed. It took years for Khmer Unicode as it now stands to pass through to become a standard, so similar thoughtfulness and discussion should be applied in considering changes.

5.2 How are mistakes corrected?

With respect to codes and names: use of a particular code may be deprecated and justifiable additions may be made if they pass through the proper procedures (they may not be deleted, however). Descriptions and non-normative properties are relatively easily corrected.

5.3 Are there mistakes in Khmer Unicode?

The paper copy of Khmer Unicode Version 3.0 had some significant display errors (U+17BF, U+17C1, U+17C2, U+17C3). These were corrected in a PDF file issued subsequently. The corrected code charts for the scripts in Unicode may be found at: http://www.unicode.org/charts The descriptive writeup on pages 251 and 253 also has mistakes (Such as: “The ZERO WIDTH SPACE can grow to have a visible width when justified”). These may be corrected (but persists in Unicode 4.0;-().

One of the characters that should be deprecated (but probably will not be) is KHMER SIGN ROBAT (U+17CC). A careful study of sorted words which include it, reveals that it is handled in Khmer like in other Indic scripts: as a glyph variant of an initial KHMER LETTER RO, followed by COENG/virama, followed by the base character over which it resides (in those rare instances that a full base-KHMER LETTER RO is followed by a subscript the entry order is RO, U+200C, subscriptable-base). Note that the Khmer COENG functions like the VIRAMA in other indic scripts. COENG/VIRAMA should be distinguished from VIRIAM (U+17D1) which has other functionality, indicating that the base character is part of the previous word). Some characters used for transcription of Sanskrit/Pali have subsequently had warnings attached to them (as they should not be used in everyday Khmer data entry).

5.4 What additions might be added to Unicode relating to the Khmer script?

5.4.1 Why should a SOFT SPACE be added to Unicode?

One of the characters which I would like to see added to Unicode is a CONDITIONAL SPACE/SOFT SPACE (a pseudo-phrase break which could added to facilitate justification of Khmer and other Indic scripts, but which would disappear when no longer needed in re-wrapping). At the moment the SPACE character must be used to represent both real phrase breaks as well as justification spaces that have been added to lines in particular contexts. Expansion of space between words for the purpose of justification is not allowed in Khmer, so these justification spaces are added on lines that need them in such a way as to minimize a disruption in the underlying thought. The retention of these ‘justification spaces’ introduces significant corruption into Khmer texts when they are wrapped differently.

5.4.2. What other characters/signs should be added to Khmer Unicode?

There are also a number of transliterated Sanskrit/Pali signs which could be added. The accompanying scans of a Sanskirt Grammar are worth investigating (Cover, 002, 003, 004, 018, 019, 020, 021, 063, 064, 065, 066, 070, 139, 140, 172). There is also a Pali independent vowel which resembles U+17A3 + U+17C6 that is sorted separately in a Pali dictionary in Khmer script.

5.5 Official challenge to the standard.

5.5.1 Official complaint was document N2380.

5.5.2 Response from Maurice Bauhahn and Michael Everson, N2384, and corrections.

5.5.3 Notes on meeting, N2394. ISO WG2 decision, N2404R

5.5.4 Some problems yet to be solved: RemainingIssues.html

5.5.5 21st meeting of Unicode Consortium http://www.unicode.org/iuc/iuc21 14-17 May 2002 in Dublin and WG2 http://wwwold.dkuug.dk/JTC1/SC2/WG2/docs/n2404r.doc 20-23 May 2002 also in Dublin, Ireland.

5.5.6 Revised N2380R document with encoding table

5.5.7 Response to N2384 (N2406)

5.5.8 New Khmer Unicode mailing list (KhmerUnicodeMailingList)

5.5.9 Additional proposals to change Khmer Unicode: Lunar dates, Vowels Om/Am, Additional Signs, Deprecations, Krung Voicing Marks, and Explicit Subscript Encoding. The vowels Om/Am and Explicit Subscript Encodings were rejected by the UTC.

5.5.10 Unicode Technical Report 28 (Unicode 3.2): http://www.unicode.org/unicode/reports/tr28/#9_15_khmer

55.11 Comments on Technical Report 28 OM/AM Vowels, inherent vowels, deprecation comments.

5.5.12 Official documents considered at the WG2 meeting in Dublin included: N2406, N2412, N2458, N2459R, N2470, N2459, N2471, N2472-AMD2. The Lunar dates and additional divination numbers were accepted by the UTC and WG2 of ISO. The WG2 does not actually provide for deprecations, but an ad hoc committee at the WG2 meeting did agree that certain characters should be very restricted in their use. The Krung voicing marks were rejected in the WG2 ad hoc.

5.5.13 Khmer Unicode Deprecation recommendations by UTC: http://www.unicode.org/unicode/consortium/utc-minutes/UTC-092-200208.html

5.5.14 Khmer Unicode 4.0 (with clearer descriptions, addition of one rare sign, ten divination digits, and lunar dates). Updated chart, new chart, description.

Top.

6. What will a Khmer Unicode keyboard look like?

6.1 Where are keyboard layouts standardised?

The current contact is Mr. Alain LaBonté (email: alb at iquebec.com), ISO/IEC 9995 series Project editor, International Standards Organisation/International Electrotechnical Commission JTC1/SC 35 – User Interfaces, N0232, ISO/IEC 9995, Keyboard layouts for text and office systems. Alain’s Web site contains some useful documents: http://www.iquebec.com/cyberiel

6.2 What characters should be included in a Khmer keyboard?

Please follow this hyperlink to a PDF document that I have prepared detailing characters useful for a Khmer Unicode keyboard: Khmer Characters. In addition Unicode savvy operating systems will offer a means to enter any Unicode character via code number.

6.3 What are the issues involved in a Khmer Unicode keyboard?

Here are two documents which discuss these issues: KeyboardingofKhmerUnicode.rtf (in Microsoft Word Rich Text Format) and KhmerUnicodeKeyboard.rtf (also in Microsoft Word Rich Text Format). A generalised page with several useful utilities may be found at: http://www.hclrss.demon.co.uk/unicode/utilities_fonts.html

A Khmer Unicode keyboard layout for Windows may be created using Keyman 6.0 (available in the Developer version from Marc Durdin of Travultesoft; however once a keyboard is created a run-time version is available for free for non-commercial or non-governmental use). A sample keyboard resembling the Khek phonetic layout is available here in the compiled and un-compiled form.

Macintosh Khmer Unicode keyboards for MacOS X (http://developer.apple.com/technotes/tn2002/pdf/tn2056.pdf) may be created using a text editor or http://homepage.mac.com/poorant79/software/download.html / http://wordherd.com/keyboards/

6.4 What statistics need to be considered in designing a keyboard?

A keyboard should be designed to be efficient and relaxing. Some of the rules which help to facilitate that are: (1) the most frequently used characters are easiest to type (generally on the home row, row B as designated in the standards), (2) characters that follow each other should generally be entered by opposite hands, (3) either hand has about the same load and (4) the fingers of each hand have a fairly balanced load. Here is a document which presents the statistics on a Khmer text with about a quarter of a million Khmer characters: KhmerLetterPairFrequenciesB.txt. It would be helpful if we could calculate the latency of reaching to various keyboard positions from other positions. This chart has been helpfully reworked by Javier Sola: Frequency_of_key_strokes_in_Khmer_Unicode.pdf

A small computer programme that accepts text and measures the time to reach from one character to the other (correlating that with keyboard position) would be helpful. Some years earlier I attempted something like that by measuring the gaps between the sounds of text being typed on to the keyboard and comparing that against the keyboard positions of the individual characters. Examples of Java applets that compile other statistics are at the following URL: http://www.acm.vt.edu/~jmaxwell/dvorak/compare.html

6.5 What is the keyboard layout for Khmer Unicode?

That has not been decided, yet. The final decisions will lie with official bodies of the Cambodian government. However I have designed two sample layouts. One is a keyboard designed on the principles of the Dvorak keyboard where letter pair frequencies decided the placement of Khmer characters (source). The second is what I call a ‘phonetic’ keyboard...where Khmer characters are often placed on keys that have letters of a similar sound when using the English Qwerty keyboard (UnicodePhoneticKeyboard.pdf) [source]. Alpha versions of both have been created and used on Windows. Here is a graphic which can be used with these. My apologies that some of the non-Khmer characters do not show up in the PDFs (my version of Adobe Acrobat will not embed these TrueType glyphs). As earlier notes indicated, this keyboard has no subscript or ligature forms. To form a subscript type the COENG character followed by the BASE form of the character you want expressed as a subscript (which may involve typing BASE, COENG, DEAD-KEY, BASE for rarely used characters). At the moment on the Phonetic keyboard the DEAD-KEY is on the semicolon keycap (same side as COENG)...however I would prefer that it were on the left-hand side of the keyboard. I would very much appreciate feedback on this and comparisons of the placement of BASE forms and signs with the base forms of standard/pseudo-standard layouts. A Macintosh implementation is taking longer. Reportedly a Cambodian standards committee has settled on another keyboard layout...but I do not have a copy of that layout.

6.6 Keyboard layout creation programmes

Tavultesoft (http://www.tavultesoft.com/keyman/)

Microsoft (http://www.microsoft.com/globaldev/tools/msklc.mspx, http://www.microsoft.com/downloads/details.aspx?FamilyId=FB7B3DCD-D4C1-4943-9C74-D8DF57EF19D7&displaylang=en)

Keyboard Layout Manager (http://solair.eunet.yu/~minya/Programs/klm/index.html)

Unicode Keyboards for Mac OS X (http://wordherd.com/keyboards/) (http://developer.apple.com/technotes/tn2002/pdf/tn2056.pdf) (Ukelele 1.0.3 and many other data entry programmes: http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=inputtoollinks)

Top.

7. How should Khmer Unicode be sorted/collated/ordered?

The actual mechanism of sorting differs from one implementer to another. A detailed explanation of how Khmer is ordered in the standard Chuon Nath dictionary (and description of options) is present the documents entitled: KhmerSortingQuestions.rtf and KhmerSortingUnicodegamma.pdf. Some of the first implementations of sorting Khmer Unicode have come from Anders Karl Keller (Email: anderskarlkeller at yahoo.com). From 16 November 2001 there has been a suggestion that indic scripts could be sorted under the Unicode Collation Algorithm (UCA).

To store Khmer Unicode data in SQL Server 2000 be sure to upgrade to service pack 2. There is no provision yet for intelligent font display or sorting...but the information still should store.

Kent Karlsson [Email: kentk at md.chalmers.se] has created a document, L2/01-476, concerning Khmer sorting. This needed some significant changes but was a good start by someone expert in Unicode sorting algorithms. A significantly improved later edition is at http://std.dkuug.dk/jtc1/sc22/wg20/docs/n1076-Khmer-order11.pdf

A Perl programme (still in need of some changes suggested by Kent Karlsson) that sorts a Khmer Unicode wordlist into alphabetical order (sample output from the beginning of the Chuon Nath dictionary [duplicates and short words removed] CNList.txt) : KhmerCollation.pl.txt and KhmerCollation4.data

A site to test out collation: http://oss.software.ibm.com/icu/demo/

Top.

8. Can legacy encoded Khmer text be converted to Khmer Unicode?

8.1 Plain text

Many code conversions with other scripts are fairly simple: there is a one to one code conversion from a legacy code to a Unicode code. Khmer conversions are rather complicated by contrast. Some of the problems encountered in moving legacy glyph-based encodings to Unicode: (1) Initial dependent vowels and subscript RO need to be reordered to a position after the base consonant, (2) Sometimes initial dependent vowels must be merged with vowel-part glyph to create a separate vowel , (3) Ambiguous KHMER VOWEL SIGN U (U+17BB)/KHMER SIGN MUUSIKATOAN (U+17C9)/KHMER SIGN TRUYSAP (U+17CA) need to be distinguished, (4) Ambiguous subscript KHMER LETTER DA (U+178A)/KHMER LETTER TA (U+178F) need to be distinguished, (5) {KHMER SIGN ROBAT needs to be converted into KHMER LETTER RO + KHMER SIGN COENG + ‘base consonant’} Not yet! Although it probably is a mistake in the encoding, ROBAT is treated as a diacritic immediately following the base character, (6) word break ZERO WIDTH SPACE (U+200B) may need to be inserted to create word breaks at appropriate points, (7) duplicate/similar zero width superscripts/subscripts which occurred in the original text need to be flagged, (8) and diacritics may need to be rearranged in the order: ROBAT, register shifter, first subscript, second subscript, single vowel, signs NIKAHIT, REAHMUK, and YUUKALEAPINTU, sign2, sign3. Anders Karl Keller and Maurice Bauhahn have both been working on programmes to convert legacy encodings to Khmer Unicode (as well as bidirectional encodings). Here may I put in an urgent request that legacy (one byte) encoded fonts incorporate distinct subscripts for KHMER LETTER DA, KHMER LETTER TA, KHMER VOWEL SIGN U, subscript KHMER SIGN MUUSIKATOAN, and subscript KHMER SIGN TRIISAP. I would also appreciate it if someone would send me a nearly-complete set of Khmer words with subscript KHMER LETTER DA.

Word breaks are an important part of the use of Khmer on computer, but because Khmer does not natively have word breaks text is typically typed in without any character to distinguish the end of one word and the beginning of the next. In Unicode a ZERO WIDTH SPACE (U+200B) may be typed in to signal that break...but hopefully in the future an automatic algorithm will detect words without that mechanism (in order to handle wordwrap, spell checking, etc). Martin Hosken from SIL has kindly shared some useful hints on means he has used with Thai to handle an analogous problem: WordBreaks.html

Lin Chear has produced a useful tool (this is for Microsoft Windows, but knowing his interest in Linux...you'll find that buried somewhere) to convert plain text written with the encoding of ABC fonts to HTML: http://unicode.khmer.cc/khmeros/cvt.zip

Usage: cvt infile.txt > outfile.html

While working on TECkit scripts for cross-platform transcoding of Khmer from legacy to Unicode (encodings derived from legacy fonts by Sok Nhep Arun and fonts by Maurice Bauhahn) I've developed some methodologies and small Perl scripts that could be improved...but may be a help to beginners: TECkit.html.

8.2 Khmer text embedded in formatting markup.

8.2.1 Generic problem of text embedded in markup

Two or more different encodings may need to be converted to Khmer Unicode dependent upon the font used for individual characters in a given word.

Note this paper.

8.2.2 Rich Text Format markup

Java Swing includes RTFEditorKit that knows how to read and write text using RTF (Rich Text Format), a generic format that many word processors are able to work with (com.sun.java.swing.text.rtf). Some useful URLs: http://javafaq.nu/java/free-swing-book/free-swing-book-chapter20.shtml

http://www.cs.cf.ac.uk/Dave/HCI/HCI_Handout_CALLER/node123.html

http://www.cs.cf.ac.uk/Dave/HCI/Exams/HCI_EXAM_2000_Solns.pdf

http://jdelavarene.online.fr/Java/SwingBook/chap05.pdf

Some people have used Visual Basic for Applications to convert from legacy to Unicode encodings (and back again).

8.2.3 Maker Interchange Format

A Python toolkit for parsing Maker Interchange Format markup (used in Adobe FrameMaker) is available at: http://members.home.net/doughellmann/MIFFile/index.html

There was a fear that Adobe was going to discontinue support for its FrameMaker product, but the growing interest in XML has now persuaded the company to continue development (there has long been an SGML version). A developer has been hired to create a Unicode-savvy version of FrameMaker...I met him at the 21st IUC in Dublin. Furthermore, they are using parts of the ICU library in this Unicode port. Top.

9. How can Khmer be transliterated?

9.1 American Library Association - Library of Congress

An excellent Khmer script transliteration scheme is described on pages 96-98 in the book ALA-LC Romanization Tables: Transliteration Schemes for Non-Roman Scripts/approved by the Library of Congress and the American Library Association; tables compiled and edited by Randall K. Barry. Library of Congress: Cataloging Distribution Service, 1997 Edition. ISBN 0-8444-0940-5 (pbk. only). (Dewey: P226.A4 1997 or LOC card number 97-012740). This is largely reversible (except for ambiguity with independent vowels/dependent vowels and base characters/their matching subscripts). In addition to the Latin alphabet it uses a scheme of upper and lower diacritics and a few spacing signs. Maurice worked on a Graphite font which had an optional feature to display Khmer Unicode text in this transliteration scheme (this project now dormant [2004]). The mapping of Unicode characters to the US MARC encoding can be found at: http://lcweb.loc.gov/marc/specifications/speccharlatin.html

9.2 Judith M. Jacob IPA transliteration

Another scheme which is very phonetic and reversible is that of Judith M. Jacob described in her book: Introduction to Cambodian. London: Oxford University Press, 1990. ISBN: 0197135560. This scheme uses quite a few characters from the International Phonetic Alphabet.

9.3 UN System

Søren Binks in Copenhagen kindly sent me on 18 August 2002 the following URL for a transliteration scheme for Khmer from the UNGEGN Working Group on Romanisation Systems (March 2000): http://www.eki.ee/wgrs/; the Khmer document is at http://www.eki.ee/wgrs/rom1_km.pdf

Top.

10. Language Codes (including Khmer)

10.1 http://www.sesame.demon.co.uk/revcodes/part3rev.htm

10.2 http://www.sesame.demon.co.uk/revcodes/

11. Khmer Unicode Effect on Other Standards

11.1 XML Blueberry: http://www.w3.org/TR/xml-blueberry-req

12. Using Khmer Unicode in Microsoft Office 2003

Happily after what seemed like an interminable wait, Khmer Unicode can now be used legally in commercial software...and it is not on Macintosh! Microsoft Windows 2003 (for PC) was formally released 21 October 2003. In particular Microsoft Publisher 2003 is highly recommended for Khmer Unicode publishing efforts. Putting the USP10.DLL into a WINNT or WINDOWS directory (so that the Khmer Unicode functionality is usable in a wider variety of programmes (such as browsers) may require getting around the permissions problems. See usp10.html

12.1 Keyboard lacking

One thing Office 2003 lacks is a Khmer Unicode keyboard. I have heard that a standardisation body in Cambodia has recommended a keyboard, however I have not seen it (and certainly would like to see it!). Meanwhile please find attached two Khmer Unicode keyboards that can be used with the Keyman 6.0.x (free for personal use) utility. The layouts in PDF format (with apologies for the non-appearance of certain non-Khmer characters) are of two sorts: Phonetic and Dvorak-like. The source files are: bitmap, Phonetic and Dvorak-like. Precompiled files are: Phonetic and Dvorak-like. These files have not been updated (only recompiled) since 2001, so lack the ability to reference items from Unicode 4.0. One of the advantages of Khmer Unicode is that it allows many fewer keys (for example, ligatures and subscripts no longer need separate keyboard positions). Both keyboards were created to maximize speed (seldom do two keys need to be pressed at the same time) and after careful analysis of Khmer character frequency tables. It will be confusing at first, however, to have to press COENG, then a dead key, and finally a consonant key for rarer consonantal/independent vowel subscripts.

12.2 Font installation

You can secure Khmer Unicode fonts from the following:

Om Mony (ommony_cdt at hotmail.com)

Danh Hong (danhhong at hotmail.com; http://www.geocities.com/danhhong2003/typefaces.html; http://geocities.com/chruytoem/khkeyonline.html; http://www.geocities.com/dnhhng/typefaces.html (Danh Hong's public domain font, KhmerOS.ttf is available in version 1.6 at: http://www.khmeros.info/khmeros_download.html) A description of Danh Hong's keyboard layout is also here.

Note also the efforts by Javier Sola to enable Khmer Unicode (http://bugzilla.gnome.org/show_bug.cgi?id=125605) and a new web site in cooperation with Open Forum of Cambodia: http://www.khmeros.info/

Masavang Sean (Masavang.sean at datagraphic.fr; http://www.cfcambodge.org/KhmerUnicodFont.htm)

http://home.att.net/~jameskass/ (Code2000)

http://www.khmer.cc/community/t.c?b=16&t=907

Microsoft licensed a font from Om Mony (but I have not found it in Office 2003). I am told it is in the Microsoft Office 2003 Multilanguage Pack.

Didi and Mimi Kanjahn (http://projects.thedanielmay.com/khmerfonts) and email: didimimi at web.de

12.3 Help Concerning Viewing and Creating Khmer Unicode Web Pages (browser.html)

13. Voice to Khmer Unicode text

Some research is being undertaken by Ryan Thomas (ryanpt2000 at yahoo.com) on voice to Khmer Unicode text.

It is my understanding that Khmer has more vowels than virtually any other language (except that Khmer is not tonal) and that discrimination between one word and another is frequently dependent upon vowel sounds. Furthermore, computer voice to text algorithms distinguish vowel sounds better than consonantal ones. According to Ryan, Khmer 'has a more regular stress pattern than most languages which helps segment bits of sound into words'. Hence presumably a Khmer voice to text programme will be more accurate than such a programme for other scripts.

Top.

14. Locale

The best locations to work on standardisation of locales (date/number formats, etc) are the Common Locale Data Repository:
http://www.openi18n.org/subgroups/lade/locale/

http://oss.software.ibm.com/cvs/icu/~checkout~/locale/data_formats.html

http://oss.software.ibm.com/cvs/icu/locale/common/

An early attempt at part of a Khmer locale: pc3.htm (updated 12:45pm 26 December 2003)

15. Text editors; escape codes

HTML: You can enter any Khmer Unicode value into an HTML document by entering serially the following escape sequence (NCR):

Ampersand (&)

Number Sign (#)

x

(Unicode hex code; for Khmer that is any one code in the ranges 1780 to 17dd; 17e0 to 17e9; 17f0 to 17f9; 19e0 to 19ff; for decimal base 10 values omit the preceding x)

Semicolon (;)

Macintosh:

Alpha 7 is a relatively useful and inexpensive text editor for Macintosh, but it has a severe limitation in that one cannot print from it in MacOS 10.3: Alpha 7 (available from http://www.kelehers.org/alpha/ or ftp://ftp.ucsd.edu/alpha/); hence I needed to fall back on the simple TextEdit that comes with the MacOS system.

TextEdit has an annoying way of substituting certain characters from another font (i.e., 222, 223, 240).

PC:

BabelPad (http://uk.geocities.com/BabelStone1357/Software/BabelPad.html#download) is freeware that facilitates conversion between Unicode (UTF-16BE) and UCR (\u1780) and NCR (see above). Displays UTF-8 encoded documents. Uses OpenType/TrueType fonts. Uses Uniscribe. Highly recommended. May have multiple copies in memory at one time (allows copying between documents).

SC Unipad (http://www.unipad.org/download/) Shareware, US$99 for unlimited edition. Uses bitmapped fonts and not Uniscribe (not recommended for Khmer).

UltraEdit (http://www.ultraedit.com) Shareware, US$35. Uses Uniscribe.

Microsoft Word: You can enter Unicode characters into Word by entering the hexadecimal codes (say, 17a0) followed by the "Alt-x" and then save the file as "encoded text" (Word2000) or just "plain text" (Word2002). Similarly, place the cursor after a character and type Alt-x; the character will be replaced with the hexadecimal code.

Unix/Linux:

Advanced vim: http://www.vim.org supports Ctrl-v + 'u' + 'hhhh' sequence for BMP Unicode character input

16. Khmer Unicode URLs

http://khmertech.org/ An excellent virtually all Khmer site by Rattanack X. Ath

http://tech.khmerknowledge.com/ http://unicode.khmer.cc A couple Lin Chear sites

http://www.khmeros.info/ KhmerOS project

http://www.camboday.com/ (Om Mony with an impressive team; note that several new Khmer Unicode fonts are available on this site!!)

No comments:

Post a Comment