otsukare Thoughts after a day of work

Khmer Line Breaking

I'm not an expert in Khmer language, it's just me stumbling on a webcompat issue and trying to make sense of it.

Khmer Language

The Khmer language, apart of being the official language of Cambodia (South-East Asia), is spoken by some people in Thailand and Vietnam.

Webcompat Issue - 56316

We receive a webcompat issue recently where a long Khmer line on a mobile device was not wrapping hence breaking the layout of the site. Jonathan Kew helped me figure out if the issue was with the fonts or with the browser.

I don't think this is about fonts, it's that we don't have Khmer line-breaking support on Android. Line-breaking for SEAsian languages that are written without word spaces (e.g. Thai, Lao, Khmer) is based on calling an operating system API to find potential word-break positions. Hence the results are platform-dependent. Unfortunately on Android we don't have any such API to call, and so we don't find break positions within long runs of text. We have an internal line-breaker for Thai (and recently implemented some basic support for Tibetan), but nothing for Khmer.

So that intrigued me the "find potential word-break positions".

Khmer Language Line Breaking

In western language like French and English, there are breaking opportunities, usually spaces in between words. So for example,

a sentence can break like this

because there are spaces in between words, but in Khmer language, there are no space in between words inside a phrase.

Thai, Lao, and Khmer are languages that are written with no spaces between words. Spaces do occur, but they serve as phrase delimiters, rather than word delimiters. However, when Thai, Lao, or Khmer text reaches the end of a line, the expectation is that text is wrapped a word at a time.

So how do you discover the word boudaries?

Most applications do this by using dictionary lookup. It’s not 100% perfect, and authors may need to adjust things from time to time.

It means you would like to have a better rendering in a browser, you need to either include a dictionary of words inside the browser or call a dictionary loaded on the system. And there are subtleties for compound words.

How is Khmer line-breaking handled on the Web? is trying to understand what is the status.

But let's go back to Gecko on mobile.

Gecko Source Code For Line Breaking

I found this reference in gecko source code for line breaking for these specific languages in LineBreaker.h

I opened an issue on bugzilla so we can try to implement line breaking for Khmer language. I was wondering if it would be a simple modification, but Makoto Kato jumped in and commented

Old Android doesn't have native line break API, but Android 24+ can use ICU from Java (android.icu.text.BreakIterator). Since we still support Android 5+ on Fenix, so not easy.

Chromium in Issue 136148: Add Khmer and Lao Line-Breaking layout tests has some tests, that might help if Mozilla decides to solve this issue.

Firefox Usage In Khmer Language Areas

I don't know if there's a big usage of Firefox in khmer but definitely on mobile that kind of bugs would have a strong impact on the usability of the browser. It is important to report bugs, it helps to improve the platform. It shows also how challenging it can be to implement a browser with all the diversity and variability of context.

A small report might benefit a lot of people.

Otsukare!