Limited Unicode Support in LibreOffice 220.127.116.11? Character insertion, non-zero (SMP, SIP) planes, and multi language documents - [專]
(Note: this is a complaint I posted the LibreOffice user support mailing list. It talks about Unicode support issues in LibreOffice Writer, notablly support for non-zero plane glyphs and document display.)
Limited Unicode Support in LibreOffice 18.104.22.168? Character insertion, non-zero (SMP, SIP) planes, and multi language documents
I am trying out LibreOffice 22.214.171.124 and is quite satisfied with most of its features, especially the experimental sidebar panel. However, several issues occurred when I was playing around with Unicode characters in LO.
Environment: MS Win7 64bit Ultimate English with French and Chinese language pack, non-Unicode code page: Chinese (CP936), LibreOffice 126.96.36.199, JRE 1.7.0_09 (both 32-bit and 64-bit).
The Special Characters Dialog
1. no unicode code point entry field in the dialog
Let's first talk about issues regarding the Special Characters dialog. Unlike MS Word, it does not feature a Unicode code point entry field. This causes some inconvenience if you know the code point of a certain glyph and need to insert it regularly. The “Compose Character” extension only solves part of the puzzle, which will be explained later in issues with non-zero plane glyphs.
2. wrong detection of unicode ranges
A more complicated problem with the Special Character dialog is the mis-identification of supported code points in a font—LO doesn't seem to handle this thing correctly at all. Instead, it displays blocks or squared question marks or blanks for the unsupported glyphs in a Unicode range partially supported by the font, and glyphs from fallback fonts assigned by the OS for a Unicode range the font totally does not support—with almost no sign of suppressing these unsupported glyphs or ranges from display. Only very few fonts are correctly identified (Cardo, Code2000 and Quivira, for example).
To illustrate this, install FreeSerif (version 0412.2263), and bring up the Special Characters dialog in Writer. Switch to this font, and go to the “Tibetan” subset. FreeSerif does not support Tibetan glyphs as of now, but LO still assumes it supports, and therefore displays strings of boxed question marks in this range.
Glyphs in non-zero Unicode planes
3. limited support for fonts with non-zero plane glyphs?
More serious issues are found in LO’s support for non-zero plane Unicode glyphs. For many fonts, LO doesn't seem to support non-zero plane Unicode glyphs even though the font itself does. Pasting the glyph from external applications doesn't help either, regardless of the text format: unformatted, RTF or HTML. The glyph is shown as square, question mark or blank space.
For example, you try to input in LO Unicode character U+1F374 (fork and knife) which is supported by the font Segoe UI Symbol (version 5.01). It is not shown in the Special Characters dialog—the highest code point available (for this font) is limited within the BMP plane (plane 0). You open MS Word 2010, and find that in Insert→Symbols, this glyph is correctly displayed and inserted when you switch to this font. You select this glyph, copy it, and paste it into LO. The glyph becomes square. Paste it into Windows Notepad (ensuring that Segoe UI Symbol is the font used), the glyph is again displayed correctly. You copy the glyph again from Notepad, still it is shown as square in LO.
In fact, among all the fonts with non-zero plane glyphs installed on my PC (namely: freeserif, freesans, freemono, code2001, code2002, WenQuan Yi Zen Hei, Cardo, PMingLiU-ExtB, Quivira, Segoe UI Symbol, Sun-ExtB, SunManPUA, SimSun-ExtB, and SimSun(Founder Extended)), only code2001, Cardo and Quivira are correctly identified for the non-zero plane glyphs the three fonts each support. The other fonts either do not expose their non-zero plane glyphs in LO or do so only sometimes.
4. no built-in mechanism to produce glyphs from typed code points
Nor can you work around this issue by manually specifying code points, as is usually done in MS Word (to generate the glyph previously discussed, key in 1F374, and press [Alt]+[X]). There is no built-in mechanism in LO to generate a glyph from typed code points. One may argue the “compose character” extension will work, but this extension adopts a fairly obsolete Unicode standard and does not recognize any code point higher than U+FFFF, i.e. no support for non-zero plane glyphs. Some professionals would suggest configuring the Windows registry to allow hexadecimal Unicode Alt input using the Numpad, but this method also does not support non-zero planes. Therefore, basically you could only manipulate non-zero plane glyphs in LO with a fairly limited set of fonts which may not share the same typeface with your document.
5. unstable detection of unicode ranges: weird behaviors in the Special Characters dialog
This compatibility issue is further complicated by some strange behavior with the Special Character dialog in Writer. Sometimes, when you switch to a font which supports non-zero plane glyphs after browsing or inserting glyphs using another un-supporting font in Special Characters dialog, the non-zero plane Unicode ranges supported by this new font disappears in the “subset” drop-down, and remains invisible until you restart Writer. Conversely, when you switch to a font which does not support non-zero plane glyphs after browsing or inserting glyphs using another supporting font, LO still tries to display the last used glyph in the new font, and occasionally even the last used Unicode range. What you see is, of courses, boxes, question marks or blanks.
Interestingly enough, this issue does not occur every time I use Writer. Two hours ago I was encountering this issue all the way, but right now, with no settings configured, new programs launched or updates installed during the interval, Writer appears to be comparatively more stable in terms of non-zero plane support, although still not all Unicode ranges are correctly detected for all fonts.
Messed up with mixed-script documents
Some of the most dramatic blunders I have came across are found in documents with multiple languages. Generally, support for documents with multiple writing systems and/or complex scripts exhibit defects in four aspects: broken bi-di display, broken glyph display, improperly spaced glyphs, and wrongly applied fonts and glyphs.
A typical example which illustrates this issue to its full extent is the UTF-8 test page from Columbia University. This test page features glyphs from a number of writing systems of different families, and can be used as a stress test for correct rendering of Unicode glyphs in different applications. Internet Explorer displayed the page perfectly without any glitches, and Microsoft Word, as well as Firefox, also scored high. LibreOffice, however, failed to present a number of lines correctly. The renditions by IE, MS Word and LO are printed using PDFCreator into three PDF files respectively. You can compare them for future research. I have highlighted lines that are evidently ill-displayed in these files.
6. broken bi-di display in mixed-script document
LO Writer failed to display lines with LRT texts mixed with RTL texts, including Pashto, Persian/Farsi, Hebrew, Yiddish, Arabic, Hebrew, Yiddish, and Urdu. These lines appear completely in RTL direction and may sometimes display glyphs abnormally overlapped. However, pasting these lines one by one unformatted into a new document resolves the problem. This is not seen in IE or MS Word.
7. Broekn glyph display in mixed-script document
LO Writer failed to display glyphs in several languages, including Gothic, Bengali, Telugu, Sinhalese, Burmese, Vietnamese (nôm), Khmer, Lao and Tibetan. Not all glyphs are shown for Vietnamese (nôm), and the diacritics did not combine in Lao. It appears LO Writer was not able to automatically assign fonts to these writing systems properly. One has to do so manually, which still leaves Gothic and Vietnamese (nôm) ill-formed in LO, probably because the two systems use glyphs from non-zero planes. This is not seen in IE, or MS Word (after manually setting font for Gothic).
8. problematic character spacing
LO Writer failed to space glyphs properly for certain fonts in certain writing systems. One example is the Runes glyphs used for ancient Scandinavian texts displayed using Code2001. The glyphs are densely arranged, sometimes overlapped, with display problems when one scrolls the page. Changing the font to Segoe UI Symbol resolves the issue. This is not seen in IE or MS Word.
9. wrong font information for mixed language documents
LO Writer failed to display font information correctly for mixed language documents. Although a number of glyphs from non-Latin writing systems have been identified and correctly displayed, the font information is not set correspondingly. One example is Ogham. To correctly display Ogham glyphs, one need to use fonts like Code2000 or Segoe UI Symbol. However, Times New Roman, a Latin/Greek/Cyrillic/Arabic font, is displayed in the “font name” drop-down or the “character” dialog when you select Ogham texts.
These are the issues I am concerned with after using LO Writer for a dozen hours. I am happy to see that the document text is more elegantly displayed in LO than in MS Office, and that combing marks are well supported in this system, but to get LO to work as a MS Office substitute means finding a resolution to these issues, as I rely heavily on these rarely used Unicode glyphs in my work, especially Chinese characters in SIP plane (plane 2).
Only several hours of experience with LibreOffice implies unfairness to simply justify these problems as bugs or flaws, but the limited support for Unicode LO exhibits on initial runs as compared to MS Office and web browsers also underlines the urgent necessity to address these issues, at least by means as such. Is there any possible answers to these problems, e.g. extensions? Or at least workarounds? Are these issues already known to the LibreOffice development team? Any work addressing them under progress? And finally, if these issues are adequately qualified as bugs or limitations with the LibreOffice products, how can I report them to the development team? Thanks very much.