Jump to content

Talk:Unicode normalization considerations

Add topic
From mediawiki.org
Latest comment: 2 years ago by MasterQuestionable in topic Undesirable normalization cases

PHP 6 Native Unicode Support

[edit]

Hey guys,

Last week, I was a volunteer for the PHP Québec 2007 that was held in Montreal. I managed to attend Andrei Zmievski's presentation of the upcoming unicode support in PHP 6 and it simply blew my mind! Not only will there be fully native unicode support but once the encoding has been declared, PHP will be able to recognize all languages simultaneously directly in your class, function or whatever. But the truly amazing part of the demonstration was that PHP recognized a function written say in greek (ltr) with an argument passed in hebrew (rtl) without ever having to declare the text direction...

OK, I'm still a newbie in the PHP world but that seemed pretty powerful to me! I'm not sure if this could be useful to solve the current issue but I'm sure it is definitely worth looking into before planning too far in the future.

Stéphane Thibault 06:55, 19 March 2007 (UTC)Reply

Firefox 3

[edit]

Hebrew vowelization seems much improved in Firefox 3. It is important to document exactly what changed and how.

Firefox 3 seems to correctly represent the vowel order for webpages in general and Wikimedia pages in particular.

The only anomaly I found is that pasting vowelized text into the edit page only shows partial vowelization. On the "saved" wiki page it appears correctly. Dovi 05:49, 18 June 2008 (UTC)Reply

Examples when normalization should be performed and when it should not

[edit]

bugzilla:022031 "deactivate Unicode normalization via <foobar>bla</foobar>"

Investigating on some authors in various book catalogues I run into a problem because I used homographic tags at http://www.librarything.com/ :

http://www.librarything.com/work/9352937/book/54517382 tag Kálmán Kalocsay
>>> http://www.librarything.com/catalog/gangleri&tag=K%C3%A1lm%C3%A1n%20Kalocsay

http://www.librarything.com/work/9393183/book/54949365 tag Kálmán Kalocsay
>>> http://www.librarything.com/catalog/gangleri&tag=Ka%CC%81lma%CC%81n%20Kalocsay

at http://pastebin.org/ I could see that the first tag was « Kálmán Kalocsay » and the second « Ka&#769;lma&#769;n Kalocsay » . The second exmple is using Unicode Character 'COMBINING ACUTE ACCENT' (U+0301).

I am using various computers in various places having different operating systems, browsers with different versions and different fonts installed. Sometimes it is not possible to distinguish the homographs. The chance to detect them is higher using older computers, older versions etc.

Many sites as loc.org, worldcat.com, librarything.com are using data records which are not normalized.

http://www.worldcat.org/oclc/63378583
>>> La kontrubuo de Kálmán Kalocsay al la Esperanta kulturo

http://opc4.kb.nl/DB=1/PPN?PPN=801854571
>>> La kontrubuo de Kálmán Kalocsay al la Esperanta kulturo / Reinhard Haupenthal

http://pastebin.org/71649 shows that the first example is using also &#769; : « Ka&#769;lma&#769;n Kalocsay »

a) I wonder how it should be possible to document such texts. MediaWiki will make the normalization immediately when a page is previewed or saved.

b) I tried to generate some search links for loc.org and worldcat.org because I saw many different spellings about transliteration of Yiddish authors, book titles etc. The work is meaningless when the search items are passed as UTF-8 in parameters together with {{URLENCODE:foo}}  template talk:Bswc. Only if properly urlencoded substrings are passed usefull wiki or html code can be generated.

Conclusion: "copy and paste" is a wonderful feature when used the context is known. But sometimes the content should be preserved and sometimes normalization makes sense. The documentation of historical data processing systems, historical digital data collections, catalogues etc. would require the partial deactivation of the Unicode normalization.

Probably the best way would be implementing such a deactivation via <foobar>bla</foobar>. This would be a fair solution for citations. I am not shure how this should be handled for template parameters.
1) {{foo|<foobar>bla</foobar>}} would require a large additional work in combination with copy and paste.
2) <foobar>{{foo|bla}}{{foo|bla bla}}{{foo|bla bla bla}}</foobar> would be easier when generating lists.

Best regards user:Gangleri
לערי ריינהארט 11:41, 6 January 2010 (UTC)Reply

Longer term: Three different titles

[edit]

For the longer term solution, IMHO there should be three titles (in order of increasing normalisation):

  1. the displayed title
  2. the URI title
  3. the title that forms the DB key

For example, the displayed title could be iMonkëy 123 (extending the example on the main page), the URI title could be iMonkey_123, and the DB key would be IMONKEY123. To find an article, MediaWiki would normalise the URI given to the DB key. If the URI title for that article does not match the given URI, the user would be HTTP-redirected to the nice URI. The normalisation would be configureable on a per-wiki basis, for example, the english Wikipedia could use:

  1. NFC
  2. NFKC, remove diacritics, replace spaces with "_"
  3. upper case, remove non-alphanumeric chars

whereas the German Wikipedia might choose to map umlauts to their ASCII ersatz rendering (ä=>ae, ö=>oe, etc.) for the URI. — Cfaerber 08:58, 23 September 2010 (UTC)Reply

Upgrading ICU Project Library

[edit]

According to the MW 1.21.1 installer...

"The installed version of the Unicode normalization wrapper uses an older version of the ICU project's library.
You should upgrade if you are at all concerned about using Unicode."

The "upgrade" link directs people to this page. Coming here, however, there's no mention of upgrading or even the ICU project. Given the link, we should probably give some explicit instructions on what to do about the warning message. (And before someone says it, if I knew myself, I'd write something up.) – RobinHood70 talk 21:30, 28 July 2013 (UTC)Reply

I arrived at this page for that same reason. Anyone? Chrisarnesen (talk) 22:08, 19 June 2014 (UTC)Reply
I also arrived here and didn't find any clue how to upgrade (Neofun, 2nd Jan 2015)

The BabelPad Unicode text editor for Windows and the custom normalization for Hebrew

[edit]

BabelPad is a powerful Unicode text editor for the Windows platform. The most recent versions have an option for the custom normalization of Hebrew text. This option implements the special Canonical Combining Class scheme recommended in the SBL Hebrew Font manual issued by the Society of Biblical Literature. This may prove useful for repairing damaged Hebrew text. DFH David Haslam (talk) 19:16, 24 August 2015 (UTC)Reply

In Unicode tests the characters must ALWAYS be preserved!

[edit]

In https://unicode-subsets.fandom.com/wiki/User:PiotrGrochowski there are Unicode subsets but I had to escape every single character with that &#x; thingy. Otherwise it replaces the characters and ruins the tests! Two of the characters were even replaced with characters NOT in Subset2! I'm so upset 2A01:119F:21D:7900:1D17:2FA3:A158:AE9F 18:01, 4 February 2019 (UTC)Reply

No normalization after percent-decoding

[edit]

Look: U+1F75 GREEK SMALL LETTER ETA WITH OXIA is normalized to U+03AE GREEK SMALL LETTER ETA WITH TONOS.

There is a page Χρήστης:Kalogeropoulos having the latter character in its name. Replace “ή” with U+1F75 UTF-8 encoded in percents, namely %E1%BD%B5, and place the title into [[…]]: Χρήστης:Kalogeropoulos. The resulting link is red – no normalization occurs (although the action=edit&redlink=1 query immediately redirects to the existing page). When the same code point U+1F75 is supplied via a numeric reference, namely Χρήστης:Kalogeropoulos, then the resulting link is blue.

Is it a bug? Incnis Mrsi (talk) 17:04, 21 January 2020 (UTC)Reply

Replacement docs

[edit]

It would actually be good to have a page which discusses the current Unicode normalization process, even if it basically says look at utfnormal for all detailed matters.

I might have added something here myself, but for the translation markup. MaxEnt (talk) 23:55, 3 December 2021 (UTC)Reply

Undesirable normalization cases

[edit]

    I believe in below cases (and alike) the normalization should not be applied:
    |*| `<nowiki>é</nowiki>`
    |*| `<syntaxhighlight>é</syntaxhighlight>`

    <& Strikeout>Adding a directive for controlling the normalization behavior page-wise would be also desirable.</&> Probably just the XML tag alone should suffice. ("<normalization/>" form applying for all following content)
    And a relevant XML tag for applying the normalization selectively.
[ E.g.
    |*| `<normalization none>`
    |*| `<normalization NFC>`
    ; alike. ]

    Related: https://www.mediawiki.org/?diffonly=1&oldid=194969&diff=296218 (# Examples when normalization should be performed and when it should not; as of the time)

- MasterQuestionable (talk) 06:34, 15 September 2022 (UTC)Reply