Wikipedia talk:Unicode

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Discussion[edit]

Discussion page about using Unicode in Wikipedia.

(Note that this page and the discussion page are covering much the same topic, so you might want to read both. m.e. 11:02, 14 Jul 2004 (UTC))

Old discussions are now archived here in the standard fashion.  — Scott talk 13:41, 5 October 2015 (UTC)[reply]

From the Village Pump

Unicode question[edit]

This may be the wrong place to ask this, or it may be answered elsewhere, but can anyone tell me if and when the English Wiki will be changed over to UTF-8? I ask becuase it's hugely inconvenient to work with text that's full of ś's, but for some topics (Sanskrit and associated languages and subjects, in my case), there is no adequate alternative to using unicode characters. This is true even if I eschew Devanagari and work in roman, because standardized roman transliteration requires characters with diacritics that aren't available in latin-1. कुक्कुरोवाच 20:51, 31 Mar 2004 (UTC)

I'm assuming that moniker is in Tamil, because my Mozilla 1.6 is totally fazed by it. -Phil | Talk 14:, Apr 1, 2004 (UTC)
No, regular everyday Sanskrit, in Devanagari.
As I understand it: Until recently, the general prognosis was "never", but the French Wikipedia recently converted, and I believe it was mostly successful. So if the remaining problems highlighted by that conversion get ironed out, there may be a possibility that the English 'pedia could make the switch as well if the desire is there. - IMSoP 22:08, 31 Mar 2004 (UTC)
What are the pros and cons? (I am sure this conversation has been had before, so a pointer will plenty). Pete/Pcb21 (talk) 22:30, 31 Mar 2004 (UTC)
The pros are that people who edit pages using special characters or non-Roman alphabets can just enter the characters as normal, and it'll just "work," instead of them having to encode the characters using a somewhat random numerical code. For example, the characters in Kukkurovaca's name above must be encoded as कुक्कुरोवाच
I'm not sure of all of the cons, but one is that some older browsers don't support Unicode, in input if not in output; the database back end that Wikipedia uses may not support it either, in which case there would have to be a layer of code that would convert the Unicode-encoding text into something the database can handle when it is stored, and convert that text back into Unicode when it is retrieved. Also, special characters which are already on many pages currently in Wikipedia could go glitchy due to the change. Garrett Albright 22:41, 31 Mar 2004 (UTC)
Those older browsers are not able to browser half the WWW by now. — Jor (Talk) 12:21, 1 Apr 2004 (UTC)
The masses clamor for Unicode! I'm surprised something so standards-oriented as Wikipedia isn't using it already... Garrett Albright 22:23, 31 Mar 2004 (UTC)
The main reason it isn't Unicode is because the original version of the software didn't support it, and conversion is difficult. It'll require some downtime. There were worries about corruption of the database in various ways, but we have a fairly good handle on that problem now thanks to the recent conversion of the French Wikipedia. I think conversion of the English Wikipedia would be a good idea, some time during the next few months. -- Tim Starling 00:04, Apr 1, 2004 (UTC)
The only Mac browsers able to use Unicode are Safari, Opera etc. on MacOS X, as far as I know, while it is not possible to edit unicode pages with IE. A switch to unicode would be very problematic for many Mac users. Ertz 00:12, 1 Apr 2004 (UTC)
OS 9 has Unicode support; not quite as slick as OS X, no, but it's there. Either way, the number of people still using OS 9 is dwindling rapidly, and will continue to do so. Garrett Albright 02:43, 1 Apr 2004 (UTC)
Which masses have you polled? Unicode would be largely impossible to edit. RickK | Talk 02:45, 1 Apr 2004 (UTC)
Howso? I mean, what are the specific drawbacks, other than for the users of older macs?कुक्कुरोवाच 03:10, 1 Apr 2004 (UTC)
If I were trying to edit a page, and came across something looking like |कुक्कुरोवाच, I would have NO idea what to do with it. RickK | Talk 03:35, 1 Apr 2004 (UTC)
RickK: Just work around it and don't touch it. :)
Judging by your <nowiki> tags, do you mean "something looking like &#2325;&#2369;&#2325;&#2381;&#2325;&#2369;&#2352;&#2379;&#2357;&#2366;&#2330;"? In which case, I'm not sure I see your point. We already use such character entities extensively in articles. The idea of UTF-8 is to allow unicode characters to be inserted without resorting to such ugly constructions. Also, switching en to UTF-8 will make it easier to implement some proposed interwiki features, such as merging the meta recent changes (which is UTF-8) with the local wiki recent changes. -- Tim Starling 03:49, Apr 1, 2004 (UTC)
Doesn't work in Safari, at least not whatever particular language that is. I see the same character (a box surrounding a char I don't recognize) repeated for each character in your sig. Other languages work fine: Japanese, Chinese, Greek, some Cyrillic, but there's one Cyrillic-alphabet-based language that also doesn't work (not sure which it is). That's the problem: support is spotty. If user A enters in text in Japanese natively, what happens when user B who doesn't have Unicode support saves the page? I'm pretty sure the characters would change to little boxes (or whatever the browser displays when it doesn't understand a character) in the textarea, the user would save the page and then everybody would see the "little boxes." I think it could be a problem waiting to happen. RADICALBENDER 05:02, 1 Apr 2004 (UTC)
The web browser does *not* rewrite the characters to "little boxes" when editing -- they are simply shown that way by whatever display mechanism the browser uses. silsor 05:29, Apr 1, 2004 (UTC)
RB: Next time you (re)install OS X, make sure to let it install every language file it can. I'm running Safari on OS X, and I see the characters just fine. Garrett Albright 05:34, 1 Apr 2004 (UTC)
I have no idea how to do that. And how many other random Wikipedia editors would? RickK | Talk 04:14, 1 Apr 2004 (UTC)
The whole point is that if the software were switched over to UTF, you wouldn't need to interact with these strings or know anything about them at all. They would just work as regular characters.
I'm at an utter loss. How would I possibly be able to insert a character that isn't on my keyboard? RickK | Talk 04:56, 1 Apr 2004 (UTC)
Rick, if you're using Windows, then the Character Map applet is your friend. Find the character you want and it will either tell you how to enter it from the keyboard or allow you to copy+paste it. You'll need some nice Unicode fonts, like Junicode, but newer versions of Windows come with Lucida Sans Unicode anyway. --Phil | Talk 14:, Apr 1, 2004 (UTC)
In most Windows applications, Left alt + numeric keyboard types (dec) Unicode. alt+0549 is ȥ for example. — Jor (Talk) 12:21, 1 Apr 2004 (UTC)
The prefixing 0 is important by the way: otherwise the Windows encoding is used instead, which wraps around (alt+256 = alt+0) — Jor (Talk) 12:25, 1 Apr 2004 (UTC)
Actually with or without 0 you don't get Unicode, but the systems ANSI and OEM codepages respectively. You can use Wordpad (or anything other which uses a Richedit control), type the hex number for the Unicode character, then type Alt-x.Pjacobi 08:43, 14 Jul 2004 (UTC)
With a compose key, maybe, or with copy-and-paste. I keep a set of characters I need which I don't have on my keyboard on my userpage on cy:, and c+p them when I need them in articles. Marnanel 05:01, Apr 1, 2004 (UTC)
People who use the languages in question know how to type in them. Someone who studies Sanskrit needs to be aware of how to produce the relevant unicode characters. Similarly, someone who writes mathematical articles may need to learn TeX, and someone who works in science may need to produce diagrams. You contribute what you know, it's not necessary to be an encyclopedia to contribute to an encyclopedia. That said, there's a good resource at http://www.alanwood.net/unicode/ . If you go to the test pages, you'll see a list of characters which can be copied and pasted into an edit box. -- ɫɪɱ ʂɫɒɼʅɪɳɠ 05:10, Apr 1, 2004 (UTC)
If you were going to work with Sanskrit (or other languages in its family) I would suggest http://www.aczone.com/itrans/online/. Other tools would apply for other languages (there's also http://www.emeld.org/tools/charwrite.cfm for IPA in Unicode, which would offer pan-linguistic functionality of a certain kind.) Of course, it's entirely possible you'll never need to deal with nonstandard characters (in which case it shouldn't make the least differnece to you which encoding the site uses, as your keyboard will suffice in either), but those who contribute to articles that necessarily involve terms from languages that aren't representable with the characters that go into English, there's a basic need, here.कुक्कुरोवाच 05:42, 1 Apr 2004 (UTC)

Switching the entire project over to UTF-8 or leaving things in ISO-8859-1 are not the only two choices. It would be straightforward to add a user option for "Edit in UTF-8". When a logged-in user with this option set requests to edit a page, the server translates HTML character references to their UTF-8. When the users submits their edit, the server translates non-ASCII (or non-ISO-8859-1) characters back to the HTML character references for storage in the database. Users who don't set this option would see no differences. See my Editing in UTF-8 feature request. — Gdr 12:33 2004-04-01.

For complex scripts, this is a nontrivial operation. This would require the server to change all entities over #255 in Unicode to numeric entities when converting to ISO-8859-1, and likewise to convert all entities back to direct characters when converting to UTF-8. Let alone the problem of combining diacritics and RTL/LTR! — Jor (Talk) 12:41, 1 Apr 2004 (UTC)
I don't see the difficulty. Numeric character references are trivial to translate since HTML &#x1234; turns into Unicode U+1234 and vice versa. Named character references like &ouml; and &rarr; can be looked up in a table. There's no need to do anything with diacritics and bidirectional text. Just store and transmit the text as it was written and leave it up to the browser to render it. — Gdr 13:52 2004-04-01 (UTC)
I agree with the last part. But that, if anything, is an argument for UTF-8 only rather than for a server-side ISO-8859-1/UTF-8 conversion. Just for argument's sake, browsers that can't handle Unicode won't be affected as UTF-8 is identical to ISO-8859-1 in the first 256 characters. Any chars above that probably will not display correctly for people using archaic browsers anyway. — Jor (Talk) 17:43, 1 Apr 2004 (UTC)
I think you misunderstand. The point of having an "edit in UTF-8" option has nothing to do with display. Pages display just fine with the current system. The point is to make it easy to enter international text in browsers other than Mozilla. If the editing page is transmitted in UTF-8, I can type international characters directly into the edit box in many browsers, including Opera, Safari, and Internet Explorer. With the current system (editing page transmitted in ISO-8859-1), I have to convert international characters into the corresponding HTML character entity references. This is tedious. — Gdr, 11:44 2004-04-02.
Hehe- even early versions of moz are more advanced than IE, not only when it comes to utf-8. IE4 has patchy support, NS4 as well. Nobody editing pages in languages where utf-8 is important uses these browsers though. A check if the posted text validates as utf-8 makes sense imo, throw error otherwise. Just somebody has to write it. Volunteers? -- Gabriel Wicke 13:24, 2 Apr 2004 (UTC)
I guess using Opera made me lazy. I just type non–West European chars like Ł or 匥, and Opera does the conversion to the HTML entity for me if the page is in a non-Unicode charset :). Thanks for clarifying! — Jor (Talk) 19:55, 2 Apr 2004 (UTC)

Hi! I am a user from the french wikipédia. I know that some of you were interested by the conversion to utf-8. As you perhaps want to test on your personal wiki before considering the switch, here is the software to convert the MySQL dump : http://mboquien.free.fr/wikiconvert/ . It converts :

  • html entities, for instance &szlig; => ß, excluding on purpose &gt;, &lt;, &nbsp; and &amp;
  • unicode entities (decimal or hexadecimal), for instance &#223; => ß
  • all other caracters valid in your encoding are converted properly

What it doesn't do :

  • bad formatted entities are not converted, typically an entity that doesn't finish with ;
  • windows-1252 characters are also not converted. To have them corrected before the conversion, you can ask Looxix on the french wiki. He has a very good bot to perform this kind of task, if you don't already have one.

This version is the rewritten version of the one we used (which was really dirty) to convert the french wiki. I rewrote it this afternoon and i tested it on an old cur dump of the french wiki, everything seems to work as expected. For the details, it depends on Qt (no troll on the toolkit used please) and i ran it on Mandrake 10.0. I was reported that it also compiles out of the box on Slackware. If you use another distribution, you may perhaps need to tweak the Makefile to have the correct path for Qt (you should set QTDIR correctly before trying to compile). No need to say that you need the Qt development packages installed. Using it is quite easy. The Makefile produces a wikiconvert executable. To convert you just need to write : ./wikiconvert < dump > converteddump (if you don't use iso8859-1, there is one line to change in wikiconvert.cpp, as explained in the source). On my computer (an athlonXP 2000+ underclocked at 1,5 GHz), converting a 90 Mb dump of cur lasts about 100 seconds. You should ask for a non compressed dump of cur for your test since converting compressed dumps available at http://download.wikipedia.org/ are not suitable for conversion since, once converted, MySQL can't load the dump completely (a problem of lines too long apparently, last time i tried).

I'd be very happy to get some feedback, and i would gladly accept patches to make the program faster/better. :) If you have any question, you can reach me on #fr.wikipedia on Freenode or on my discussion page (french or english only please). Med 09:41, 4 Apr 2004 (UTC)

I think the ironic thing is that Wikipedia is already using Unicode. Tagging the pages as ISO-8859-1 and forcing users to use HTML entities just takes up more bandwidth and makes the editing slower.

-浪人

update: By now the spanish and the german wikipedia have been converted successfully to utf8. Only dutch, danish, swedish and english still use 8859-1.


While the whole Unicode debate is going on, you might find a little tool I wrote useful. Just go to my user page for the source and a link to a "runnable" version. All it does is convert all non-ASCII Unicode characters you type in it into the &#0000; format. I didn't know if there was something like this already out there, so I just spent 25 minutes writing my own. --Aramgutang 06:46, 8 Aug 2004 (UTC)

Greek unicodes[edit]

I have placed a set of Greek alphabet unicodes at the foot of my User page for anyone who works on Greek-related articles and shares my inability to memorise them. Adam 03:12, 23 Apr 2004 (UTC)

Wouldn't it be best to use HTML entities, for backwards compatibility? Dysprosia 10:28, 23 Apr 2004 (UTC)
Plus they are a lot easier to remember.theresa knott 11:01, 23 Apr 2004 (UTC)
HTML entities are hard to edit and look ugly in the editing window, not to mention that they are SGML only, and that Unicode can just be copied&pasted in any text editor. — Jor (Talk) 12:21, 23 Apr 2004 (UTC)
What was wrong with the Unicode tables in the Greek alphabet article? Gdr 11:56, 2004 Apr 23 (UTC)
There is nothing inherently wrong with Unicode, but most people who are on non-Unicode compliant systems can't see Unicode glyphs. Dysprosia 12:05, 23 Apr 2004 (UTC)
But people using those archaic systems won't be able to access most non-US ASCII websites anyway. Why punish everyone to cator to a very small minority which probably has no interest in reading Greek in the first place? — Jor (Talk) 12:21, 23 Apr 2004 (UTC)


That doesn't mean we should actively seek to prevent users on different, non-Unicode-compatible systems from reading the text. I was somewhat sure that Windows 9x versions were not natively Unicode compatible, but [1] seems to suggest that this is the case.
In any case, how are the HTML entities "punishment" in comparison to the Unicode glyphs? One would think that the numerical Unicode entity would be more painful to enter than the slightly more intuitive HTML text-based entity... Dysprosia 12:53, 23 Apr 2004 (UTC)
You can't save unicode characters into articles on en, the encoding is ISO 8859-1. If you paste in a unicode character, or type it somehow, most browsers will automatically convert it to a numeric character entity. You can type in unicode if you wish, but it means that numeric character entities will be saved (e.g. &#945;) rather than the more readable named character entities, e.g. &alpha;. Unicode support in browsers is irrelevant. -- Tim Starling 01:15, Apr 24, 2004 (UTC)
I don't think the named entities are really necessary for typing Greek text: they exist mostly as a coincidental accident because of the fact that Greek letters are used as symbols in a lot of other areas. We type Cyrillic using the numeric entities, for example, because that's the only way to do it, and it doesn't seem like doing the same for Greek is somehow worse. Furthermore, it is not possible to write correct Greek text using only the named entities, because no entities are provided for accented characters, and nearly every Greek word has at least one accent in it (and spelling it without the accent is not correct). Writing a word using all named entities except for one numeric entity in the middle would be kind of odd. --Delirium 02:50, Apr 28, 2004 (UTC)


Which Unicode characters can/should we use?[edit]

I started a few weeks ago changing various Greek language entries (e.g. in the top line of Jesus, I put Greek Ἰησοῦς Χριστός Iēsoûs Khristós) to display the proper accent marks. This displays fine in Mozilla. But when I try to display the same pages in Microsoft Internet Explorer all I get is little squares not Greek letters.

Is there an official Wikipedia policy on which Unicode characters we should and should not use? m.e. 10:58, 24 Jun 2004 (UTC)/m.e. 08:12, 9 Jul 2004 (UTC)

I can see a few question marks in between the aramaeic spelling, and I have the rather complete MS Arial Unicode font installed. The different display in Mozilla or IE might be a font selection problem, maybe you have set your Mozilla to use a different default font? I am not aware on any official policy on unicode, only that we should limit ourself to the original and the english spelling, as there is not much point in having the Cyrillic spelling of someplace in Greece. If it displays better in most cases you can try it without the accent marks, maybe put the correct version enclosed in a HTML comment behind it. andy 11:33, 24 Jun 2004 (UTC)
The Mozilla is on Linux and the MSIE is on XP, so I'm not surprised to get different results. I know that some users will be reading Wikipedia using Mosaic on Windows 1.0 and some will have the complete Unicode everything installed. I'm not sure how to strike a compromise in between. m.e. 12:02, 24 Jun 2004 (UTC)
IE displays a subset of the characters Mozilla displays of unicode on the same machine with the same operating system and the same font. I think this is because Mozilla has a better developed character code mapping table (its had three years' more development). Mr. Jones 14:07, 24 Jun 2004 (UTC)
You might find this page on meta useful. theresa knott 14:02, 24 Jun 2004 (UTC)   — thank you, Theresa, I've read it now; I have been creating the characters using &#xffff;, I was wondering which characters I should and should not use. m.e. 10:45, 25 Jun 2004 (UTC)
I suggest that MS Arial Unicode is perhaps the worst font for page compatibility tests because, although it is probably the most complete Unicode font commonly available, it is limited to only those who have a Microsoft product like MS Office 2000 or later installed on their MS Windows IBM-compatible computer. Even though this probably includes more than half the computer user population of the world, it leaves out a huge minority as well. (Personally, I've never gone beyond Office 97, having no compelling reason to pay the huge expense.) Microsoft doesn't seem to offer it as a separately downloadable font, even for a price. (Just another of the thousands of little ways it encourages everyone to buy its major software products.) -- Jeff Q 21:09, 24 Jun 2004 (UTC)
Are there any good alternative fonts that are more widely available? Also, is the En wiki ever going to go UTF-8 like all the others? -- कुक्कुरोवाच|Talk‽ 21:13, 24 Jun 2004 (UTC)
Alan Wood's Unicode Resources page is an excellent resource for Unicode font issues. His "Introduction" section includes a set of links in the line reading: "Lists of fonts for Windows, Mac OS 9, Mac OS X 10 and Unix, with the Unicode ranges they support, and where to obtain them." -- Jeff Q 11:22, 25 Jun 2004 (UTC)
On the basis of this, it appears that IE is rendering Greek but not Extended Greek. According to the Alanwood pages that you referred to, Arial Unicode MS should render both Greek and Extended Greek correctly. Does the Wikipedia CSS force IE to another font that does not have Extended Greek? Also, I notice that Wikipedia pages have charset=iso-8859-1 in the header, but I presume this doesn't matter as I am coding my characters as &#x0000; codes rather than directly inserting the characters themselves.
I suppose this means we need a rule that says only use the characters supported by Arial???? m.e. 10:03, 27 Jun 2004 (UTC)
Font rendering is an incredibly complex, multidimensional problem that is far from being adequately solved, especially for a global Web resource like Wikipedia. You can't really speak of what IE will render; you've got to specify what version it is, what platform you're running on, what fonts you have installed (by manufacturer name, not style), how your browser is configured to render certain types of fonts, what language it's set to, and so on. (I can see that you, m.e., know much of this already, but I state it here explicitly for other folks reading this.) Most of these settings are done very differently for different browsers and even between versions of the same browser. Frankly, I don't understand a good bit of it myself. Just when I think I've got everything configured properly for my Opera browser, something weird happens and I have to delve back into this confusion. From what little I've seen, MSIE is simpler to configure but more difficult to customize properly. One thing to keep in mind is that simply finding a font that renders your desired characters isn't sufficient, since you can't expect anybody to have done this for their browsers. Any Wikipedia page that displays nicely in your customized browser will be useless to the vast majority. I have no good answer for this annoyance. It seems to require a commitment to robust Unicode font inclusion in browser installations and preconfigurations AND cooperation between the mercilessly-competitive platform, browser, and font vendors that just doesn't exist yet. -- Jeff Q 14:47, 28 Jun 2004 (UTC)

I suppose someone should jump in and write a policy that says which characters one should and should not use? Where would it go? Who should write it? Would it go through some sort of acceptance test before it reaches 'production'? I'd think it would be a bit contextual; in some (more specialised) contexts you might go for the 'real' characters, and accept that they might not display for evveryone.

Also, could we solve this by using the TeX option? Can we use the TeX display mode, normally used for mathematics, for displaying non-Latin characters?... TeX mode doesn't seem to work for this, as it throws you straight into math mode, and it seems only to recognise a limited subset of TeX commands; is this true/ m.e. 09:22, 29 Jun 2004 (UTC)

I think the policy should be use any Unicode characters you think right for the article. Writing excellent encyclopedia articles is more important than worrying too much about browser and operating system capabilities. Browsers and operating systems will catch up (some are pretty good already). To cater for people who can't see some characters, the right thing to do is to present the same information in several forms. For example many articles give pronunciation indications in both IPA and ASCII-IPA. Gdr 19:12, 2004 Jul 3 (UTC) — that's a point, I suppose we should work on the principle that Wikipedia will still be around in 10n years and we should write for then as well as for now. m.e. 09:53, 5 Jul 2004 (UTC)

Little conversion tool[edit]

I posted this on the article page, though I'd post it here as well, so that more people know about it.

While the whole Unicode debate is going on, you might find a little tool I wrote useful. Just go to my user page for the source and a link to a "runnable" version. All it does is convert all non-ASCII Unicode characters you type in it into the &#0000; format. I didn't know if there was something like this already out there, so I just spent 25 minutes writing my own. --Aramgutang 06:50, 8 Aug 2004 (UTC)

Resin identification code[edit]

The Resin identification code Unicode symbols don't work (on firefox anyway). Is there someone here who knows how to fix them?

Duk 16:00, 10 Oct 2004 (UTC)

They work here in Firefox on Windows XP. It's all about whether you have appropriate fonts installed. DopefishJustin (・∀・) 17:16, Nov 11, 2004 (UTC)