You are here: Home | Professional | Unicode

Unicode

Unicode is a character encoding that has the ability to support characters from almost every language in the world, including quite a few ancient languages. I have learned a lot about Unicode while working as an editor, since it is quickly becoming the encoding of choice for eBooks.

If you are unfamiliar with Unicode, I suggest you take a look at the related links on my Links page. They provide much more foundational information than I cover here.

At WORDsearch, we use Unicode because of the Hebrew, Greek and transliterations in many of our books. Encoding the foreign languages in Unicode allows us to create functionality that is extremely hard to create otherwise. Since all of our books are created in XHTML, the same rules and hints for displaying Unicode characters apply to websites and other HTML-based uses of foreign languages.

An ASCII Example

For instance, let’s say I wanted to put my Hebrew name on a web page: יְהוֹשֻׁעַ. I could use SP Tiberian, an ASCII font that is somewhat common, to display the name. In my source code it would appear like this:

<span class="heb">(a#$uwOhy;</span>

Notice a few things about this example. First, since SPTiberian is an ASCII font, the underlying characters are regular ASCII characters. The web browser applies the font to the text based on a style in the stylesheet called “heb”. This means that the user will not see the word in Hebrew if they do not have the SP Tiberian font installed on their computer.

Second, notice the order of the characters. The “y” is the character used for a yod, and the “(” is the character used for an ayin. This means that the word is encoded from left to right in the source. Third, you will notice that the vowels are placed to the right of their consonant. These two encoding peculiarities make transforming foreign languages from one font map to another quite a task, especially when you don't know if the source documents where proofed by someone who knows the language you are displaying! It can be further hindered by the different ordering of the characters in different fonts. For instance, “SIL Ezra,” SIL’s ASCII Hebrew font, requires the diacritics to be placed to the left of the consonant.

A Unicode Example

Another option I have for displaying my name in Hebrew on a web page is Unicode. In this case, the source code could look like this:

<span class="heb">&#x05D9;&#x05B0;&#x05D4;&#x05D5;&#x05B9;
&#x05E9;&#x05C1;&#x05BB;&#x05E2;&#x05B7;</span>

That example is encoded in Unicode using the characters’ hexadecimal values, indicated by the use of the “x” in the numeric character reference and by the presence of alphabetic values. (These are the same numeric reference numbers commonly shown as “U+123A” in discussions on Unicode.) The name could also be encoded using the decimal (base-10) number for each character, like this:

<span class="heb">&#1497;&#1456;&#1492;&#1493;&#1465;
&#1513;&#1473;&#1467;&#1506;&#1463;</span>

Both of these encodings will display exactly the same on the web site. They are also both treated the same way by the browser, so let’s use the second encoding to compare Unicode to ASCII.

The first important note to make about Unicode is that it does not necessarily require the use of any specific font. While specific fonts will allow the user to create texts that display specifically as desired, the most common fonts supplied with most operating systems are Unicode fonts. For instance, the stylesheet used on this web site specifies that “Ezra SIL,” SIL’s Unicode Hebrew font, be used to display Hebrew characters. If your computer does not have Ezra SIL installed, the secondary font is Narkisim, and the tertiary font is Times New Roman, both of which will display the Hebrew text in acceptable ways. The result of this fact is that, unlike with the use of ASCII fonts, it is not necessary for users to have the specific fonts requested by the web site in order to display the foreign languages.

In the above example, “&#1497;” is the code for yod, and “&#1506;” is the code for ayin. Notice that the order of the characters is left-to-right. This same character order is followed in all Unicode uses, but when displaying Hebrew, the browser knows to display the text right-to-left. Also note that the order of the vowels is also left-to-right, meaning that the vowel shewa under the yod is encoded to the right of it in the source (&#1456;). This consonant/diacritic encoding provides a more easily-editable source.

These same concepts can be applied to any other Unicode text in a document, whether you are creating Chinese text or simply adding a combined diacritic to a character.

Also, be aware that these two examples of Unicode encodings are not the only ways to display Unicode characters on your web site. However, I have found them to be the easiest to work with and the easiest to debug, so I use them almost exclusively.

Notes On Hebrew

While we are talking about Hebrew Unicode, I would like to note a few issues that I have encountered in the use of Hebrew in web pages. These notes and fixes are worthy of mentioning not only because they can be frustrating, but also because I could not readily find the answers to them anywhere else on the Internet. So, I am hoping to keep you from banging your head against the computer screen too much.

Character Order

The first issue I would like to note is that sometimes when you display Hebrew in a sentence with English punctuation marks directly attached to it, the browser will mess up the ordering of the characters, placing the punctuation marks incorrectly. If you do run into this issue (which I cannot seem to reproduce here...), you can fix it by mandating that the broweser display the Hebrew characters from right to left. This can be accomplished using the order control markers, U+200E (&#8206;) for left-to-right and U+200F (&#8207;) for right-to-left. Just place these at the proper ends of your Hebrew text and the problem should be resolved.

Cantillation marks

Another common issue with the displaying of Hebrew text is the apparent lack of support for cantillation marks. While common Unicode fonts such as Times New Roman and Verdana do not support cantillation marks, I know of two fonts that do support them: Ezra SIL and SBL Hebrew (from the Society of Biblical Literature). The use of these fonts is restricted by their copyrights, but if you are looking to display Hebrew text in any program, they are both great fonts to utilize. I have links to their download pages in my list of Links.

Normalization

The order in which you encode your Hebrew text is important for its proper display. This ordering is known as “normalization.” While the ordering of Hebrew characters on your web site may not be affected by the commonly-used normalization routines directly, you will still need to ensure that you normalize the characters yourself to ensure their proper display.

The normalization routine utilized on the Web is Normalization Form C (NFC). The problem is that NFC is not compatible with Biblical Hebrew because when it is applied the characters are often moved out of allignment with each other, resulting in the text being displayed incorrectly.

For more information on Biblical Hebrew normalization, check out my normalization page. For more resources on normalizaion in general, check out my Professional Links page.

Other Issues

There are other issues with the use of Unicode Hebrew for Biblical and other ancient texts. Some of them are somewhat technical and deal with issues that the common user will probably not encounter. For an explanation of a few of these issues, check out Peter Kirk’s web site.

photo of meThe various musings and kvetchings of a Torah-loving believer in Messiah. The Four Questions come from Shabbat 31a.