Overview of Unicode

Here you will find details about how information is managed in Unicode.

Code Points

The basic unit of information in Unicode is known as a code point. A code point is simply a number between zero and 10FFFF₁₆ that represents a single entity. Code points are generally represented as hexadecimal numbers, that is base 16. An entity may be any of the following:

Read on

Entity	Description
graphic	A letter, mark, number, punctuation, symbol or space, e.g. the letter `a`.
format	Controls the formatting of text, e.g. soft hyphen () for breaking a word over lines.
control	A control character, e.g. the tab character (`^I`).
private-use	Not defined in the Unicode 8.0 standard but used by other non-Unicode scripts, e.g. unused cp 1252 character, 91₁₆.
surrogate	Used to select supplementary planes in UTF-16. Characters in the range D800-DFFF₁₆.
non-character	Permanently reserved for internal use. Characters in the range FFFE-FFFF₁₆ and FDD0-FDEF₁₆.
reserved	All unassigned code points, that is code points that are not one of the above.

The table below lists some code points along with their representation, label and category:

Code point (hex)	Representation	Label	Category
E9	é	Latin small letter e with acute	graphic (letter - lower case)
600	؀	Arabic number sign	format (other)
D6A1	횡	Hangul syllable hoeng	graphic (letter - other)
B4	´	Acute accent	graphic (symbol - modifier)
F900	豈	Chinese, Japanese, Korean (cjk) compatibility ideograph	graphic (letter - other)

A piece of text is logically just a sequence of code points, where each code point represents a part of the text. For example, the piece of text:

豈 ↔ how?

consists of the following code points:

Code point (hex)	Representation	Label
F900	豈	Chinese, Japanese, Korean (cjk) compatibility ideograph
20		Space
2194	↔	Left right arrow
20		Space
68	h	Latin small letter h
6F	o	Latin small letter o
77	w	Latin small letter w
3F	?	Question mark

The code point sequence defines the text itself. There are a number of different ways that the code point sequence can be saved on a computer. One method, called UTF-32, represents each code point as a 32 bit (4 byte) quantity. Such a scheme uses a large amount of storage space as most text uses the Latin alphabet (ASCII), which can be represented in a single byte.

Another encoding is UTF-8. This allows ASCII characters to be stored as a single byte (code points 00-7F), with multiple bytes used for higher code points. UTF-8 is very efficient space wise where the text consists of mainly ASCII characters, and the World Wide Web has adopted it as the preferred encoding method for Unicode code points. EMu also uses UTF-8 as the encoding method. Below, we show a string encoded in UTF-32 with a space between each code point:

豈 ↔ how?

0000F900 00000020 00002194 00000020 00000068 0000006F 00000077 0000003F

And the same string encoded in UTF-8:

EFA480 20 E28694 20 68 6F 77 3F

As you can see the UTF-8 encoding saves considerable space.

Prior to EMu 5.0 either UTF-8 or ISO-8859-1 could be configured as the encoding method. EMu 5.0 drops support for ISO-8859-1 and only supports UTF-8 encoded characters. The change means that moving to EMu 5.0 requires all data to be converted from ISO-8859-1 to UTF-8 before the system may be used. The upgrade process performs this important function.

Inputting Unicode Characters

Now that we understand that text is made up of a sequence of Unicode code points it is worth considering how these characters can be entered into EMu.

Read on

Graphemes

It is important to understand that what we think of as a character, that is a basic unit of writing, may not be represented by a single Unicode code point. Instead, that basic unit may be made up of multiple Unicode code points.

Read on

For example, "g" + acute accent (ǵ) is a user-perceived character as we think of it as a single character, however it is represented by two Unicode code points (67 301). A user-perceived character, which consists of one or more code points, is known as a grapheme. The use of graphemes is important for:

collation (sorting);

regular expressions;

indexing; and

counting character positions within text.

EMu uses graphemes as the basic building block for text. Thus a text string is handled as a sequence of graphemes.

A grapheme consists of one or more base code points followed by zero or more zero width code points and zero or more non-spacing mark code points. In the case of "g" + acute accent (ǵ), the letter g is the base code point (67) and the acute accent is a non-spacing mark code point (301). The table below shows some multiple code point graphemes:

Grapheme	Code points
각	1100 (ᄀ) Hangul choseong kiyeok (base code point) 1161 (ᅡ) Hangul jungseong a (base code point) 11A8 (ᄀ) Hangul jongseong kiyeok (base code point)
	64 (d) Latin small letter d (base code point) 325 ( ̥ ) combining ring below (non-spacing mark) 301 ( ́ ) combining acute accent (non-spacing mark)
á	61 (a) Latin small letter a (base code point) 301 ( ́ ) combining acute accent (non-spacing mark)

Grapheme

Code points

각

1100 (ᄀ) Hangul choseong kiyeok (base code point)

1161 (ᅡ) Hangul jungseong a (base code point)

11A8 (ᄀ) Hangul jongseong kiyeok (base code point)

64 (d) Latin small letter d (base code point)

325 ( ̥ ) combining ring below (non-spacing mark)

301 ( ́ ) combining acute accent (non-spacing mark)

á

61 (a) Latin small letter a (base code point)

301 ( ́ ) combining acute accent (non-spacing mark)

Some common multiple code point graphemes have been combined into a single code point. For example, the last entry in the table above, á, can also be represented by the single code point E1. Hence we have two representations, or two graphemes, that represent the same character (á in this case).

Index Terms

An index term is the basic unit for searching. It is a sequence of one or more graphemes that can be found in a search but for which searching of sub-parts is not supported (except if regular expressions are used). EMu provides word based searching, so an index term corresponds to a word. You can search for a word, and records that contain that word will be matched. In languages that define a word as a sequence of letters separated by either spaces or punctuation, an index term corresponds to a word. In languages in which single (or sometimes multiple) letters make up a word, such as kanji, an index term corresponds to each individual letter. EMu 5.0 added support for searching for punctuation, so each punctuation character is considered to be an index term.

Read on

Consider the following text:

香港 is Chinese for "Hong Kong" (香 = fragrant, 港 = harbour).

The index terms for the above text are:

Index Term
香
港
is
Chinese
for
"
Hong
Kong
"
(
香
=
fragrant
,
港
=
harbour
)
.

Each of the above terms can be used in a search and the query will be able to use the high speed indexes to locate the matching records. It is possible to use regular expression characters (e.g. fra\* to find all words beginning with fra) to search for sub-parts of words, however the high speed indexes will not be used in this case (unless partial indexing is enabled).

Each index term is folded and converted to its base form. The folding process, as described earlier, removes case significance from the term. The conversion to its base form involves removing all "mark" code points from the term and then converting the remaining code points to their compatible form (as defined by the Unicode 8.0 standard). The compatible form for a code point is a mapping from the current code point to a base character that has the same meaning. For example the code point for subscript 5 (₅) has a compatible code point of 5.

The table below shows some more examples for compatibility:

Type	Compatibility Examples
Font variants	H	à	H
Font variants	H	à	H
Positional variants	ع	à	ع
	ﻊ	à	ع
	ﻋ	à	ع
	ﻌ	à	ع
Circled variants		à	1
Width variants	ｶ	à	カ
Rotated variants	︷	à	{
Rotated variants	︸	à	}
Superscripts / subscripts	i⁹	à	i9
Superscripts / subscripts	i₉	à	i9

Unfortunately, some of the compatibility mappings in the Unicode 8.0 standard are narrower than we might expect when searching text. For example the oe ligature (œ) does not map to the characters "oe". So the French word cœur ("heart") does not have an index term of coeur, but remains as cœur. When searching you need to enter cœur as the search term otherwise cœur will not be found.

In order to correct some of the compatibility mappings, EMu 1 provides a mapping file where a code point can be mapped to its compatible code point(s), hence "œ" can be mapped to "oe". The mapping file is located in the Texpress installation directory in the etc/unicode/base.map file.

Compatible mappings may be added to the file as required.

Note: If the file is modified, a complete reindex of the system is required in order for the new mappings to be used to calculate the index terms.

If you consider the French phrase:

Sacré-Cœur est situé à Paris.

the index terms after folding and conversion to base form are:

Index Term
sacre
coeur
est
situe
a
paris
.

When a record is saved in EMu all index terms are folded and converted to their base form before indexing occurs. Similarly, when a search is performed, the query terms are folded and converted to their base form before the search commences. Hence a search for "coeur" or "Cœur" or even "COEUR" will still match the text in the French phrase above.

Auto-Phrasing

Unicode graphemes are broken down into one of three categories for use in EMu.

Read on

The categories are:

Category	Description
combining	A grapheme that is a simple letter or number. It is not a word in its own right but requires other characters to form words. Examples are the Latin, Arabic and Hebrew letters and numbers.
single	A single grapheme is used to represent a base word or meaning. Examples are Kanji and punctuation characters.
break	A character that delineates words, typically a space character.

Grapheme	Category
香	single
港	single
	break
=	single
	break
"	single
H	combining
o	combining
n	combining
g	combining
	break
K	combining
o	combining
n	combining
g	combining
"	single
.	single

Collation

Collation is the general term for the process of determining the sorting order of strings of characters. EMu 5.0 onwards uses the Default Unicode Collation Element Table (DUCET), as defined in the Unicode 8.0 standard, to determine how text should be sorted. DUCET provides a locale independent mechanism for ordering values.

If you are interested in the ordering used by DUCET, please consult the Unicode Collation Chart.