Overview of Unicode

Here we take a look at how information is managed in Unicode, before exploring what changes have been implemented in EMu and how those changes impact usage of EMu 5.0 onwards.

Code Points

The basic unit of information in Unicode is known as a code point. A code point is simply a number between zero and 10FFFF16 that represents a single entity. Code points are generally represented as hexadecimal numbers, that is base 16. An entity may be any of the following:

Inputting Unicode Characters

Now that we understand that text is made up of a sequence of Unicode code points it is worth considering how these characters can be entered into EMu.

Graphemes

It is important to understand that what we think of as a character, that is a basic unit of writing, may not be represented by a single Unicode code point. Instead, that basic unit may be made up of multiple Unicode code points.

Index Terms

An index term is the basic unit for searching. It is a sequence of one or more graphemes that can be found in a search but for which searching of sub-parts is not supported (except if regular expressions are used). EMu provides word based searching, so an index term corresponds to a word. You can search for a word, and records that contain that word will be matched. In languages that define a word as a sequence of letters separated by either spaces or punctuation, an index term corresponds to a word. In languages in which single (or sometimes multiple) letters make up a word, such as kanji, an index term corresponds to each individual letter. EMu 5.0 added support for searching for punctuation, so each punctuation character is considered to be an index term.

Auto-Phrasing

Unicode graphemes are broken down into one of three categories for use in EMu.

Collation

Collation is the general term for the process of determining the sorting order of strings of characters. EMu 5.0 onwards uses the Default Unicode Collation Element Table (DUCET), as defined in the Unicode 8.0 standard, to determine how text should be sorted. DUCET provides a locale independent mechanism for ordering values.

If you are interested in the ordering used by DUCET, please consult the Unicode Collation Chart.