EMu's support for Unicode

EMu introduced full support for the Unicode 8.0 standard with version 5.0. While earlier versions of EMu allowed Unicode characters to be stored and retrieved, the system did not interpret the characters entered, leading to limited searching functionality. In order to retrieve a Unicode character it was necessary to enter the search term in exactly the same case (upper or lower) along with the same diacritics. For example, a search for the name Frederic would not match Fréderic as the e acute character was not interpreted as an e character with a diacritic associated with it.

Now EMu supports case folding and base character mapping:

  • Case folding is similar to converting a character to its lower case equivalent except that it handles some special cases. The purpose of case folding is to make searching case insensitive. One special case is that the German lower case sharp s character (ß) is generally written in upper case as SS. So Großen would be converted to GROSSEN in upper case. When searching we would like to enter either of the previous terms and find all case variations. In order to do this the ß character needs to be folded to ss for searching purposes.
  • The base version of a character is its most basic representation after all diacritics and marks have been removed. For example the base character of é is e.

The combination of case folding and base characters provides the basic mechanisms required to provide flexible searching over the full range of Unicode characters.

All data stored in EMu is encoded in UTF-8 format. UTF-8 is a compact way of representing Unicode characters, particularly ASCII characters. The World Wide Web has adopted UTF-8 as the character encoding format to be used in web documents. EMu now enforces the use of UTF-8 by not allowing any invalid byte sequences to be stored in the system. The change has implications for data imports as all data imported must be encoded in UTF-8. In earlier versions of EMu, systems may have been configured to allow ISO-8859-1 (latin1) as the standard input format. ISO-8859-1 encoding is no longer supported.

Tip: The Import Tool is able to convert ANSI to UTF-8.

With Unicode support, searching in EMu has been extended to include punctuation characters. This greatly enhances searching in EMu, making it possible, for instance, to search for punctuation either as individual characters (?) or as part of a more complex string (fred@global.com).

Related Topics Link IconRelated Topics