How Unicode affects searching in EMu

Now that we understand what an index term is we can talk about searching. The incorporation of Unicode into EMu1 has resulted in the searching mechanism being extended to handle all code points that have a base representation. In essence this is all graphic code points except for marks and spaces, namely:

letters

numbers

punctuation

symbols

The inclusion of punctuation as an index term means that punctuation may now be included in searches and the high speed indexes will be used to locate matches.

Read on

An issue arises in versions of EMu prior to 5.0 as certain punctuation characters are used to adjust the type of searching performed. For example, in EMu 4.3 a search for:

@John will find all records containing words that sound like John (phonetic searching).
^joh* uses wildcards to match records where the first word starts with the letters joh (case ignored).
=John will locate records containing John with case significance (that is an upper case J and lower case ohn).

Because versions of EMu prior to 5.0 removed punctuation and symbols from searching (only letters and numbers were supported) there was no ambiguity about the punctuation associated with search terms. As EMu 5.0 does allow symbols and punctuation to be searched for, some ambiguity can creep in. For example, what does searching for fred@global.com mean? In EMu 4.3 it would have meant finding:

"fred"

AND the phonetic of "global"

AND "com"

However, in EMu 5.0 onwards is the @ character to be treated as punctuation or does it mean the phonetic of the word "global"?

When searching for a word prior to EMu 5.0 you simply entered the word and performed the search. We have taken the same approach in EMu 5.0 with punctuation characters. In other words, when you have punctuation in a search, only records containing the punctuation are matched. Thus, in the previous example the @ character is treated as punctuation and so must appear in matching records.

How then do we indicate that the @ character means we want the phonetic version of the following word? We proceed the character with a special marker indicating the character is to take on its phonetic meaning. The marker character used is the backslash (\) character. The introduction of a marker character to alter the meaning of a character is not new in EMu. For example, \n can be used in strings to represent the newline character; similarly \u{} is used to introduce the escape sequence for a Unicode code point.

There is a simple rule to determine how to format a search in EMu 5.0 onwards:

All graphic characters, except for spaces and marks, in a search are matched as the character. Where the special meaning of a character (e.g. @) is required, the character must be preceded by the backslash (\) escape character. The only exception to this rule is that the backslash character itself must be entered twice (\\) where the actual character is required.

The table below compares some searches in EMu 4.3 and their equivalent in EMu 5.0:

Find	EMu 4.3	EMu 5.0
Records containing `Fred`	`fred`	`fred`
Records where `Fred` is the only word in the field	`^fred$`	`\^fred\$`
Records that contain `Fred` phonetically	`@fred`	`\@fred`
Records containing `Fred` with matching case	`=Fred`	`\=Fred`
Records containing the phrase `Sacré-Cœur`	`"sacré cœur"`	`\"sacre-coeur\"`
Records where `blue` and `sky` are within five index terms of each other	`'(blue sky) <= 5 words'`	`\'$blue sky$ <= 5 words\'`

In the following sections we look at all available special search operators and show examples of their use in EMu 5.0 onwards. Each of the operators is displayed with its leading escape character, the backslash character:

Transformations: ~, &, @, =, ==

Transformations are an operator that is applied to a search term to alter its interpretation. The table below lists all valid transformations:

Transformation	Type of search	Description
`\~`	Stemming	Search for all variations of a word. For example, searching for `\~elect` will match `elect`, `election`, `electing` and `elected`, but not `electricity` (its base word is `electric`)
`\&`	Case	Ignore the case (upper or lower) of the search term. This is the default transformation if one is not specified explicitly.
`\@`	Phonetic	Use phonetic or sounds like searching for the specified word.
`\=`	Case	Perform the search using case significance for the following word.
`\==`	Diacritics	Perform the search not only matching the case but also matching any marks (diacritics).

A transformation is always applied to a word and is placed immediately before the word to which it applies. Some examples are:

Find	Search
Records containing all tenses of the word `locate`.	`\~locate`
Records where `melbourne` is all in lower case.	`\=melbourne`
Records with `Sacré` and `Cœur` exactly as specified, that is matching case and diacritics, but not necessarily next to each other.	`\==Sacré \==Cœur`
Records containing words similar to `smythe` phonetically.	`\@smythe`

Regular Expressions: ?, *, [any one of], {one or more of}

Regular expressions provide a mechanism for searching for patterns in a word. With regular expressions, sub-parts of a word may be matched. In general the high speed indexes cannot be used with regular expression searches. The only exception is trailing regular expressions (that is a regular expression that has leading letters), where partial indexing has been enabled.

Regular expressions can be intermixed with the \= and \== transformations to enforce case and diacritic significance.

The table below lists all valid regular expressions:

Regular Expression	Type of search	Description
`\?`	Wildcard	Matches any single grapheme.
`\*`	Wildcard	Matches zero or more graphemes.
`\[`any one of`\]`	Pattern Matching	Matches only one of a sequence of graphemes specified in any one of. any one of may consist of individual graphemes or a beginning and end grapheme may be specified separated by a minus sign (e.g. `a-z`).
`\{`one or more of`\}`	Pattern Matching	Matches one or more of a sequence of graphemes specified in one or more of. one or more of may consist of individual graphemes or a beginning and end grapheme may be specified separated by a minus sign (e.g. `0-9`).

Some examples are:

Find	Search
Records containing words starting with `abs`.	`abs\*`
Records containing Arabic numbers.	`\{0-9\}`
Records with a three grapheme word.	`\?\?\?`
Records with `organisation` spelled with either an `s` or `z`.	`organi\[sz\]ation`
Records with at least one word containing a capital `S`.	`\=\S\`
Records containing either an upper case or lower case `é`.	`\==\\[éÉ\]\`

Anchors: ^, $

Anchors are used to indicate that a search term should be located as either the first or last word in a piece of text. Anchors can be used in combination with all other types of search operators, namely transformations, regular expressions, phrases and proximity.

The table below lists all valid anchors:

Anchors	Type of search	Description
`\^`	Wildcard	The search term following must be the first word in the text.
`\$`	Wildcard	The preceding search term must be the last word in the text.

Some examples are:

Find	Search
Records that have text ending in a question mark.	`?\$`
Records with text beginning with the word `the`.	`\^the`
Records where the text contains only the word `Unknown`.	`\^Unknown\$`
Records with text where the first word starts with a lower case Latin letter.	`\^\==\[a-z\]\*`

Proximity

Proximity searching provides a mechanism for finding a list of words within a specified distance (either words, sentences or paragraphs). EMu supports two types of proximity searches:

The first is phrase searches where the words must appear next to each other and in the order they are specified. The words in a phrase search may have transformations, regular expressions and anchors applied.

The second is a regular proximity search. Proximity searches may include transformations, regulars expressions, anchors and phrases.

The table below lists all valid proximity operators:

Proximity Type of search Description

Proximity	Type of search	Description
`\"`search terms`\"`	Phrase	The search terms enclosed within the phrase operator (`\"`) must appear next to each other and in the order they are specified.
`\'\(`search terms`\)` distance\`'`	Proximity	The search terms may appear in any order unless otherwise specified. The distance between the terms indicates the range within which the search terms must appear. The syntax for distance is: `[ordered]` relop number type where: relop is one of the relational operators <, <=, =, >, >= number is the distance to use type is one of `words`, `sentences` or `paragraphs` The keyword `ordered` is optional, but if given, requires the search terms to be in the order specified.

\"search terms\"

Phrase

The search terms enclosed within the phrase operator (\") must appear next to each other and in the order they are specified.

\'$search terms$ distance\'

Proximity

The search terms may appear in any order unless otherwise specified. The distance between the terms indicates the range within which the search terms must appear. The syntax for distance is:

[ordered] relop number type

where:

relop is one of the relational operators <, <=, =, >, >=

number is the distance to use

type is one of words, sentences or paragraphs

The keyword ordered is optional, but if given, requires the search terms to be in the order specified.

Some examples are:

Find	Search
Records where the phrase `the black cat` occurs.	`\"the black cat\"`
Records containing only the phrase `Not Applicable`.	`\"\^Not Applicable\$\"`
Records where `Fred` occurs case significantly in the same sentence as the phonetic of `Smith` where `Fred` appears first.	`\'$\=Fred \@Smith$ ordered = 1 sentence\'`
Records where the kanji character 豈 appear within 5 characters of the phrase 香港.	`\'$豈 \"香港\"$ <= 5 words\'`

Conditionals: NOT

EMu provides support for one conditional operator, NOT. The NOT operator excludes records that have the next term in the searched field.

For example: a search for \!rock roll will return records that mention sausage roll in the search field but not rock and roll.

The NOT operator can be applied to any of the other search operators, that is transformations, regular expressions, anchors and proximity.

The table below lists the valid conditional operator:

Conditionals	Type of search	Description
`\!`	Boolean	Records that contain the next search term in the searched field are excluded from search results.

Some examples are:

Find	Search
Records that do not contain the kanji character 豈.	`\!豈`
Records that contain anything apart from the single word `Unknown`.	`\!\^Unknown\$`
Records that do not contain the phrase `Not Applicable`.	`\!\"Not Applicable\"`
Records containing the phrase `Sacré Cœur` with case and diacritic significance but not `Paris`.	`\"\==Sacré \==Cœur\" \!Paris`

➤