Indexing methods

Word indexing: Two Level method

The basic searching provided in EMu is word based. A term in word based searching is a sequence of alphabetic and numeric characters up to a word break character. A word break character consists of all punctuation characters except for apostrophes (') and underscores (_), and all space characters. For example the string:

Relax. You know you're in safe hands.

consists of the words:

relax

you
know
youre

in
safe
hands

Notice how the punctuation is removed and the case is converted to lower case (searching is case insensitive). The words above are the search terms that can be used to locate the sample string. When word based indexing is used, extra terms are generated for word pairs (two consecutive words within the same sentence). For the above sample the word pairs generated are:

relax you
you know
know youre

youre in
in safe
safe hands

The word pairs provide direct support for phrase based searching, that is a search where the words are enclosed in double quotes (e.g. "in safe hands"). Enabling word based searching automatically provides phrase based searching.

Stem indexing

In many instances it may be useful to search for variations on a base word regardless of the tense or form of the word used. Consider a search for the word election:

A word based search will return all occurrences of the word election, but it will not find elect, elected, elector, electing, etc.
A stem based search, which is specified by preceding a word with a tilde (~), will find all variations of the word.

In stem based indexing the basic term is the root of the word (in this case elect). Entering any variation of the word preceded by \~ will result in a search for the root word (e.g. searching for \~elected will result in the search term elect). The algorithm to convert a word to its root word is a complex one that has been refined over many years. It is not a simple truncation mechanism, otherwise \~elect would match electric. The rules used to determine the root word are effective for English text only.

Stem based indexing is not enabled on many fields in EMu. If it is not enabled, stem based searches are still possible, however the exhaustive search method is used. In general, fields that contain descriptive text or notes based fields are good candidates for stem indexing.

Phonetic indexing

When searching for names or scientific terms it is often useful to be able to search for records that contain words that sound like the search term. phonetic based indexing provides this functionality. The basic term for phonetic searching is a number sequence. Each word is converted into a number sequence that encodes the basic sound groups that make up the word. The number sequence is then used for the search and words that generate the same number sequence (and hence sound the same) will be matched.

A phonetic search is specified using the @ character before a word, e.g. \@term.

The algorithm used to convert a word into its basic sound group is very complex as letters can take on different sounds depending on the letters that surround them. A number of refinements to the basic algorithm have been made over many years.

Phonetic based indexing is not enabled on many fields in EMu. It is enabled on fields that typically contain names (e.g. first name and surname fields). If phonetic indexing is not enabled, phonetic based searching is still available, however the exhaustive search method is used.

String indexing

Occasionally you may want a search to return records where only the exact value entered in a field is matched. String based indexing uses the complete contents of the field as the term for retrieval and only records that exactly match the search terms will be returned. For example, two records have either one or the other of the following values in a suburb field:

Melbourne
North Melbourne

A word based search using Melbourne as the search term will return both records (as they both contain the word Melbourne). A String based search, using Melbourne will only return the first record as only one record contains Melbourne and nothing else.

String based searching is not used widely in EMu (as word based searching provides a more flexible and useful searching mechanism). It is enabled for the security fields (SecCanDisplay_tab, SecCanEdit_tab and SecCanDelete_tab) used by Record Level Security. Since Record Level Security uses group names and user names it is useful to use String based searching to avoid mismatches on names (e.g. on a group called Herpetology and another called Herpetology Managers).

Range searching

Some forms of data may be searched by locating all records between start and end points, otherwise known as a range search. EMu provides range based searching for a number of data types: latitude, longitude, date, time and numeric (both integer and floating point) types. Range based searches are specified using the following relational operators:

Less than (<)
Less than or equal to (<=)
Greater than (>)
Greater than or equal to (>=)

To search for all records inserted between 1 January 2006 and 28 February 2006, in the Date Inserted field you would specify:

>="1 January 2006" <="28 February 2006"

Note the use (and position) of the double quotes as the dates contain spaces.

Range based searching is a variation on the standard Two Level scheme used by the other indexing methods. Essentially it divides the possible values for a range based field into a series of discrete intervals. Then, for each field, value terms are generated to indicate into which intervals the value falls. Using these terms it is possible to provide efficient range based searches.

NULL indexing

Sometimes you may only be interested in whether a field contains a value or not. For example, to determine whether records have been assigned to a department, you would use the wildcard search !* in the Department field (SecDepartment_tab).

NULL based searching provides an additional term that indicates whether the field contains a value or not. This term is used to provide high speed retrieval when certain wildcard searches are used:

\!\* finds records where a field does not contain a value
\* finds records where a field contains a value

Null based searching can also provide support for searches used to determine whether an attachment field is empty or not:

\!\+ for records that do not have any attachments
\+ for records that do have one or more attachments

Partial indexing

Most wildcard based searches (i.e. searches looking for patterns or part of a word) specify leading letters. For example, you may want to find all people with a surname starting with Al*. Since the standard EMu indexing is word based, wildcard based searches must use the exhaustive search method to locate matches. Such a search may take some time for very large numbers of records.

Partial based indexing provides a mechanism for providing very efficient wildcard searching where leading letters are specified (that is where one or more letters or digits appear at the start of the pattern). The method takes each word in a field and generates a term based on the number of leading letters indexed. Part of the partial based indexing configuration is the number of letters to be used to generate partial terms. For example, a field may be configured to provide partial indexing for the first 1,3,5 letters of each word. If we take the sentence:

Relax. You know you're in safe hands.

the following terms are generated for partial based indexing:

r
rel
relax
y
you
k
kno
y
you
youre
i
s
saf
h
han
hands

Thus if we searched for rel\* the three letter terms can be searched to locate the matching records. The searching mechanism makes use of any indexing it can; for example, if you search for re\* the single letter terms are used to retrieve a set of potential matches, which are then checked to see if the second letter matches. Partial based indexing provides the same level of response for wildcard based searches where leading letters are supplied as does word based searching.

Partial indexing is enabled on a small number of fields, typically fields containing names.