How it works
The integration of Solr indexing with Texpress provides an alternative but compatible way of searching for records. In this section the integration is examined in detail, providing useful information for those who would like to interact with the Solr indexes directly. Such interaction may be useful for web-based systems.
When the schema for a Texpress table is saved using texdesign
, the equivalent Solr schema is generated. The Texpress schema is located in the file:
data/table/ins
which contains the Insertion Form; and the Solr schema is found in the file:
solr/table/solr/conf/schema.xml
The Solr schema is in two sections. The first contains the field types used to define the indexing types to apply to fields. The field types are enclosed within the <fieldType>
XML tag. For example:
<fieldType name='string' class='solr.StrField' stored='false' />
<fieldType name='strings' class='solr.StrField' stored='false' multiValued='true' />
<fieldType name='int' class='solr.LongPointField' stored='false' />
<fieldType name='ints' class='solr.LongPointField' stored='false' multiValued='true' />
The second section lists the field indexes and are defined using the <field>
XML tag. For example:
<field name='irn_value' type='value' />
<field name='SummaryData_text' type='text' />
<field name='SummaryData_phonetic' type='phonetic' />
<field name='ExtendedData_text' type='text' />
The Solr schema file is generated automatically and must not be altered by hand. It is very important that the Texpress schema and the Solr schema are in sync.
The Solr field types declare the type of index to be built for a given data type. Each field type maps onto a Texpress index type. The table below shows the mapping between the Solr field type and the corresponding Texpress index type:
Solr field type | Texpress indexing type | Comments |
---|---|---|
|
|
Word-based index where each word and symbol is indexed. |
|
|
Same as |
|
|
Word-based index where the phonetic (sounds like) of each word and symbol is indexed. |
|
|
Same as |
|
|
Word-based index where the stem (base word) of each word and symbol is indexed. |
|
|
Same as |
|
|
The exact value for range and tuple-based fields. Range fields are date, time, latitude and longitude fields. Tuple fields are Texpress library items with more than one field or key items. |
|
|
Same as |
|
|
The minimum value for a partial value (e.g. Jan 2020 has a minimum value of 1st Jan 2020). |
|
|
Same as |
|
|
The maximum value for a partial value (e.g. Jan 2020 has a maximum value of 31st Jan 2020). |
|
|
Same as |
|
|
Term-based index where the complete value is indexed as a single term. |
|
|
Same as |
|
|
Integer number-based index. Supports exact value and range-based searching. |
|
|
Same as |
|
|
Floating point number-based index. Supports exact value and range based searching. |
|
|
Same as |
|
|
Index used to search whether a field is empty or non-empty. |
Each field in the Texpress schema with indexing enabled will have an entry in the Solr schema. For each Texpress index type enabled a corresponding Solr field value is generated of the same index type. The name given to the Solr index is the field name appended with an underscore followed by the Solr field type. For example the NamTitle
field in the eparties
module has the following Texpress index types enabled:
HASH (word)
STEM
NULL
The corresponding Solr field definitions are:
NamTitle_text
NamTitle_stem
NamTitle_null
When a record is saved, Texpress processes each field one at a time. For each field it generates the terms for each index type enabled. When Solr indexing is enabled, the terms are stored in the field corresponding to the index type. For example if a record had the value Doctor
in the NamTitle
field, the following values would be added to each Solr field:
Field |
Value |
---|---|
NamTitle_text
|
doctor
|
NamTitle_stem
|
doct
|
NamTitle_null
|
false
|
The term Doctor
is converted to lowercase by Texpress so that case insignificant searching is the default.
When a search is performed on a table, a check of its indexing type is made. If Solr indexing is enabled, the query engine generates a Solr query on the required index supplying the query term. For example, if a search is performed on eparties
for records where the Title field (NamTitle
) contains the value Doctor
, the Solr query NamTitle_text:doctor
is generated and sent to Solr for processing. Solr returns a list of record offsets matching the query supplied. The offsets are added to the list of matching records.
In order to reduce indexing overhead, the query terms generated by Texpress are added to the Solr indexes and not stored as data. The query terms cannot be displayed, only searched. Such an approach avoids the overhead of saving the indexed values for each record. Also the complete data for the record is not stored, only the offset in the Texpress data file where the record is located. The combination of these two optimizations vastly reduces the indexing overhead. For example, the indexing size for a 2GB data file for the eaudit
table using Texpress indexing is 3.7GB, while the size for Solr indexing is 630MB (0.6GB). In ratio terms the Texpress indexing in
this case is over six times larger than Solr indexing.
If the solrdata
option is enabled, the Solr indexes will also contain a field called _data_
. The field is not used by EMu, but is available to third party applications that search Solr directly. The field value is a JSON string containing a JSON representation of the record indexed. If the _data_
field is to be retrieved, the fl
(Field List) parameter in Solr should contain _data_:[json]
to ensure the _data_
string field is
translated into JSON itself.
The JSON record generated by the solrdata
option contains the full data for the record. Also all records are stored. In effect the EMu security system is bypassed if this option is enabled. Care must be taken when retrieving data to ensure any institution based policies are observed before displaying data. If third party access is required where record level security is observed, the EMu RESTful API should be used.