Apache Solr

Mtas can be used as plugin for Apache Solr

Prerequisites

  • Installed Apache Solr
  • Currently supported and advised version is 7.3.0

Start with a new Solr core.

Libraries

Add the mtas-7.3.0.3.jar to the lib directory of the new Solr core. A prebuilt mtas-7.3.0.3.jar is available from the download page.

Furthermore, add the Apache Commons Mathematics Library to the lib directory of the new Solr core.

Solrconfig.xml

Some changes have to be made within the solrconfig.xml file, elements have to be added to the <config/> or existing elements have te be adjusted:

Define a new mtas searchComponent:

<searchComponent name="mtas" class="mtas.solr.handler.component.MtasSolrSearchComponent"/>

Add this component to the select requestHandler by inserting the following within the <requestHandler/> with name "/select":

<arr name="last-components">
  <str>mtas</str>
</arr>

Define a new mtas_cql queryParser and mtas_join queryParser:

<queryParser name="mtas_cql" class="mtas.solr.search.MtasSolrCQLQParserPlugin"/>
<queryParser name="mtas_join" class="mtas.solr.search.MtasSolrJoinQParserPlugin"/>

Define a new mtas requestHandler:

<requestHandler name="/mtas" class="mtas.solr.handler.MtasRequestHandler" />

Define a new updateRequestProcessorChain:

<updateRequestProcessorChain name="mtasUpdateProcessor">
  <processor class="mtas.solr.update.processor.MtasUpdateRequestProcessorFactory" />
  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

Define or adjust the update requestHandler with this updateRequestProcessorChain:

<requestHandler name="/update" class="solr.UpdateRequestHandler">
  <lst name="defaults">
    <str name="update.chain">mtasUpdateProcessor</str>
  </lst>    
</requestHandler>

Finally, in this instruction we will use a classic schema instead of the managed-schema. So the configuration must contain:

<schemaFactory class="ClassicIndexSchemaFactory"/>

Schema.xml

We extend a (classic) schema with one (or multiple) fields that may contain annotated text, e.g.

<field name="text" type="mtas" required="false" multiValued="false" indexed="true" stored="true" />

We define the referred Mtas fieldType by

<fieldType name="mtas" class="solr.TextField" postingsFormat="MtasCodec">
  <analyzer type="index">
    <charFilter class="mtas.analysis.util.MtasCharFilterFactory" type="url" prefix="http://localhost/demo/" postfix="" />
    <tokenizer class="mtas.analysis.util.MtasTokenizerFactory" configFile="folia.xml" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
    <filter class="mtas.analysis.util.MtasPrefixTokenFilterFactory" prefix="t" />
  </analyzer>
</fieldType>

The charFilter with class mtas.analysis.util.MtasCharFilterFactory in the index analyzer contains an obligatory attribute type and two optional attributes prefix and postfix. The type can be url or file, referring to either an external url or a file on the filesystem. On indexing, the optional prefix and postfix attributes will be added to the provided value, resulting in a full url or location of a file.

The tokenizer with class mtas.analysis.util.MtasTokenizerFactory in the index analyzer has an attribute configFile containing the name of the required tokenizer configuration.

The filter in the query analyzer contains an obligatory attribute prefix defining the assumed prefix when this field will be queried directly within Solr.

See configuration for more information about the definition of a tokenizer configuration.

Multiple tokenize configurations

If multiple tokenizer configurations are required, the Mtas fieldType has to be defined slightly different:

<fieldType name="mtas_config" class="solr.TextField" postingsFormat="MtasCodec">
  <analyzer type="index">
    <charFilter class="mtas.analysis.util.MtasCharFilterFactory" config="mtas.xml" />
    <tokenizer class="mtas.analysis.util.MtasTokenizerFactory" config="mtas.xml" />
  </analyzer>
</fieldType>
<fieldType name="mtas" class="mtas.solr.schema.MtasPreAnalyzedField"
  followIndexAnalyzer="mtas_config" defaultConfiguration="default"
  configurationFromField="type" setNumberOfTokens="numberOfTokens"
  setNumberOfPositions="numberOfPositions" setSize="size"
  setError="error" setPrefix="prefixes" postingsFormat="MtasCodec">
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
    <filter class="mtas.analysis.util.MtasPrefixTokenFilterFactory" prefix="t" />
  </analyzer>
</fieldType>

An additional fieldType (here named mtas_config) is defined, containing only an index analyzer. Both the charFilter and tokenizer within this analyzer have an attribute config referring to a Mtas configuration file. Depending on the required tokenizer configuration, for the charFilter this file will define type, prefix and postfix and for the tokenizer this file will define the configFile. An example of a Mtas configuration file is added below.

The Mtas fieldType is defined with class mtas.solr.schema.MtasPreAnalyzedField, an obligatory attribute followIndexAnalyzer referring to the additional fieldType we defined before. The optional attribute defaultConfiguration contains the name of the default configuration to be used, and the obligatory attribute configurationFromField contains the name of the field defining the required configuration. The optional attributes setNumberOfTokens, setNumberOfPositions, setSize, setPrefix and setError define fields that may be filled with respectively number of tokens, number of positions, filesize, prefixes and possible errors.

Example of a Mtas configuration file

<?xml version="1.0" encoding="UTF-8" ?>
<mtas>
  <configurations type="mtas.analysis.util.MtasTokenizerFactory">
    <configuration name="folia" file="folia.xml" />
    <configuration name="tei" file="tei.xml" />
  </configurations>
  <configurations type="mtas.analysis.util.MtasCharFilterFactory">
    <configuration name="folia" type="url" prefix="http://www.mycompany.com/archive/" postfix=".xml" />
    <configuration name="tei" type="file" prefix="/storage/tei/" postfix="" />
  </configurations>
</mtas>