Corpus Query Language

Within Lucene and Solr, each field containing tokenized text can be considered as a set of tokens, where each token is associated with a position and its value can be seen as a word from the original text. Mtas extends this concept by allowing to associate multiple positions with one token and by associating each token with a prefix and optional postfix instead of the single value. This makes it possible to use multiple tokens on the same position, and distinguish annotations by using a unique prefix for each type, and allows structures like sentences, paragraphs or entities consisting of multiple adjacent or non-adjacent positions.

To describe sets of tokens matching some condition, a query language is needed. Mtas supports CQL based on the Corpus Query Language introduced by the Corpus WorkBench and supported by the Lexicom Sketch Engine.

Prefix

For each field containing Mtas tokenized text, every token is associated with a prefix. Within the field, only a limited set of prefixes is used to distinguish between the different types of annotation. By using a prefix query a full list of used prefixes can be produced.

Value

The optional postfix associated with a token can be queried within CQL by providing a value. This is a regular expression, the supported syntax is documented in the RegExp class provided by Lucene. By using a termvector query, for each prefix a list of postfix values can be produced.

Variable

The optional postfix associated with a token can also be queried within CQL by providing a variable. Each variable may occur only once in a CQL query, and should be provided as a comma separated list together with this query. Each provided variable has to occur in the query.

CQL

Syntax Description Example
token Matches a single position token [t="de"]
multi-position Matches a (single or) multi position token <s/>
sequence Matches a sequence [pos="ADJ"]{2}[pos="N"]
Syntax Description Example
cql { <number> } Matches provided number of occurrence from cql [pos="ADJ"]{2}
cql { <number> , <number>} Matches each number between provided start and end of occurrence from cql [pos="ADJ"]{2,3}
Syntax Description Example
( cql ) within ( cql ) Matches CQL expression within another CQL expression ([t="de"]) within (<s/>)
( cql ) !within ( cql ) Matches CQL expression not within another CQL expression ([t="de"]) !within (<s/>)
( cql ) containing ( cql ) Matches CQL expression containing another CQL expression (<s/>) containing ([t="de"])
( cql ) !containing ( cql ) Matches CQL expression not containing another CQL expression (<s/>) !containing ([t="de"])
( cql ) intersecting ( cql ) Matches CQL expression intersecting another CQL expression (<s/>) intersecting (<div/>)
( cql ) !intersecting ( cql ) Matches CQL expression not intersecting another CQL expression (<s/>) !intersecting (<div/>)
( cql ) fullyalignedwith ( cql ) Matches CQL expression fully aligned with another CQL expression (<s/>) fullyalignedwith (<div/>)
( cql ) !fullyalignedwith ( cql ) Matches CQL expression not fully aligned with another CQL expression (<s/>) !fullyalignedwith (<div/>)
( cql ) followedby ( cql ) Matches CQL expression followed by another CQL expression ([t="de"]) followedby ([pos="ADJ"])
( cql ) !followedby ( cql ) Matches CQL expression not followed by another CQL expression ([t="de"]) !followedby ([pos="ADJ"])
( cql ) precededby ( cql ) Matches CQL expression preceded by another CQL expression ([pos="ADJ"]) precededby ([t="de"])
( cql ) !precededby ( cql ) Matches CQL expression not preceded by another CQL expression ([pos="ADJ"]) !precededby ([t="de"])

Token

Syntax Description Example
[ ] Matches each single position token []
" value " Matches a single position token with condition defined by a basic single-position-expression, where the prefix is the default prefix provided with the query "de"
[ single-position-expression ] Matches single position token with condition defined by an single-position-expression [t="de"]

Single Position Expression

Expression Syntax Example
basic prefix = "value" t="de"
variable prefix = $[variable-name] t=$1
not ! single-position-expression !t="de"
and ( single-position-expression & single-position-expression &) t="de" & pos="LID"
or ( single-position-expression | single-position-expression |) t="de" | t="het"
position # <position> #100
range # <position> - <position> #100-110

Multi-position

Syntax Description Example
< multi-position-expression /> Matches (single and) multi position tokens with condition defined by multi-position-expression <s/>
< multi-position-expression > Matches start of (single and) multi position tokens with condition defined by multi-position-expression <s>
</ multi-position-expression > Matches end of (single and) multi position tokens with condition defined by multi-position-expression </s>

Multi Position Expression

Expression Syntax
prefix prefix
basic prefix = "value"

Sequence

Syntax Description Example
cql cql cql A sequence of cql [t="de"][pos="ADJ"]{2}[pos="N"]