Statistics - spans

To get statistics on the occurrence of a span within a set of documents in Solr requests, besides the parameter to enable statistics, the following parameter should be provided.

Parameter	Value	Obligatory
mtas.stats.spans	true	yes

Multiple statistics on the occurrence of a span can be produced within the same request. To distinguish them, a unique identifier has to be provided for each of the required statistics. Furthermore, statistics for the occurrence of multiple spans can be produced. Spans are described by a query, and to distinguish multiple spans, also a query identifier has to be provided.

Parameter	Value	Info	Obligatory
mtas.stats.spans.<identifier>.key	<string>	key used in response	no
mtas.stats.spans.<identifier>.field	<string>	Mtas field	yes
mtas.stats.spans.<identifier>.query.<identifier query>.type	<string>	query language: cql	yes
mtas.stats.spans.<identifier>.query.<identifier query>.value	<string>	query: cql	yes
mtas.stats.spans.<identifier>.query.<identifier query>.prefix	<string>	default prefix	no
mtas.stats.spans.<identifier>.query.<identifier query>.ignore	<string>	ignore query: cql	no
mtas.stats.spans.<identifier>.query.<identifier query>.maximumIgnoreLength	<integer>	maximum number of succeeding occurrences to ignore	no
mtas.stats.spans.<identifier>.type	<string>	required type of statistics	no
mtas.stats.spans.<identifier>.minimum	<double>	minimum number of occurrences span	no
mtas.stats.spans.<identifier>.maximum	<double>	maximum number of occurrences span	no

The key is added to the response and may be used to distinguish between multiple statistics on the occurrence of spans, and should therefore be unique. The optional minimum and maximum can be used to focus only on documents satisfying a condition on the number of occurrences of the spans. When multiple queries are provided, the provided boundary will hold on the sum of occurrences of the resulting spans.

Variables

The query may contain one or more variables, and the value(s) of these variables have to be defined

Parameter	Value	Info	Obligatory
mtas.stats.spans.<identifier>.query.<identifier query>.variable.<identifier variable>.name	<string>	name of variable	yes
mtas.stats.spans.<identifier>.query.<identifier query>.variable.<identifier variable>.value	<string>	comma separated list of values	yes

Functions

To compute statistics for values based on the occurrence of one or multiple spans, optionally functions can be added. The parameters for these functions are the number of occurrences $q0, $q1, … for each span and the number of positions $n in a document. Statistics on the value computed for each document in the set are added to the response.

Parameter	Value	Info	Obligatory
mtas.stats.spans.<identifier>.function.<identifier function>.key	<string>	key used in response	no
mtas.stats.spans.<identifier>.function.<identifier function>.expression	<string>	see functions	yes
mtas.stats.spans.<identifier>.function.<identifier function>.type	<string>	required type of statistics	no

Again, the key is added to the response and may be used to distinguish between multiple functions, and should therefore be unique.

Examples

Basic : basic statistics on the occurrence of a word.
Minimum and Maximum : statistics on the occurrence of a word with restrictions on the number of occurrences.
Subset : statistics on the occurrence of a word within a subset of documents.
Multiple : statistics on the occurrence of multiple words.
Prefix : default prefix for query
Ignore : query with ignore
Ignore and maximumIgnoreLength : query with ignore and maximumIgnoreLength
Functions : statistics using functions.
Multiple and Functions : statistics using functions on the occurrence of multiple words.

Basic

Example
Total and average number of occurrences of the word “de” and the number of documents.

CQL
[t="de"]

Request and response
q=*%3A*&mtas=true&mtas.stats=true&mtas.stats.spans=true&mtas.stats.spans.0.field=text&mtas.stats.spans.0.query.0.type=cql&mtas.stats.spans.0.query.0.value=%5Bt%3D%22de%22%5D&mtas.stats.spans.0.key=example - basic&mtas.stats.spans.0.type=n%2Csum%2Cmean&rows=0&wt=json&indent=true

"mtas":{
    "stats":{
      "spans":[{
          "key":"example - basic",
          "mean":10.488239100197209,
          "sum":21656200,
          "n":2064808}]}}

Minimum and Maximum

Example
Full statistics on the number of occurrences of the word “de” for documents with a minimum of 100 occurrences, for documents with a maximum of 200 occurrences, and for documents with between 100 and 200 occurrences.

CQL
[t="de"]

"mtas":{
    "stats":{
      "spans":[{
          "key":"example - minimum",
          "sumsq":8.697655383E9,
          "populationvariance":419224.862744871,
          "max":18192.0,
          "sum":4531747.0,
          "kurtosis":164.01633761739456,
          "standarddeviation":647.4937185426337,
          "n":18030,
          "quadraticmean":694.5495506941058,
          "min":100.0,
          "median":136.0,
          "variance":419248.1155521673,
          "mean":251.3448141985584,
          "geometricmean":160.50112302303313,
          "sumoflogs":91561.76594051626,
          "skewness":10.552060273112971},
        {
          "key":"example - maximum",
          "sumsq":7.37391079E8,
          "populationvariance":271.8217238864797,
          "max":200.0,
          "sum":1.9102393E7,
          "kurtosis":31.734626574581217,
          "standarddeviation":16.487020826545898,
          "n":2061623,
          "quadraticmean":18.91229851589547,
          "min":0.0,
          "median":4.0,
          "variance":271.82185573495815,
          "mean":9.265706193615522,
          "geometricmean":0.0,
          "sumoflogs":"-Infinity",
          "skewness":4.741031505227169},
        {
          "key":"example - minimum and maximum",
          "sumsq":2.73698488E8,
          "populationvariance":684.3248008017308,
          "max":200.0,
          "sum":1977940.0,
          "kurtosis":-0.47377181206297303,
          "standarddeviation":26.16048359466255,
          "n":14845,
          "quadraticmean":135.78321834689768,
          "min":100.0,
          "median":127.0,
          "variance":684.3709019066084,
          "mean":133.23947457056252,
          "geometricmean":130.83072059647412,
          "sumoflogs":72353.10901272473,
          "skewness":0.7177265003819447}]}}

Subset

Example
Total and average number of occurrences of the word “de” and the number of documents for a subset of documents.

CQL
[t="de"]

Request and response
q=text:koe&rows=0&mtas=true&mtas.stats=true&mtas.stats.tokens=true&mtas.stats.tokens.0.field=text&mtas.stats.tokens.0.key=example - subset&mtas.stats.tokens.0.type=sum,mean,n&wt=json&indent=true

"mtas":{
    "stats":{
      "tokens":[{
          "key":"example - subset",
          "mean":42901.60996309963,
          "sum":116263363,
          "n":2710}]}}

Multiple

Example
Total and average number of occurrences of the word “de” and “het”, and the number of documents.

CQL
1. combined cql: [t="de"|t="het"]
2. combined regexp: [t="(de|het)"]
3. two queries: [t="de"] [t="het"]

"mtas":{
    "stats":{
      "spans":[{
          "key":"multiple - combined cql",
          "mean":15.178130848001365,
          "sum":31339926,
          "n":2064808},
        {
          "key":"multiple - combined regexp",
          "mean":15.178130848001365,
          "sum":31339926,
          "n":2064808},
        {
          "key":"multiple - two queries",
          "mean":15.178130848001365,
          "sum":31339926,
          "n":2064808}]}}

Prefix

Example
Total and average number of occurrences of the word “de” followed by an adjective.

CQL
"de" [pos="ADJ"]

Request and response
q=*%3A*&mtas=true&mtas.stats=true&mtas.stats.spans=true&mtas.stats.spans.0.field=text&mtas.stats.spans.0.query.0.type=cql&mtas.stats.spans.0.query.0.value="de" [pos%3D"ADJ"]&mtas.stats.spans.0.query.0.prefix=t_lc&mtas.stats.spans.0.key=example - prefix&mtas.stats.spans.0.type=n%2Csum%2Cmean&rows=0&wt=json&indent=true

"mtas":{
    "stats":{
      "spans":[{
          "key":"example - prefix",
          "mean":2.1725308115815127,
          "sum":4485859,
          "n":2064808}]}}

Ignore

Example
Total and average number of occurrences of an article followed by a noun, ignoring adjectives.

CQL
[pos="LID"][pos="N"]

Ignore [pos="ADJ"]

"mtas":{
    "stats":{
      "spans":[{
          "key":"functions - de",
          "mean":12.352043386116287,
          "sum":25504598,
          "n":2064808,
          "functions":{
            "number of words":{
              "sum":504361094,
              "n":2064808},
            "relative frequency":{
              "distribution(start=0,end=0.1,number=10)":{
                "[0.000,0.010)":390003,
                "[0.010,0.020)":120903,
                "[0.020,0.030)":173830,
                "[0.030,0.040)":209994,
                "[0.040,0.050)":245098,
                "[0.050,0.060)":253528,
                "[0.060,0.070)":218325,
                "[0.070,0.080)":163982,
                "[0.080,0.090)":115929,
                "[0.090,0.100)":77207},
              "mean":0.04538673326024501,
              "errorList":{"division by zero":1039},
              "standarddeviation":0.03284884758453086,
              "errorNumber":1039}}}]}}

Ignore and maximumIgnoreLength

Functions

Example
Statistics for the relative frequency of the word “de” and the total number of words in documents containing this word.

CQL
[t="de"]

Functions
$q0/$n
$n

"mtas":{
    "stats":{
      "spans":[{
          "key":"functions - de",
          "mean":12.352043386116287,
          "sum":25504598,
          "n":2064808,
          "functions":{
            "number of words":{
              "sum":504361094,
              "n":2064808},
            "relative frequency":{
              "distribution(start=0,end=0.1,number=10)":{
                "[0.000,0.010)":390003,
                "[0.010,0.020)":120903,
                "[0.020,0.030)":173830,
                "[0.030,0.040)":209994,
                "[0.040,0.050)":245098,
                "[0.050,0.060)":253528,
                "[0.060,0.070)":218325,
                "[0.070,0.080)":163982,
                "[0.080,0.090)":115929,
                "[0.090,0.100)":77207},
              "mean":0.04538673326024501,
              "errorList":{"division by zero":1039},
              "standarddeviation":0.03284884758453086,
              "errorNumber":1039}}}]}}

Multiple and Functions

Example
Statistics for the absolute and relative frequency of the words “de”, “het” and “een”, for part of speech type “LID” and the total number of words in documents containing this word.

CQL
[t="de"]
[t="het"]
[t="een"]
[pos="LID"]

Functions
$q0/$n
$q1/$n
$q2/$n
$q3/$n
$q0/$q3
$q1/$q3
$q2/$q3
($q0+$q1+$q2)/$q3

Request and response
q=*%3A*&mtas=true&mtas.stats=true&mtas.stats.spans=true&mtas.stats.spans.0.field=text&mtas.stats.spans.0.query.0.type=cql&mtas.stats.spans.0.query.0.value=[t_lc%3D"de"]&mtas.stats.spans.0.query.1.type=cql&mtas.stats.spans.0.query.1.value=[t_lc%3D"het"]&mtas.stats.spans.0.query.2.type=cql&mtas.stats.spans.0.query.2.value=[t_lc%3D"een"]&mtas.stats.spans.0.query.3.type=cql&mtas.stats.spans.0.query.3.value=[pos%3D"LID"]&mtas.stats.spans.0.key=multiple+and+functions+-+de%2Bhet%2Been+and+LID&mtas.stats.spans.0.type=n&mtas.stats.spans.0.minimum=1&mtas.stats.spans.0.function.0.expression=%24q0&mtas.stats.spans.0.function.0.key=de+-+absolute&mtas.stats.spans.0.function.0.type=n%2Csum&mtas.stats.spans.0.function.1.expression=%24q1&mtas.stats.spans.0.function.1.key=het+-+absolute&mtas.stats.spans.0.function.1.type=n%2Csum&mtas.stats.spans.0.function.2.expression=%24q2&mtas.stats.spans.0.function.2.key=een+-+absolute&mtas.stats.spans.0.function.2.type=n%2Csum&mtas.stats.spans.0.function.3.expression=%24q3&mtas.stats.spans.0.function.3.key=LID+-+absolute&mtas.stats.spans.0.function.3.type=n%2Csum&mtas.stats.spans.0.function.4.expression=%24q0%2F%24n&mtas.stats.spans.0.function.4.key=de+-+relative+to+positions&mtas.stats.spans.0.function.4.type=n%2Cmean&mtas.stats.spans.0.function.5.expression=%24q1%2F%24n&mtas.stats.spans.0.function.5.key=het+-+relative+to+positions&mtas.stats.spans.0.function.5.type=n%2Cmean&mtas.stats.spans.0.function.6.expression=%24q2%2F%24n&mtas.stats.spans.0.function.6.key=een+-+relative+to+positions&mtas.stats.spans.0.function.6.type=n%2Cmean&mtas.stats.spans.0.function.7.expression=%24q3%2F%24n&mtas.stats.spans.0.function.7.key=LID+-+relative+to+positions&mtas.stats.spans.0.function.7.type=n%2Cmean&mtas.stats.spans.0.function.8.expression=%24q0%2F%24q3&mtas.stats.spans.0.function.8.key=de+-+relative+to+LID&mtas.stats.spans.0.function.8.type=n%2Cmean&mtas.stats.spans.0.function.9.expression=%24q1%2F%24q3&mtas.stats.spans.0.function.9.key=het+-+relative+to+LID&mtas.stats.spans.0.function.9.type=n%2Cmean&mtas.stats.spans.0.function.10.expression=%24q2%2F%24q3&mtas.stats.spans.0.function.10.key=een+-+relative+to+LID&mtas.stats.spans.0.function.10.type=n%2Cmean&mtas.stats.spans.0.function.11.expression=(%24q0%2B%24q1%2B%24q2)%2F%24q3&mtas.stats.spans.0.function.11.key=de%2Bhet%2Been+-+relative+to+LID&mtas.stats.spans.0.function.11.type=n%2Cmean&rows=0&wt=json&indent=true

"mtas":{
    "stats":{
      "spans":[{
          "key":"multiple and functions - de+het+een and LID",
          "n":1890377,
          "functions":{
            "een - relative to LID":{
              "mean":0.26177400695591124,
              "errorList":{"division by zero":24175},
              "n":1890377,
              "errorNumber":24175},
            "LID - absolute":{
              "sum":44077220,
              "n":1890377},
            "de+het+een - relative to LID":{
              "mean":1.0864079360130154,
              "errorList":{"division by zero":24175},
              "n":1890377,
              "errorNumber":24175},
            "het - relative to LID":{
              "mean":0.2740826070638114,
              "errorList":{"division by zero":24175},
              "n":1890377,
              "errorNumber":24175},
            "een - relative to positions":{
              "mean":0.021631171906706374,
              "n":1890377},
            "een - absolute":{
              "sum":10620744,
              "n":1890377},
            "het - relative to positions":{
              "mean":0.02235754528581941,
              "n":1890377},
            "de - absolute":{
              "sum":25504598,
              "n":1890377},
            "het - absolute":{
              "sum":11530937,
              "n":1890377},
            "LID - relative to positions":{
              "mean":0.08693980190126971,
              "n":1890377},
            "de - relative to LID":{
              "mean":0.5505513219945993,
              "errorList":{"division by zero":24175},
              "n":1890377,
              "errorNumber":24175},
            "de - relative to positions":{
              "mean":0.049574709134571515,
              "n":1890377}}}]}}

Lucene

To use statistics on the occurrence of a span directly in Lucene, ComponentSpan together with the provided collect method can be used.

MTAS

Statistics - spans

Variables

Functions

Examples

Basic

Minimum and Maximum

Subset

Multiple

Prefix

Ignore

Ignore and maximumIgnoreLength

Functions

Multiple and Functions

Lucene