Skip to main content

Setting up Elastic to use the MLT Widget

Term Vectors - Term vectors are additional data stored as part of each document which keeps track of the frequency of words that appear more often. Term vectors increase the size of an index but are required for highlighting and More Like This searches. They can only be applied to text fields.

Default Fields

The content, simflofy_typename, simflofy_filename fields have term vectors enabled by default.

Sample Size

MLT searches won't work properly without a sufficient number of documents, likely 200-300 at least with relevant metadata. This number is hard to pin down, but the default configuration of the widget is very permissive.

Job Specification

info

If a field mapping already exists in ElasticSearch without term vectors, attempting to add term vectors to it will cause an exception. You can review mapping properties by either using Kibana or checking the following endpoint in ElasticSearch

(elasticSearch)/(indexName/_mappings

Any Indexing Job with ElasticSearch can apply term vectors. The job specification takes a comma delimited list of fields to make into term vectors. The fields included in this list must also be mapped, as seen below:

MappingElasticSearch Server Properties

The MLT Widget

Simflofy does not have a widget instance for the MLT Widget out of the box. You will need to go to the Federation menu and select widget instances.

Widget Definitions

Then, select Create New Widget Instance.

Create New Widget Instance

Select MLTWidget from the dropdown.

Select MLT Widget

You will then be taken to the Widget Instance page.

MLT Widget Configuration

Configuring the Widget

It is highly recommended you read up on how MLT searches work. Most of the configuration for the widget is pulled directly from ElasticSearch

PropertyDescription
refDocsThe sample size of documents used. The MLT search requires a source set of documents to start using as search criteria.
maxqtThe maximum number of query terms that will be selected. How many common phrases will be selected from the source documents to begin the search? Increasing this value gives greater accuracy at the expense of query execution speed.
mintfThe minimum term frequency (how often a word or phrase shows up) below which the terms will be ignored from the input document. A setting of 1 means that if a document matches a term one time, it will be included in the results.
mindfThe minimum document frequency (how many matches a document gets) below which the terms will be ignored from the input document.
minwlThe minimum word length below which the terms will be ignored.
maxwlThe maximum word length above which the terms will be ignored. Defaults to unbound (0)
mltflThe list of fields checked for similarities. The default values are fields that have term vectors by default.
btnLabelThe label for the More Like This search button.
note

These default values are meant to be a starting point, as they require very few matches to be considered like another document. More refined searches will require more tuning.

Usage

The widget can be placed on the sidebar of any view using the simflofy template. Upon initial load, assuming your indexes have enough data for MLT to get results, you should see something like the following:

More Like This TSearch View

The information icons can be used to view the metadata of the sample documents.

After a search is completed (including the initial one), the widget performs a separate search using the results (number determined by refDocs) to provide a sample of the MLT documents.

Pressing the button on the Widget will use the sample documents as a reference and will load the results as normal. Note that the widget will still perform an mlt search on the return for this search as well.

Interactions with Other Facet Widgets

MLT results will return facet counts, but those values cannot be drilled down into by other widgets as of now.

For example, you could not perform an MLT Search and then use the Facet Select widget to select all PDFs. For situations like this:

Pre 3.1

It is recommended that you separately index the content type using the following calculated mapping

Content Type Mapping

Then add content_type to your Term Vector field list in the output specification and the field list for the widget.

Post 3.1

simflofy_content_type will have its term vector added automatically as part of the migration and can be added to the field