From Relevant Search by Doug Turnbull and John Berryman

This article discusses debugging query matching in search engines.

Save 37% on Relevant Search. Just enter code fccturnbull into the discount code box at checkout at

In our work, we’ll often encounter cases where a piece of text we assumed would match our search didn’t, eliminating a relevant document from the search results. Conversely, we might be surprised when low-value or spurious terms match, adding an irrelevant document to the results. Even within the documents retrieved, matching or not matching a term might influence relevance ranking — unexpectedly causing poor results to be ranked highly due to spurious matches or ranked low due to unexpected misses. We need to be able to take apart this process with Elasticsearch’s analysis and query validation debugging tools.

First, let’s talk about what precisely what we mean by “matching.” Declaring a term a match in the inverted index is a strict, exact binary equivalence. Search engines don’t have the intelligence even to know that “Aliens” and “alien” refer to the same idea. Or that “extraterrestrial” almost refers to the same idea. English speaking humans understand that these mentions should be counted as signifiers of the idea of “alien”; or an indicator of the feature of “alien-ness” present in the text. The way human’s learn language is through being taught, with some using somewhere like AJ Hoge’s Effortless English Club to learn the language and to communicate effectively. However, to the unintelligent search engine, these two tokens exist as distinct UTF-8 binary strings. The two strings, [0x41,0x6c,0x69,0x65,0x6e,0x73] (“Aliens”) and [0x61,0x6c,0x69,0x65,0x6e] (“alien”), are not at all the same, and do not match.

This exacting matching behavior points to two areas to take apart (1) query parsing – how our Query DSL query above translated into a matching strategy of specific terms to fields, and (2) analysis — the process of creating tokens from the query and document text. By understanding query parsing, we can see exactly how our query DSL query is translated into internal uses of Lucene’s data structures to satisfy searches, starting with matches in fields. Through analysis, we can massage, interrogate, pry, and prod text with hope that the text’s true “alien-ness” can be boiled down to a single term. We can further identify meaningless terms, like “the” that might match and really represent no important feature, creating spurious matches on low value terms.

Only once we understand how the underlying data structures are created and accessed can we hope to take control of the process. Let’s walk through our search, and see if there’s a matching problem inadvertently including spurious matches like Fire with Fire.

The Query Validate API

The first thing we’ll do to inspect matching behavior is ask Elasticsearch to explain how the query was parsed. This will decompose our search query into an alternate description that more closely describes the underlying interaction with Lucene’s data structures. To do this, we’ll use Elasticsearch’s query validation endpoint. This endpoint, seen in the subsequent listing, takes as an argument a Query DSL query and returns a low-level explanation of the strategy used to satisfy the query

Listing 1 – Explain The Behavior Of Our Query

search = {    'query': {         'multi_match': {             'query': usersSearch,             'fields': ['title^10', 'overview']         }     } } httpResp = requests.get('http://localhost:9200' + #A                     '/tmdb/movie/_validate/query?explain',                      data=json.dumps(search)) print json.loads(httpResp.text)   Response:  {u'_shards': {u'failed': 0, u'successful': 1, u'total': 1},  u'explanations': [{u'explanation': u'filtered((((title:basketball title:with title:cartoon title:aliens)^10.0) | (overview:basketball overview:with overview:cartoon overview:aliens)))->cache(_type:movie)',    u'index': u'tmdb',    u'valid': True}],  u'valid': True}

#A Issue validation request to validate API

Here the returned “explanation” field (bolded) lists what we’re interested in, our query translated to a more precise syntax that gives us deeper information about how Lucene will work with our Elasticsearch query.

((title:basketball title:with title:cartoon title:aliens)^10.0) | (overview:basketball overview:with overview:cartoon overview:aliens)


Taking Apart Query Parsing

The query validation endpoint has returned an alternative representation of our query DSL query to help us debug our matching issues. Let’s examine this alternative syntax to see what it tells us. The query validation output is reminiscent of Lucene query syntax* — a low-level, precise way of specifying a search. Because of the additional precision, Lucene query syntax describes the requirements of a relevant document a bit more closely to how Lucene itself will perform the search using the inverted index.

Lucene queries are composed of a number of Boolean (MUST(+), SHOULD, and MUST_NOT(-)) clauses, each one specifying a field to search in the underlying document. Each clause takes the form [+/-]:. To debug matching, the most important part of the clause is the component that specifies the match itself: :. If we examine one of the clauses above, such as title:basketball, we can see that we’re asking the title field to look for the specific term basketball. Each clause here is a simple term query – a single term lookup in the inverted index. Besides term query, the most prominent queries we’ll encounter are phrase queries. We discussed these in chapter 2. In Lucene query syntax, these are specified using quotes, as in title:”space jam” to indicate the terms should be adjacent.

* In reality, the representation depends on each Lucene Query’s Java “toString” method, which attempts (but doesn’t always accurately reflect) strict “Lucene Query Syntax”.

In our example, as we move one layer out from each match, we can see Lucene’s query strategy. While we’re currently focused on matching, this encompasses more than matching. Above the inner most matches, we see, four “should” clauses scored together (grouped with parenthesis):

(title:basketball title:with title:cartoon title:aliens)

Boosted by a factor of 10 (as we’ve requested when searching):

(title:basketball title:with title:cartoon title:aliens)^10

Compared to another query, with a maximum score taken (| symbol)

((title:basketball title:with title:cartoon title:aliens)^10.0) | (overview:basketball overview:with overview:cartoon overview:aliens)

It seems odd that a lot of surprising scoring mumbo-jumbo is already happening for us. What matters is using the information about what terms are being searched for to answer why spurious matches like Dances with Wolves or Fire with Fire even considered matches.

Debugging Analysis

Now that we know what terms are being searched for, the next step to debugging matching is to see how documents are decomposed into terms and placed in the index. After all, our searches will fail if the terms we’re searching for don’t actually exist in the index. We gave an example of this above. Searches for the term “Aliens” will not match the term “alien” regardless of our intuition. Further, term searches might result in spurious matches that in reality don’t signify anything valuable. For example, matching on “the” in isolation is spurious for English. It signifies no important feature latent in the text to our user’s English language trained minds.

It’s often the case that despite our intuitive notion of how a document should be decomposed into terms, the mechanics of analysis surprise us. It’s a process we’ll need to debug often. We already know how these terms are extracted – through index-time analysis. Analyzers are the entity that defines the analysis process. They contain these components: character filters, a tokenizer, and token filters. In Elasticsearch, analyzer used can be specified at many levels, including the ability to specify an analyzer for the index (all of tmdb), a node (a running instance of Elasticsearch), a type (all movies), a field, or even at query time for a particular query. We have yet to specify an analyzer, so the default Standard analyzer is used. We can use this knowledge along with Elasticsearch’s useful analyze endpoint to view how text from our documents was transformed into the tokens that will form the inverted index.

Perhaps if we see how the analysis for the title Fire with Fire translates to the inverted index, we might see the terms that match our query. Then we might see why this seemingly random, irrelevant movie is included in the results.

Listing 2 – Debugging Analysis

resp =  requests.get('http://localhost:9200/tmdb/_analyze?analyzer=standard',           data="Fire with Fire") #A print resp.text

#A Request analysis of the string “Fire with Fire” as a title field

The result (in prettier YAML) is:

tokens: - token: "fire" #A   start_offset: 0 #B   end_offset: 4   type: ""   position: 1 #C - token: "with"   start_offset: 5   end_offset: 9   type: ""   position: 2 - token: "fire"   start_offset: 10   end_offset: 14   type: ""        position: 3

#A An entry in the token stream, showing the extracted properties of a token (in this case the first time “fire” was mentioned

#B Start/end offsets indicate where the token exists in the source text

#C Position indicates term ordering, distance, and adjacency

This output shows us important properties of each token extracted from the snippet Fire with Fire by the standard analyzer. This list of tokens resulting from analysis is known as the token stream. Here in this token stream, we extract three tokens, “fire”, “with” and “fire”. Notice how the text has been tokenized by whitespace and lowercased? Further notice how more attributes than just the token text are included. Notice the offset values – indicating the exact character position of each term in the source text, and the “position” indicating the position of the token in the stream.

After analysis, the token stream is indexed and placed into the inverted index. For debugging and illustration purposes, we can represent the inverted index in a simple data structure known as SimpleText – an index storage format created by Mike McCandless purely for educational purposes.

Let’s take a second to reflect on how the token stream above is translated to a SimpleText representation of an index, focused just on the term “fire”.

Listing 3 — SimpleText Index Representation For Term “fire”

field title   term fire     doc 0  #A       freq 2       position 1       position 3     doc 2 

#A Posting indicates doc 0 contains the term “fire”

The search engine’s goal when indexing is to consume the token stream into the inverted index, placing documents under their appropriate terms. After counting how many occurrences of any particular token occur (we see two instances of “fire”) indexing adds entries to the postings list for the term “fire”. Under “fire” we add our document, doc 0, at #A. We’ll further store the number of occurrences of “fire” in doc 0 as “freq” and record where it occurred through each “position” entry. With all the tokens taken together, this document is added to the postings for two terms, “fire” and “with” as below postings for two terms, “fire” and “with” as below.

Listing 4 — View of Title Index with Fire With Fire terms highlighted

field title   term fire     doc 0    #A       freq 2       position 1       position 3     doc 2    term with doc 0       #B       freq 1   

#A “Fire with Fire” under “title:fire” posting

#B “Fire with Fire” under “title:with” posting

Of course, as we’ve discussed, data structures other than the inverted index consume this token stream. There are numerous features that can be enabled in Lucene. For our purposes, we should consider data structures that consume this token stream to provide other forms of global-term statistics like the document frequency. In this case, the document frequency of “fire” will increase by one, reflecting the new document.

What’s important to note is we can deeply control this process. Typically, analysis is controlled on a field-by-field basis. We’ll see how we can define our own analyzers for our fields, using the components of character filters, tokenizers, and token filters. Before that, armed with what we know about the query and the terms in the index, we need to examine why a spurious match like Fire with Fire would even match in the first place.

Our Query Vs The Inverted Index

We’re now prepared to compare our parsed query with the context of the inverted index. If we compare the parsed query:

((title:basketball title:with title:cartoon title:aliens)^10.0) | (overview:basketball overview:with overview:cartoon overview:aliens)

against the inverted index snippet from the token stream for Fire with Fire, we see exactly where the match occurs:

field title  term fire     doc 0       freq 2       position 1       position 3     doc 2     … (more docs)   term with #A doc 0.       freq 1       position 2 

#A with given equal prominence to fire in postings list

The clause “title:with” pulls in doc 0, Fire with Fire, from the inverted index. Recalling how term matches work, we see how the mechanics of how this would work. Our document is listed under “with” in the index. Therefore, it gets included in the search results along with other matches for “with”. This is an entirely mechanical process, devoid of any consciousness of how to English works. Of course, to English speakers, a match on “with” is not helpful. Matches on “with” will only leave English speaking searchers scratching their head on why such a noisy word was considered important in matching.

Other spurious movies seem to fall into this category, movies like Dances with Wolves or From Russia with Love get slurped up into the search results just as easily as documents that match more important terms like “basketball” or “aliens.” Without help the search engine can’t discriminate between meaningful, valid, and important English terms and those that are noise and low value.

Fixing Our Matching By Changing Analyzers

Our matching problem luckily has a fairly straightforward fix. We’ve teased Elasticsearch for not knowing much about English. In actuality, Elasticsearch has an analyzer that handles English text fairly well. It does a reasonable job doing some of the work we’d expect from an English-focused analyzer for general-purpose use cases. It will stem English terms to root forms (running -> run), and remove noise terms like “the” known as stop words. Lucky for us “with” is one such stop word. Removing it from the index could solve our problem.

How do we do use this analyzer? Simple, we need to assign a different analyzer to the fields. As modifications to index-time analysis alter the structure of the inverted index, we’ll have to re-index our documents. To customize our analysis, we will simply recreate our index, and rerun much of the indexing code from above. The main difference is at #1 below, we’ll explicitly configure the field with the English analyzer before creating the index.

Listing 5 — Reindexng With The English Analyzer

mappingSettings = {        'movie': {             'properties': {                'title': { #A                    'type': 'string',                    'analyzer': 'english'                },             'overview': {                    'type': 'string',                    'analyzer': 'english'                }             }        } } reindex(mappingSettings=mappingSettings, movieDict=movieDict) #B  

#A Modifying fields “title” and “overview” to use the English analyzer

#B Reindex with new field mappings

Great! Did it work? Well let’s reanalyze Fire with Fire to see the results.

resp = requests.get('http://localhost:9200/tmdb/_analyze?field=_all',"Fire with Fire")


tokens: - token: "fire"   start_offset: 0   end_offset: 4   type: ""   position: 1 - token: "fire" #A   start_offset: 10   end_offset: 14   type: ""   position: 3 #B

#A “with” token no longer in postings

#B position of second “fire” term unchanged

Notice the removal of with in this token stream. Particularly, notice the gap between positions 1 and 3. Elasticsearch is reflecting the removal of the token by this position gap to avoid spurious phrase or positional matches. Indeed, rerunning the query validation also shows a removal of “with” from the query:

u'valid': True} {u'_shards': {u'failed': 0, u'successful': 1, u'total': 1},  u'explanations': [{u'explanation': u'filtered((((title:basketbal #A title:cartoon title:alien)^10.0) | (overview:basketbal overview:cartoon overview:alien)))- >cache(_type:movie)',    u'index': u'tmdb',    u'valid': True}],

#A The new query strategy, note “basketball” is now stemmed to “basketball” due to rules for stemming English

And indeed, the matches become much closer to what we want. At least we’re in the range of “aliens”. Further, because of more sophisticated analysis, stemming, and token normalization we’re picking up other matches of “alien” that were missing.

Num   Relevance Score   Movie Title 1     1.0643067         Alien 2     1.0643067         Aliens  3     1.0643067         Alien³  4     1.0254613         The Basketball Diaries 5     0.66519165        Cowboys & Aliens 6     0.66519165        Aliens in the Attic 7     0.66519165        Alien: Resurrection 8     0.53215337        Aliens vs Predator: Requiem 9     0.53215337        AVP: Alien vs. Predator 10    0.53215337        Monsters vs Aliens 11    0.08334568        Space Jam

Congratulations, we’ve made a marked improvement! By turning on the English analyzer, we’ve made a significant leap forward. Our target has moved up to #11. We’ve achieved a saner mapping of text that corresponds to the text’s “alien-ness” feature through simple English-focused analysis. We’ve also eliminated text that shouldn’t be thought of as corresponding to any feature of the text: stop words.

If you want to learn more about the book, check it out on liveBook here.