ElasticSearch: Boosting the relevance of documents
ElasticSearch provides a multitude of functionalities that allow data to be queried and represented according to business needs with as much ease as traditional RDBMS.
Here are a few simple snippets of the different ways we can boost the relevance of records.
Setting up a simple index
Let's start by setting up an index and add two documents:
- The index will have a mapping [i.e., property] called Categories, that is nested, this will be used in the future queries as a use case for boosting:
DELETE my-index
PUT my-index
{
"mappings": {
"properties": {
"Id": {
"type": "integer"
},
"Name": {
"type": "text"
},
"Categories" : {
"type" : "nested",
"properties" : {
"Id" : {
"type" : "integer"
},
"Level" : {
"type" : "integer"
},
"Name" : {
"type" : "text"
}
}
}
}
}
}
- Insert the following 2 documents into the index:
PUT my-index/_doc/1
{
"Id": 1,
"Name": "Alcoholic Drinks",
"Categories" : [
{
"Id" : 61348,
"Name" : "Alcoholic Drinks",
"Level" : 0
},
{
"Id" : 64346,
"Name" : "Spirits",
"Level" : 1
},
{
"Id" : 54819,
"Name" : "RTDs",
"Level" : 1
},
{
"Id" : 19467,
"Name" : "Wine",
"Level" : 1
},
{
"Id" : 59258,
"Name" : "Beer",
"Level" : 1
},
{
"Id" : 76261,
"Name" : "Cider/Perry",
"Level" : 1
},
{
"Id" : 59262,
"Name" : "Stout",
"Level" : 1
},
{
"Id" : 33534,
"Name" : "Non/Low Alcohol Beer",
"Level" : 1
}
]
}
PUT my-index/_doc/2
{
"Id": 1,
"Name": "Packaging",
"Categories" : [
{
"Id" : 68054,
"Name" : "Packaging",
"Level" : 0
},
{
"Id" : 30379,
"Name" : "Ice Cream and Frozen Desserts Packaging",
"Level" : 1
},
{
"Id" : 60793,
"Name" : "Ready Meals Packaging",
"Level" : 1
},
{
"Id" : 92362,
"Name" : "Soup Packaging",
"Level" : 1
},
{
"Id" : 52672,
"Name" : "Spirits",
"Level" : 2
},
{
"Id" : 94441,
"Name" : "RTDs",
"Level" : 2
},
{
"Id" : 92398,
"Name" : "Wine",
"Level" : 2
},
{
"Id" : 51836,
"Name" : "Beer",
"Level" : 2
},
{
"Id" : 45116,
"Name" : "Cider/Perry",
"Level" : 2
},
{
"Id" : 23913,
"Name" : "Stout",
"Level" : 2
}
]
}
As seen above, both documents have an array of Categories, defined by their Id, Name and Level.
Querying data
For starters, let's write a simple query to list documents by filtering the Name property:
GET my-index/_search
{
"query": {
"match": {
"Name": "Packaging"
}
},
"_source": [
"Name"
]
}
This yields the following result:
Now, let's say we want to fetch the results if there's a match in any of the categories. We'd write a nested query as shown below:
GET my-index/_search
{
"query": {
"nested": {
"path": "Categories",
"query": {
"bool": {
"must": [
{
"match": {
"Categories.Name": "Beer"
}
}
]
}
}
}
},
"_source": [
"Name"
]
}
If we observe the 2 documents, the Category Beer is present in both of them, and the result would look like this:
Here comes the challenge:
Notice that for the document with name Alcoholic Drinks, the category Beer is at Level 1, whereas for Packaging it is Level 2.
How do we rewrite the above query to filter by categories AND give more relevance to documents with matched categories at Level 1 than Level 2? [This would mean that contrary to the above query's result, Alcoholic Drinks should be the first document, then Packaging]
This is where boosting comes into picture!
Boost records
Boost using Conditions
There are many ways to boost records, the simplest being using conditionals in our queries. For our example, we can boost category level 1 more than category level 2, hence we'd be writing 2 queries similar to the below pseudo-code:
( (Categories.Name == 'beer' && Boost((Categories.Level == 1), 4)
|| (Categories.Name == 'beer' && Boost((Categories.Level == 2), 1) )
According to the above pseudo-code, we need to wrap each Name+Level query in must [logical and], and both these sub-queries in a should [logical or].
The query would look like this:
GET my-index/_search
{
"query": {
"nested": {
"path": "Categories",
"query": {
"bool": {
"must": [
{
"bool": {
"should": [
{
"bool": {
"must": [
{
"match": {
"Categories.Name": {
"query": "beer"
}
}
},
{
"match": {
"Categories.Level": {
"query": 1,
"boost": 4
}
}
}
]
}
},
{
"bool": {
"must": [
{
"match": {
"Categories.Name": {
"query": "beer"
}
}
},
{
"match": {
"Categories.Level": {
"query": 2,
"boost": 1
}
}
}
]
}
}
]
}
}
]
}
}
}
},
"_source": [
"Name"
]
}
The output would be as shown below:
There's a caveat of boosting by the above approach; If we were to, say, boost based on even more levels, we would have to add multiple conditions, and the query would become bulky. Luckily we have in-built utilities that help us write cleaner queries.
Boost using function_score
We can wrap our queries in a function_score utility and define custom functions that would define the relevance of records based on a secondary match.
Here's a simple query using only weights in the functions:
GET my-index/_search
{
"query": {
"nested": {
"path": "Categories",
"query": {
"bool": {
"must": [
{
"function_score": {
"query": {
"match": {
"Categories.Name": "Beer"
}
},
"functions": [
{
"filter": {
"match": {
"Categories.Level": 1
}
},
"weight": 5
},
{
"filter": {
"match": {
"Categories.Level": 2
}
},
"weight": 2
}
]
}
}
]
}
}
}
},
"_source": [
"Name"
]
}
The weight is considered while calculating the score for the document, and the query gives the desired result as shown below:
But this has the same disadvantage of the query becoming bulky with multiple functions. This is where the next approach comes into play.
Boosting using script_score
The script_score utility allows us to define transformations based on one or more values of our choice and generate the score however we want.
Since our use case requires that we need to give more relevance to level 1 than 2, and so on, we can use something as simple as dividing an arbitrary value so that the result would yield a higher value for level 1 than 2.
The query would look something like this:
GET my-index/_search
{
"query": {
"nested": {
"path": "Categories",
"query": {
"bool": {
"must": [
{
"function_score": {
"query": {
"match": {
"Categories.Name": "Beer"
}
},
"script_score": {
"script": "4/doc['Categories.Level'].value"
}
}
}
]
}
}
}
},
"_source": [
"Name"
]
}
In the above query, we can see that the higher the category level, the lower the quotient/result, and thus the lower the score, as seen in the below result:
Also, we don't need to add anything more for levels 3 and beyond, which reduces the LOC on the query.
Wrapping up
There's various ways to improve/optimize how the search results are generated, and the above approaches can be used as stepping stones over which more complex queries can be built on top of. Anyone interested can go through Elastic's own documentation and look into function_score and script_score in depth.
Subscribe to my newsletter
Read articles from Akshay Kumar R directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by