Skip to content

Latest commit

 

History

History
452 lines (397 loc) · 13.7 KB

30_Tutorial_Search.asciidoc

File metadata and controls

452 lines (397 loc) · 13.7 KB

Retrieving a document

Now that we have some data stored in Elasticsearch, we can get to work on the business requirements for this application. The first requirement is the ability to retrieve individual employee data.

This is very easy in Elasticsearch. We simply execute an HTTP GET request and specify the ``address'' of the document — the index, type and id. Using those three pieces of information, we can return the original JSON document:

GET /megacorp/employee/1

And the response contains some metadata about the document, and John Smith’s original JSON document as the _source field:

{
  "_index" :   "megacorp",
  "_type" :    "employee",
  "_id" :      "1",
  "_version" : 1,
  "found" :    true,
  "_source" :  {
      "first_name" :  "John",
      "last_name" :   "Smith",
      "age" :         25,
      "about" :       "I love to go rock climbing",
      "interests":  [ "sports", "music" ]
  }
}

In the same way that we changed the HTTP verb from PUT to GET in order to retrieve the document, we could use the DELETE verb to delete the document, and the HEAD verb to check whether or not the document exists. To replace an existing document with an updated version, we just PUT it again.

Search Lite

A GET is fairly simple — you get back the document that you ask for. Let’s try something a little more advanced, like a simple search!

The first search we will try is the simplest search possible. We will search for all employees, with this request:

GET /megacorp/employee/_search

You can see that we’re still using index megacorp and type employee, but instead of specifying a document ID, we now use the _search endpoint. The response includes all three of our documents in the hits array. By default, a search will return the top 10 results.

{
   "took":      6,
   "timed_out": false,
   "_shards": { ... },
   "hits": {
      "total":      3,
      "max_score":  1,
      "hits": [
         {
            "_index":         "megacorp",
            "_type":          "employee",
            "_id":            "3",
            "_score":         1,
            "_source": {
               "first_name":  "Douglas",
               "last_name":   "Fir",
               "age":         35,
               "about":       "I like to build cabinets",
               "interests": [ "forestry" ]
            }
         },
         {
            "_index":         "megacorp",
            "_type":          "employee",
            "_id":            "1",
            "_score":         1,
            "_source": {
               "first_name":  "John",
               "last_name":   "Smith",
               "age":         25,
               "about":       "I love to go rock climbing",
               "interests": [ "sports", "music" ]
            }
         },
         {
            "_index":         "megacorp",
            "_type":          "employee",
            "_id":            "2",
            "_score":         1,
            "_source": {
               "first_name":  "Jane",
               "last_name":   "Smith",
               "age":         32,
               "about":       "I like to collect rock albums",
               "interests": [ "music" ]
            }
         }
      ]
   }
}
Note
The response not only tells us which documents matched, but it also includes the whole document itself: all of the information that we need to display the search results to the user.

Next, let’s try searching for employees who have Smith'' in their last name. To do this, we’ll use a lightweight'' search method which is easy to use from the command line. This method is often referred to as a query string search, since we pass the search as a URL query string parameter:

GET /megacorp/employee/_search?q=last_name:Smith

We use the same _search endpoint in the path, and we add the query itself in the q= parameter. The results that come back show all Smith’s:

{
   ...
   "hits": {
      "total":      2,
      "max_score":  0.30685282,
      "hits": [
         {
            ...
            "_source": {
               "first_name":  "John",
               "last_name":   "Smith",
               "age":         25,
               "about":       "I love to go rock climbing",
               "interests": [ "sports", "music" ]
            }
         },
         {
            ...
            "_source": {
               "first_name":  "Jane",
               "last_name":   "Smith",
               "age":         32,
               "about":       "I like to collect rock albums",
               "interests": [ "music" ]
            }
         }
      ]
   }
}

Search with Query DSL

Query-string search is handy for ad hoc searches from the command line, but it has its limitations (see Search Lite). Elasticsearch provides a rich, flexible, query language called the Query DSL, which allows us to build much more complicated, robust queries.

The DSL (Domain Specific Language) is specified using a JSON request body. We can represent the previous search for all Smith’s like so:

GET /megacorp/employee/_search
{
    "query" : {
        "match" : {
            "last_name" : "Smith"
        }
    }
}

This will return the same results as the previous query. You can see that a number of things have changed. For one, we are no longer using query string parameters, but instead a request body. This request body is built with JSON, and uses a match query (one of several types of queries, which we will learn about later).

More complicated searches

Let’s make the search a little more complicated. We still want to find all employees with a last name of ``Smith'', but we only want employees who are older than 30. Our query will change a little to accommodate a filter, which allows us to execute structured searches efficiently:

GET /megacorp/employee/_search
{
    "query" : {
        "filtered" : {
            "filter" : {
                "range" : {
                    "age" : { "gt" : 30 } (1)
                }
            },
            "query" : {
                "match" : {
                    "last_name" : "smith" (2)
                }
            }
        }
    }
}
  1. This portion of the query is a range filter, which will find all ages older than 30 — gt stands for ``greater than''.

  2. This portion of the query is the same match query that we used before.

Don’t worry about the syntax too much for now, we will cover it in great detail later on. Just recognize that we’ve added a filter which performs a range search, and reused the same match query as before. Now our results only show one employee who happens to be 32 and is named ``Jane Smith'':

{
   ...
   "hits": {
      "total":      1,
      "max_score":  0.30685282,
      "hits": [
         {
            ...
            "_source": {
               "first_name":  "Jane",
               "last_name":   "Smith",
               "age":         32,
               "about":       "I like to collect rock albums",
               "interests": [ "music" ]
            }
         }
      ]
   }
}

The searches so far have been simple: single names, filtering by age. Let’s try a more advanced, full-text search — a task which traditional databases would really struggle with.

We are going to search for all employees who enjoy ``rock climbing'':

GET /megacorp/employee/_search
{
    "query" : {
        "match" : {
            "about" : "rock climbing"
        }
    }
}

You can see that we use the same match query as before to search the about field for ``rock climbing''. We get back two matching documents:

{
   ...
   "hits": {
      "total":      2,
      "max_score":  0.16273327,
      "hits": [
         {
            ...
            "_score":         0.16273327, (1)
            "_source": {
               "first_name":  "John",
               "last_name":   "Smith",
               "age":         25,
               "about":       "I love to go rock climbing",
               "interests": [ "sports", "music" ]
            }
         },
         {
            ...
            "_score":         0.016878016, (1)
            "_source": {
               "first_name":  "Jane",
               "last_name":   "Smith",
               "age":         32,
               "about":       "I like to collect rock albums",
               "interests": [ "music" ]
            }
         }
      ]
   }
}
  1. The relevance scores.

By default, Elasticsearch sorts matching results by their relevance score, that is: by how well each document matched the query. The first and highest scoring result is obvious: John Smith’s about field clearly says ``rock climbing'' in it.

But why did Jane Smith, come back as a result? The reason her document was returned is because the word rock'' was mentioned in her about field. Because only rock'' was mentioned, and not `climbing'', her `_score is lower than John’s.

This is a good example of how Elasticsearch can search within full text fields and return the most relevant results first. This concept of relevance is important to Elasticsearch, and is a concept that is completely foreign to traditional relational databases where a record either matches or it doesn’t.

Finding individual words in a field is all well and good, but sometimes you want to match exact sequences of words or phrases. For instance, we could perform a query that will only match employees that contain both rock'' and climbing'' and where the words are next to each other in the phrase ``rock climbing''.

To do this, we use a slight variation of the match query called the match_phrase query:

GET /megacorp/employee/_search
{
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
        }
    }
}

Which, to no surprise, returns only John Smith’s document:

{
   ...
   "hits": {
      "total":      1,
      "max_score":  0.23013961,
      "hits": [
         {
            ...
            "_score":         0.23013961,
            "_source": {
               "first_name":  "John",
               "last_name":   "Smith",
               "age":         25,
               "about":       "I love to go rock climbing",
               "interests": [ "sports", "music" ]
            }
         }
      ]
   }
}

Highlighting our searches

Many applications like to highlight snippets of text from each search result so that the user can see why the document matched their query. Retrieving highlighted fragments is very easy in Elasticsearch.

Let’s rerun our previous query, but add a new highlight parameter:

GET /megacorp/employee/_search
{
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
        }
    },
    "highlight": {
        "fields" : {
            "about" : {}
        }
    }
}

When we run this query, the same hit is returned as before, but now we get a new section in the response called highlight. This contains a snippet of text from the about field with the matching words wrapped in <em></em> HTML tags:

{
   ...
   "hits": {
      "total":      1,
      "max_score":  0.23013961,
      "hits": [
         {
            ...
            "_score":         0.23013961,
            "_source": {
               "first_name":  "John",
               "last_name":   "Smith",
               "age":         25,
               "about":       "I love to go rock climbing",
               "interests": [ "sports", "music" ]
            },
            "highlight": {
               "about": [
                  "I love to go <em>rock</em> <em>climbing</em>" (1)
               ]
            }
         }
      ]
   }
}
  1. The highlighted fragment from the original text.

You can read more about the highlighting of search snippets in the {ref}search-request-highlighting.html[highlighting reference documentation].