Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Learning Elasticsearch
Learning Elasticsearch

Learning Elasticsearch: Structured and unstructured data using distributed real-time search and analytics

Arrow left icon
Profile Icon Andhavarapu
Arrow right icon
$19.99 per month
Full star icon Full star icon Full star icon Full star icon Half star icon 4.3 (4 Ratings)
Paperback Jun 2017 404 pages 1st Edition
eBook
$9.99 $43.99
Paperback
$54.99
Subscription
Free Trial
Renews at $19.99p/m
Arrow left icon
Profile Icon Andhavarapu
Arrow right icon
$19.99 per month
Full star icon Full star icon Full star icon Full star icon Half star icon 4.3 (4 Ratings)
Paperback Jun 2017 404 pages 1st Edition
eBook
$9.99 $43.99
Paperback
$54.99
Subscription
Free Trial
Renews at $19.99p/m
eBook
$9.99 $43.99
Paperback
$54.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Learning Elasticsearch

Introduction to Elasticsearch

In this chapter, we will focus on the basic concepts of Elasticsearch. We will start by explaining the building blocks and then discuss how to create, modify and query in Elasticsearch. Getting started with Elasticsearch is very easy; most operations come with default settings. The default settings can be overridden when you need more advanced features.

I first started using Elasticsearch in 2012 as a backend search engine to power our Analytics dashboards. It has been more than five years, and I never looked for any other technologies for our search needs. Elasticsearch is much more than just a search engine; it supports complex aggregations, geo filters, and the list goes on. Best of all, you can run all your queries at a speed you have never seen before. To understand how this magic happens, we will briefly discuss how Elasticsearch works internally and then discuss how to talk to Elasticsearch. Knowing how it works internally will help you understand its strengths and limitations. Elasticsearch, like any other open source technology, is very rapidly evolving, but the core fundamentals that power Elasticsearch don't change. By the end of this chapter, we will have covered the following:

  • Basic concepts of Elasticsearch
  • How to interact with Elasticsearch
  • How to create, read, update, and delete
  • How does search work
  • Availability and horizontal scalability
  • Failure handling
  • Strengths and limitations

Basic concepts of Elasticsearch

Elasticsearch is a highly scalable open source search engine. Although it started as a text search engine, it is evolving as an analytical engine, which can support not only search but also complex aggregations. Its distributed nature and ease of use makes it very easy to get started and scale as you have more data.

One might ask what makes Elasticsearch different from any other document stores out there. Elasticsearch is a search engine and not just a key-value store. It's also a very powerful analytical engine; all the queries that you would usually run in a batch or offline mode can be executed in real time. Support for features such as autocomplete, geo-location based filters, multilevel aggregations, coupled with its user friendliness resulted in industry-wide acceptance. That being said, I always believe it is important to have the right tool for the right job. Towards the end of the chapter, we will discuss it’s strengths and limitations.

In this section, we will go through the basic concepts and terminology of Elasticsearch. We will start by explaining how to insert, update, and perform a search. If you are familiar with SQL language, the following table shows the equivalent terms in Elasticsearch:

Database Table Row Column
Index Type Document Field

Document

Your data in Elasticsearch is stored as JSON (Javascript Object Notation) documents. Most NoSQL data stores use JSON to store their data as JSON format is very concise, flexible, and readily understood by humans. A document in Elasticsearch is very similar to a row when compared to a relational database. Let's say we have a User table with the following information:

Id Name Age Gender Email
1 Luke 100 M [email protected]
2 Leia 100 F [email protected]

The users in the preceding user table, when represented in JSON format, will look like the following:

{
"id": 1,
"name": "Luke",
"age": 100,
"gender": "M",
"email": "[email protected]"
} {
"id": 2,
"name": "Leia",
"age": 100,
"gender": "F",
"email": "[email protected]"
}

A row contains columns; similarly, a document contains fields. Elasticsearch documents are very flexible and support storing nested objects. For example, an existing user document can be easily extended to include the address information. To capture similar information using a table structure, you need to create a new address table and manage the relations using a foreign key. The user document with the address is shown here:

{
"id": 1,
"name": "Luke",
"age": 100,
"gender": "M",
"email": "[email protected]",
"address": {
"street": "123 High Lane",
"city": "Big City",
"state": "Small State",
"zip": 12345
}
}

Reading similar information without the JSON structure would also be difficult as the information would have to be read from multiple tables. Elasticsearch allows you to store the entire JSON as it is. For a database table, the schema has to be defined before you can use the table. Elasticsearch is built to handle unstructured data and can automatically determine the data types for the fields in the document. You can index new documents or add new fields without adding or changing the schema. This process is also known as dynamic mapping. We will discuss how this works and how to define schema in Chapter 3, Modeling Your Data and Document Relations.

Index

An index is similar to a database. The term index should not be confused with a database index, as someone familiar with traditional SQL might assume. Your data is stored in one or more indexes just like you would store it in one or more databases. The word indexing means inserting/updating the documents into an Elasticsearch index. The name of the index must be unique and typed in all lowercase letters. For example, in an e-commerce world, you would have an index for the items--one for orders, one for customer information, and so on.

Type

A type is similar to a database table, an index can have one or more types. Type is a logical separation of different kinds of data. For example, if you are building a blog application, you would have a type defined for articles in the blog and a type defined for comments in the blog. Let's say we have two types--articles and comments.

The following is the document that belongs to the article type:

{
"articleid": 1,
"name": "Introduction to Elasticsearch"
}

The following is the document that belongs to the comment type:

{
"commentid": "AVmKvtPwWuEuqke_aRsm",
"articleid": 1,
"comment": "Its Awesome !!"
}

We can also define relations between different types. For example, a parent/child relation can be defined between articles and comments. An article (parent) can have one or more comments (children). We will discuss relations further in Chapter 3, Modeling Your Data and Document Relations.

Cluster and node

In a traditional database system, we usually have only one server serving all the requests. Elasticsearch is a distributed system, meaning it is made up of one or more nodes (servers) that act as a single application, which enables it to scale and handle load beyond what a single server can handle. Each node (server) has part of the data. You can start running Elasticsearch with just one node and add more nodes, or, in other words, scale the cluster when you have more data. A cluster with three nodes is shown in the following diagram:

In the preceding diagram, the cluster has three nodes with the names elasticsearch1, elasticsearch2, elasticsearch3. These three nodes work together to handle all the indexing and query requests on the data. Each cluster is identified by a unique name, which defaults to elasticsearch. It is often common to have multiple clusters, one for each environment, such as staging, pre-production, production.

Just like a cluster, each node is identified by a unique name. Elasticsearch will automatically assign a unique name to each node if the name is not specified in the configuration. Depending on your application needs, you can add and remove nodes (servers) on the fly. Adding and removing nodes is seamlessly handled by Elasticsearch.

We will discuss how to set up an Elasticsearch cluster in Chapter 2, Setting Up Elasticsearch and Kibana.

Shard

An index is a collection of one or more shards. All the data that belongs to an index is distributed across multiple shards. By spreading the data that belongs to an index to multiple shards, Elasticsearch can store information beyond what a single server can store. Elasticsearch uses Apache Lucene internally to index and query the data. A shard is nothing but an Apache Lucene instance. We will discuss Apache Lucene and why Elasticsearch uses Lucene in the How search works section later.

I know we introduced a lot of new terms in this section. For now, just remember that all data that belongs to an index is spread across one or more shards. We will discuss how shards work in the Scalability and Availability section towards the end of this chapter.

Interacting with Elasticsearch

The primary way of interacting with Elasticsearch is via REST API. Elasticsearch provides JSON-based REST API over HTTP. By default, Elasticsearch REST API runs on port 9200. Anything from creating an index to shutting down a node is a simple REST call. The APIs are broadly classified into the following:

  • Document APIs: CRUD (Create Retrieve Update Delete) operations on documents
  • Search APIs: For all the search operations
  • Indices APIs: For managing indices (creating an index, deleting an index, and so on)
  • Cat APIs: Instead of JSON, the data is returned in tabular form
  • Cluster APIs: For managing the cluster

We have a chapter dedicated to each one of them to discuss more in detail. For example, indexing documents in Chapter 4, Indexing and Updating Your Data and search in Chapter 6, All About Search and so on. In this section, we will go through some basic CRUD using the Document APIs. This section is simply a brief introduction on how to manipulate data using Document APIs. To use Elasticsearch in your application, clients in all major languages, such as Java, Python, are also provided. The majority of the clients acts as a wrapper around the REST API.

To better explain the CRUD operations, imagine we are building an e-commerce site. And we want to use Elasticsearch to power its search functionality. We will use an index named chapter1 and store all the products in the type called product. Each product we want to index is represented by a JSON document. We will start by creating a new product document, and then we will retrieve a product by its identifier, followed by updating a product's category and deleting a product using its identifier.

Creating a document

A new document can be added using the Document API's. For the e-commerce example, to add a new product, we execute the following command. The body of the request is the product document we want to index.

PUT http://localhost:9200/chapter1/product/1
{
"title": "Learning Elasticsearch",
"author": "Abhishek Andhavarapu",
"category": "books"
}

Let's inspect the request:

INDEX chapter1
TYPE product
IDENTIFIER 1
DOCUMENT JSON
HTTP METHOD PUT

The document's properties, such as title, author, the category, are also known as fields, which are similar to SQL columns.

Elasticsearch will automatically create the index chapter1 and type product if they don't exist already. It will create the index with the default settings.

When we execute the preceding request, Elasticsearch responds with a JSON response, shown as follows:

{
"_index": "chapter1",
"_type": "product",
"_id": "1",
"_version": 1,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"created": true
}

In the response, you can see that Elasticsearch created the document and the version of the document is 1. Since you are creating the document using the HTTP PUT method, you are required to specify the document identifier. If you don’t specify the identifier, Elasticsearch will respond with the following error message:

No handler found for uri [/chapter1/product/] and method [PUT]

If you don’t have a unique identifier, you can let Elasticsearch assign an identifier for you, but you should use the POST HTTP method. For example, if you are indexing log messages, you will not have a unique identifier for each log message, and you can let Elasticsearch assign the identifier for you.

In general, we use the HTTP POST method for creating an object. The HTTP PUT method can also be used for object creation, where the client provides the unique identifier instead of the server assigning the identifier.

We can index a document without specifying a unique identifier as shown here:

POST http://localhost:9200/chapter1/product/
{
"title": "Learning Elasticsearch",
"author": "Abhishek Andhavarapu",
"category": "books"
}

In the above request, URL doesn't contain the unique identifier and we are using the HTTP POST method. Let's inspect the request:

INDEX chapter1
TYPE product
DOCUMENT JSON
HTTP METHOD POST

The response from Elasticsearch is shown as follows:

{
"_index": "chapter1",
"_type": "product",
"_id": "AVmKvtPwWuEuqke_aRsm",
"_version": 1,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"created": true
}

You can see from the response that Elasticsearch assigned the unique identifier AVmKvtPwWuEuqke_aRsm to the document and created flag is set to true. If a document with the same unique identifier already exists, Elasticsearch replaces the existing document and increments the document version. If you have to run the same PUT request from the beginning of the section, the response from Elasticsearch would be this:

{
"_index": "chapter1",
"_type": "product",
"_id": "1",
"_version": 2,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"created": false
}

In the response, you can see that the created flag is false since the document with id: 1 already exists. Also, observe that the version is now 2.

Retrieving an existing document

To retrieve an existing document, we need the index, type and a unique identifier of the document. Let’s try to retrieve the document we just indexed. To retrieve a document we need to use HTTP GET method as shown below:

GET http://localhost:9200/chapter1/product/1

Let’s inspect the request:

INDEX chapter1
TYPE product
IDENTIFIER 1
HTTP METHOD GET

Response from Elasticsearch as shown below contains the product document we indexed in the previous section:

{
"_index": "chapter1",
"_type": "product",
"_id": "1",
"_version": 2,
"found": true,
"_source": {
"title": "Learning Elasticsearch",
"author": "Abhishek Andhavarapu",
"category": "books"
}
}

The actual JSON document will be stored in the _source field. Also note the version in the response; every time the document is updated, the version is increased.

Updating an existing document

Updating a document in Elasticsearch is more complicated than in a traditional SQL database. Internally, Elasticsearch retrieves the old document, applies the changes, and re-inserts the document as a new document. The update operation is very expensive. There are different ways of updating a document. We will talk about updating a partial document here and in more detail in the Updating your data section in Chapter 4, Indexing and Updating Your Data.

Updating a partial document

We already indexed the document with the unique identifier 1, and now we need to update the category of the product from just books to technical books. We can update the document as shown here:

 POST http://localhost:9200/chapter1/product/1/_update
{
"doc": {
"category": "technical books"
}
}

The body of the request is the field of the document we want to update and the unique identifier is passed in the URL.

Please note the _update endpoint at the end of the URL.

The response from Elasticsearch is shown here:

{
"_index": "chapter1",
"_type": "product",
"_id": "1",
"_version": 3,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
}
}

As you can see in the response, the operation is successful, and the version of the document is now 3. More complicated update operations are possible using scripts and upserts.

Deleting an existing document

For creating and retrieving a document, we used the POST and GET methods. For deleting an existing document, we need to use the HTTP DELETE method and pass the unique identifier of the document in the URL as shown here:

DELETE http://localhost:9200/chapter1/product/1

Let's inspect the request:

INDEX chapter1
TYPE product
IDENTIFIER 1
HTTP METHOD DELETE

The response from Elasticsearch is shown here:

{
"found": true,
"_index": "chapter1",
"_type": "product",
"_id": "1",
"_version": 4,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
}
}

In the response, you can see that Elasticsearch was able to find the document with the unique identifier 1 and was successful in deleting the document.

How does search work?

In the previous section, we discussed how to create, update, and delete documents. In this section, we will briefly discuss how search works internally and explain the basic query APIs. Mostly, I want to talk about the inverted index and Apache Lucene. All the data in Elasticsearch is internally stored in Apache Lucene as an inverted index. Although data is stored in Apache Lucene, Elasticsearch is what makes it distributed and provides the easy-to-use APIs. We will discuss Search API in detail in Chapter 6, All About Search.

Importance of information retrieval

As the computation power is increasing and cost of storage is decreasing, the amount of day-to-day data we deal with is growing exponentially. But without a way to retrieve the information and to be able to query it, the information we collect doesn't help.

Information retrieval systems are very important to make sense of the data. Imagine how hard it would be to find some information on the Internet without Google or other search engines out there. Information is not knowledge without information retrieval systems.

Simple search query

Let's say we have a User table as shown here:

Id Name Age Gender Email
1 Luke 100 M [email protected]
2 Leia 100 F [email protected]

Now, we want to query for all the users with the name Luke. A SQL query to achieve this would be something like this:

select * from user where name like ‘%luke%’

To do a similar task in Elasticsearch, you can use the search API and execute the following command:

GET http://127.0.0.1:9200/chapter1/user/_search?q=name:luke

Let's inspect the request:

INDEX chapter1
TYPE user
FIELD name

Just like you would get all the rows in the User table as a result of the SQL query, the response to the Elasticsearch query would be JSON documents:

{
"id": 1,
"name": "Luke",
"age": 100,
"gender": "M",
"email": "[email protected]"
}

Querying using the URL parameters can be used for simple queries as shown above. For more practical queries, you should pass the query represented as JSON in the request body. The same query passed in the request body is shown here:

POST http://127.0.0.1:9200/chapter1/user/_search 
{
"query": {
"term": {
"name": "luke"
}
}
}

The Search API is very flexible and supports different kinds of filters, sort, pagination, and aggregations.

Inverted index

Before we talk more about search, I want to talk about the inverted index. Knowing about inverted index will help you understand the limitations and strengths of Elasticsearch compared with the traditional database systems out there. Inverted index at the core is how Elasticsearch is different from other NoSQL stores, such as MongoDB, Cassandra, and so on.

We can compare an inverted index to an old library catalog card system. When you need some information/book in a library, you will use the card catalog, usually at the entrance of the library, to find the book. An inverted index is similar to the card catalog. Imagine that you were to build a system like Google to search for the web pages mentioning your search keywords. We have three web pages with Yoda quotes from Star Wars, and you are searching for all the documents with the word fear.

Document1: Fear leads to anger

Document2: Anger leads to hate

Document3: Hate leads to suffering

In a library, without a card catalog to find the book you need, you would have to go to every shelf row by row, look at each book title, and see whether it's the book you need. Computer-based information retrieval systems do the same.

Without the inverted index, the application has to go through each web page and check whether the word exists in the web page. An inverted index is similar to the following table. It is like a map with the term as a key and list of the documents the term appears in as value.

Term Document
Fear 1
Anger 1,2
Hate 2,3
Suffering 3
Leads 1,2,3

Once we construct an index, as shown in this table, to find all the documents with the term fear is now just a lookup. Just like when a library gets a new book, the book is added to the card catalog, we keep building an inverted index as we encounter a new web page. The preceding inverted index takes care of simple use cases, such as searching for the single term. But in reality, we query for much more complicated things, and we don’t use the exact words. Now let’s say we encountered a document containing the following:

Yosemite national park may be closed for the weekend due to forecast of substantial rainfall

We want to visit Yosemite National Park, and we are looking for the weather forecast in the park. But when we query for it in the human language, we might query something like weather in yosemite or rain in yosemite. With the current approach, we will not be able to answer this query as there are no common terms between the query and the document, as shown:

Document Query
rainfall rain

To be able to answer queries like this and to improve the search quality, we employ various techniques such as stemming, synonyms discussed in the following sections.

Stemming

Stemming is the process of reducing a derived word into its root word. For example, rain, raining, rained, rainfall has the common root word "rain". When a document is indexed, the root word is stored in the index instead of the actual word. Without stemming, we end up storing rain, raining, rained in the index, and search relevance would be very low. The query terms also go through the stemming process, and the root words are looked up in the index. Stemming increases the likelihood of the user finding what he is looking for. When we query for rain in yosemite, even though the document originally had rainfall, the inverted index will contain term rain.

We can configure stemming in Elasticsearch using Analyzers. We will discuss how to set up and configure analyzers in Chapter 3, Modeling Your Data and Document Relations.

Synonyms

Similar to rain and raining, weekend and sunday mean the same thing. The document might not contain Sunday, but if the information retrieval system can also search for synonyms, it will significantly improve the search quality. Human language deals with a lot of things, such as tense, gender, numbers. Stemming and synonyms will not only improve the search quality but also reduce the index size by removing the differences between similar words.

More examples:

Pen, Pen[s] -> Pen

Eat, Eating -> Eat

Phrase search

As a user, we almost always search for phrases rather than single words. The inverted index in the previous section would work great for individual terms but not for phrases. Continuing the previous example, if we want to query all the documents with a phrase anger leads to in the inverted index, the previous index would not be sufficient. The inverted index for terms anger and leads is shown below:

Term Document
Anger 1,2
Leads 1,2,3

From the preceding table, the words anger and leads exist both in document1 and document2. To support phrase search along with the document, we also need to record the position of the word in the document. The inverted index with word position is shown here:

Term Document
Fear 1:1
Anger 1:3, 2:1
Hate 2:3, 3:1
Suffering 3:3
Leads 1:2, 2:2, 3:2

Now, since we have the information regarding the position of the word, we can search if a document has the terms in the same order as the query.

Term Document
anger 1:3, 2:1
leads 1:2, 2:2

Since document2 has anger as the first word and leads as the second word, the same order as the query, document2 would be a better match than document1. With the inverted index, any query on the documents is just a simple lookup. This is just an introduction to inverted index; in real life, it's much more complicated, but the fundamentals remain the same. When the documents are indexed into Elasticsearch, documents are processed into the inverted index.

Apache Lucene

Apache Lucene is one of the most matured implementations of the inverted index. Lucene is an open source full-text search library. It's very high performing, entirely written in Java. Any application that requires text search can use Lucene. It allows adding full-text search capabilities to any application. Elasticsearch uses Apache Lucene to manage and create its inverted index. To learn more about Apache Lucene, please visit http://lucene.apache.org/core/.

We will talk about how distributed search works in Elasticsearch in the next section.

The term index is used both by Apache Lucene (inverted index) and Elasticsearch index. For the remainder of the book, unless specified the term index refers to an Elasticsearch index.

Scalability and availability

Let's say you want to index a billion documents; having just a single machine might be very challenging. Partitioning data across multiple machines allows Elasticsearch to scale beyond what a single machine do and support high throughput operations. Your data is split into small parts called shards. When you create a index, you need to tell Elasticsearch the number of shards you want for the index and Elasticsearch handles the rest for you. As you have more data, you can scale horizontally by adding more machines. We will go in to more details in the sections below.

There are type of shards in Elasticsearch - primary and replica. The data you index is written to both primary and replica shards. Replica is the exact copy of the primary. In case of the node containing the primary shard goes down, the replica takes over. This process is completely transparent and managed by Elasticsearch. We will discuss this in detail in the Failure Handling section below. Since primary and replicas are the exact copies, a search query can be answered by either the primary or the replica shard. This significantly increases the number of simultaneous requests Elasticsearch can handle at any point in time.

As the index is distributed across multiple shards, a query against an index is executed in parallel across all the shards. The results from each shard are then gathered and sent back to the client. Executing the query in parallel greatly improves the search performance.

In the next section, we will discuss the relation between node, index and shard.

Relation between node, index, and shard

Shard is often the most confusing topic when I talk about Elasticsearch at conferences or to someone who has never worked on Elasticsearch. In this section, I want to focus on the relation between node, index, and shard. We will use a cluster with three nodes and create the same index with multiple shard configuration, and we will talk through the differences.

Three shards with zero replicas

We will start with an index called esintroduction with three shards and zero replicas. The distribution of the shards in a three node cluster is as follows:

In the above screenshot, shards are represented by the green squares. We will talk about replicas towards the end of this discussion. Since we have three nodes(servers) and three shards, the shards are evenly distributed across all three nodes. Each node will contain one shard. As you index your documents into the esintroduction index, data is spread across the three shards.

Six shards with zero replicas

Now, let's recreate the same esintroduction index with six shards and zero replicas. Since we have three nodes (servers) and six shards, each node will now contain two shards. The esintroduction index is split between six shards across three nodes.

The distribution of shards for an index with six shards is as follows:

The esintroduction index is spread across three nodes, meaning these three nodes will handle the index/query requests for the index. If these three nodes are not able to keep up with the indexing/search load, we can scale the esintroduction index by adding more nodes. Since the index has six shards, you could add three more nodes, and Elasticsearch automatically rearranges the shards across all six nodes. Now, index/query requests for the esintroduction index will be handled by six nodes instead of three nodes. If this is not clear, do not worry, we will discuss more about this as we progress in the book.

Six shards with one replica

Let's now recreate the same esintroduction index with six shards and one replica, meaning the index will have 6 primary shards and 6 replica shards, a total of 12 shards. Since we have three nodes (servers) and twelve shards, each node will now contain four shards. The esintroduction index is split between six shards across three nodes. The green squares represent shards in the following figure.

The solid border represents primary shards, and replicas are the dotted squares:

As we discussed before, the index is distributed into multiple shards across multiple nodes. In a distributed environment, a node/server can go down due to various reasons, such as disk failure, network issue, and so on. To ensure availability, each shard, by default, is replicated to a node other than where the primary shard exists. If the node containing the primary shard goes down, the shard replica is promoted to primary, and the data is not lost, and you can continue to operate on the index. In the preceding figure, the esintroduction index has six shards split across the three nodes. The primary of shard 2 belongs to node elasticsearch 1, and the replica of the shard 2 belongs to node elasticsearch 3. In the case of the elasticsearch 1 node going down, the replica in elasticsearch 3 is promoted to primary. This switch is completely transparent and handled by Elasticsearch.

Distributed search

One of the reasons queries executed on Elasticsearch are so fast is because they are distributed. Multiple shards act as one index. A search query on an index is executed in parallel across all the shards.

Let's take an example: in the following figure, we have a cluster with two nodes: Node1, Node2 and an index named chapter1 with two shards: S0, S1 with one replica:

Assuming the chapter1 index has 100 documents, S1 would have 50 documents, and S0 would have 50 documents. And you want to query for all the documents that contain the word Elasticsearch. The query is executed on S0 and S1 in parallel. The results are gathered back from both the shards and sent back to the client. Imagine, you have to query across million of documents, using Elasticsearch the search can be distributed. For the application I'm currently working on, a query on more than 100 million documents comes back within 50 milliseconds; which is simply not possible if the search is not distributed.

Failure handling

Elasticsearch handles failures automatically. This section describes how the failures are handled internally. Let’s say we have an index with two shards and one replica. In the following diagram, the shards represented in solid line are primary shards, and the shards in the dotted line are replicas:

As shown in preceding diagram, we initially have a cluster with two nodes. Since the index has two shards and one replica, shards are distributed across the two nodes. To ensure availability, primary and replica shards never exist in the same node. If the node containing both primary and replica shards goes down, the data cannot be recovered. In the preceding diagram, you can see that the primary shard S0 belongs to Node 1 and the replica shard S0 to the Node 2.

Next, just like we discussed in the Relation between Node, Index and Shard section, we will add two new nodes to the existing cluster, as shown here:

The cluster now contains four nodes, and the shards are automatically allocated to the new nodes. Each node in the cluster will now contain either a primary or replica shard. Now, let's say Node2, which contains the primary shard S1, goes down as shown here:

Since the node that holds the primary shard went down, the replica of S1, which lives in Node3, is promoted to primary. To ensure the replication factor of 1, a copy of the shard S1 is made on Node1. This process is known as rebalancing of the cluster.

Depending on the application, the number of shards can be configured while creating the index. The process of rebalancing the shards to other nodes is entirely transparent to the user and handled automatically by Elasticsearch.

Strengths and limitations of Elasticsearch

The strengths of Elasticsearch are as follows:

  • Very flexible Query API:
    • It supports JSON-based REST API.
    • Clients are available for all major languages, such as Java, Python, PHP, and so on.
    • It supports filtering, sort, pagination, and aggregations in the same query.
  • Supports auto/dynamic mapping:
    • In the traditional SQL world, you should predefine the table schema before you can add data. Elasticsearch handles unstructured data automatically, meaning you can index JSON documents without predefining the schema. It will try to figure out the field mappings automatically.
    • Adding/removing the new/existing fields is also handled automatically.
  • Highly scalable:
    • Clustering, replication of data, automatic failover are supported out of the box and are completely transparent to the user. For more details, refer to the Availability and Horizontal Scalability section.
  • Multi-language support:
    • We discussed how stemming works and why it is important to remove the difference between the different forms of root words. This process is completely different for different languages. Elasticsearch supports many languages out of the box.
  • Aggregations:
    • Aggregations are one of the reasons why Elasticsearch is like nothing out there.
    • It comes with very a powerful analytics engine, which can help you slice and dice your data.
    • It supports nested aggregations. For example, you can group users first by the city they live in and then by their gender and then calculate the average age of each bucket.
  • Performance:
    • Due to the inverted index and the distributed nature, it is extremely high performing. The queries you traditionally run using a batch processing engine, such as Hadoop, can now be executed in real time.
  • Intelligent filter caching:
    • The most recently used queries are cached. When the data is modified, the cache is invalidated automatically.

The limitations of Elasticsearch are as follows:

  • Not real time - eventual consistency (near real time):
    • The data you index is only available for search after 1 sec. A process known as refresh wakes up every 1 sec by default and makes the data searchable.
  • Doesn't support SQL like joins but provides parent-child and nested to handle relations.
  • Doesn't support transactions and rollbacks: Transactions in a distributed system are expensive. It offers version-based control to make sure the update is happening on the latest version of the document.
  • Updates are expensive. An update on the existing document deletes the document and re-inserts it as a new document.
  • Elasticsearch might lose data due to the following reasons:

We will discuss all these concepts in detail in the further chapters.

Summary

In this chapter, you learned the basic concepts of Elasticsearch. Elasticsearch REST APIs make most operations simple and straightforward. We discussed how to index, update, and delete documents. You also learned how distributed search works and how failures are handled automatically. At the end of the chapter, we discussed various strengths and limitations of Elasticsearch.

In the next chapter, we will discuss how to set up Elasticsearch and Kibana. Installing Elasticsearch is very easy as it's designed to run out of the box. Throughout this book, several examples have been used to better explain various concepts. Once you have Elasticsearch up and running, you can try the queries for yourself.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Get to grips with the basics of Elasticsearch concepts and its APIs, and use them to create efficient applications
  • Create large-scale Elasticsearch clusters and perform analytics using aggregation
  • This comprehensive guide will get you up and running with Elasticsearch 5.x in no time

Description

Elasticsearch is a modern, fast, distributed, scalable, fault tolerant, and open source search and analytics engine. You can use Elasticsearch for small or large applications with billions of documents. It is built to scale horizontally and can handle both structured and unstructured data. Packed with easy-to- follow examples, this book will ensure you will have a firm understanding of the basics of Elasticsearch and know how to utilize its capabilities efficiently. You will install and set up Elasticsearch and Kibana, and handle documents using the Distributed Document Store. You will see how to query, search, and index your data, and perform aggregation-based analytics with ease. You will see how to use Kibana to explore and visualize your data. Further on, you will learn to handle document relationships, work with geospatial data, and much more, with this easy-to-follow guide. Finally, you will see how you can set up and scale your Elasticsearch clusters in production environments.

Who is this book for?

If you want to build efficient search and analytics applications using Elasticsearch, this book is for you. It will also benefit developers who have worked with Lucene or Solr before and now want to work with Elasticsearch. No previous knowledge of Elasticsearch is expected.

What you will learn

  • See how to set up and configure Elasticsearch and Kibana
  • Know how to ingest structured and unstructured data using Elasticsearch
  • Understand how a search engine works and the concepts of relevance and scoring
  • Find out how to query Elasticsearch with a high degree of performance and scalability
  • Improve the user experience by using autocomplete, geolocation queries, and much more
  • See how to slice and dice your data using Elasticsearch aggregations.
  • Grasp how to use Kibana to explore and visualize your data
  • Know how to host on Elastic Cloud and how to use the latest X-Pack features such as Graph and Alerting

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jun 30, 2017
Length: 404 pages
Edition : 1st
Language : English
ISBN-13 : 9781787128453
Vendor :
Elastic
Category :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Jun 30, 2017
Length: 404 pages
Edition : 1st
Language : English
ISBN-13 : 9781787128453
Vendor :
Elastic
Category :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 154.97
Learning Elastic Stack 6.0
$38.99
Learning Elasticsearch
$54.99
Elasticsearch 5.x Cookbook
$60.99
Total $ 154.97 Stars icon
Banner background image

Table of Contents

10 Chapters
Introduction to Elasticsearch Chevron down icon Chevron up icon
Setting Up Elasticsearch and Kibana Chevron down icon Chevron up icon
Modeling Your Data and Document Relations Chevron down icon Chevron up icon
Indexing and Updating Your Data Chevron down icon Chevron up icon
Organizing Your Data and Bulk Data Ingestion Chevron down icon Chevron up icon
All About Search Chevron down icon Chevron up icon
More Than a Search Engine (Geofilters, Autocomplete, and More) Chevron down icon Chevron up icon
How to Slice and Dice Your Data Using Aggregations Chevron down icon Chevron up icon
Production and Beyond Chevron down icon Chevron up icon
Exploring Elastic Stack (Elastic Cloud, Security, Graph, and Alerting) Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.3
(4 Ratings)
5 star 50%
4 star 25%
3 star 25%
2 star 0%
1 star 0%
Ravikanth Dec 11, 2017
Full star icon Full star icon Full star icon Full star icon Full star icon 5
When Abhishek called me after he almost finished writing his book on elastic search he had a thought that bothered him. Is there a necessity for another book where are so many other experts who wrote books on elastic search?The language “C” was created by Dennis Ritchie but his book is not the most used or referred book in C. Creating something and making something approachable are two different skills.Abhishek has made the book very approachable. I haven’t read the book entirely but I have already used it(If you are an architect or a developer you know that it is a good sign). The book is very hands-on and gives you the right amount of theory for you to make your technological choices. When we approach a new technology we want the stack up and running as soon as possible so that you can play around with it and test the things you want to test. The theory becomes important for you to make long-term choices. The book plays really well on the first count and reasonably well on the second one.If you are planning to use Dynamodb in your architecture you will find that elastic search becomes a necessity because of the limitation of querying capabilities in Dynamodb. If you are moving in that direction get this book, you will most likely need it.There is one missing chapter Abhishek should consider adding and that is about analyzers. Everybody needs to add some twist to the default search capabilities provided by elastic search. Just giving exposure helps people walk that road with ease.There were two times when I had to ask him personally about a couple of questions. One of which is an email and other one was a short phone call to verify if the architecture in my mind was right. The phone call was because I did not read enough of the book at that time.Full-Disclosure - Abhishek is my cousin whom I know since his birth and hence I have a vested interest in his success. That does not mean anything I wrote so far is inaccurate.@Abhishek - You did not send a free copy to me. I had to buy it just like everyone else(from packtpub). I am going to remember this :P
Amazon Verified review Amazon
Jim O'Quinn Jan 27, 2018
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Why didn't you put the Elasticsearch version on the cover page of your book? In the description here on Amazon?I hope it covers 6.X otherwise I'm returning it.
Amazon Verified review Amazon
J. Franks Nov 24, 2018
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
I worked my way through this on a camping trip and now use it regularly when I need to reindex etc. I don't use ElasticSearch every day but do need to work out queries or fix problems from time to time, and it's a great book to have on kindle - the online docs and/or google just aren't as good at helping solve problems. This gets it done.4 stars; it's a workman-like book, there are places where it could go further, but still, I wouldn't be without it.
Amazon Verified review Amazon
Mike A. May 05, 2018
Full star icon Full star icon Full star icon Empty star icon Empty star icon 3
IMO, he just brushes over the more complex issues. Will need to do more digging.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.