You're reading from ElasticSearch Cookbook - Second Edition Over 130 advanced recipes to search, analyze, deploy, manage, and monitor data effectively with ElasticSearch

Product type Paperback

Published in Jan 2015

Publisher

ISBN-13 9781783554836

Length 472 pages

Edition 2nd Edition

Languages

Java

Tools

Elasticsearch

Concepts

Enterprise Search

Author (1):

Alberto Paro

View More author details

Table of Contents (14) Chapters

Preface

1. Getting Started

2. Downloading and Setting Up FREE CHAPTER

3. Managing Mapping

4. Basic Operations

5. Search, Queries, and Filters

6. Aggregations

7. Scripting

8. Rivers

9. Cluster and Node Monitoring

10. Java Integration

11. Python Integration

12. Plugin Development

Index

Understanding clusters, replication, and sharding

Related to shard management, there is the key concept of replication and cluster status.

Getting ready

You need one or more nodes running to have a cluster. To test an effective cluster, you need at least two nodes (that can be on the same machine).

How it works...

An index can have one or more replicas; the shards are called primary if they are part of the primary replica, and secondary ones if they are part of replicas.

To maintain consistency in write operations, the following workflow is executed:

The write operation is first executed in the primary shard
If the primary write is successfully done, it is propagated simultaneously in all the secondary shards
If a primary shard becomes unavailable, a secondary one is elected as primary (if available) and then the flow is re-executed

During search operations, if there are some replicas, a valid set of shards is chosen randomly between primary and secondary to improve its performance. ElasticSearch has several allocation algorithms to better distribute shards on nodes. For reliability, replicas are allocated in a way that if a single node becomes unavailable, there is always at least one replica of each shard that is still available on the remaining nodes.

The following figure shows some examples of possible shards and replica configuration:

The replica has a cost in increasing the indexing time due to data node synchronization, which is the time spent to propagate the message to the slaves (mainly in an asynchronous way).

Note

To prevent data loss and to have high availability, it's good to have a least one replica; so your system can survive a node failure without downtime and without loss of data.

There's more...

Related to the concept of replication, there is the cluster status indicator that will show you information on the health of your cluster. It can cover three different states:

Green: This shows that everything is okay
Yellow: This means that some shards are missing but you can work on your cluster
Red: This indicates a problem as some primary shards are missing

Solving the yellow status...

Mainly, yellow status is due to some shards that are not allocated.

If your cluster is in the recovery status (meaning that it's starting up and checking the shards before they are online), you need to wait until the shards' startup process ends.

After having finished the recovery, if your cluster is always in the yellow state, you may not have enough nodes to contain your replicas (for example, maybe the number of replicas is bigger than the number of your nodes). To prevent this, you can reduce the number of your replicas or add the required number of nodes. A good practice is to observe that the total number of nodes must not be lower than the maximum number of replicas present.

Solving the red status

This means you are experiencing lost data, the cause of which is that one or more shards are missing.

To fix this, you need to try to restore the node(s) that are missing. If your node restarts and the system goes back to the yellow or green status, then you are safe. Otherwise, you have obviously lost data and your cluster is not usable; the next action would be to delete the index/indices and restore them from backups or snapshots (if you have done them) or from other sources. To prevent data loss, I suggest having always a least two nodes and a replica set to 1 as good practice.

Note

Having one or more replicas on different nodes on different machines allows you to have a live backup of your data, which stays updated always.