Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Elasticsearch 5.x Cookbook

You're reading from   Elasticsearch 5.x Cookbook Distributed Search and Analytics

Arrow left icon
Product type Paperback
Published in Feb 2017
Publisher
ISBN-13 9781786465580
Length 696 pages
Edition 3rd Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Alberto Paro Alberto Paro
Author Profile Icon Alberto Paro
Alberto Paro
Arrow right icon
View More author details
Toc

Table of Contents (25) Chapters Close

Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Dedication
Preface
1. Getting Started 2. Downloading and Setup FREE CHAPTER 3. Managing Mappings 4. Basic Operations 5. Search 6. Text and Numeric Queries 7. Relationships and Geo Queries 8. Aggregations 9. Scripting 10. Managing Clusters and Nodes 11. Backup and Restore 12. User Interfaces 13. Ingest 14. Java Integration 15. Scala Integration 16. Python Integration 17. Plugin Development 18. Big Data Integration

Understanding cluster, replication, and sharding


Related to shards management, there are key concepts of replication and cluster status.

Getting ready

You need one or more nodes running to have a cluster. To test an effective cluster, you need at least two nodes (that can be on the same machine).

How it works...

An index can have one or more replicas (full copies of your data, automatically managed by Elasticsearch): the shards are called primary ones if they are part of the primary replica, and secondary ones if they are part of other replicas.

To maintain consistency in write operations, the following workflow is executed:

  • The write is first executed in the primary shard

  • If the primary write is successfully done, it is propagated simultaneously in all the secondary shards

  • If a primary shard becomes unavailable, a secondary one is elected as primary (if available) and the flow is re-executed

During search operations, if there are some replicas, a valid set of shards is chosen randomly between primary and secondary to improve performances. Elasticsearch has several allocation algorithms to better distribute shards on nodes. For reliability, replicas are allocated in a way that if a single node becomes unavailable, there is always at least one replica of each shard that is still available on the remaining nodes.

The following figure shows some example of possible shards and replica configuration:

The replica has a cost to increase the indexing time due to data node synchronization and also the time spent to propagate the message to the slaves (mainly in an asynchronous way).

Best practice

To prevent data loss and to have high availability, it's good to have at least one replica; so, your system can survive a node failure without downtime and without loss of data.

A typical approach for scaling performance in search when your customer number is to increase the replica number.

There's more...

Related to the concept of replication, there is the cluster status indicator of the health of your cluster.

It can cover three different states:

  • Green: This state depicts that everything is ok.

  • Yellow: This state depicts that some shards are missing but you can work.

  • Red: This state depicts that, "Houston we have a problem". Some primary shards are missing. The cluster will not accept writing and errors and stale actions may happen due to missing shards. If the missing shard cannot be restored, you have lost your data.

Solving the yellow status

  • Mainly yellow status is due to some shards that are not allocated.

  • If your cluster is in "recovery" status (this means that it's starting up and checking the shards before we put them online), just wait so that the shards start up process ends.

  • After having finished the recovery, if your cluster is always in yellow state, you may not have enough nodes to contain your replicas (because, for example, the number of replicas is bigger than the number of your nodes). To prevent this, you can reduce the number of your replicas or add the required number of nodes.

Note

The total number of nodes must not be lower than the maximum number of replicas.

Solving the red status

  • You have loss of data. This is when you have one or more shards missing.

  • You need to try to restore the node(s) that are missing. If your nodes restart and the system goes back to yellow or green status, you are safe. Otherwise, you have lost data and your cluster is not usable: delete the index/indices and restore them from backups or snapshots (if you have already done it) or from other sources.

To prevent data loss, I suggest having always at least two nodes and a replica set to 1.

Tip

Having one or more replicas on different nodes on different machines allows you to have a live backup of your data, always updated.

See also

  • We'll see replica and shard management in the Managing index settings recipe in Chapter 4, Basic Operations.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image