Elastic Stack 8.x Cookbook

Ingesting General Content Data

This chapter, along with Chapter 4, will focus on data ingestion. Generally, we can categorize data into two groups – general content (data from APIs, HTML pages, catalogs, data from Relational Database Management System (RDBMS), PDFs, spreadsheets, etc.), and time series (data indexed in chronological order, such as logs, metrics, traces, and security events). In this chapter, we will ingest general content to illustrate the basic concepts of data ingestion, including fundamental data operations (index, delete, and update), analyzers, static and dynamic index mappings, and index templates.

Figure 2.1 illustrates the connections between various components, and in this chapter, we will explore recipes dedicated to the Client APP, Analyzer, Mapping, and Index template components (you can view the color image when you download the free PDF version of this book):

Figure 2.1 – Elasticsearch index management components

In this chapter, we are going to cover the following main topics:

Adding data from the Elasticsearch client
Updating data in Elasticsearch
Deleting data in Elasticsearch
Using an analyzer
Defining index mapping
Using dynamic templates in document mapping
Creating an index template
Indexing multiple documents using Bulk API

Adding data from the Elasticsearch client

To ingest general content such as catalogs, HTML pages, and files from your application, Elastic provides a wide range of Elastic language clients to easily ingest data via Elasticsearch REST APIs. In this recipe, we will learn how to add sample data to Elasticsearch hosted on Elastic Cloud using a Python client.

To use Elasticsearch’s REST APIs through various programming languages, a client application chooses a suitable client library. The client initializes and sends HTTP requests, directing them to the Elasticsearch cluster for data operations. Elasticsearch processes the requests and returns HTTP responses containing results or errors. The client application parses these responses and acts on the data accordingly. Figure 2.2 shows the summarized data flow:

Figure 2.2 – Elasticsearch’s client request and response flow

Getting ready

To simplify the package management, we recommend you install pip(https://pypi.org/project/pip/).

The snippets of this recipe are available here: https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/snippets.md#adding-data-from-the-elasticsearch-client.

How to do it…

First, we will install the Elasticsearch Python client:

Add elasticsearch, elasticsearch-async, and load_dotenv to the requirements.txt file of your Python project (the sample requirements.txt file can be found at this address: https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/python-client-sample/requirements.txt).
Run the following command to install the Elasticsearch Python client library:
```
$ pip install -r requirements.txt
```
Now, let’s set up a connection to Elasticsearch.
Prepare a .env file to store the access information, Cloud ID("ES_CID"), user name("ES_USER"), and password("ES_PWD"), for the basic authentication. You can find the sample .env file at this address: https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/python-client-sample/.env.
Remember that we saved the password for our default user, elastic, in the Deploying Elastic Stack on Elastic Cloud recipe in Chapter 1, and the instructions to find the cloud ID can be found in the same recipe.
Import the libraries in a Python file (sampledata_index.py), which we will use for this recipe:
```
import os
from elasticsearch import Elasticsearch
from dotenv import load_dotenv
```

Load the environment variables and initiate an Elasticsearch connection:

load_dotenv()
ES_CID = os.getenv('ES_CID')
ES_USER = os.getenv('ES_USER')
ES_PWD = os.getenv('ES_PWD')
es = Elasticsearch(
    cloud_id=ES_CID,
    basic_auth=(ES_USER, ES_PWD)
)
print(es.info())

Now, you can run the script to check whether the connection is successful. Run the following command:
```
$ python sampledata_index.py
```
You should see an output that looks like the following screenshot:

Figure 2.3 – Connected Elasticsearch information

We can now extend the script to ingest a document. Prepare a sample JSON document from the movie dataset:

mymovie = {
    'release_year': '1908',
    'title': 'It is not this day.',
    'origin': 'American',
    'director': 'D.W. Griffith',
    'cast': 'Harry Solter, Linda Arvidson',
    'genre': 'comedy',
    'wiki_page':'https://en.wikipedia.org/wiki/A_Calamitous_Elopement',
    'plot': 'A young couple decides to elope after being caught in the midst of a romantic moment by the woman.'
}

Index the sample data in Elasticsearch. Here, we will choose the index name 'movies' and print the index results. Finally, we will store the document ID in a tmp file that we will reuse for the following recipes:

response = es.index(index='movies', document=mymovie)
print(response)
# Write the '_id' to a file named tmp.txt
with open('tmp.txt', 'w') as file:
    file.write(response['_id'])
# Print the contents of the file to confirm it's written correctly
with open('tmp.txt', 'r') as file:
    print(f"document id saved to tmp.txt: {file.read()}")
time.sleep(2)

Verify the data in Elasticsearch to ensure that it has been successfully indexed; wait two seconds after the indexing, query Elasticsearch using the _search API, and then print the results:
```
response = es.search(index='movies', query={"match_all": {}})
print("Sample movie data in Elasticsearch:")
for hit in response['hits']['hits']:
print(hit['_source'])
```
Execute the script again with the following script:
```
$ python sampledata_index.py
```
You should have the following result in the console output:

Figure 2.4 – The output of the sampledata_index.py script

The full code sample can be found at https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/python-client-sample/sampledata_index.py.

How it works...

In this recipe, we learned how to use the Elastic Python client to securely connect to a hosted deployment on Elastic Cloud.

Elasticsearch created the movies index by default during the first ingestion, and the fields were created with default mapping.

Later in this chapter, we will learn how to define static and dynamic mapping to customize field types with the help of concrete recipes.

It’s also important to note that as we did not provide a document ID, Elasticsearch automatically generated an ID during the indexing phase as well.

The following diagram (Figure 2.5) shows the index processing flow:

Figure 2.5 – The ingestion flow

There’s more…

In this recipe, we used the HTTP basic authentication method. The Elastic Python client provides authentication methods such as HTTP Bearer authentication and API key authentication. Detailed documentation can be found at the following link: https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#auth-bearer.

We chose to illustrate the simplicity of general content data ingestion by using the Python client. Detailed documentation for other client libraries can be found at the following link: https://www.elastic.co/guide/en/elasticsearch/client/index.html

During the development and testing phase, it’s also very useful to use the Elastic REST API and test either with an HTTP client, such as CURL/Postman, or with the Kibana Dev Tools console (https://www.elastic.co/guide/en/kibana/current/console-kibana.html).

Updating data in Elasticsearch

In this recipe, we will explore how to update data in Elasticsearch using the Python client.

Getting ready

Ensure that you have installed the Elasticsearch Python client and have successfully set up a connection to your Elasticsearch cluster (refer to the Adding data from the Elasticsearch client recipe). You will also need to have completed the previous recipe, which involves ingesting a document into the movies index.

Note

The following three recipes will use the same set of requirements.

How to do it…

In this recipe, we’re going to update the director field of a particular document within the movies index. The director field will be changed from its current value, D.W. Griffith, to a new value, Clint Eastwood. The following are the steps you’ll need to follow in your Python script to perform this update and confirm that it has been successfully applied. Let’s inspect the Python script that we will use to update the ingested document (https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/python-client-sample/sampledata_update.py):

First, we need to retrieve the document ID of the previously ingested document from the tmp.txt file, which we intend to update. The field to update here is director; we are going to update the value from D.W. Griffith to Clint Eastwood:
```
index_name = 'movies'
document_id = ''
# Read the document_id the ingested document of the previous recipe
with open('tmp.txt', 'r') as file:
    document_id = file.read()
document = {
    'director': 'Clint Eastwood'
}
```

We can now check document_id, verify that the document exists in the index, and then perform the update operation:

# Update the document in Elasticsearch if document_id is valid
if document_id != '':
    if es.exists(index=index_name, id=document_id):
        response = es.update(index=index_name, id=document_id,
                             doc=document)
        print(f"Update status: {response['result']}")

Once the document is updated, to verify that the update is successful, you can retrieve the updated document from Elasticsearch and print the modified fields:
```
updated_document = es.get(index=index_name, id=document_id)
print("Updated document:")
print(updated_document)
```
After inspecting the script, let’s run it with the following command:
```
$ python sampledata_update.py
```

Figure 2.6 – The output of the sampledata_update.py script

You should see that the _version and director fields are updated.

How it works...

Each document includes a _version field in Elasticsearch. Elasticsearch documents cannot be modified directly, as they are immutable. When you update an existing document, a new document is generated with an incremented version, while the previous document is flagged for deletion.

There’s more…

We have just seen how to update a single document in Elasticsearch; in general, this is not optimal from a performance point of view. To update multiple documents that match a specific query, you can use the Update By Query API. This allows you to define a query to select the documents you want to update and specify the changes to be made; here is an example of how to do it via Elasticsearch’s REST API:

q = {
    "script": {
        "source": "ctx._source.genre = 'comedies'",
        "lang": "painless"
    },
    "query": {
        "bool": {
            "must": [
              {
                "term": {
                    "genre": "comedy"
                }
              }
            ]
        }
    }
}
es.update_by_query(body=q, index=index_name)

The full Python script is available here: https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/python-client-sample/sampledata_update_by_query.py.

Note

The script used here is based on a painless script; we will see more examples in Chapter 6.

The other way to update multiple documents in a single request is via Elasticsearch’s Bulk API. The Bulk API can be used to insert, update, and delete multiple documents efficiently. We will learn how to use the Bulk API to ingest multiple documents at the end of this chapter. For more detailed information, refer to the following documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html.

To retrieve the ID of the document we want to update, we rely on a tmp.txt file where the ID of a previously created document was saved. Alternatively, you can retrieve the document’s ID by using the Dev Tools in Kibana, perform a search on the movies index, go to Kibana | Dev Tools, and execute the following command:

GET movies/_search

This query should return a list of hits that display all documents in the index, along with their respective IDs, as shown in Figure 2.7. Using these results, locate and record the ID of the document you would like to update:

Figure 2.7 – Checking the document ID

Deleting data in Elasticsearch

In this recipe, we will explore how to delete a document from an Elasticsearch index.

Getting ready

Refer to the requirements for the Updating data in Elasticsearch recipe.

Make sure to download the following Python script from the GitHub repository: https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/python-client-sample/sampledata_delete.py.

The snippets of the recipe are available at https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/snippets.md#deleting-data-in-elasticsearch.

How to do it…

First, let us inspect the sampledata_delete.py Python script. Like the process in the previous recipe, we need to retrieve document_id from the tmp.txt file:
```
with open('tmp.txt', 'r') as file:
          document_id = file.read()
```

We can now check document_id, verify that the document exists in the index, and then perform the delete operation by using the previously obtained document_id:

if document_id != '':
    if es.exists(index=index_name, id=document_id):
        # delete the document in Elasticsearch
        response = es.delete(index=index_name, id=document_id)
        print(f"delete status: {response['result']}")

After reviewing the delete script, execute it with the following command:
```
$ python sampledata_delete.py
```
You should see the following output:

Figure 2.8 – The output of the sampledata_delete.py script

For further verification, return to the Dev Tools in Kibana and execute the search request again on the movies index:
```
GET movies/_search
```
This time, the result should reflect the deletion:

Figure 2.9 – The search results in the movies index after deletion

The total hits will now be 0, confirming that the document has been successfully deleted.

How it works...

When a document is deleted in Elasticsearch, it is not immediately removed from the index. Instead, Elasticsearch marks the document as deleted. These documents remain in the index until a merging process occurs during routine optimization tasks, when Elasticsearch physically expunges the deleted documents from the index.

This mechanism allows Elasticsearch to handle deletions efficiently. By marking documents as deleted rather than expunging them outright, Elasticsearch avoids costly segment reorganizations within the index. The removal occurs during optimized, controlled background tasks.

There’s more…

While we have discussed deleting documents by document_id, this might not be the most efficient approach for deleting multiple documents. For such scenarios, the Delete By Query API is more suitable, such as the following:

Note

Before executing the upcoming query, it is necessary to re-index the document, since it was deleted earlier in the recipe. Ensure that you have re-added the document to the movies index by executing the sampledata_index.py Python script.

POST /movies/_delete_by_query
{
  "query": {
    "match": {
      "genre": "comedy"
    }
  }
}

The preceding query will delete all movies matching the comedy genre in our index.

Also, when deleting many documents, the best practice is to use the Delete By Query with the slices parameter to improve performance. The Delete by Query feature with the slices parameter in Elasticsearch offers considerable advantages, especially when dealing with the deletion of numerous documents. This best practice enhances performance by splitting a large deletion task into smaller, parallel operations. This method not only boosts the efficiency and reliability of the deletion process but also lessens the burden on the cluster. By dividing the task, you ensure a more balanced and effective approach to managing large-scale deletions in Elasticsearch.

Using an analyzer

In this recipe, we are going to learn how to set up and use a specific analyzer for text analysis. Indexing data in Elasticsearch, especially for search use cases, requires that you define how text should be processed before indexation; this is what analyzers accomplish.

Analyzers in Elasticsearch handle tokenization and normalization functions. Elasticsearch offers a variety of ready-made analyzers for common scenarios, as well as language-specific analyzers for English, German, Spanish, French, Hindi, and so on.

In this recipe, we will see how to configure the standard analyzer with the English stopwords filter.

Getting ready

Make sure that you completed the Adding data from the Elasticsearch client recipe. Also, make sure to download the following sample Python script from the GitHub repository: https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/python-client-sample/sampledata_analyzer.py.

The command snippets of this recipe are available at https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/snippets.md#using-analyzer.

How to do it…

In this recipe, you will learn how to configure your Python code to interface with an Elasticsearch cluster, define a custom English text analyzer, create a new index with the analyzer, and verify that the index uses the specified settings.

Let’s look at the provided Python script:

At the beginning of the script, we create an instance of the Elasticsearch client:

es = Elasticsearch(
    cloud_id=ES_CID,
    basic_auth=(ES_USER, ES_PWD)
)

To ensure that we do not use an existing movies index, the script includes code that deletes any such index:

if es.indices.exists(index="movies"):
    print("Deleting existing movies index...")
    es.options(ignore_status=[404, 400]).indices.delete(index="movies")

Next, we define the analyzer configuration:

index_settings = {
    "analysis": {
        "analyzer": {
            "standard_with_english_stopwords": {
                "type": "standard",
                "stopwords": "_english_"
            }
        }
    }
}

We then create the index with settings that define the analyzer:
```
es.indices.create(index='movies', settings=index_settings)
```

Finally, to verify the successful addition of the analyzer, we retrieve the settings:

settings = es.indices.get_settings(index='movies')
analyzer_settings = settings['movies']['settings']['index']['analysis']
print(f"Analyzer used for the index: {analyzer_settings}")

After reviewing the script, execute it with the following command, and you should see the output shown in Figure 2.10:
```
$ python sampledata_analyzer.py
```

Figure 2.10 – The output of the sampledata_analyzer.py script

Alternatively, you can go to Kibana | Dev Tools and issue the following request:

GET /movies/_settings

In the response, you should see the settings currently applied to the movies index with the configured analyzer, as shown in Figure 2.11:

Figure 2.11 – The analyzer configuration in the index settings

How it works...

The settings block of the index configuration is where the analyzer is set. As we are modifying the built-in standard analyzer in our recipe, we will give it a unique name (standard_with_english_stopwords) and set the type to standard. Text indexed from this point will undergo analysis by the modified analyzer. To test this, we can use the _analyze endpoint on the index:

POST movies/_analyze
{
  "text": "A young couple decides to elope.",
  "analyzer": "standard_with_stopwords"
}

It should yield the results shown in Figure 2.12:

Figure 2.12 – The index result of a text with the stopword analyzer

There’s more…

While Elasticsearch offers many built-in analyzers for different languages and text types, you can also define custom analyzers. These allow you to specify how text is broken down and modified for indexing or searching, using components such as tokenizers, token filters, and character filters – either those provided by Elasticsearch or custom ones you create. For example, you can design an analyzer that converts text to lowercase, removes common words, substitutes synonyms, and strips accents.

Reasons for needing a custom analyzer may include the following:

Handling various languages and scripts that require special processing, such as Chinese, Japanese, and Arabic
Enhancing the relevance and comprehensiveness of search results using synonyms, stemming, lemmatization, and so on
Unifying text by removing punctuation, whitespace, and accents and making it case-insensitive

Defining index mapping

In Elasticsearch, mapping refers to the process of defining the schema or structure of an index. It defines how documents and their fields are stored and indexed within Elasticsearch. Mapping allows you to specify the data type of each field, such as text, a keyword, a numeric character, and a date, and configure various properties for each field, including indexing options and analyzers. By defining a mapping, you provide Elasticsearch with crucial information about the data you intend to index, enabling it to efficiently store, search, and analyze the documents.

Mapping plays a critical role in delivering precise search results, efficient data storage, and effective handling of different data types within Elasticsearch.

When no mapping is predefined, Elasticsearch attempts to dynamically infer data types and create the mapping; this is what has occurred with our movie dataset thus far.

In this recipe, we will apply an explicit mapping to the movies index.

Getting ready

Make sure that you have completed the Updating data in Elasticsearch recipe.

All the command snippets for the Dev Tools in this recipe are available at https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/snippets.md#defining-index-mapping.

How to do it…

You can define mappings during index creation or update them in an existing index.

An important note on mappings

When updating the mapping of an existing index that already contains documents, the mapping of those existing documents will not change. The new mapping will only apply to documents indexed afterward.

In this recipe, you are going to create a new index with explicit mapping, and then re-index the data from the movie index, assuming that you have already created that index beforehand:

Head to Kibana | Dev Tools.
Next, let’s check the mapping of the previously created index with the following command:
```
GET /movies/_mapping
```
You will get the results shown in the following figure. Note that, for readability, some fields were collapsed.

Figure 2.13 – The default mapping on the movies index

Let’s review what’s going on in the figure:

a. Examining the current mapping of the genre field reveals a multi-field mapping technique. This approach allows a single field to be indexed in several ways to serve different purposes. For example, the genre field is indexed both as a text field for full-text search and as a keyword field for sorting and aggregation. This dual approach to mapping the genre field is actually beneficial and well-suited for its intended use cases.

b. Examining the release_year field reveals that indexing it as a text field is not optimal, since it represents numerical data, which could be beneficial for range queries, as well as other numeric-specific operations. Retaining the keyword mapping for this field is advantageous for sorting and aggregation purposes. To address this, applying an explicit mapping to treat release_year appropriately as a numerical field is the next step.

c. There are two other fields that will require mapping adjustments – plot and cast. Given their nature, these fields should be indexed solely as text, considering it is unlikely there will be a need to sort or aggregate on these fields. However, this indexing strategy still allows for effective searching against them.

Now, let’s create a new index with the correct explicit mapping for the cast, plot, and release_year fields:

PUT movies-with-explicit-mapping
{
  "mappings": {
    "properties": {
      "release_year": {
        "type": "short",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "cast": {
        "type": "text"
      },
      "plot": {
        "type": "text"
      }
    }
  }
}

Next, reindex the original data in the new index so that the new mapping is applied:

POST /_reindex
{
  "source": {
    "index": "movies"
  },
  "dest": {
    "index": "movies-with-explicit-mapping"
  }
}

Check whether the new mapping has been applied to the new index:
```
GET movies-with-explicit-mapping/_mapping
```
Figure 2.14 shows the explicit mapping applied to the index:

Figure 2.14 – Explicit mapping

How it works...

Explicit mapping in Elasticsearch allows you to define the schema or mapping for your index explicitly. Instead of relying on dynamic mapping, which automatically detects and creates the mapping based on the first indexed document, explicit mapping gives you full control over the data types, field properties, and analysis settings for each field in your index, as shown in Figure 2.15:

Figure 2.15 – The field mapping options

There’s more…

Mapping is a key aspect of data modeling in Elasticsearch. Avoid relying on dynamic mapping and try, when possible, to explicitly define your mappings to have better control over the field types, properties, and analysis settings. This helps maintain consistency and avoids unexpected field mappings.

You should consider using multi-field mapping to index the same field in different ways, depending on the use cases. For instance, for a full-text search of a string field, text mapping is necessary. If the same string field is mostly used for aggregations, filtering, or sorting, then mapping it to a keyword field is more efficient. Also, consider using mapping limit settings to prevent a mapping explosion (https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-settings-limit.html). A situation where every new ingested document introduces new fields, as with dynamic mapping, can result in defining too many fields in an index. This can cause a mapping explosion. When each new field is continually added to the index mapping, it can grow excessively and lead to memory shortages and recovery challenges.

When it comes to mapping limit settings, there are several best practices to keep in mind. First, limit the number of field mappings to prevent documents from causing a mapping explosion. Second, limit the maximum depth of a field. Third, restrict the number of different nested mappings an index can have. Fourth, set a maximum for the count of nested JSON objects allowed in a single document, across all nested types. Finally, limit the maximum length of a field name. Keep in mind that setting higher limits can affect performance and cause memory problems.

For many years now, Elastic has been developing a specification called Elastic Common Schema (ECS) that provides a consistent and customizable way to structure data in Elasticsearch. Adopting this mapping has a lot of benefits (data correlation, reuse, and future-proofing, to name a few), and as a best practice, always refer to the ECS convention when you consider naming your fields. We will see more examples using ECS in the next chapters.

Using dynamic templates in document mapping

In this recipe, we will explore how to leverage dynamic templates in Elasticsearch to automatically apply mapping rules to fields, based on their data types. Elasticsearch allows you to define dynamic templates that simplify the mapping process by dynamically applying mappings to new fields as they are indexed.

Getting ready

Make sure that you have completed the previous recipes:

Using an analyzer
Defining index mapping

The snippets of the recipe are available at this address: https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/snippets.md#using-dynamic-templates-in-document-mapping.

How to do it…

In our example, the default mapping of the year field is set to the long field type, which is suboptimal for storage. We also want to prepare the document mapping so that if additional year fields such as review_year and award_year are introduced, they will have a dynamically applied mapping. Let’s go to Kibana | Dev Tools, where we can extend the previous mapping as follows:
```
PUT movies/_mapping
{
  "dynamic_templates": [{
    "years_as_short": {
      "match_mapping_type": "long",
        "match": "*year",
          "mapping": {
            "type": "short"
          }
    }
  }]
}
```

Next, we ingest a new document with a review_year field using the following command:

POST movies/_doc/
{
  "review_year": 1993,
  "release_year": 1992,
  "title": "Reservoir Dogs",
  "origin": "American",
  "director": "Quentin Tarantino",
  "cast": "Harvey Keitel, Tim Roth, Steve Buscemi, Chris Penn, Michael Madsen, Lawrence Tierney",
  "genre": "crime drama",
  "wiki_page": "https://en.wikipedia.org/wiki/Reservoir_Dogs",
  "plot": "a group of criminals whose planned diamond robbery goes disastrously wrong, leading to intense suspicion and betrayal within their ranks."
}

We can now check the mapping with the following command, and we can see that the movies mapping now contains the dynamic template, and the review_year field correctly maps to short, as shown in Figure 2.16.
```
GET /movies/_mapping
```

Figure 2.16 – Updated mapping for the movies index with a dynamic template

How it works...

In our example for the years_as_short dynamic template, we configured custom mapping as follows:

The match_mapping_type parameter is used to define the data type to be detected. In our example, we try to define the data type for long values.
The match parameter is used to define the wildcard for the filename ending with year. It uses a pattern to match the field name. (It is also possible to use the unmatch parameter, which uses one or more patterns to exclude fields matched by match.)
mapping is used to define the mapping the match field should use. In our example, we map the target field type to short.

There’s more…

Apart from the example that we have seen in this recipe, dynamic templates can also be used in the following scenarios:

Only with a match_mapping_type parameter that applies to all the fields of a single type, without needing to match the field name
With patch_match or patch_unmatch for a full dotted patch to the field such as "path_match": "myfield_prefix.*" or "path_unmatch": "*.year".

For timestamped data, it is common to have many numeric fields such as metrics. In such cases, filtering on those fields is rarely required and only aggregation is useful. Therefore, it is recommended to disable indexing on those fields to save disk space. You can find a concrete example in the following documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-templates.html#_time_series.

The default dynamic field mapping in Elasticsearch is convenient to get started, but it is beneficial to consider defining field mappings more strategically to optimize storage, memory, and indexing/search speed. The workflow to design new index mappings can be as follows:

Index a sample document containing the desired fields in a dummy index.
Retrieve the dynamic mapping created by Elasticsearch.
Modify and optimize the mapping definition.
Create your index with the custom mapping, either explicit or dynamic.

Creating an index template

In this recipe, we will explore how to use index templates in Elasticsearch to define mappings, settings, and other configurations for new indices. Index templates automate the index creation process and ensure consistency across your Elasticsearch cluster.

Getting ready

Before we begin, familiarize yourself with creating component and index templates by using Kibana Dev Tools as explained in this documentation:

Make sure that you have completed the previous recipes:

Using an analyzer
Defining index mapping

All the commands for the Dev Tools in this recipe are available at this address: https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/snippets.md#creating-an-index-template.

How to do it…

In this recipe, we will create two component templates – one for the genre field and another for *year fields with dynamic mapping – and then combine them in an index template:

Create the first component template for the genre field:

PUT _component_template/movie-static-mapping
{
  "template": {
    "mappings": {
      "properties": {
        "genre": {
          "type": "keyword"
        }
      }
    }
  }
}

Create the second component template for the dynamic *year field:

PUT _component_template/movie-dynamic-mapping
{
  "template": {
    "mappings": {
      "dynamic_templates": [{
        "years_as_short": {
          "match_mapping_type": "long",
          "match": "*year",
          "mapping": {
            "type": "short"
          }
        }
      }]
    }
  }
}

Create the index template, which consists of the component templates that we just created; additionally, we define an explicit mapping director field directly in the index template:

PUT _index_template/movie-template
{
  "index_patterns": ["movie*"],
  "template": {
    "settings": {
      "number_of_shards": 1
    },
    "mappings": {
      "_source": {
        "enabled": true
      },
      "properties": {
        "director": {
        "type": "keyword"
        }
      }
    },
    "aliases": {
      "mydata": { }
    }
  },
  "priority": 500,
  "composed_of": ["movie-static-mapping", "movie-dynamic-mapping"],
  "version": 1,
  "_meta": {
    "description": "movie template"
  }
}

Now, we can index another new movie with a field called award_year, as follows:

POST movies/_doc/
{
  "award_year": 1998,
  "release_year": 1997,
  "title": "Titanic",
  "origin": "American",
  "director": "James Cameron",
  "cast": "Leonardo DiCaprio, Kate Winslet, Billy Zane, Frances Fisher, Victor Garber, Kathy Bates, Bill Paxton, Gloria Stuart, David Warner, Suzy Amis",
  "genre": "historical epic",
  "wiki_page": "https://en.wikipedia.org/wiki/Titanic_(1997_film)",
  "plot": "The ill-fated maiden voyage of the RMS Titanic, centering on a love story between a wealthy young woman and a poor artist aboard the luxurious, ill-fated R.M.S. Titanic"
}

Let’s check the mapping after the document ingestion with the following command:
```
GET /movies/_mapping
```
Note the updated mapping, as illustrated in Figure 2.17, with award_year dynamically mapped to short. Additionally, both the genre and director fields are mapped to keyword, thanks to our field definitions in the movie-static-mapping component template and the movie-template index template.

Figure 2.17 – The updated mapping for the movies index

How it works...

Index templates include various configuration settings, such as shard and replica initialization parameters, mapping configurations, and aliases. They also allow you to assign priorities to templates, with a default priority of 100.

Component templates act as building blocks for index templates, which can comprise settings, aliases, or mappings and can be combined in an index template, using the composed_of parameter.

Legacy index templates were deprecated upon the release of Elasticsearch 7.8.

Figure 2.18 gives you an overview of the relationship between index templates, component templates, and legacy templates:

Figure 2.18 – Index templates versus legacy index templates

There’s more…

Elasticsearch provides predefined index templates that are associated with index and data stream patterns (you can find more details in Chapter 4), such as logs-*-*, metrics-*-*, and synthetics-*-*, with a default priority of 100. If you wish to create custom index templates that override the predefined ones but still use the same patterns, you can assign a priority value higher than 100. If you want to disable the built-in index and component templates altogether, you can set the stack.templates.enabled configuration parameter to false; the detailed documentation can be found here: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-templates.html.

Indexing multiple documents using Bulk API

In this recipe, we will explore how to use the Elasticsearch client to ingest an entire movie dataset using the bulk API. We will also integrate various concepts we have covered in previous recipes, specifically related to mappings, to ensure that the correct mappings are applied to our index.

Getting ready

For this recipe, we will work with the sample Wikipedia Movie Plots dataset introduced at the beginning of the chapter. The file is accessible in the GitHub repository via this URL: https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/dataset/wiki_movie_plots_deduped.csv.

Make sure that you have completed the previous recipes:

Using an analyzer
Creating index template

How to do it…

Head to the GitHub repository to download the Python script at https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/python-client-sample/sampledata_bulk.py and then follow these steps:

Update the .env file with the MOVIE_DATASET variable, which specifies the path to the downloaded movie dataset CSV file:
```
MOVIE_DATASET=<the-path-of-the-csv-file>
```
Once the .env file is correctly configured, run the sampledata_bulk.py Python script. During execution, you should see output similar to the following (note that, for readability, the output image has been truncated):

Figure 2.19 – The output of the sampledata_bulk.py script

To verify that the new movies index has the appropriate mappings, head to Kibana | Dev Tools and execute the following command:
```
GET /movies/_mapping
```

Figure 2.20 – The movies index with a new mapping

As illustrated in Figure 2.20, dynamic mapping on the release_year field was applied to the newly created movies index, despite a mapping being explicitly specified in the script. This occurred because an index template was defined in the Using dynamic templates in document mapping recipe, with the index pattern set to movies*. As a result, any index that matches this pattern will automatically inherit the settings from the template, including its dynamic mapping configuration.
Next, to verify that the entire dataset has been indexed, execute the following command:
```
GET /movies/_count
```
The command should produce the output illustrated in Figure 2.21. According to this output, your movies index should contain 34,886 documents:

Figure 2.21 – A count of the documents in bulk-indexed movies

We have just set up an index with the right explicit mapping and loaded an entire dataset by using the Elasticsearch Python client.

How it works...

The script we’ve provided contains several sections. First, we delete any existing movies indexes to make sure we start from a clean slate. This is the reason you did not see the award_year and review_year fields in the new mapping shown in Figure 2.20. We then use the create_index method to create the movies index and specify the settings and the mappings we wish to apply to the documents that will be stored in this index.

Then, there is the generate_actions function that yields a document for each row in our CSV dataset. This function is then used by the streaming_bulk helper method.

The streaming_bulk helper function in the Elasticsearch Python client is used to perform bulk indexing of documents in Elasticsearch. It is like the bulk helper function, but it is designed to handle large datasets.

The streaming_bulk function accepts an iterable of documents and sends them to Elasticsearch in small batches. This strategy allows you to efficiently process substantial datasets without exhausting system memory.

There’s more…

The Elasticsearch Python Client provides several helper functions for the bulk API, which can be challenging to use directly because of its specific formatting requirements and other considerations. These helpers accept an instance of the es class and an iterable action, which can be any iterable or generator.

The most common format for the iterable action is the same as that returned by the search() method. The bulk() API accepts the index, create, delete, and update actions. The _op_type field is used to specify an action, with _op_type defaulting to index. There are several bulk helpers available, including bulk(), parallel_bulk(), streaming_bulk(), and bulk_index(). The following table outlines these helpers and their preferred use cases:

Bulk helper functions	Use cases
`bulk()`	This helper is used to perform bulk operations on a single thread. It is ideal for small- to medium-sized datasets and is the simplest of the bulk helpers.
`parallel_bulk()`	This helper is used to perform bulk operations on multiple threads. It is ideal for large datasets and can significantly improve indexing performance.
`streaming_bulk()`	This helper is used to perform bulk operations on a large dataset that cannot fit into memory. It is ideal for large datasets and can be used to stream data from a file or other source.
`bulk_index()`	This helper is used to perform bulk indexing operations on a large dataset that cannot fit into memory. It is ideal for large datasets and can be used to stream data from a file or other source.