Python Data Cleaning and Preparation Best Practices

Data Ingestion Techniques

Data ingestion is a critical component of the data life cycle and sets the foundation for subsequent data transformation and cleaning. It involves the process of collecting and importing data from various sources into a storage system where it can be accessed and analyzed. Effective data ingestion is crucial for ensuring data quality, integrity, and availability, which directly impacts the efficiency and accuracy of data transformation and cleaning processes. In this chapter, we will dive deep into the different types of data sources, explore various data ingestion methods, and discuss their respective advantages, disadvantages, and real-world applications.

In this chapter, we’ll cover the following topics:

Ingesting data in batch mode
Ingesting data in streaming mode
Real-time versus semi-real-time ingestion
Data sources technologies

Ingesting data in batch mode

Batch ingestion is a data processing technique whereby large volumes of data are collected, processed, and loaded into a system at scheduled intervals, rather than in real-time. This approach allows organizations to handle substantial amounts of data efficiently by grouping data into batches, which are then processed collectively. For example, a company might collect customer transaction data throughout the day and then process it in a single batch during off-peak hours. This method is particularly useful for organizations that need to process high volumes of data but do not require immediate analysis.

Batch ingestion is beneficial because it optimizes system resources by spreading the processing load across scheduled times, often when the system is underutilized. This reduces the strain on computational resources and can lower costs, especially in cloud-based environments where computing power is metered. Additionally, batch processing simplifies data management, as it allows for the easy application of consistent transformations and validations across large datasets. For organizations with regular, predictable data flows, batch ingestion provides a reliable, scalable, and cost-effective solution for data processing and analytics.

Let’s explore batch ingestion in more detail, starting with its advantages and disadvantages.

Advantages and disadvantages

Batch ingestion offers several notable advantages that make it an attractive choice for many data processing needs:

Efficiency is a key benefit, as batch processing allows for the handling of large volumes of data in a single operation, optimizing resource usage and minimizing overhead
Cost-effectiveness is another benefit, reducing the need for continuous processing resources and lowering operational expenses.
Simplicity makes it easier to manage and implement periodic data processing tasks compared to real-time ingestion, which often requires more complex infrastructure and management
Robustness, as batch processing is well-suited for performing complex data transformations and comprehensive data validation, ensuring high-quality, reliable data

However, batch ingestion also comes with certain drawbacks:

There is an inherent delay between the generation of data and its availability for analysis, which can be a critical issue for applications requiring real-time insights.
Resource spikes can occur during batch processing windows, leading to high resource usage and potential performance bottlenecks
Scalability can also be a concern, as handling very large datasets may require significant infrastructure investment and management
Lastly, maintenance is a crucial aspect of batch ingestion; it demands careful scheduling and ongoing maintenance to ensure the timely and reliable execution of batch jobs

Let’s look at some common use cases for ingesting data in batch mode.

Common use cases for batch ingestion

Any data analytics platform such as data warehouses or data lakes requires regularly updated data for Business Intelligence (BI) and reporting. Batch ingestion is integral as it ensures that data is continually updated with the latest information, enabling businesses to perform comprehensive and up-to-date analyses. By processing data in batches, organizations can efficiently handle vast amounts of transactional and operational data, transforming it into a structured format suitable for querying and reporting. This supports BI initiatives, allowing analysts and decision-makers to generate insightful reports, track Key Performance Indicators (KPIs), and make data-driven decisions.

Extract, Transform, and Load (ETL) processes are a cornerstone of data integration projects, and batch ingestion plays a crucial role in these workflows. In ETL processes, data is extracted from various sources, transformed to fit the operational needs of the target system, and loaded into a database or data warehouse. Batch processing allows for efficient handling of these steps, particularly when dealing with large datasets that require significant transformation and cleansing. This method is ideal for periodic data consolidation, where data from disparate systems is integrated to provide a unified view, supporting activities such as data migration, system integration, and master data management.

Batch ingestion is also widely used for backups and archiving, which are critical processes for data preservation and disaster recovery. Periodic batch processing allows for the scheduled backup of databases, ensuring that all data is captured and securely stored at regular intervals. This approach minimizes the risk of data loss and provides a reliable restore point in case of system failures or data corruption. Additionally, batch processing is used for data archiving, where historical data is periodically moved from active systems to long-term storage solutions. This not only helps in managing storage costs but also ensures that important data is retained and can be retrieved for compliance, auditing, or historical analysis purposes.

Batch ingestion use cases

Batch ingestion is a methodical process involving several key steps: data extraction, data transformation, data loading, scheduling, and automation. To illustrate these steps, let’s explore a use case involving an investment bank that needs to process and analyze trading data for regulatory compliance and performance reporting.

Batch ingestion in an investment bank

An investment bank needs to collect, transform, and load trading data from various financial markets into a central data warehouse. This data will be used for generating daily compliance reports, evaluating trading strategies, and making informed investment decisions.

Data extraction

The first step is identifying the sources from which data will be extracted. For the investment bank, this includes trading systems, market data providers, and internal risk management systems. These sources contain critical data such as trade execution details, market prices, and risk assessments. Once the sources are identified, data is collected using connectors or scripts. This involves setting up data pipelines that extract data from trading systems, import real-time market data feeds, and pull risk metrics from internal systems. The extracted data is then temporarily stored in staging areas before processing.

Data transformation

The extracted data often contains inconsistencies, duplicates, and missing values. Data cleaning is performed to remove duplicates, fill in missing information, and correct errors. For the investment bank, this ensures that trade records are accurate and complete, providing a reliable foundation for compliance reporting and performance analysis. After cleaning, the data undergoes transformations such as aggregations, joins, and calculations. For example, the investment bank might aggregate trade data to calculate daily trading volumes, join trade records with market data to analyze price movements, and calculate key metrics such as Profit and Loss (P&L) and risk exposure. The transformed data must be mapped to the schema of the target system. This involves aligning the data fields with the structure of the data warehouse. For instance, trade data might be mapped to tables representing transactions, market data, and risk metrics, ensuring seamless integration with the existing data model.

Data loading

The transformed data is processed in batches, which allows the investment bank to handle large volumes of data efficiently, performing complex transformations and aggregations in a single run. Once processed, the data is loaded into the target storage system, such as a data warehouse or data lake. For the investment bank, this means loading the cleaned and transformed trading data into their data warehouse, where it can be accessed for compliance reporting and performance analysis.

Scheduling and automation

To ensure that the batch ingestion process runs smoothly and consistently, scheduling tools such as Apache Airflow or Cron jobs are used. These tools automate the data ingestion workflows, scheduling them to run at regular intervals, such as every night or every day. This allows the investment bank to have up-to-date data available for analysis without manual intervention. Implementing monitoring is crucial to track the success and performance of batch jobs. Monitoring tools provide insights into job execution, identifying any failures or performance bottlenecks. For the investment bank, this ensures that any issues in the data ingestion process are promptly detected and resolved, maintaining the integrity and reliability of the data pipeline.

Batch ingestion with an example

Let’s have a look at a simple example of a batch processing ingestion system written in Python. This example will simulate the ETL process. We’ll generate some mock data, process it in batches, and load it into a simulated database.

You can find the code for this part in the GitHub repository at https://github.com/PacktPublishing/Python-Data-Cleaning-and-Preparation-Best-Practices/blob/main/chapter01/1.batch.py. To run this example, we don’t need any bespoke library installation. We just need to ensure that we are running it in a standard Python environment (Python 3.x):

We create a generate_mock_data function that generates a list of mock data records:

def generate_mock_data(num_records):
    data = []
    for _ in range(num_records):
        record = {
            'id': random.randint(1, 1000),
            'value': random.random() * 100
        }
        data.append(record)
return data

Each record is a dictionary with two fields:

id: A random integer between 1 and 1000
value: A random float between 0 and 100

Let’s have a look at what the data looks like:

print("Original data:", data)
{'id': 449, 'value': 99.79699336555473}
{'id': 991, 'value': 79.65999078145887}

A list of dictionaries is returned, each representing a data record.

Next, we create a batch processing function:
```
def process_in_batches(data, batch_size):
    for i in range(0, len(data), batch_size):
        yield data[i:i + batch_size]
```
This function takes the data, which is a list of data records to process, and batch_size, which represents the number of records per batch, as parameters. The function uses a for loop to iterate over the data in steps of batch_size. The yield keyword is used to generate batches of data, each of the batch_size size. A generator that yields batches of data is returned.

We create a transform_data function that transforms each record in the batch:

def transform_data(batch):
    transformed_batch = []
    for record in batch:
        transformed_record = {
            'id': record['id'],
            'value': record['value'],
            'transformed_value': record['value'] * 1.1
        }
        transformed_batch.append(transformed_record)
return transformed_batch

This function takes as an argument the batch, which is a list of data records to be transformed. The transformation logic is simple: a new transformed_value field is added to each record, which is the original value multiplied by 1.1. At the end, we have a list of transformed records. Let’s have a look at some of our transformed records:

{'id': 558, 'value': 12.15160339587219, 'transformed_value': 13.36676373545941}
{'id': 449, 'value': 99.79699336555473, 'transformed_value': 109.77669270211021}
{'id': 991, 'value': 79.65999078145887, 'transformed_value': 87.62598985960477}

Next, we create a load_data function to load the data. This function simulates loading each transformed record into a database:
```
def load_data(batch):
    for record in batch:
        # Simulate loading data into a database
        print(f"Loading record into database: {record}")
```
This function takes the batch as a parameter, which is a list of transformed data records that is ready to be loaded. Each record is printed to the console to simulate loading it into a database.

Finally, we create a main function. This function calls all the aforementioned functions:

def main():
    # Parameters
    num_records = 100 # Total number of records to generate
    batch_size = 10 # Number of records per batch
    # Generate data
    data = generate_mock_data(num_records)
    # Process and load data in batches
    for batch in process_in_batches(data, batch_size):
        transformed_batch = transform_data(batch)
        print("Batch before loading:")
        for record in transformed_batch:
            print(record)
        load_data(transformed_batch)
        time.sleep(1) # Simulate time delay between batches

This function calls generate_mock_data to create the mock data and uses process_in_batches to divide the data into batches. For each batch, the function does the following:

Transforms the batch using transform_data
Prints the batch to show its contents before loading
Simulates loading the batch using load_data

Now, let’s transition from batch processing to a streaming paradigm. In streaming, data is processed as it arrives, rather than in predefined batches.

Ingesting data in streaming mode

Streaming ingestion is a data processing technique whereby data is collected, processed, and loaded into a system in real-time, as it is generated. Unlike batch ingestion, which accumulates data for processing at scheduled intervals, streaming ingestion handles data continuously, allowing organizations to analyze and act on information immediately. For instance, a company might process customer transaction data the moment it occurs, enabling real-time insights and decision-making. This method is particularly useful for organizations that require up-to-the-minute data analysis, such as in financial trading, fraud detection, or sensor data monitoring.

Streaming ingestion is advantageous because it enables immediate processing of data, reducing latency and allowing organizations to react quickly to changing conditions. This is particularly beneficial in scenarios where timely responses are critical, such as detecting anomalies, personalizing user experiences, or responding to real-time events. Additionally, streaming can lead to more efficient resource utilization by distributing the processing load evenly over time, rather than concentrating it into specific batch windows. In cloud-based environments, this can also translate into cost savings, as resources can be scaled dynamically to match the real-time data flow. For organizations with irregular or unpredictable data flows, streaming ingestion offers a flexible, responsive, and scalable approach to data processing and analytics. Let’s look at some of its advantages and disadvantages.

Advantages and disadvantages

Streaming ingestion offers several distinct advantages, making it an essential choice for specific data processing needs:

One of the primary benefits is the ability to obtain real-time insights from data. This immediacy is crucial for applications such as fraud detection, real-time analytics, and dynamic pricing, where timely data is vital.
Streaming ingestion supports continuous data processing, allowing systems to handle data as it arrives, thereby reducing latency and improving responsiveness.
This method is highly scalable, as well as capable of managing high-velocity data streams from multiple sources without significant delays.

However, streaming ingestion also presents some challenges:

Implementing a streaming ingestion system can be complex, requiring sophisticated infrastructure and specialized tools to manage data streams effectively.
Continuous processing demands constant computational resources, which can be costly and resource-intensive.
Ensuring data consistency and accuracy in a streaming environment can be difficult due to the constant influx of data and the potential for out-of-order or duplicate records

Let’s look at common use cases for ingesting data in batch mode.

Common use cases for streaming ingestion

While batch processing is well-suited for periodic, large-scale data updates and transformations, streaming data ingestion is crucial for real-time data analytics and applications that require immediate insights. Here are some common use cases for streaming data ingestion.

Real-time fraud detection and security monitoring

Financial institutions use streaming data to detect fraudulent activities by analyzing transaction data in real-time. Immediate anomaly detection helps prevent fraud before it can cause significant damage. Streaming data is used in cybersecurity to detect and respond to threats immediately. Continuous monitoring of network traffic, user behavior, and system logs helps identify and mitigate security breaches as they occur.

IoT and sensor data

In manufacturing, streaming data from sensors on machinery allows for predictive maintenance. By continuously monitoring equipment health, companies can prevent breakdowns and optimize maintenance schedules.

Another interesting application in the IoT and sensors space is smart cities. Streaming data from various sensors across a city (traffic, weather, pollution, etc.) helps in managing city operations in real-time, improving services such as traffic management and emergency response.

Online recommendations and personalization

Streaming data enables e-commerce platforms to provide real-time recommendations to users based on their current browsing and purchasing behavior. This enhances user experience and increases sales. Platforms such as Netflix and Spotify use streaming data to update recommendations as users interact with the service, providing personalized content suggestions in real-time.

Financial market data

Stock traders rely on streaming data for up-to-the-second information on stock prices and market conditions to make informed trading decisions. Automated trading systems use streaming data to execute trades based on predefined criteria, requiring real-time data processing for optimal performance.

Telecommunications

Telecommunication companies use streaming data to monitor network performance and usage in real-time, ensuring optimal service quality and quick resolution of issues. Streaming data also helps in tracking customer interactions and service usage in real-time, enabling personalized customer support and improving the overall experience.

Real-time logistics and supply chain management

Streaming data from GPS devices allows logistics companies to track vehicle locations and optimize routes in real-time, improving delivery efficiency. Real-time inventory tracking helps businesses maintain optimal stock levels, reducing overstock and stockouts while ensuring timely replenishment.

Streaming ingestion in an e-commerce platform

Streaming ingestion is a methodical process involving several key steps: data extraction, data transformation, data loading, and monitoring and alerting. To illustrate these steps, let’s explore a use case involving an e-commerce platform that needs to process and analyze user activity data in real-time for personalized recommendations and dynamic inventory management.

An e-commerce platform needs to collect, transform, and load user activity data from various sources such as website clicks, search queries, and purchase transactions into a central system. This data will be used for generating real-time personalized recommendations, monitoring user behavior, and managing inventory dynamically.

Data extraction

This is the first step is identifying the sources from which data will be extracted. For the e-commerce platform, this includes web servers, mobile apps, and third-party analytics services. These sources contain critical data such as user clicks, search queries, and transaction details. Once the sources are identified, data is collected using streaming connectors or APIs. This involves setting up data pipelines that extract data from web servers, mobile apps, and analytics services in real-time. The extracted data is then streamed to processing systems such as Apache Kafka or AWS Kinesis.

Data transformation

The extracted data often contains inconsistencies and noise. Real-time data cleaning is performed to filter out irrelevant information, handle missing values, and correct errors. For the e-commerce platform, this ensures that user activity records are accurate and relevant for analysis. After cleaning, the data undergoes transformations such as parsing, enrichment, and aggregation. For example, the e-commerce platform might parse user clickstream data to identify browsing patterns, enrich transaction data with product details, and aggregate search queries to identify trending products. The transformed data must be mapped to the schema of the target system. This involves aligning the data fields with the structure of the real-time analytics system. For instance, user activity data might be mapped to tables representing sessions, products, and user profiles, ensuring seamless integration with the existing data model.

Data loading

The transformed data is processed continuously using tools such as Apache Flink or Apache Spark Streaming. Continuous processing allows the e-commerce platform to handle high-velocity data streams efficiently, performing transformations and aggregations in real-time. Once processed, the data is loaded into the target storage system, such as a real-time database or analytics engine, where it can be accessed for personalized recommendations and dynamic inventory management.

Monitoring and alerting

To ensure that the streaming ingestion process runs smoothly and consistently, monitoring tools such as Prometheus or Grafana are used. These tools provide real-time insights into the performance and health of the data ingestion pipelines, identifying any failures or performance bottlenecks. Implementing alerting mechanisms is crucial to promptly detect and resolve any issues in the streaming ingestion process. For the e-commerce platform, this ensures that any disruptions in data flow are quickly addressed, maintaining the integrity and reliability of the data pipeline.

Streaming ingestion with an example

As we said, in streaming, data is processed as it arrives rather than in predefined batches. Let’s modify the batch example to transition to a streaming paradigm. For simplicity, we will generate data continuously, process it immediately upon arrival, transform it, and then load it:

The generate_mock_data function generates records continuously using a generator and simulates a delay between each record:

def generate_mock_data():
    while True:
        record = {
            'id': random.randint(1, 1000),
            'value': random.random() * 100
        }
        yield record
        time.sleep(0.5)  # Simulate data arriving every 0.5 seconds

The process_stream function processes each record as it arrives from the data generator, without waiting for a batch to be filled:

def process_stream(run_time_seconds=10):
    start_time = time.time()
    for record in generate_mock_data():
        transformed_record = transform_data(record)
        load_data(transformed_record)
        # Check if the run time has exceeded the limit
        if time.time() – start_time > run_time_seconds:
            print("Time limit reached. Terminating the stream processing.")
            break

The transform_data function transforms each record individually as it arrives:

def transform_data(record):
    transformed_record = {
        'id': record['id'],
        'value': record['value'],
        'transformed_value': record['value'] * 1.1  # Example transformation
    }
    return transformed_record

The load_data function simulates loading data by processing each record as it arrives, instead of processing each record within a batch as before:
```
def load_data(record):
    print(f"Loading record into database: {record}")
```

Let’s move from real-time to semi-real-time processing, which you can think it as batch processing over short intervals. It is usually called micro-batch processing.

Real-time versus semi-real-time ingestion

Real-time ingestion refers to the process of collecting, processing, and loading data almost instantaneously as it is generated, as we have discussed. This approach is critical for applications that require immediate insights and actions, such as fraud detection, stock trading, and live monitoring systems. Real-time ingestion provides the lowest latency, enabling businesses to react to events as they occur. However, it demands robust infrastructure and continuous resource allocation, making it complex and potentially expensive to maintain.

Semi-real-time ingestion, on the other hand, also known as near real-time ingestion, involves processing data with minimal delay, typically in seconds or minutes, rather than instantly. This approach strikes a balance between real-time and batch processing, providing timely insights while reducing the resource intensity and complexity associated with true real-time systems. Semi-real-time ingestion is suitable for applications such as social media monitoring, customer feedback analysis, and operational dashboards, where near-immediate data processing is beneficial but not critically time-sensitive.

Common use cases for near-real-time ingestion

Let’s look at some of the common use cases wherein we can use near-real-time ingestion.

Real-time analytics

Streaming enables organizations to continuously monitor data as it flows in, allowing for real-time dashboards and visualizations. This is critical in industries such as finance, where stock prices, market trends, and trading activities need to be tracked live. It also allows for instant report generation, facilitating timely decision-making and reducing the latency between data generation and analysis.

Social media and sentiment analysis

Companies track mentions and sentiments on social media in real-time to manage brand reputation and respond to customer feedback promptly. Streaming data allows for the continuous analysis of public sentiment towards brands, products, or events, providing immediate insights that can influence marketing and PR strategies.

Customer experience enhancement

Near-real-time processing allows support teams to access up-to-date information on customer issues and behavior, enabling quicker and more accurate responses to customer inquiries. Businesses can also use near-real-time data to update customer profiles and trigger personalized marketing messages, such as emails or notifications, shortly after a customer interacts with their website or app.

Semi-real-time mode with an example

Transitioning from real-time to semi-real-time data processing involves adjusting the example to introduce a more structured approach to handling data updates, rather than processing each record immediately upon arrival. This can be achieved by batching data updates over short intervals, which allows for more efficient processing while still maintaining a responsive data processing pipeline. Let’s have a look at the example and as always, you can find the code in the GitHub repository https://github.com/PacktPublishing/Python-Data-Cleaning-and-Preparation-Best-Practices/blob/main/chapter01/3.semi_real_time.py:

For generating mock data continuously, there are no changes from the previous example. This continuously generates mock data records with a slight delay (time.sleep(0.1)).
For processing in semi-real-time, we can use a deque to buffer incoming records. This function processes records when either the specified time interval has elapsed, or the buffer reaches a specified size (batch_size). Then, it converts the deque to a list (list(buffer)) before passing it to transform_data, ensuring the data is processed in a batch:
```
def process_semi_real_time(batch_size, interval):
    buffer = deque()
    start_time = time.time()
    for record in generate_mock_data():
        buffer.append(record)
```

Check whether the interval has elapsed, or the buffer size has been reached:

        if (time.time() - start_time) >= interval or len(buffer) >= batch_size:

Process and clear the buffer:

            transformed_batch = transform_data(list(buffer))  # Convert deque to list
            print(f"Batch of {len(transformed_batch)} records before loading:")
            for rec in transformed_batch:
                print(rec)
            load_data(transformed_batch)
            buffer.clear()
            start_time = time.time()  # Reset start time

Then, we transform each record in the batch. There are no changes from the previous example and we load the data.

When you run this code, it continuously generates mock data records. Records are buffered until either the specified time interval (interval) has elapsed, or the buffer reaches the specified size (batch_size). Once the conditions are met, the buffered records are processed as a batch, transformed, and then “loaded” (printed) into the simulated database.

When discussing the different types of data sources that are suitable for batch, streaming, or semi-real-time streaming processing, it’s essential to consider the diversity and characteristics of these sources. Data can originate from various sources, such as databases, logs, IoT devices, social media, or sensors, as we will see in the next section.

Data source solutions

In the world of modern data analytics and processing, the diversity of data sources available for ingestion spans a wide spectrum. From traditional file formats such as CSV, JSON, and XML to robust database systems encompassing both SQL and NoSQL variants, the landscape expands further to include dynamic APIs such as REST, facilitating real-time data retrieval. Message queues such as Kafka offer scalable solutions for handling event-driven data while streaming services such as Kinesis and pub/sub enable continuous data flows crucial for applications demanding immediate insights. Understanding and effectively harnessing these diverse data ingestion sources is fundamental to building robust data pipelines that support a broad array of analytical and operational needs.

Let’s start with event processing.

Event data processing solution

In a real-time processing system, data is ingested, processed, and responded to almost instantaneously, as we’ve discussed. Real-time processing systems often use message queues to handle incoming data streams and ensure that data is processed in the order it is received, without delays.

The following Python code demonstrates a basic example of using a message queue for processing messages, which is a foundational concept in both real-time and semi-real-time data processing systems. The Queue class from Python’s queue module is used to create a queue—a data structure that follows the First-in-First-out (FIFO) principle. In this context, a queue is used to manage messages or tasks that need to be processed. The code simulates an event-based system where messages (in this case, strings such as message 0, message 1, etc.) are added to a queue. This mimics a scenario wherein events or tasks are generated and need to be processed in the order they arrive. Let’s have a look at each part of the code. You can find the code file at https://github.com/PacktPublishing/Python-Data-Cleaning-and-Preparation-Best-Practices/blob/main/chapter01/4.work_with_queue.py:

The read_message_queue() function initializes a queue object q using the Queue class from the queue module:
```
def read_message_queue():
    q = Queue()
```
This loop adds 10 messages to the queue. Each message is a string in the format message i, where i ranges from 0 to 9:
```
for i in range(10): # Mocking messages
    q.put(f"message {i}")
```
This loop continuously retrieves and processes messages from the queue until it is empty. q.get() retrieves a message from the queue, and q.task_done() signals that the retrieved message has been processed:
```
while not q.empty():
    message = q.get()
    process_message(message)
    q.task_done() # Signal that the task is done
```
The following function takes a message as input and prints it to the console, simulating the processing of the message:
```
def process_message(message):
    print(f"Processing message: {message}")
```
Call the read_message_queue function:
```
read_message_queue()
```

Here, the read_message_queue function reads messages from the queue and processes them one by one using the process_message function. This demonstrates how event-based systems handle tasks—by placing them in a queue and processing each task as it becomes available.

The while not q.empty() loop ensures that each message is processed in the exact order it was added to the queue. This is crucial in many real-world applications where the order of processing matters, such as in handling user requests or processing logs.

The q.task_done() method signals that a message has been processed. This is important in real-world systems where tracking the completion of tasks is necessary for ensuring reliability and correctness, especially in systems with multiple workers or threads.

In real-world applications, message queues are often integrated into more sophisticated data streaming platforms to ensure scalability, fault tolerance, and high availability. For instance, in real-time data processing, platforms such as Kafka and AWS Kinesis come into play.

Ingesting event data with Apache Kafka

There are different technologies to ingest and handle event data. One of the technologies we will discuss is Apache Kafka. Kafka is an open source distributed event streaming platform first developed by LinkedIn and later donated to the Apache Software Foundation. It is designed to handle large amounts of data in real-time and provides a scalable and fault-tolerant system for processing and storing streams.

Figure 1.1 – Components of Apache Kafka

Let’s see the different components of Apache Kafka:

Ingestion: Data streams can be ingested into Kafka using Kafka producers. Producers are applications that write data to Kafka topics, which are logical channels that can hold and organize data streams.
Processing: Kafka can process streams of data using Kafka Streams, a client library for building real-time stream processing applications. Kafka Streams allows developers to build custom stream-processing applications that can perform transformations, aggregations, and other operations on data streams.
Storage: Kafka stores data streams in distributed, fault-tolerant clusters called Kafka brokers. Brokers store the data streams in partitions, which are replicated across numerous brokers for fault tolerance.
Consumption: Data streams can be consumed from Kafka using Kafka consumers. Consumers are applications that read data from Kafka topics and process it as needed.

Several libraries can be used to interact with Apache Kafka in Python; we will explore the most popular ones in the next section.

Which library should you use for your use case?

Kafka-Python is a pure Python implementation of Kafka’s protocol, offering a more Pythonic interface for interacting with Kafka. It is designed to be simple and easy to use, making it particularly appealing for beginners. One of its primary advantages is its simplicity, making it easier to install and use compared to other Kafka libraries. Kafka-Python is flexible and well-suited for small to medium-sized applications, providing the essential features needed for basic Kafka operations without the complexity of additional dependencies. Its pure Python nature means that it does not rely on any external libraries beyond Python itself, streamlining the installation and setup process.

Confluent-kafka-python is a library developed and maintained by Confluent, the original creator of Kafka. It stands out for its high-performance and low-latency capabilities, leveraging the librdkafka C library for efficient operations. The library offers extensive configuration options akin to the Java Kafka client and closely aligns with Kafka’s feature set, often pioneering support for new Kafka features. It is particularly well-suited for production environments where both performance and stability are crucial, making it an ideal choice for handling high-throughput data streams and ensuring reliable message processing in critical applications.

Transitioning from event data processing to databases involves shifting focus from real-time data streams to persistent data storage and retrieval. While event data processing emphasizes handling continuous streams of data in near real-time for immediate insights or actions, databases are structured repositories designed for storing and managing data over the long term.

Ingesting data from databases

Databases, whether relational or non-relational, serve as foundational components in data management systems. Classic databases and NoSQL databases are two different types of database management systems that differ in architecture and characteristics. A classic database, also known as a relational database, stores data in tables with a fixed schema. Classic databases are ideal for applications that require complex querying and transactional consistency, such as financial systems or enterprise applications.

On the other hand, NoSQL databases do not store data in tables with a fixed schema. They use a document-based approach to store data in a flexible schema format. They are designed to be scalable and handle large amounts of data, with a focus on high-performance data retrieval. NoSQL databases are well-suited for applications that require high performance and scalability, such as real-time analytics, content management systems, and e-commerce platforms.

Let’s start with relational databases.

Performing data ingestion from a relational database

Relational databases are useful for batch ETL processes where structured data from various sources needs consolidation, transformation, and loading into a data warehouse or analytical system. SQL-based operations are efficient for joining and aggregating data before processing. Let’s try to understand how SQL databases represent data in tables with rows and columns using a code example. We’ll simulate a basic SQL database interaction using Python dictionaries to represent tables and rows. You can see the full code example at https://github.com/PacktPublishing/Python-Data-Cleaning-and-Preparation-Best-Practices/blob/main/chapter01/5.sql_databases.py:

We create a read_sql function that simulates reading rows from a SQL table, represented here as a list of dictionaries where each dictionary corresponds to a row in the table:

def read_sql():
# Simulating a SQL table with a dictionary
    sql_table = [
        {"id": 1, "name": "Alice", "age": 30},
        {"id": 2, "name": "Bob", "age": 24},
    ]
    for row in sql_table:
        process_row(row)

The process_row function takes a row (dictionary) as input and prints its contents, simulating the processing of a row from a SQL table:
```
def process_row(row):
    print(f"Processing row: id={row['id']}, name={row['name']}, age={row['age']}")
read_sql()
```

Let’s print our SQL table in the proper format:

print(f"{'id':<5} {'name':<10} {'age':<3}")
print("-" * 20)
# Print each row
for row in sql_table:
    print(f"{row['id']:<5} {row['name']:<10} {row['age']:<3}")

This will print the following output:

id   name     age
------------------
1    Alice    30
2    Bob      24

The key to learning from the previous example is understanding how SQL databases structure and manage data through tables composed of rows and columns, and how to efficiently retrieve and process these rows programmatically. This knowledge is crucial because it lays the foundation for effective database management and data manipulation in any application.

In real-world applications, this interaction is often facilitated by libraries and drivers such as Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC), which provide standardized methods for connecting to and querying databases. These libraries are typically wrapped by higher-level frameworks or libraries in Python, making it easier for developers to ingest data from various SQL databases without worrying about the underlying connectivity details. Several libraries can be used to interact with SQL databases using Python; we will explore the most popular ones in the following section.

Which library should you use for your use case?

Let’s explore the different libraries available for interacting with SQL databases in Python, and understand when to use each one:

SQLite (sqlite3) is ideal for small to medium-sized applications, local storage, and prototyping. Its zero-configuration, serverless architecture makes it perfect for lightweight, embedded database needs and quick development cycles. It is especially useful in scenarios where the overhead of a full-fledged database server is unnecessary. Avoid using sqlite3 for applications requiring high concurrency or extensive write operations, or where multiple users need to access the database simultaneously. It is not suitable for large-scale applications or those needing robust security features and advanced database functionalities.
SQLAlchemy is suitable for applications requiring a high level of abstraction over raw SQL, support for multiple database engines, and complex queries and data models. It is ideal for large-scale production environments that need flexibility, scalability, and the ability to switch between different databases with minimal code changes. Avoid using SQLAlchemy for small, lightweight applications where the overhead of its comprehensive ORM capabilities is unnecessary. If you need direct, low-level access to a specific database’s features and are comfortable writing raw SQL queries, a simpler database adapter such as sqlite3, Psycopg2, or MySQL Connector/Python might be more appropriate.
Psycopg2 is the go-to choice for interacting with PostgreSQL databases, making it suitable for applications that leverage PostgreSQL’s advanced features, such as ACID compliance, complex queries, and extensive data types. It is ideal for production environments requiring reliability and efficiency in handling PostgreSQL databases. Avoid using Psycopg2 if your application does not interact with PostgreSQL. If you need compatibility with multiple database systems or a higher-level abstraction, consider using SQLAlchemy instead. Also, it might not be the best choice for lightweight applications where the overhead of a full PostgreSQL setup is unnecessary.
MySQL Connector/Python (mysql-connector-python) is great for applications that need to interact directly with MySQL databases. It is suitable for environments where compatibility and official support from Oracle are critical, as well as for applications leveraging MySQL’s features such as transaction management and connection pooling. Do not use MySQL Connector/Python if your application requires compatibility with multiple database systems or a higher-level abstraction. For simpler applications where the overhead of a full MySQL setup is unnecessary, or where MySQL’s features are not specifically needed, consider other lightweight alternatives.

After understanding the various libraries and their use cases for interacting with SQL databases, it’s equally important to explore alternatives for scenarios where the traditional relational model of SQL databases may not be the best fit. This brings us to NoSQL databases, which offer flexibility, scalability, and performance for handling unstructured or semi-structured data. Let’s delve into the key Python libraries for working with popular NoSQL databases and examine when and how to use them effectively.

Performing data ingestion from the NoSQL database

Non-relational databases can be used for storing and processing large volumes of semi-structured or unstructured data in batch operations. They are particularly effective when the schema can evolve or when handling diverse data types in a consolidated manner. NoSQL databases excel in streaming and semi-real-time workloads due to their ability to handle high throughput and low-latency data ingestion. They are commonly used for capturing and processing real-time data from IoT devices, logs, social media feeds, and other sources that generate continuous streams of data.

The provided Python code mocks a NoSQL database with a dictionary and processes each key-value pair. Let’s have a look at each part of the code:

The process_entry function takes a key and its associated value from the data store and prints a formatted message showing the processing of that key-value pair. It provides a simple way to view or handle individual entries, highlighting how data is accessed and manipulated based on its key:
```
def process_entry(key, value):
    print(f"Processing key: {key} with value: {value}")
```
The following function prints the entire data_store dictionary in a tabular format:
```
def print_data_store(data_store):
    print(f"{'Key':<5} {'Name':<10} {'Age':<3}")
    print("-" * 20)
    for key, value in data_store.items():
        print(f"{key:<5} {value['name']:<10} {value['age']:<3}")
```
It starts by printing column headers for Key, Name, and Age, followed by a separator line for clarity. It then iterates over all key-value pairs in the data_store dictionary, printing each entry’s key, name, and age. This function helps visualize the current state of the data store. The initial state of the data is as follows:
```
Initial Data Store:
Key   Name      Age
-----------------------
1     Alice     30
2     Bob       24
```
This function adds a new entry to the data_store dictionary:
```
def create_entry(data_store, key, value):
    data_store[key] = value
    return data_store
```
It takes a key and a value, then inserts the value into data_store under the specified key. The updated data_store dictionary is then returned. This demonstrates the ability to add new data to the store, showcasing the creation aspect of Create, Read, Update, and Delete (CRUD) operations.
The update_entry function updates an existing entry in the data_store dictionary:
```
def update_entry(data_store, key, new_value):
    if key in data_store:
        data_store[key] = new_value
    return data_store
```
It takes a key and new_value, and if the key exists in the data_store dictionary, it updates the corresponding value with new_value. The updated data_store dictionary is then returned. This illustrates how existing data can be modified, demonstrating the update aspect of CRUD operations.
The following function removes an entry from the data_store dictionary:
```
def delete_entry(data_store, key):
    if key in data_store:
        del data_store[key]
    return data_store
```
It takes a key, and if the key is found in the data_store dictionary, it deletes the corresponding entry. The updated data_store dictionary is then returned.
The following function wraps all the process together:
```
def read_nosql():
    data_store = {
        "1": {"name": "Alice", "age": 30},
        "2": {"name": "Bob", "age": 24},
    }
    print("Initial Data Store:")
    print_data_store(data_store)
    # Create: Adding a new entry
    new_key = "3"
    new_value = {"name": "Charlie", "age": 28}
    data_store = create_entry(data_store, new_key, new_value)
    # Read: Retrieving and processing an entry
    print("\nAfter Adding a New Entry:")
    process_entry(new_key, data_store[new_key])
    # Update: Modifying an existing entry
    update_key = "1"
    updated_value = {"name": "Alice", "age": 31}
    data_store = update_entry(data_store, update_key, updated_value)
    # Delete: Removing an entry
    delete_key = "2"
    data_store = delete_entry(data_store, delete_key)
    # Print the final state of the data store
    print("\nFinal Data Store:")
    print_data_store(data_store)
```
This code illustrates the core principles of NoSQL databases, including schema flexibility, key-value pair storage, and basic CRUD operations. It begins with the read_nosql() function, which simulates a NoSQL database using a dictionary, data_store, where each key-value pair represents a unique identifier and associated user information. Initially, the print_data_store() function displays the data in a tabular format, highlighting the schema flexibility inherent in NoSQL systems. The code then demonstrates CRUD operations. It starts by adding a new entry with the create_entry() function, showcasing how new data is inserted into the store. Following this, the process_entry() function retrieves and prints the details of the newly added entry, illustrating the read operation. Next, the update_entry() function modifies an existing entry, demonstrating the update capability of NoSQL databases. The delete_entry() function is used to remove an entry, showing how data can be deleted from the store. Finally, the updated state of the data_store dictionary is printed again, providing a clear view of how the data evolves through these operations.

Let’s execute the whole process:

read_nosql()

This returns the final datastore:

Final Data Store:
Key   Name      Age
-----------------------
1     Alice     31
2     Charlie   28

In the preceding example, we demonstrated an interaction with a mocked NoSQL system using Python so that we can showcase the core principles of NoSQL databases such as schema flexibility, key-value pair storage, and basic CRUD operations. We can now better grasp how NoSQL databases differ from traditional SQL databases in terms of data modeling and handling unstructured or semi-structured data efficiently.

There are several libraries that can be used to interact with NoSQL databases. In the next section, we will explore the most popular ones.

Which library should you use for your use case?

Let’s explore the different libraries available for interacting with NoSQL databases in Python, and understand when to use each one:

pymongo is the official Python driver for MongoDB, a popular NoSQL database known for its flexibility and scalability. pymongo allows Python applications to interact seamlessly with MongoDB, offering a straightforward API to perform CRUD operations, manage indexes, and execute complex queries. pymongo is particularly favored for its ease of use and compatibility with Python’s data structures, making it suitable for a wide range of applications from simple prototypes to large-scale production systems.
cassandra-driver (Cassandra): The cassandra-driver library provides Python applications with direct access to Apache Cassandra, a highly scalable NoSQL database designed for handling large amounts of data across distributed commodity servers. Cassandra’s architecture is optimized for write-heavy workloads and offers tunable consistency levels, making it suitable for real-time analytics, IoT data, and other applications requiring high availability and fault tolerance.

Transitioning from databases to file systems involves shifting the focus from structured data storage and retrieval mechanisms to more flexible and versatile storage solutions.

Performing data ingestion from cloud-based file systems

Cloud storage is a service model that allows data to be remotely maintained, managed, and backed up over the internet. It involves storing data on remote servers accessed from anywhere via the internet, rather than on local devices. Cloud storage has revolutionized the way we store and access data. It provides a flexible and scalable solution for individuals and organizations, enabling them to store large amounts of data without investing in physical hardware. This is particularly useful for ensuring that data is always accessible and can be shared easily.

Amazon S3, Microsoft Azure Blob Storage, and Google Cloud Storage are all cloud-based object storage services that allow you to store and retrieve files in the cloud. Cloud-based file systems are becoming increasingly popular for several reasons.

Firstly, they provide a flexible and scalable storage solution that can easily adapt to the changing needs of an organization. This means that as the amount of data grows, additional storage capacity can be added without the need for significant capital investment or physical infrastructure changes. Thus, it can help reduce capital expenditures and operational costs associated with maintaining and upgrading on-premises storage infrastructure.

Secondly, cloud-based file systems offer high levels of accessibility and availability. With data stored in the cloud, users can access it from anywhere with an internet connection, making it easier to collaborate and share information across different teams, departments, or locations. Additionally, cloud-based file systems are designed with redundancy and failover mechanisms, ensuring that data is always available even in the event of a hardware failure or outage. Finally, they provide enhanced security features to protect data from unauthorized access, breaches, or data loss. Cloud service providers typically have advanced security protocols, encryption, and monitoring tools to safeguard data and ensure compliance with data privacy regulations.

Files in cloud-based storage systems are essentially the same as those on local devices, but they are stored on remote servers and accessed over the internet. However, how are these files organized in these cloud storage systems? Let’s discuss that next.

Organizing files in cloud storage systems

One of the primary methods of organizing files in cloud storage is by using folder structures, similar to local file systems. Users can create folders and subfolders to categorize and store files systematically. Let’s have a look at some best practices:

Creating a logical and intuitive hierarchy that reflects how you work or how your projects are structured is essential. This involves designing a folder structure that mimics your workflow, making it easier to locate and manage files. For instance, you might create main folders for different departments, projects, or clients, with subfolders for specific tasks or document types. This hierarchical organization not only saves time by reducing the effort needed to find files but also enhances collaboration by providing a clear and consistent framework that team members can easily navigate.
Using consistent naming conventions for folders and files is crucial for ensuring easy retrieval and maintaining order within your cloud storage. A standardized naming scheme helps avoid confusion, reduces errors, and speeds up the process of locating specific documents. For example, adopting a format such as YYYY-MM-DD_ProjectName_DocumentType can provide immediate context and make sorting and searching more straightforward. Consistent naming also facilitates automation and integration with other tools, as predictable file names can be more easily processed by scripts and applications.
Grouping files by project or client is an effective way to keep related documents together and streamline project management. This method involves creating dedicated folders for each project or client, where all relevant files, such as contracts, communications, and deliverables, are stored.
Many cloud storage systems allow tagging files with keywords or metadata, which significantly enhances file categorization and searchability. Tags are essentially labels that you can attach to files, making it easier to group and find documents based on specific criteria. Metadata includes detailed information, such as the author, date, project name, and file type, which provides additional context and aids in more precise searches. By using relevant tags and comprehensive metadata, you can quickly filter and locate files, regardless of their location within the folder hierarchy. This practice is particularly useful in large storage systems where traditional folder structures might become cumbersome.

From discussing cloud storage systems, the focus now shifts to exploring the capabilities and integration opportunities offered by APIs.

APIs

APIs have become increasingly popular in recent years due to their ability to enable seamless communication and integration between different systems and services. APIs provide developers with a standardized and flexible way to access data and functionality from other systems, allowing them to easily build new applications and services that leverage existing resources. APIs have become a fundamental building block for modern software development and are widely used across a wide range of industries and applications.

Now that we understand what APIs represent, let’s move on to the requests Python library with which developers can programmatically access and manipulate data from remote servers.

The requests library

When it comes to working with APIs in Python, the requests library is the go-to Python library for making HTTP requests to APIs and other web services. It makes it easy to send HTTP/1.1 requests using Python, and it provides many convenient features for working with HTTP responses.

Run the following command to install the requests library:

pip install requests==2.32.3

Let’s have a quick look at how we can use this library:

Import the requests library:
```
import requests
```

Specify the API endpoint URL:

url = "https://jsonplaceholder.typicode.com/posts"

Make a GET request to the API endpoint:
```
response = requests.get(url)
```
Get the response content:
```
print(response.content)
```

Here, we’re making a GET request to the API endpoint at https://jsonplaceholder.typicode.com/posts and storing the response object in the response variable. We can then print the response content using the content attribute of the response object. The requests library provides many other methods and features for making HTTP requests, including support for POST, PUT, DELETE, and other HTTP methods, handling headers and cookies, and handling redirects and authentication.

Now that we’ve explained the requests library, let’s move on to a specific example of retrieving margarita cocktail data from the Cocktail DB API, which can illustrate how practical web requests can be in accessing and integrating real-time data sources into applications.

Learn how to make a margarita!

The use case demonstrates retrieving cocktail data from the Cocktail DB API using Python. If you want to improve your bartending skills and impress your friends, you can use an open API to get real-time information on the ingredients required for any cocktail. For this, we will use the Cocktail DB API and the request library to see which ingredients we need for a margarita:

Define the API endpoint URL. We are making a request to the Cocktail DB API endpoint to search for cocktails with the margarita name:
```
url = "https://www.thecocktaildb.com/api/json/v1/1/search.php?s=margarita"
```
Make the API request. We define the API endpoint URL as a string and pass it to the requests.get() function to make the GET request:
```
response = requests.get(url)
```

Check whether the request was successful (status code 200) and get the data. The API response is returned as a JSON string, which we can extract by calling the response.json() method. We then assign this JSON data to a variable called data:

if response.status_code == 200:
    # Extract the response JSON data
    data = response.json()
    # Check if the API response contains cocktails data
    if 'drinks' in data:
        # Create DataFrame from drinks data
        df = pd.DataFrame(data['drinks'])
        # Print the resulting DataFrame
        print(df.head())
    else:
        print("No drinks found.")

If the request was not successful, print this error message:

else:
    print(f"Failed to retrieve data from API. Status code: {response.status_code}")

You can replace the margarita search parameter with any other cocktail name or ingredient to get data for different drinks.

With this, we come to the end of our first chapter. Let’s summarize what we have learned so far.

Amazon Customer Oct 25, 2024

The book is great in guiding readers from basic data cleaning methods on structured datasets, such as normalization, standardization, and encoding of categorical features, to more sophisticated techniques like text preprocessing and image/audio handling.The sections on natural language processing (NLP) and handling multimedia data are particularly valuable. The use of Python throughout the book ensures that concepts are not just theoretical but also applicable in real-world scenarios. Code examples help readers immediately apply methods discussed in the text, enhancing the hands-on experience and the book is full of different use cases that you can find in real scenarios.The introduction of unstructured data processing with focus on large language models (LLMs) and AI is great and the examples really applicable

Amazon Verified review

J Gil Oct 08, 2024

This booked has been key in strengthening my understanding of data preparation. It has built my confidence in tackling the issues faced in real applications. My favourite aspect of it is how every section starts with a clear outline of how the content covered ties back to issues teams face in building their products in the real world. It progressively goes into the details and explains clearly why they matter.The code for each section is provided in github repos and I found the step-by-step walkthroughs to be clear and actionable as a beginner/intermediate user of python. The setup steps are also detailed without any assumptions being made so you can start from scratch. The practical aspect of this book is very well executed and clearly helps with the understanding and confidence on the topic.Real world tools across streaming, SQL warehouses, noSQL databases, and how/when to use them is also covered (eg. BigQuery, Kafka, Databricks SQL). This has been super useful in connecting the dots between different parts of architectures.

Amazon Customer Oct 20, 2024

As a practitioner in the data field, I found this book to be incredibly practical and comprehensive for tackling data cleaning and preparation across various data types and sources, from structured to unstructured. The coverage of the latest techniques for processing text, audio, and image data with LLMs really stood out, offering practical insights I can apply directly in my projects.

Spiros Zervos Oct 09, 2024

The book excels in demonstrating both structured and unstructured data handling, offering end-to-end code examples for practical implementation. Its sections on optimizing and tuning operations like joining and merging are especially strong, showing how these techniques can significantly impact code performance. The detailed testing methods included help users understand the performance trade-offs of their operations. Additionally, the chapter on large language models (LLMs) is a highlight, showing how to combine modern techniques with traditional problem-solving approaches, bridging older and newer technologies.