Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Data Processing with Optimus

You're reading from   Data Processing with Optimus Supercharge big data preparation tasks for analytics and machine learning with Optimus using Dask and PySpark

Arrow left icon
Product type Paperback
Published in Sep 2021
Publisher Packt
ISBN-13 9781801079563
Length 300 pages
Edition 1st Edition
Languages
Concepts
Arrow right icon
Authors (2):
Arrow left icon
Dr. Argenis Leon Dr. Argenis Leon
Author Profile Icon Dr. Argenis Leon
Dr. Argenis Leon
Luis Aguirre Contreras Luis Aguirre Contreras
Author Profile Icon Luis Aguirre Contreras
Luis Aguirre Contreras
Arrow right icon
View More author details
Toc

Table of Contents (16) Chapters Close

Preface 1. Section 1: Getting Started with Optimus
2. Chapter 1: Hi Optimus! FREE CHAPTER 3. Chapter 2: Data Loading, Saving, and File Formats 4. Section 2: Optimus – Transform and Rollout
5. Chapter 3: Data Wrangling 6. Chapter 4: Combining, Reshaping, and Aggregating Data 7. Chapter 5: Data Visualization and Profiling 8. Chapter 6: String Clustering 9. Chapter 7: Feature Engineering 10. Section 3: Advanced Features of Optimus
11. Chapter 8: Machine Learning 12. Chapter 9: Natural Language Processing 13. Chapter 10: Hacking Optimus 14. Chapter 11: Optimus as a Web Service 15. Other Books You May Enjoy

Discovering Optimus internals

Optimus is designed to be easy to use for non-technical users and developers. Once you know how some of the internals work, you'll know how some transformations work, and hopefully how to avoid any unexpected behavior. Also, you'll be able to expand Optimus or make more advanced or engine-specific transformations if the situation requires it.

Engines

Optimus handles all the details that are required to initialize any engine. Although pandas, Vaex, and Ibis won't handle many configuration parameters because they are non-distributed engines, Dask and Spark handle many configurations, some of which are mapped and some of which are passed via the *args or **kwargs arguments.

Optimus always keeps a reference to the engine you initialize. For example, if you want to get the Dask client from the Optimus instance, you can use the following command:

op.client

This will show you the following information:

Figure 1.11 – Dask client object inside Optimus

Figure 1.11 – Dask client object inside Optimus

One interesting thing about Optimus is that you can use multiple engines at the same time. This might seem weird at first, but it opens up amazing opportunities if you get creative. For example, you can combine Spark, to load data from a database, and pandas, to profile a data sample in real time, or use pandas to load data and use Ibis to output the instructions as a set of SQL instructions.

At the implementation level, all the engines inherit from BaseEngine. Let's wrap all the engine functionality to make three main operations:

  • Initialization: Here, Optimus handles all the initialization processes for the engine you select.
  • Dataframe creation: op.create.dataframe maps to the DataFrame's creation, depending on the engine that was selected.
  • Data Loading: op.load handles file loading and databases.

The DataFrame behind the DataFrame

The Optimus DataFrame is a wrapper that exposes and implements a set of functions to process string and numerical data. Internally, when Optimus creates a DataFrame, it creates it using the engine you select to keep a reference in the .data property. The following is an example of this:

op = Optimus("pandas")
df = op.load.csv("foo.txt", sep=",")
type(df.data)

This produces the following result:

Pandas.core.frame.DataFrame

A key point is that Optimus always keeps the data representation as DataFrames and not as a Series. This is important because in pandas, for example, some operations return a Series as result.

In pandas, use the following code:

import pandas as pd
type(pd.DataFrame({"A":["A",2,3]})["A"].str.lower())
pandas.core.series.Series

In Optimus, we use the following code:

from optimus import Optimus
op = Optimus("pandas")
type(op.create.dataframe({"A":["A",2,3]}).cols.lower().data)
pandas.core.frame.DataFrame

As you can see, both values have the same types.

Meta

Meta is used to keep some data that does not belong in the core dataset, but can be useful for some operations, such as saving the result of a top-N operation in a specific column. To achieve this, we save metadata in our DataFrames. This can be accessed using df.meta. This metadata is used for three main reasons. Let's look at each of them.

Saving file information

If you're loading a DataFrame from a file, it saves the file path and filename, which can be useful for keeping track of the data being handled:

from optimus import Optimus 
op = Optimus("pandas") 
df = op.load.csv("foo.txt", sep=",")
df.meta

You will get the following output:

{'file_name': 'foo.txt', 'name': 'foo.txt'}

Data profiling

Data cleaning is an iterative process; maybe you want to calculate the histogram or top-N values in the dataset to spot some data that you want to remove or modify. When you calculate profiling for data using df.profile(), Optimus will calculate a histogram or frequency chart, depending on the data type. The idea is that while working with the Actions data, we can identify when the histogram or top-N values should be recalculated. Next, you will see how Actions work.

Actions

As we saw previously, Optimus tries to cache certain operations to ensure that you do not waste precious compute time rerunning tasks over data that has not changed.

To optimize the cache usage and reconstruction, Optimus handles multiple internal Actions to operate accordingly.

You can check how Actions are saved by trying out the following code:

from optimus import Optimus 
op = Optimus("pandas") 
df = op.load.csv("foo.txt", sep=",")
df = df.cols.upper("*")

To check the actions you applied to the DataFrame, use the following command:

df.meta["transformations"]

You will get a Python dictionary with the action name and the column that's been affected by the action:

{'actions': [[{'upper': ['name']}], [{'upper': ['function']}]]}

A key point is that different actions have different effects on how the data is profiled and how the DataFrame's metadata is handled. Every Optimus operation has a unique Action name associated with it. Let's look at the five Actions that are available in Optimus and what effect they have on the DataFrame:

  • Columns: These actions are triggered when operations are applied to entire Optimus columns; for example, df.cols.lower() or df.cols.sqrt().
  • Rows: These actions are triggered when operations are applied to any row in an Optimus column; for example, df.rows.set()or df.rows.drop_duplicate().
  • Copy: Triggered only for a copy column operation, such as df.cols.copy(). Internally, it just creates a new key on the dict meta with the source metadata column. If you copy an Optimus column, a profiling operation is not triggered over it.
  • Rename: Triggered only for a rename column operation, such as df.cols.rename(). Internally, it just renames a key in the meta dictionary. If you copy an Optimus column, a profiling operation is not triggered over it.
  • Drop: Triggered only for a rename column operation, such as df.cols.drop(). Internally, it removes a key in the meta dictionary. If you copy an Optimus column, a profiling operation is not triggered over it.

Dummy functions

There are some functions that do not apply to all the DataFrame technologies. Functions such as .repartition(), .cache(), and compute() are used in distributed DataFrames such as Spark and Dask to trigger operations in the workers, but these concepts do not exist in pandas or cuDF. To preserve the API's cohesion in all the engines, we can simply use pass or return the same DataFrame object.

Diagnostics

When you use Dask and Spark as your Optimus engine, you have access to their respective diagnostics dashboards. For very complex workflows, it can be handy to understand what operations have been executed and what could be slowing down the whole process.

Let's look at how this works in the case of Dask. To gain access to the diagnostic panel, you can use the following command:

op.client()

This will provide you with information about the Dask client:

Figure 1.12 – Dask client information

In this case, you can point to http://192.168.86.249:39011/status in your browser to see the Dask Diagnostics dashboard:

Figure 1.13 – Dask Diagnostics dashboard

Figure 1.13 – Dask Diagnostics dashboard

An in-depth discussion about diagnostics is beyond the scope of this book. To find out more about this topic, go to https://docs.dask.org/en/latest/diagnostics-distributed.html.

You have been reading a chapter from
Data Processing with Optimus
Published in: Sep 2021
Publisher: Packt
ISBN-13: 9781801079563
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image