Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Extending Excel with Python and R
Extending Excel with Python and R

Extending Excel with Python and R: Unlock the potential of analytics languages for advanced data manipulation and visualization

Arrow left icon
Profile Icon Steven Sanderson Profile Icon Kun
Arrow right icon
Free Trial
Full star icon Full star icon Full star icon Full star icon Full star icon 5 (5 Ratings)
Paperback Apr 2024 344 pages 1st Edition
eBook
NZ$36.99 NZ$52.99
Paperback
NZ$51.99 NZ$65.99
Subscription
Free Trial
Arrow left icon
Profile Icon Steven Sanderson Profile Icon Kun
Arrow right icon
Free Trial
Full star icon Full star icon Full star icon Full star icon Full star icon 5 (5 Ratings)
Paperback Apr 2024 344 pages 1st Edition
eBook
NZ$36.99 NZ$52.99
Paperback
NZ$51.99 NZ$65.99
Subscription
Free Trial
eBook
NZ$36.99 NZ$52.99
Paperback
NZ$51.99 NZ$65.99
Subscription
Free Trial

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Extending Excel with Python and R

Reading Excel Spreadsheets

In the deep and wide landscape of data analysis, Excel stands tall and by your side as a trusted warrior, simplifying the process of organizing, calculating, and presenting information. Its intuitive interface and widespread usage have cemented its position as a staple in the business world. However, as the volume and complexity of data continue to grow exponentially, Excel’s capabilities may start to feel constrained. It is precisely at this point that the worlds of Excel, R, and Python converge. Extending Excel with R and Python invites you to embark on a truly transformative journey. This trip will show you the power of these programming languages as they synergize with Excel, expanding its horizons and empowering you to conquer data challenges with ease. In this book, we will delve into how to integrate Excel with R and Python, uncovering the hidden potential that lies beneath the surface and enabling you to extract valuable insights, automate processes, and unleash the true power of data analysis.

Microsoft Excel came to market in 1985 and has remained a popular spreadsheet software choice. Excel was originally known as MultiPlan. Microsoft Excel and databases in general share some similarities in terms of organizing and managing data, although they serve different purposes. Excel is a spreadsheet program that allows users to store and manipulate data in a tabular format. It consists of rows and columns, where each cell can contain text, numbers, or formulas. Similarly, a database is a structured collection of data stored in tables, consisting of rows and columns.

Both Excel and databases provide a way to store and retrieve data. In Excel, you can enter data, perform calculations, and create charts and graphs. Similarly, databases store and manage large amounts of structured data and enable querying, sorting, and filtering. Excel and databases also support the concept of relationships. In Excel, you can link cells or ranges across different sheets, creating connections between data. Databases use relationships to link tables based on common fields, allowing you to retrieve related data from multiple tables.

This chapter aims to familiarize you with reading Excel files into the R environment and performing some manipulation on them. Specifically, in this chapter, we’re going to cover the following main topics:

  • R packages for Excel manipulation
  • Reading Excel files to manipulate with R
  • Reading multiple Excel sheets with a custom R function
  • Python packages for Excel manipulation
  • Opening an Excel sheet from Python and reading the data

Technical requirements

At the time of writing, we are using the following:

  • R 4.2.1
  • The RStudio 2023.03.1+446 “Cherry Blossom” release for Windows

For this chapter, you will need to install the following packages:

  • readxl
  • openxlsx
  • xlsx

To run the Python code in this chapter, we will be using the following:

  • Python 3.11
  • pandas
  • openpyxl
  • The iris.xlsx Excel file available in this book’s GitHub repository

While setting up a Python environment is outside the scope of this book, this is easy to do. The necessary packages can be installed by running the following commands:

python -m pip install pandas==2.0.1
python -m pip install openpyxl==3.1.2

Note that these commands have to be run from a terminal and not from within a Python script. They need to be run in the folder where requirements.txt resides or a full path to the requirements.txt file has to be included.

This book’s GitHub repository also contains a requirements.txt file that you can use to install all dependencies. You can do this by running the following command:

python -m pip install -r requirements.txt

This command installs all the packages that will be used in this chapter so that you don’t have to install them one by one. It also guarantees that the whole dependency tree (including the dependencies of the dependencies) will be the same as what this book’s authors have used.

Alternatively, when using Jupyter Notebooks, you can use the following magic commands:

%pip install pandas==2.0.1
%pip install openpyxl==3.1.2

There is a GitHub account for all of the code examples in this book located at this link: https://github.com/PacktPublishing/Extending-Excel-with-Python-and-R. Each chapter has it’s own folder, with the current one as Chapter01.

Note

Technical requirements for Python throughout the book are conveniently compiled in the requirements.txt file, accessible on GitHub repository here, https://github.com/PacktPublishing/Extending-Excel-with-Python-and-R/blob/main/requirements.txt. Installing these dependencies will streamline your coding experience and ensure smooth progression through the book. Be sure to install them all before diving into the exercises.

Working with R packages for Excel manipulation

There are several packages available both on CRAN and on GitHub that allow for reading and manipulation of Excel files. In this section, we are specifically going to focus on the packages: readxl, openxlsx, and xlsx to read Excel files. These three packages all have their own functions to read Excel files. These functions are as follows:

  • readxl::read_excel()
  • openxlsx::read.xlsx()
  • xlsx::read.xlsx()

Each function has a set of parameters and conventions to follow. Since readxl is part of the tidyverse collection of packages, it follows its conventions and returns a tibble object upon reading the file. If you do not know what a tibble is, it is a modern version of R’s data.frame, a sort of spreadsheet in the R environment. It is the building block of most analyses. Moving on to openxlsx and xlsx, they both return a base R data.frame object, with the latter also able to return a list object. If you are wondering how this relates to manipulating an actual Excel file, I can explain. First, to manipulate something in R, the data must be in the R environment, so you cannot manipulate the file unless the data is read in. These packages have different functions for manipulating Excel or reading data in certain ways that allow for further analysis and or manipulation. It is important to note that xlsx does require Java to be installed.

As we transition from our exploration of R packages for Excel manipulation, we’ll turn our attention to the crucial task of effectively reading Excel files into R, thereby unlocking even more possibilities for data analysis and manipulation.

Reading Excel files to R

In this section, we are going to read data from Excel with a few different R libraries. We need to do this before we can even consider performing any type of manipulation or analysis on the data contained in the sheets of the Excel files.

As mentioned in the Technical requirements section, we are going to be using the readxl, openxlsx, and xlsx packages to read data into R.

Installing and loading libraries

In this section, we are going to install and load the necessary libraries if you do not yet have them. We are going to use the openxlsx, xlsx, readxl, and readxlsb libraries. To install and load them, run the following code block:

pkgs <- c("openxlsx", "xlsx", "readxl")
install.packages(pkgs, dependencies = TRUE)
lapply(pkgs, library, character.only = TRUE)

The lapply() function in R is a versatile tool for applying a function to each element of a list, vector, or DataFrame. It takes two arguments, x and FUN, where x is the list and FUN is the function that is applied to the list object, x.

Now that the libraries have been installed, we can get to work. To do this, we are going to read a spreadsheet built from the Iris dataset that is built into base R. We are going to read the file with three different libraries, and then we are going to create a custom function to work with the readxl library that will read all the sheets of an Excel file. We will call this the read_excel_sheets() function.

Let’s start reading the files. The first library we will use to open an Excel file is openxlsx. To read the Excel file we are working with, you can run the code in the chapter1 folder of this book’s GitHub repository called ch1_create_iris_dataset.R Refer to the following screenshot to see how to read the file into R.

You will notice a variable called f_pat. This is the path to where the Iris dataset was saved as an Excel file – for example, C:/User/UserName/Documents/iris_data.xlsx:

Figure 1.1 – Using the openxlsx package to read the Excel file

Figure 1.1 – Using the openxlsx package to read the Excel file

The preceding screenshot shows how to read an Excel file. This example assumes that you have used the ch1_create_iris_datase.R file to create the example Excel file. In reality, you can read in any Excel file that you would like or need.

Now, we will perform the same type of operation, but this time with the xlsx library. Refer to the following screenshot, which uses the same methodology as with the openxlsx package:

Figure 1.2 – Using the xlsx library and the read.xlsx() function to open the Excel file we’ve created

Figure 1.2 – Using the xlsx library and the read.xlsx() function to open the Excel file we’ve created

Finally, we will use the readxl library, which is part of the tidyverse:

Figure 1.3 – Using the readxl library and the read_excel() function to read the Excel file into memory

Figure 1.3 – Using the readxl library and the read_excel() function to read the Excel file into memory

In this section, we learned how to read in an Excel file with a few different packages. While these packages can do more than simply read in an Excel file, that is what we needed to focus on in this section. You should now be familiar with how to use the readxl::read_excel(), xlsx::read.xlsx(), and openxlsx::read.xlsx() functions.

Building upon our expertise in reading Excel files into R, we’ll now embark on the next phase of our journey: unraveling the secrets of efficiently extracting data from multiple sheets within an Excel file.

Reading multiple sheets with readxl and a custom function

In Excel, we often encounter workbooks that have multiple sheets in them. These could be stats for different months of the year, table data that follows a specific format month over month, or some other period. The point is that we may want to read all the sheets in a file for one reason or another, and we should not call the read function from a particular package for each sheet. Instead, we should use the power of R to loop through this with purrr.

Let’s build a customized function. To do this, we are going to load the readxl function. If we have it already loaded, then this is not necessary; however, if it is already installed and you do not wish to load the library into memory, then you can call the excel_sheets() function by using readxl::excel_sheets():

Figure 1.4 – Creating a function to read all the sheets into an Excel file at once – read_excel_sheets()

Figure 1.4 – Creating a function to read all the sheets into an Excel file at once – read_excel_sheets()

The new code can be broken down as follows:

 read_excel_sheets <- function(filename, single_tbl) {

This line defines a function called read_excel_sheets that takes two arguments: filename (the name of the Excel file to be read) and single_tbl (a logical value indicating whether the function should return a single table or a list of tables).

Next, we have the following line:

sheets <- readxl::excel_sheets(filename)

This line uses the readxl package to extract the names of all the sheets in the Excel file specified by filename. The sheet names are stored in the sheets variable.

Here’s the next line:

if (single_tbl) {

This line starts an if statement that checks the value of the single_tbl argument.

Now, we have the following:

x <- purrr::map_df(sheets, read_excel, path = filename)

If single_tbl is TRUE, this line uses the purrr package’s map_df function to iterate over each sheet name in sheets and read the corresponding sheet using the read_excel function from the readxl package. The resulting DataFrame are combined into a single table, which is assigned to the x variable.

Now, we have the following line:

} else {

This line indicates the start of the else block of the if statement. If single_tbl is FALSE, the code in this block will be executed.

Here’s the next line:

 x <- purrr::map(sheets, ~ readxl::read_excel(filename, sheet = .x))

In this line, the purrr package’s map function is used to iterate over each sheet name in sheets. For each sheet, the read_excel function from the readxl package is called to read the corresponding sheet from the Excel file specified by filename. The resulting DataFrame are stored in a list assigned to the x variable.

Now, we have the following:

 purrr::set_names(x, sheets)

This line uses the set_names function from the purrr package to set the names of the elements in the x list to the sheet names in sheets.

Finally, we have the following line:

 x

This line returns the value of x from the function, which will be either a single table (data.frame) if single_tbl is TRUE, or a list of tables (data.frame) if single_tbl is FALSE.

In summary, the read_excel_sheets function takes an Excel filename and a logical value indicating whether to return a single table or a list of tables. It uses the readxl package to extract the sheet names from the Excel file, and then reads the corresponding sheets either into a single table (if single_tbl is TRUE) or into a list of tables (if single_tbl is FALSE). The resulting data is returned as the output of the function. To see how this works, let’s look at the following example.

We have a spreadsheet that has four tabs in it – one for each species in the famous Iris dataset and then one sheet called iris, which is the full dataset.

As shown in Figure 1.5, the read_excel_sheets() function has read all four sheets of the Excel file. We can also see that the function has imported the sheets as a list object and has named each item in the list after the name of the corresponding tab in the Excel file. It is also important to note that the sheets must all have the same column names and structure for this to work:

Figure 1.5 – Excel file read by read_excel_sheets()

Figure 1.5 – Excel file read by read_excel_sheets()

In this section, we learned how to write a function that will read all of the sheets in any Excel file. This function will also return them as a named item list, where the names are the names of the tabs in the file itself.

Now that we have learned how to read Excel sheets in R, in the next section, we will cover Python, where we will revisit the same concepts but from the perspective of the Python language.

Python packages for Excel manipulation

In this section, we will explore how to read Excel spreadsheets using Python. One of the key aspects of working with Excel files in Python is having the right set of packages that provide the necessary functionality. In this section, we will discuss some commonly used Python packages for Excel manipulation and highlight their advantages and considerations.

Python packages for Excel manipulation

When it comes to interacting with Excel files in Python, several packages offer a range of features and capabilities. These packages allow you to extract data from Excel files, manipulate the data, and write it back to Excel files. Let’s take a look at some popular Python packages for Excel manipulation.

pandas

pandas is a powerful data manipulation library that can read Excel files using the read_excel function. The advantage of using pandas is that it provides a DataFrame object, which allows you to manipulate the data in a tabular form. This makes it easy to perform data analysis and manipulation. pandas excels in handling large datasets efficiently and provides flexible options for data filtering, transformation, and aggregation.

openpyxl

openpyxl is a widely used library specifically designed for working with Excel files. It provides a comprehensive set of features for reading and writing Excel spreadsheets, including support for various Excel file formats and compatibility with different versions of Excel. In addition, openpyxl allows fine-grained control over the structure and content of Excel files, enabling tasks such as accessing individual cells, creating new worksheets, and applying formatting.

xlrd and xlwt

xlrd and xlwt are older libraries that are still in use for reading and writing Excel files, particularly with legacy formats such as .xls. xlrd enables reading data from Excel files, while xlwt facilitates writing data to Excel files. These libraries are lightweight and straightforward to use, but they lack some of the advanced features provided by pandas and openpyxl.

Considerations

When choosing a Python package for Excel manipulation, it’s essential to consider the specific requirements of your project. Here are a few factors to keep in mind:

  • Functionality: Evaluate the package’s capabilities and ensure it meets your needs for reading Excel files. Consider whether you require advanced data manipulation features or if a simpler package will suffice.
  • Performance: If you’re working with large datasets or need efficient processing, packages such as pandas, which have optimized algorithms, can offer significant performance advantages.
  • Compatibility: Check the compatibility of the package with different Excel file formats and versions. Ensure that it supports the specific format you are working with to avoid any compatibility issues.
  • Learning curve: Consider the learning curve associated with each package. Some packages, such as pandas, have a more extensive range of functionality, but they may require additional time and effort to master.

Each package offers unique features and has its strengths and weaknesses, allowing you to read Excel spreadsheets effectively in Python. For example, if you need to read and manipulate large amounts of data, pandas may be the better choice. However, if you need fine-grained control over the Excel file, openpyxl will likely fit your needs better.

Consider the specific requirements of your project, such as data size, functionality, and compatibility, to choose the most suitable package for your needs. In the following sections, we will delve deeper into how to utilize these packages to read and extract data from Excel files using Python.

Opening an Excel sheet from Python and reading the data

When working with Excel files in Python, it’s common to need to open a specific sheet and read the data into Python for further analysis. This can be achieved using popular libraries such as pandas and openpyxl, as discussed in the previous section.

You can most likely use other Python and package versions, but the code in this section has not been tested with anything other than what we’ve stated here.

Using pandas

pandas is a powerful data manipulation library that simplifies the process of working with structured data, including Excel spreadsheets. To read an Excel sheet using pandas, you can use the read_excel function. Let’s consider an example of using the iris_data.xlsx file with a sheet named setosa:

import pandas as pd
# Read the Excel file
df = pd.read_excel('iris_data.xlsx', sheet_name='setosa')
# Display the first few rows of the DataFrame
print(df.head())

You will need to run this code either with the Python working directory set to the location where the Excel file is located, or you will need to provide the full path to the file in the read_excel() command:

Figure 1.6 – Using the pandas package to read the Excel file

Figure 1.6 – Using the pandas package to read the Excel file

In the preceding code snippet, we imported the pandas library and utilized the read_excel function to read setosa from the iris_data.xlsx file. The resulting data is stored in a pandas DataFrame, which provides a tabular representation of the data. By calling head() on the DataFrame, we displayed the first few rows of the data, giving us a quick preview.

Using openpyxl

openpyxl is a powerful library for working with Excel files, offering more granular control over individual cells and sheets. To open an Excel sheet and access its data using openpyxl, we can utilize the load_workbook function. Please note that openpyxl cannot handle .xls files, only the more modern .xlsx and .xlsm versions.

Let’s consider an example of using the iris_data.xlsx file with a sheet named versicolor:

import openpyxl
import pandas as pd
# Load the workbook
wb = openpyxl.load_workbook('iris_data.xlsx')
# Select the sheet
sheet = wb['versicolor']
# Extract the values (including header)
sheet_data_raw = sheet.values
# Separate the headers into a variable
header = next(sheet_data_raw)[0:]
# Create a DataFrame based on the second and subsequent lines of data with the header as column names
sheet_data = pd.DataFrame(sheet_data_raw, columns=header)
print(sheet_data.head())

The preceding code results in the following output:

Figure 1.7 – Using the openpyxl package to read the Excel file

Figure 1.7 – Using the openpyxl package to read the Excel file

In this code snippet, we import the load_workbook function from the openpyxl library. Then, we load the workbook by providing the iris_data.xlsx filename. Next, we select the desired sheet by accessing it using its name – in this case, this is versicolor. Once we’ve done this, we read the raw data using the values property of the loaded sheet object. This is a generator and can be accessed via a for cycle or by converting it into a list or a DataFrame, for example. In this example, we have converted it into a pandas DataFrame because it is the format that is the most comfortable to work with later.

Both pandas and openpyxl offer valuable features for working with Excel files in Python. While pandas simplifies data manipulation with its DataFrame structure, openpyxl provides more fine-grained control over individual cells and sheets. Depending on your specific requirements, you can choose the library that best suits your needs.

By mastering the techniques of opening Excel sheets and reading data into Python, you will be able to extract valuable insights from your Excel data, perform various data transformations, and prepare it for further analysis or visualization. These skills are essential for anyone seeking to leverage the power of Python and Excel in their data-driven workflows.

Reading in multiple sheets with Python (openpyxl and custom functions)

In many Excel files, it’s common to have multiple sheets containing different sets of data. Being able to read in multiple sheets and consolidate the data into a single data structure can be highly valuable for analysis and processing. In this section, we will explore how to achieve this using the openpyxl library and a custom function.

The importance of reading multiple sheets

When working with complex Excel files, it’s not uncommon to encounter scenarios where related data is spread across different sheets. For example, you may have one sheet for sales data, another for customer information, and yet another for product inventory. By reading in multiple sheets and consolidating the data, you can gain a holistic view and perform a comprehensive analysis.

Let’s start by examining the basic steps involved in reading in multiple sheets:

  1. Load the workbook: Before accessing the sheets, we need to load the Excel workbook using the load_workbook function provided by openpyxl.
  2. Get the sheet names: We can obtain the names of all the sheets in the workbook using the sheetnames attribute. This allows us to identify the sheets we want to read.
  3. Read data from each sheet: By iterating over the sheet names, we can access each sheet individually and read the data. Openpyxl provides methods such as iter_rows or iter_cols to traverse the cells of each sheet and retrieve the desired data.
  4. Store the data: To consolidate the data from multiple sheets, we can use a suitable data structure, such as a pandas DataFrame or a Python list. As we read the data from each sheet, we concatenate or merge it into the consolidated data structure:
    • If the data in all sheets follows the same format (as is the case in the example used in this chapter), we can simply concatenate the datasets
    • However, if the datasets have different structures because they describe different aspects of a dataset (for example, one sheet contains product information, the next contains customer data, and the third contains the sales of the products to the customers), then we can merge these datasets based on unique identifiers to create a comprehensive dataset

Using openpyxl to access sheets

openpyxl is a powerful library that allows us to interact with Excel files using Python. It provides a wide range of functionalities, including accessing and manipulating multiple sheets. Before we dive into the details, let’s take a moment to understand why openpyxl is a popular choice for this task.

One of the primary advantages of openpyxl is its ability to handle various Excel file formats, such as .xlsx and .xlsm. This flexibility allows us to work with different versions of Excel files without compatibility issues. Additionally, openpyxl provides a straightforward and intuitive interface to access sheet data, making it easier for us to retrieve the desired information.

Reading data from each sheet

To begin reading in multiple sheets, we need to load the Excel workbook using the load_workbook function provided by openpyxl. This function takes the file path as input and returns a workbook object that represents the entire Excel file.

Once we have loaded the workbook, we can retrieve the names of all the sheets using the sheetnames attribute. This gives us a list of sheet names present in the Excel file. We can then iterate over these sheet names to read the data from each sheet individually.

Retrieving sheet data with openpyxl

openpyxl provides various methods to access the data within a sheet.

Two commonly used methods are iter_rows and iter_cols. These methods allow us to iterate over the rows or columns of a sheet and retrieve the cell values.

Let’s have a look at how iter_rows can be used:

# Assuming you are working with the first sheet
sheet = wb['versicolor']
# Iterate over rows and print raw values
for row in sheet.iter_rows(min_row=1, max_row=5, values_only=True):
    print(row)

Similarly, iter_cols can be used like this:

# Iterate over columns and print raw values
for column in sheet.iter_cols(min_col=1, max_col=5, values_only=True):
    print(column)

When using iter_rows or iter_cols, we can specify whether we want to retrieve the cell values as raw values or as formatted values. Raw values give us the actual data stored in the cells, while formatted values include any formatting applied to the cells, such as date formatting or number formatting.

By iterating over the rows or columns of a sheet, we can retrieve the desired data and store it in a suitable data structure. One popular choice is to use pandas DataFrame, which provide a tabular representation of the data and offer convenient methods for data manipulation and analysis.

An alternative solution is using the values attribute of the sheet object. This provides a generator for all values in the sheet (much like iter_rows and iter_cols do for rows and columns, respectively). While generators cannot be used directly to access the data, they can be used in for cycles to iterate over each value. The pandas library’s DataFrame function also allows direct conversion from a suitable generator object to a DataFrame.

Combining data from multiple sheets

As we read the data from each sheet, we can store it in a list or dictionary, depending on our needs. Once we have retrieved the data from all the sheets, we can combine it into a single consolidated data structure. This step is crucial for further analysis and processing.

To combine the data, we can use pandas DataFrame. By creating individual DataFrame for each sheet’s data and then concatenating or merging them into a single DataFrame, we can obtain a comprehensive dataset that includes all the information from multiple sheets.

Custom function for reading multiple sheets

To simplify the process of reading in multiple sheets and consolidating the data, we can create custom functions tailored to our specific requirements. These functions encapsulate the necessary steps and allow us to reuse the code efficiently.

In our example, we define a function called read_multiple_sheets that takes the file path as input. Inside the function, we load the workbook using load_workbook and iterate over the sheet names retrieved with the sheets attribute.

For each sheet, we access it using the workbook object and retrieve the data using the custom read_single_sheet function. We then store the retrieved data in a list. Finally, we combine the data from all the sheets into a single pandas DataFrame using the appropriate concatenation method from pandas.

By using these custom functions, we can easily read in multiple sheets from an Excel file and obtain a consolidated dataset that’s ready for analysis. The function provides a reusable and efficient solution, saving us time and effort in dealing with complex Excel files.

Customizing the code

The provided example is a starting point that you can customize based on your specific requirements. Here are a few considerations for customizing the code:

  • Filtering columns: If you only need specific columns from each sheet, you can modify the code to extract only the desired columns during the data retrieval step. You can do this by using the iter_cols method instead of the values attribute and using a filtered list in a for cycle or by filtering the resulting pandas DataFrame object(s).
  • Handling missing data: If the sheets contain missing data, you can incorporate appropriate handling techniques, such as filling in missing values or excluding incomplete rows.
  • Applying transformations: Depending on the nature of your data, you might need to apply transformations or calculations to the consolidated dataset. The custom function can be expanded to accommodate these transformations.

Remember, the goal is to tailor the code to suit your unique needs and ensure it aligns with your data processing requirements.

By leveraging the power of openpyxl and creating custom functions, you can efficiently read in multiple sheets from Excel files, consolidate the data, and prepare it for further analysis. This capability enables you to unlock valuable insights from complex Excel files and leverage the full potential of your data.

Now, let’s dive into an example that demonstrates this process:

from openpyxl import load_workbook
import pandas as pd
def read_single_sheet(workbook, sheet_name):
   # Load the sheet from the workbook
    sheet = workbook[sheet_name]
    # Read out the raaw data including headers
    sheet_data_raw = sheet.values
    # Separate the headers into a variable
    columns = next(sheet_data_raw)[0:]
    # Create a DataFrame based on the second and subsequent lines of data with the header as column names and return it
    return pd.DataFrame(sheet_data_raw, columns=columns)
def read_multiple_sheets(file_path):
    # Load the workbook
    workbook = load_workbook(file_path)
    # Get a list of all sheet names in the workbook
    sheet_names = workbook.sheetnames
    # Cycle through the sheet names, load the data for each and concatenate them into a single DataFrame
    return pd.concat([read_single_sheet(workbook=workbook, sheet_name=sheet_name) for sheet_name in sheet_names], ignore_index=True)
# Define the file path and sheet names
file_path = 'iris_data.xlsx' # adjust the path as needed
# Read the data from multiple sheets
consolidated_data = read_multiple_sheets(file_path)
# Display the consolidated data
print(consolidated_data.head())

Let’s have a look at the results:

Figure 1.8 – Using the openxlsx package to read in the Excel file

Figure 1.8 – Using the openxlsx package to read in the Excel file

In the preceding code, we define two functions:

  • read_single_sheet: This reads the data from a single sheet into a pandas DataFrame
  • read_multiple_sheets: This reads and concatenates the data from all sheets in the workbook

Within the read_multiple_sheets function, we load the workbook using load_workbook and iterate over the sheet names. For each sheet, we retrieve the data using the read_single_sheet helper function, which reads the data from a sheet and creates a pandas DataFrame for each sheet’s data, with the header row used as column names. Finally, we use pd.concat to combine all the DataFrame into a single consolidated DataFrame.

By utilizing these custom functions, we can easily read in multiple sheets from an Excel file and obtain a consolidated dataset. This allows us to perform various data manipulations, analyses, or visualizations on the combined data.

Understanding how to handle multiple sheets efficiently enhances our ability to work with complex Excel files and extract valuable insights from diverse datasets.

Summary

In this chapter, we explored the process of importing data from Excel spreadsheets into our programming environments. For R users, we delved into the functionalities of libraries such as readxl, xlsx, and openxlsx, providing efficient solutions for extracting and manipulating data. We also introduced a custom function, read_excel_sheets, to streamline the process of extracting data from multiple sheets within Excel files. On the Python side, we discussed the essential pandas and openpyxl packages for Excel manipulation, demonstrating their features through practical examples. At this point, you should have a solid understanding of these tools and their capabilities for efficient Excel manipulation and data analysis.

In the next chapter, we will learn how to write the results to Excel.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Perform advanced data analysis and visualization techniques with R and Python on Excel data
  • Use exploratory data analysis and pivot table analysis for deeper insights into your data
  • Integrate R and Python code directly into Excel using VBA or API endpoints
  • Purchase of the print or Kindle book includes a free PDF eBook

Description

– Extending Excel with Python and R is a game changer resource written by experts Steven Sanderson, the author of the healthyverse suite of R packages, and David Kun, co-founder of Functional Analytics. – This comprehensive guide transforms the way you work with spreadsheet-based data by integrating Python and R with Excel to automate tasks, execute statistical analysis, and create powerful visualizations. – Working through the chapters, you’ll find out how to perform exploratory data analysis, time series analysis, and even integrate APIs for maximum efficiency. – Both beginners and experts will get everything you need to unlock Excel's full potential and take your data analysis skills to the next level. – By the end of this book, you’ll be able to import data from Excel, manipulate it in R or Python, and perform the data analysis tasks in your preferred framework while pushing the results back to Excel for sharing with others as needed.

Who is this book for?

– If you’re a data analyst or data scientist, or a quants, actuaries, or data practitioner looking to enhance your Excel skills and expand your data analysis capabilities with R and Python, this book is for you. – The comprehensive approach to the topics covered makes it suitable for both beginners and intermediate learners. – A basic understanding of Excel, Python, and R is all you need to get started.

What you will learn

  • Read and write Excel files with R and Python libraries
  • Automate Excel tasks with R and Python scripts
  • Use R and Python to execute Excel VBA macros
  • Format Excel sheets using R and Python packages
  • Create graphs with ggplot2 and Matplotlib in Excel
  • Analyze Excel data with statistical methods and time series analysis
  • Explore various methods to call R and Python functions from Excel

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Apr 30, 2024
Length: 344 pages
Edition : 1st
Language : English
ISBN-13 : 9781804610695
Category :
Languages :
Concepts :
Tools :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Apr 30, 2024
Length: 344 pages
Edition : 1st
Language : English
ISBN-13 : 9781804610695
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just NZ$7 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just NZ$7 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total NZ$ 154.97 183.97 29.00 saved
Extending Excel with Python and R
NZ$51.99 NZ$65.99
Building Interactive Dashboards in Microsoft 365 Excel
NZ$43.99
Python Data Cleaning Cookbook
NZ$58.99 NZ$73.99
Total NZ$ 154.97 183.97 29.00 saved Stars icon
Banner background image

Table of Contents

19 Chapters
Part 1:The Basics – Reading and Writing Excel Files from R and Python Chevron down icon Chevron up icon
Chapter 1: Reading Excel Spreadsheets Chevron down icon Chevron up icon
Chapter 2: Writing Excel Spreadsheets Chevron down icon Chevron up icon
Chapter 3: Executing VBA Code from R and Python Chevron down icon Chevron up icon
Chapter 4: Automating Further – Task Scheduling and Email Chevron down icon Chevron up icon
Part 2: Making It Pretty – Formatting, Graphs, and More Chevron down icon Chevron up icon
Chapter 5: Formatting Your Excel Sheet Chevron down icon Chevron up icon
Chapter 6: Inserting ggplot2/matplotlib Graphs Chevron down icon Chevron up icon
Chapter 7: Pivot Tables and Summary Tables Chevron down icon Chevron up icon
Part 3: EDA, Statistical Analysis, and Time Series Analysis Chevron down icon Chevron up icon
Chapter 8: Exploratory Data Analysis with R and Python Chevron down icon Chevron up icon
Chapter 9: Statistical Analysis: Linear and Logistic Regression Chevron down icon Chevron up icon
Chapter 10: Time Series Analysis: Statistics, Plots, and Forecasting Chevron down icon Chevron up icon
Part 4: The Other Way Around – Calling R and Python from Excel Chevron down icon Chevron up icon
Chapter 11: Calling R/Python Locally from Excel Directly or via an API Chevron down icon Chevron up icon
Part 5: Data Analysis and Visualization with R and Python for Excel Data – A Case Study Chevron down icon Chevron up icon
Chapter 12: Data Analysis and Visualization with R and Python in Excel – A Case Study Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Full star icon Full star icon 5
(5 Ratings)
5 star 100%
4 star 0%
3 star 0%
2 star 0%
1 star 0%
Amazon Customer Jun 17, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
### Exciting Book Review for Amazon 📊📈**Extending Excel with Python and R** by Steven Sanderson and David Kun is a game-changer! This incredible guide shows you how to integrate Python and R with Excel, automating tasks, performing advanced analyses, and creating stunning visualizations. Perfect for data enthusiasts, it covers everything from VBA macros to time series forecasting. Elevate your data skills and make Excel even more powerful. Highly recommended! 📚✨
Amazon Verified review Amazon
Maxim Jun 17, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I am a Stats major. During my Senior year I took lots of R, Python and Advanced Excel classes. This book covers most of the things that I learned in college and provides code examples that you can use if you want to practice Data Analysis in Python. Great resource if you want to learn Data Science and Data Automation at your own pace.
Amazon Verified review Amazon
Taylor Balfanz Jun 13, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This is a must-read for any data enthusiast or professional seeking to enhance their analytical capabilities. This book seamlessly bridges the gap between Excel and the powerful programming languages Python and R.The book is well-structured, guiding the reader through the integration of Python and R with Excel step-by-step. Each chapter builds on the previous one, ensuring a solid understanding of the material. The practical examples and real-world case studies included in the book are particularly beneficial, as they illustrate the application of the concepts in a tangible way.Another noteworthy aspect of this book is its comprehensive coverage. It not only covers the basics but also delves into advanced topics, ensuring that readers gain a thorough understanding of how to leverage Python and R to extend Excel’s functionality. From automating repetitive tasks to performing sophisticated data analysis, the book equips readers with the knowledge needed to effectively increase their data analysis capabilities.
Amazon Verified review Amazon
Gustavo May 15, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
The book is very complete and takes you from very basic tasks like loading data from excel sheets to Python and R until more complex ones, such as VBA coding and automation.I especially liked the chapter about Time Series.Good book to Excel users to get started with programming languages.
Amazon Verified review Amazon
Jay Jul 27, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I did pick this without thinking too much. I am pleasantly surprised by the content in here. I did think people might wonder why do you need Excel if you know Python and R, right? Well, my perspective was in 90% of jobs around the world, excel is going to be go to for majority of folks, that's how you submit reports to management or share stuff with other teams, and knowing to connect R and Python to do the analysis as a plugin to Excel can help a lot! Was surprised how much of the complex excel stuff you can automate easily with Python and R. Helped a ton with automating some of the stuff I never thought about previously. Great read!
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.