Running pandas code using AWS Glue for Ray
The pandas
library is a highly popular Python library for data manipulation and analysis, based on the well-established numpy
library, handling data in a table-like format. It is so well established among Python analysts and data scientists, that it has become a de facto standard to the point that other libraries implement their interfaces so that they can run existing pandas
code. This is often done to overcome pandas
’ limitations, namely being a single process memory-based library, which limits scalability.
One such pandas
-compatible library is Modin. It can run pandas
code by just changing the imports while being able to scale by using an engine such as Dask or Ray. In this recipe, you will see how to run pandas
code on Glue for Ray using Modin.
Getting ready
This recipe requires a bash
shell with the AWS CLI installed and configured. The GLUE_ROLE_ARN
and GLUE_BUCKET
environment variables need to be set, as indicated in...