Reusing libraries in your Glue job
Spark provides a rich data framework that can be extended with additional plugins, libraries, and Python modules. As you build more jobs, you would likely reuse your own code, whether it’s UDFs to process data when it’s not possible to do the same using the Spark functions or some pipeline code you want to reuse; for instance, a function with some transformations that you do regularly.
In this recipe, you will see how you can reuse Python code on Glue for Spark jobs.
Getting ready
This recipe requires a bash
shell with the AWS CLI installed and configured and the GLUE_ROLE_ARN
and GLUE_BUCKET
environment variables set, as indicated in the Technical requirements section at the beginning of the chapter.
How to do it...
- The following
bash
commands will create a Python module and config file:mkdir my_module cat <<EOF > my_module/__init__.py from random import randint def do_some_calculation(a): return...