Code development on EMR using Workspaces
Developing data processing code on complex distributed frameworks is much more productive when it is done in an interactive way by using representative data and seeing the results of the transformations done on each step. This has led to an increase in the popularity of languages that can be interpreted interactively, such as Python or Scala.
While you can do some interactive development via a shell, as the code becomes larger, it stops being practical. The productive way to do this is via a notebook with cells, where each cell holds and executes a block of code, but the variables are common to the notebook so the work you do in one cell is visible to the others. That way, you can develop and test a small piece of code at a time and see the results.
EMR has traditionally supported this style of development with Apache Zeppelin, which can be installed on the cluster to run multiple types of notebooks including Spark or Bash, with multiple...