Geospatial with EMR
Alright, now let’s get to the good part: Geospatial with EMR! Out of the box, EMR has pretty limited geospatial support. I’ve seen a few projects with Hive to support SQL geospatial functions that have seen some success. Two big initiatives are Apache Sedona and Esri GeoAnalytics. These products have launched and are both currently running production workloads. In our previous example, we talked about using EMR with the default Python frameworks loaded to work with data, but to take advantage of parallel processing on the cluster, you would want to use something built on Spark. Both Sedona and GeoAnalytics do exactly that: they allow you to run PySpark with Python or SQL syntax against your data and they are able to parallelize the processing.
With PySpark, the syntax is pretty trivial, as in the following example, where we load geojson
into a pandas DataFrame. We could easily create a Spark dataframe with a command like this: SparkDataFrame=spark...