Architectural considerations
Before you can do any real analysis or prediction, it is critical to find the right data and determine the quality. There is an old rule of thumb for data prep. Any analysis will tell you that, with analysis, 80% of the work is in the normalization of the data. You will also want to try and use binary formats such as Parquet and GeoParquet, which are optimized for parallel reads, columnar storage, and compression. Selecting a data partitioning structure is also crucial to performance, such as partitioning by time, location (such as country, state, zip), or Geohash. Lastly, you may want to consider storing data in a database that can create a true geospatial index in memory. Geospatial partitioning in a data lake can only be optimized so much; at some point, you may need a database such as RDS or Redshift that has native index support for geometry and geography datatypes.