Common libraries and algorithms
SageMaker originally launched mostly as a notebook service called SageMaker Notebooks. Most data scientists are intimately familiar with Jupyter Notebook environments and it’s a fantastic environment for analysis as well as collaboration. In the following demo, we will launch a SageMaker notebook and pull in a few common Python libraries to work with a dataset. Probably one of the most common libraries is called pandas, and there is also GeoPandas, which is more common with geospatial datasets. The pandas framework has fantastic support for reading in and writing out CSV datasets as well as binary formats such as Parquet, Avro, and ORC.
There is also a plethora of functions to do just about any kind of data manipulation you could think of – pivots, resampling, dropping columns, renaming columns, trimming, and regex filters. pandas is the de facto standard for ETL in the data science community. Another library we will load in is a built...