Data acquisition, discovery, and preprocessing
Often, data for building and training ML models is provided by the data engineering team and given to the data science team. In our case, data engineers might have already brought data into either the lakehouse, data warehouse, or both. However, for simplicity’s sake, in this example, we will ingest data from Azure Open Datasets (https://learn.microsoft.com/en-us/azure/open-datasets/overview-what-are-open-datasets) into the lakehouse that we created earlier in this chapter.
Data acquisition
In Chapter 3, Building an End-to-End Analytics System – Lakehouse, we learned how to open an imported notebook and how to attach a lakehouse as a default lakehouse for the opened notebook. Please ensure you attach the lakehouse (nyctaxilake
) you created in the Data and storage – creating a lakehouse and ingesting data using Apache Spark section of this chapter. Once you’ve done that, you can import the data you will...