Practical lab
Your team has been given a new data source to deliver Parquet files to dbfs
. These files could come every minute or once daily; the rate and speed will vary. This data must be updated once every hour if any new data has been delivered.
Setup
Let’s set up our environment and create some fake data using Python.
Setting up folders
The following code can be run in a notebook. Here, I am using the shell magic to accomplish this:
%sh rm -rf /dbfs/tmp/chapter_4_lab_test_data rm -rf /dbfs/tmp/chapter_4_lab_bronze rm -rf /dbfs/tmp/chapter_4_lab_silver rm -rf /dbfs/tmp/chapter_4_lab_gold
Creating fake data
Use the following code to create fake data for our problems:
fake = Faker() def generate_data(num): Â Â Â Â row = [{"name":fake.name(), Â Â Â Â Â Â Â Â Â Â Â "address":fake.address(), Â Â Â Â Â Â Â Â Â Â Â "city"...