Preprocessing the CIDDS-001 dataset
In the last section, we identified some issues with the dataset we need to address to improve the accuracy of our model.
The CIDDS-001
dataset includes diverse types of data: we have numerical values such as duration, categorical features such as protocols (TCP, UDP, ICMP, and IGMP), and others such as timestamps or IP addresses. In the following exercise, we will choose how to represent these data types based on the information from the previous section and expert knowledge:
- First, we can one-hot-encode the day of the week by retrieving this information from the timestamp. We will rename the resulting columns to make them more readable:
df['weekday'] = df['Date first seen'].dt.weekday df = pd.get_dummies(df, columns=['weekday']).rename(columns = {'weekday_0': 'Monday','weekday_1': 'Tuesday','weekday_2': 'Wednesday', 'weekday_3': 'Thursday...