Practical considerations for handling time series data
When working with time series data, respecting the temporal order of observations is crucial. Unlike typical machine learning tasks, where data can be randomly shuffled, time series data must be split sequentially, so that the training set contains earlier observations, while the test set contains more recent ones. This ensures the model is evaluated as it would be in production, where future data is predicted based on past patterns. Here, we split the dataset into an 80% training set and a 20% test set:
- Split the data into training and testing sets (80% training, 20% test):
train_size = int(len(lagged_data) * 0.8) train_data = lagged_data[:train_size] test_data = lagged_data[train_size:]
- With our features set up in this way, the
Value
column is our target (what we aim to predict), and the other columns (including lagged features) serve as the input features:X_train = train_data.drop('Value', axis=1) y_train...