The French electrical grid operator RTE (Réseau de Transport d'Électricité) uses a system of PP1 and PP2 days to signal periods of high stress on the electrical grid. These designations typically occur during intense cold spells or peak consumption periods, helping to manage grid stability through specific incentives for reducing energy consumption. This analysis explores the development of a random forest model to predict these critical days using weather data.

Advantages of Random Forest for This Context

Random Forest is particularly well-suited for this context due to its ability to handle non-linear relationships, naturally capture feature interactions, and manage noise and outliers. By constructing multiple decision trees, the model can learn complex decision boundaries without requiring extensive feature engineering. For instance, it can identify how different temperature thresholds interact under various conditions, making it highly effective for weather-related predictions.

Robustness to noise and outliers is another key advantage. Because Random Forest uses bootstrapping to train each tree on a slightly different subset of data, no single noisy observation can dominate the learning process. The model’s random feature selection at each split further reduces the risk of overfitting to noise, as no single variable can dominate the whole dataset.

Furthermore, Random Forest provides valuable insights through feature importance measures. These allow us to determine which weather parameters are most predictive of PP1/PP2 days, validate the model’s reasoning against domain knowledge, and potentially refine future iterations by selecting the most relevant variables.

Why Not Deep Learning?

Deep learning is not the best choice for this problem primarily due to the limited dataset size. Neural networks typically require large amounts of data to generalize well, and with a small dataset, they are more likely to overfit rather than produce reliable predictions.

Additionally, interpretability is a key concern. Random Forest provides clear insights into feature importance, helping us understand which variables drive predictions. In contrast, deep learning models function as black boxes, making it difficult to trace how individual features influence the outcome.

Overview of the Data Pipeline

The preprocessing pipeline transforms raw weather and temporal data into a format optimized for machine learning. Let's walk through each step of this transformation process and understand why each decision was made.

Temporal Feature Engineering

The datetime processing is particularly sophisticated. Rather than just dropping the raw dates, we extract and transform temporal information in ways that preserve its predictive power:

df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day

# Trigonometric encoding for cyclical features
df['day_sin'] = np.sin(2 * np.pi * df['day'] / 31)
df['day_cos'] = np.cos(2 * np.pi * df['day'] / 31)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

The trigonometric encoding keeps the cyclical nature of days and months intact. This means the distance between December and January is the same as between January and February, unlike when months are simply encoded as integers from 1 to 12. This is important because PP1/PP2 days show strong seasonal patterns.

Handling Sunrise and Sunset Times

Daylight hours can significantly impact energy consumption patterns. The code processes sunrise and sunset times in a particularly clever way:

df['sunrise'] = pd.to_datetime(df['sunrise'])
df['sunset'] = pd.to_datetime(df['sunset'])
df['sunrise_seconds'] = df['sunrise'].dt.hour * 3600 + df['sunrise'].dt.minute * 60 + df['sunrise'].dt.second
df['sunset_seconds'] = df['sunset'].dt.hour * 3600 + df['sunset'].dt.minute * 60 + df['sunset'].dt.second

By converting these times to seconds since midnight, we create continuous numerical features that preserve the exact timing information while being more suitable for machine learning algorithms.

Seasonal Filtering

A key preprocessing decision is the filtering of months to focus on the winter period:

pythonCopydf = df[df['month'].isin([1, 2, 3, 11, 12])]

This filter reflects the domain knowledge that PP1/PP2 days typically occur during colder months when energy demand is highest. This focused approach helps the model learn patterns specific to high-risk periods rather than diluting its learning with irrelevant summer data.

Categorical Data Handling

Weather conditions come as categorical data and need special treatment:

df = pd.get_dummies(df, columns=['conditions'], drop_first=True)

The one-hot encoding transforms categorical weather conditions into binary columns, making them suitable for machine learning while preserving their predictive power. The drop_first=True parameter helps avoid multicollinearity by omitting one category as a reference level.

Feature Selection and Cleanup

The pipeline removes several types of columns:

Administrative fields like station identifiers
Redundant information after feature engineering (raw dates)
Low-value predictors like severe risk indicators
Text descriptions that can't be directly used for prediction

Dataset Splitting and Trainnig

X_train, X_test, y_pp1_train, y_pp1_test, y_pp2_train, y_pp2_test = train_test_split(
    X, y_pp1, y_pp2, test_size=0.2, random_state=42)

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

The dataset is split into training and test sets to ensure that the model is evaluated on unseen data, helping to assess its generalization performance. We allocate 80% of the data for training and 20% for testing using train_test_split. Since there are two target variables (y_pp1 and y_pp2), they must be split separately to ensure consistency between input features (X) and their corresponding labels.

After splitting, two separate Random Forest models are trained: one for predicting y_pp1 and another for y_pp2. This allows each model to focus on learning the patterns specific to its respective target variable. By setting random_state=42, we ensure that the data splits remain the same across runs, making the results reproducible.

An alternative approach would be to use a multi-output classification model, which would train a single Random Forest to predict both y_pp1 and y_pp2 simultaneously.

Evaluation

The evaluation strategy for the PP1/PP2 prediction models is comprehensive, using a confusion matrix and the F1-score to understand the models performances.

Finally, we Save models the models to deploy them later using Streamlit.

joblib.dump(model_pp1, 'model_pp1.joblib')
joblib.dump(model_pp2, 'model_pp2.joblib')

In Conclusion

We've built a Random Forest model to predict high-stress days for France's electrical grid, focusing on PP1 and PP2 days. By tapping into weather data and applying smart preprocessing, our model effectively flags these critical days, helping maintain grid stability. Opting for Random Forest over deep learning ensures our model remains both interpretable and robust, which suits our dataset's size and complexity. This project underscores the value of thoughtful feature engineering, highlighting how machine learning can play a pivotal role in sustainable energy management.

Predicting High Electrical Grid Stress Days in France