import pandas as pd
import numpy as np
# functions from sklearn
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
# data viz libraries
import matplotlib.pyplot as plt
import seaborn as sns
Link to GitHub repository
Context
This project was completed for my Machine Learning class, taken as part of my Master’s program at UC Santa Barbara. Provided with data and questions, I carried out this analysis using appropriate machine learning techniques.
Question
How can machine learning models, specifically random forest and stochastic gradient boosted tree models, be used to predict Dissovled Inorganic Carbon (DIC) in California coastal ecosystems based on ocean chemistry features?
Analysis Summary
Used data from a marine ecosystem research program to build three models (single decision tree, random forest, and stochastic gradient boosted trees) that predict dissolved inorganic carbon (DIC) based on other ocean chemistry features (e.g., sulfur trioxide concentration) that were also measured during water sampling. Developed visualizations comparing root mean squared error (RMSE) among the three models and analyzing feature importances in the best performing model.
Data
Our data set comes from the California Cooperative Oceanic Fisheries Investigations (CalCOFI), an oceanographic and marine ecosystem research program located in California (link to data set source). All water samples were taken off the California coast.
Setup
Data import & basic pre-processing
# import training data
= pd.read_csv('data/train.csv')
df
# inspect data
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1454 entries, 0 to 1453
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 1454 non-null int64
1 Lat_Dec 1454 non-null float64
2 Lon_Dec 1454 non-null float64
3 NO2uM 1454 non-null float64
4 NO3uM 1454 non-null float64
5 NH3uM 1454 non-null float64
6 R_TEMP 1454 non-null float64
7 R_Depth 1454 non-null int64
8 R_Sal 1454 non-null float64
9 R_DYNHT 1454 non-null float64
10 R_Nuts 1454 non-null float64
11 R_Oxy_micromol.Kg 1454 non-null float64
12 Unnamed: 12 0 non-null float64
13 PO4uM 1454 non-null float64
14 SiO3uM 1454 non-null float64
15 TA1.x 1454 non-null float64
16 Salinity1 1454 non-null float64
17 Temperature_degC 1454 non-null float64
18 DIC 1454 non-null float64
dtypes: float64(17), int64(2)
memory usage: 216.0 KB
None
# remove 'id' and 'unnamed:_12' columns
= df.drop(df.columns[[0, 12]], axis=1) df
Data exploration & additional pre-processing
# heatmap of correlation between features
=(12, 10))
plt.figure(figsize=True, cmap='coolwarm', square=True)
sns.heatmap(df.corr(), annot'Correlation Heatmap')
plt.title( plt.show()
I’ll use this correlation heatmap to conduct feature selection, removing features that are highly correlated with other features. Specifically, I’ll remove ‘Temperature_degC’ (correlation with ‘R_TEMP’ is 1) and ‘R_Nuts’ (correlation with ‘NH3uM’ is 1).
# define highly correlated features to remove when creating feature matrix
= ['Temperature_degC', 'R_Nuts']
columns_to_remove
# define feature matrix and target vector
= df.drop(['DIC'] + columns_to_remove, axis=1)
X = df['DIC']
y
# split the data into training and testing sets
= train_test_split(X, y, test_size=0.20, random_state=123)
X_train, X_test, y_train, y_test
# check shapes of training and testing sets
print(f'X_train : {X_train.shape}')
print(f'y_train : {y_train.shape}')
print(f'X_test : {X_test.shape}')
print(f'y_test : {y_test.shape}')
X_train : (1163, 14)
y_train : (1163,)
X_test : (291, 14)
y_test : (291,)
Model using single decision tree
For our first model, we’ll build just a single decision tree. A decision tree generates predictions by asking simple yes-or-no questions about the features. Which question to ask is determined by the partitioning objective. For our partitioning objective, we’ll be minimizing mean squared error (MSE), which is the most common objective used for regression tasks. After we build all three models, we’ll compare them based on root mean squared error (RMSE).
# define tree model
= DecisionTreeRegressor()
tree_regressor
# create tuning grid for hyperparameters
= {
param_grid 'decisiontreeregressor__max_depth': [None, 10, 20, 30, 40, 50], # max depth of tree
'decisiontreeregressor__min_samples_split': [2, 10, 20], # min number of samples required to split an internal node
'decisiontreeregressor__min_samples_leaf': [1, 5, 10] # min number of samples required to be at a leaf node
}
# setup for Pipeline and GridSearchCV
= Pipeline([
pipeline 'decisiontreeregressor', tree_regressor)
(
])= GridSearchCV(pipeline,
grid_search
param_grid,=5, # split data into 5 folds for CV
cv='neg_mean_squared_error') # determine best version of model based on MSE
scoring
# fit model to training data and extract best version
grid_search.fit(X_train, y_train)= grid_search.best_estimator_
best_tree
# predict testing data
= best_tree.predict(X_test)
y_pred
# show RMSE and best parameters
= mean_squared_error(y_test, y_pred)
mse = np.sqrt(mse)
tree_rmse print(f'RMSE: {tree_rmse:.2f}')
print(f'Best parameters: {grid_search.best_params_}')
RMSE: 8.21
Best parameters: {'decisiontreeregressor__max_depth': None, 'decisiontreeregressor__min_samples_leaf': 5, 'decisiontreeregressor__min_samples_split': 10}
Model using random forest
Random forest models build a large collection of de-correlated trees to improve predictive performance. We now define an additional hyperparameter for the number of unique features that will be considered at each split in the decision tree. This hyperparameter, called mtry, makes it so we don’t have to worry about the trees being correlated with one another because we are only looking at a randomized subset of the features at each split in each tree. Having these un-correlated trees allows us to build many trees that are also deep, without overfitting to the training data. Because there are many trees in this model and these trees are also built to be deep based on a randomized set of features, they are referred to as a random forest.
# define RF model
= RandomForestRegressor()
random_forest_regressor
# create tuning grid for hyperparameters
= {
param_grid 'randomforestregressor__n_estimators': [100], # number of trees in forest
'randomforestregressor__max_depth': [30], # max depth of each tree
'randomforestregressor__min_samples_split': [10], # min number of samples required to split an internal node
'randomforestregressor__min_samples_leaf': [5] # min number of samples required to be at a leaf node
}
# setup for Pipeline and GridSearchCV
= Pipeline([
pipeline 'randomforestregressor', random_forest_regressor)
(
])= GridSearchCV(pipeline,
grid_search
param_grid,=5, # split data into 5 folds for CV
cv='neg_mean_squared_error') # determine best version of model based on MSE
scoring
# fit model to training data and extract best version
grid_search.fit(X_train, y_train)= grid_search.best_estimator_
best_forest
# predict testing data
= best_forest.predict(X_test)
y_pred
# show RMSE and best parameters
= mean_squared_error(y_test, y_pred)
mse = np.sqrt(mse)
rf_rmse print(f'RMSE: {rf_rmse:.2f}')
print(f'Best parameters: {grid_search.best_params_}')
RMSE: 6.86
Best parameters: {'randomforestregressor__max_depth': 30, 'randomforestregressor__min_samples_leaf': 5, 'randomforestregressor__min_samples_split': 10, 'randomforestregressor__n_estimators': 100}
Model using stochastic gradient boosted trees
Boosting is a general algorithm that is often applied to decision tree models as a way to improve predictive performance through introducing another form of randomization. Boosted models are built sequentially, as each version of the model is fit to the residuals from the previous version.
Boosted tree models use a large number of shallow decision trees as a base learner. These early versions of the model, which are called “weak models” are improved sequentially based on the residuals of the previous version.
In SGB tree models, these “weak models” are improved using the sequential fitting algorithm of stochastic gradient descent, which uses random sampling of features to optimize the defined loss function (in this case, MSE) for each iteration based on the defined learning rate. If we choose too low of a learning rate, it may require too many iterations for our model to improve at all, but if we choose a learning rate that is too high, we may accidently skip over a better performing version of the model.
Another important thing to note is that even though SGB models use decision trees, the number of trees is no longer a hyperparameter. While number of trees served as the base estimator in each of the three previous models, the base estimator for SGB models is the number of iterations of the sequential fitting algorithm (i.e., boosting stages) to perform. In SGB models, the number of trees is just an “ordinary” parameter being set based on the results of the sequential fitting algorithm.
# define SGB model
= GradientBoostingRegressor()
gradient_boosting_regressor
# create tuning grid for hyperparameters
= {
param_grid 'gradientboostingregressor__n_estimators': [3000], # number of boosting stages to perform
'gradientboostingregressor__learning_rate': [0.01, 0.1], # learning rate
'gradientboostingregressor__max_depth': [10], # max depth of individual regression estimators
'gradientboostingregressor__min_samples_split': [10], # min number of samples required to split an internal node
'gradientboostingregressor__min_samples_leaf': [5] # min number of samples required to be at a leaf node
}
# setup for Pipeline and GridSearchCV
= Pipeline([
pipeline 'gradientboostingregressor', gradient_boosting_regressor)
(
])= GridSearchCV(pipeline,
grid_search
param_grid,=5, # split data into 5 folds for CV
cv='neg_mean_squared_error') # determine best version of model based on MSE
scoring
# fit model to training data and extract best version
grid_search.fit(X_train, y_train)= grid_search.best_estimator_
best_gradient_boosting
# predict testing data
= best_gradient_boosting.predict(X_test)
y_pred
# show RMSE and best parameters
= mean_squared_error(y_test, y_pred)
mse = np.sqrt(mse)
sgb_rmse print(f'RMSE: {sgb_rmse:.2f}')
print(f'Best parameters: {grid_search.best_params_}')
RMSE: 6.91
Best parameters: {'gradientboostingregressor__learning_rate': 0.1, 'gradientboostingregressor__max_depth': 10, 'gradientboostingregressor__min_samples_leaf': 5, 'gradientboostingregressor__min_samples_split': 10, 'gradientboostingregressor__n_estimators': 3000}
Compare models
# create lists of RMSE values and models
= [sgb_rmse, rf_rmse, tree_rmse]
rmse_values = ['SG Boosted Trees\n(n=3000)', 'Random Forest\n(n=100)', 'Single Tree\n(n=1)']
models
# create bar plot
=(10, 6))
plt.figure(figsize= plt.bar(models, rmse_values, color=['cornflowerblue'])
bars 'Model Comparison')
plt.title('Models')
plt.xlabel('RMSE')
plt.ylabel(
# add value labels on top of each bar
for bar in bars:
= bar.get_height()
yval + bar.get_width()/2, yval, round(yval, 3), ha='center', va='bottom')
plt.text(bar.get_x()
plt.show()
Our best performing model was the random forest model, with a RMSE of 6.856. In normal English, a RMSE of 6.856 means that, on average, the predictions made by this model deviate from the actual DIC values by 6.856 micromoles per kilogram.
Compare feature importance
# extract importance of each feature from best version of model
= best_forest.named_steps['randomforestregressor'].feature_importances_
importances
# create df to hold feature names and their importance
= pd.DataFrame({
importance_df 'Feature': X_train.columns,
'Importance': importances
})
# sort by importance in descending order
= importance_df.sort_values(by='Importance', ascending=False)
importance_df
# create plot
=(12, 8))
plt.figure(figsize'Feature'], importance_df['Importance'], color='skyblue')
plt.barh(importance_df[
plt.gca().invert_yaxis()'Importance')
plt.xlabel('Feature')
plt.ylabel('Feature Importances in RF Model')
plt.title(True, which='both', axis = 'x', linestyle='--', linewidth=0.5)
plt.grid(
plt.show()
Here, we see that the most important feature for predicting DIC in the random forest model was ‘SiO3uM’, with a feature importance over 0.70. This was significantly higher than the second most important feature, ‘PO4uM’, which had a feature importance of close to 0.2.
Citation
@online{ghanadan2024,
author = {Ghanadan, Linus},
title = {Regression {Models} to Predict {Dissolved} {Inorganic}
{Carbon} in {California} {Coastal} {Ecosystems}},
date = {2024-04-03},
url = {https://linusghanadan.github.io/blog/2023-4-3-post/dic_ml.html},
langid = {en}
}