Time Series Analysis of Nutrient Concentration in Chesapeake Bay

My favorite data science project from graduate school!
Statistical Modeling
Python
Author

Linus Ghanadan

Published

January 20, 2024

Context

For this final project in my master’s statistics course at UC Santa Barbara, I worked independently to find data, pose a question, and carry out analysis using statistical modeling and Python.

Central Question

Over the ten years that followed the onset of new regulations in 2010, what linear trends best capture how average nitrogen and phosphorus concentrations have changed in the Chespeake Bay?

Summary of Analysis

Proposed statistical question on possible underlying linear trends in average nitrogen and phosphorus concentrations across the Chesapeake Bay since 2010 (when new Clean Water Act regulations were implemented for the Bay) and found appropriate data for answering the question (over 43,000 samples from the Bay’s tidal regions). Constructed two Seasonal-Trend using LOESS (STL) decomposition models to conduct time series analysis of nitrogen and phosphorus concentrations (selected length of seasons based on autocorrelation). For each pollutant, visualized model parameters comparatively and ran regressions of parameters (to determine proportion of variation attributable to seasonality and 95% confidence interval for 10-year linear trend component).

Introduction

The Chesapeake Bay is the largest estuary in the United States, and the largest body of water that is regulated under the Clean Water Act (U.S. Environmental Protection Agency 2023). Federal regulation pertaining to the Bay took decades to implement, and this is in large part because of the Bay’s large size and the many stakeholders involved. In the 1970s, the Bay was identified as one of the first areas in the world to have a marine dead zone, a phenomenon that literally suffocates aquatic life due to lack of oxygen in the water. Despite dead zones being identified in the 1970s, it was not until 2000 that the Bay was designated as an “impaired water” under the Clean Water Act. Then, it took another ten years, until 2010, for the EPA to take the next step of issuing Total Maximum Daily Load (TMDL) requirements, the regulatory action mandating water quality improvement.

Specifically, a TMDL is the maximum amount of a particular pollutant that a body of water can receive and still meet applicable water quality standards (U.S. Environmental Protection Agency 2023). This maximum amount is calculated in pounds based on measurements taken at areas where pollution is likely to end up in the Bay. In their 2010 regulation, the EPA established TMDL requirements for nitrogen, phosphorus, and sediment. Nitrogen and phosphorus, referred to as nutrients because of their role in providing nutrition to many animals and plants, cause algal blooms, which cause marine dead zones through taking in dissolved oxygen and blocking sunlight. Sediment contributes to dead zones by blocking sunlight as well, leading it to also be included in the 2010 TMDL requirements.

This analysis will focus on nitrogen and phosphorus, the two pollutants responsible for algal blooms in the Chesapeake Bay. A 2022 study found that agricultural runoff was the largest source of nutrient pollution, accounting for 48% of nitrogen and 27% of phosphorus in the Chesapeake Bay (Chesapeake Progress, n.d.). Both of these pollutants also get to the Bay as a result of urban and suburban runoff, wastewater treatment plants releasing treated water, and natural sources (e.g., runoff from forests, wetlands, etc.). In addition, about 25% of nitrogen that ends up in the Bay comes from air pollution that is originally emitted to the atmosphere by sources such as cars and factories (Burns et al. 2021. Through a process called atmospheric deposition, these nitrogen compounds react with other chemicals to become nitrous oxides, which can be deposited back to Earth’s surface through precipitation or as dry deposition.

Through conducting a time series analysis of post-2010 nitrogen and phosphorus concentration measurements, my goal is to better understand how concentrations have changed since the introduction of TMDL requirements. I’m also interested in the nature of any seasonality and whether the three time series components (i.e., seasonal, trend, and residual) are consistent across both nitrogen and phosphorus.

Data

Yearly water quality data on the Chesapeake Bay’s tidal and non-tidal regions going back to 1984 is publicly available on the Chesapeake Bay Program (CBP) DataHub (Chesapeake Bay Program DataHub, n.d.). Data is organized into either Tier 1, 2, or 3 depending on how it was collected. While Tier 1 and 2 data can be collected by any interested group, Tier 3 data is collected by monitoring stations overseen by experienced professionals. Only Tier 3 data can be used for governmental regulatory assessments.

For my analysis, I will be using 2010 to 2019 Tier 3 data collected at 143 different monitoring stations positioned throughout the Chesapeake Bay tidal regions, which includes the mainstem Bay and tributary components. Across the 10 years that we are looking at, we’ll have more than 43,800 nitrogen observations and more than 43,500 phosphorus observations.

Below, we import the Python libraries used in this analysis. Then, we read in the yearly water quality data using their CBP DataHub URL. We also process the data by creating separate data.frames for total nitrogen and phosphorus measurements.

import numpy as np
import pandas as pd

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Statistics libraries
import statsmodels.api as sm
from statsmodels.tsa.seasonal import STL
from statsmodels.regression.linear_model import OLS
from statsmodels.iolib.summary import Summary
from scipy import stats

# Other libraries
import requests
from io import BytesIO
# Create a list of data URLs
excel_urls = [
    'https://datahub-content.chesapeakebay.net/traditional_annual_tidal_02jun21/2019_CEDR_tidal_data_01jun21.xlsx',
    'https://datahub-content.chesapeakebay.net/traditional_annual_tidal_02jun21/2018_CEDR_tidal_data_01jun21.xlsx',
    'https://datahub-content.chesapeakebay.net/traditional_annual_tidal_02jun21/2017_CEDR_tidal_data_11oct18.xlsx',
    'https://datahub-content.chesapeakebay.net/traditional_annual_tidal_02jun21/2016_CEDR_tidal_data_15jun17.xlsx',
    'https://datahub-content.chesapeakebay.net/traditional_annual_tidal_02jun21/2015_CEDR_tidal_data_15jun17.xlsx',
    'https://datahub-content.chesapeakebay.net/traditional_annual_tidal_02jun21/2014_CEDR_tidal_data_15jun17.xlsx',
    'https://datahub-content.chesapeakebay.net/traditional_annual_tidal_02jun21/2013_CEDR_tidal_data_15jun17.xlsx',
    'https://datahub-content.chesapeakebay.net/traditional_annual_tidal_02jun21/2012_CEDR_tidal_data_15jun17.xlsx',
    'https://datahub-content.chesapeakebay.net/traditional_annual_tidal_02jun21/2011_CEDR_tidal_data_15jun17.xlsx',
    'https://datahub-content.chesapeakebay.net/traditional_annual_tidal_02jun21/2010_CEDR_tidal_data_15jun17.xlsx'
]

# Create an empty list to store data frames
dfs = []

# Loop through each URL, read the Excel file directly into pandas, and append to list of data frames
for url in excel_urls:
    # Get the content of the Excel file
    response = requests.get(url)
    
    # Read the Excel file directly from the content
    wq_data = pd.read_excel(BytesIO(response.content), sheet_name=0)
    dfs.append(wq_data)

# Combine all data frames into a single data frame
wq_data_combined = pd.concat(dfs, ignore_index=True)
# Wrangle data for relevant column variables, filter for TN (total nitrogen), and remove Parameter column
nitr_data = wq_data_combined[["SampleDate", "Parameter", "MeasureValue", "Latitude", "Longitude"]]
nitr_data = nitr_data[nitr_data["Parameter"] == "TN"]
nitr_data.drop(columns=['Parameter'], inplace=True)

nitr_data.info()

# Wrangle data for relevant column variables, and filter for TP (total phosphorus)
phos_data = wq_data_combined[["SampleDate", "Parameter", "MeasureValue", "Latitude", "Longitude"]]
phos_data = phos_data[phos_data["Parameter"] == "TP"]
phos_data.drop(columns=['Parameter'], inplace=True)

phos_data.info()

# Remove from environment
del wq_data_combined
<class 'pandas.core.frame.DataFrame'>
Index: 43809 entries, 56 to 2484389
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   SampleDate    43809 non-null  datetime64[ns]
 1   MeasureValue  43809 non-null  float64       
 2   Latitude      43809 non-null  float64       
 3   Longitude     43809 non-null  float64       
dtypes: datetime64[ns](1), float64(3)
memory usage: 1.7 MB
<class 'pandas.core.frame.DataFrame'>
Index: 43590 entries, 60 to 2484391
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   SampleDate    43590 non-null  datetime64[ns]
 1   MeasureValue  43590 non-null  float64       
 2   Latitude      43590 non-null  float64       
 3   Longitude     43590 non-null  float64       
dtypes: datetime64[ns](1), float64(3)
memory usage: 1.7 MB

Exploratory Analysis

We’ll start by briefly looking at average concentration by location using a heatmap (courtesy of the seaborn library).

# Nitrogen

# Bin latitude and longitude into groups
nitr_data_copy = pd.DataFrame(nitr_data)
nitr_data_copy['Latitude Group'] = pd.cut(nitr_data['Latitude'], bins=10)
nitr_data_copy['Longitude Group'] = pd.cut(nitr_data['Longitude'], bins=10)

# Pivot the data to create a heatmap
nitr_heatmap_data = nitr_data_copy.pivot_table(index='Latitude Group', columns='Longitude Group', values='MeasureValue', aggfunc='mean')

# Remove from environment
del nitr_data_copy

# Initialize the figure
plt.figure(figsize=(10, 8))

# Set style
plt.style.use('seaborn-whitegrid')

# Plot heatmap
sns.heatmap(nitr_heatmap_data, cmap='Reds')

# Customize the plot
plt.title('Heatmap of Mean Nitrogen Concentration (mg/L) by Location')
plt.xlabel('Longitude')
plt.ylabel('Latitude')

plt.tight_layout()
plt.show()

# Remove from environment
del nitr_heatmap_data
/tmp/ipykernel_17/2516891936.py:9: FutureWarning: The default value of observed=False is deprecated and will change to observed=True in a future version of pandas. Specify observed=False to silence this warning and retain the current behavior
  nitr_heatmap_data = nitr_data_copy.pivot_table(index='Latitude Group', columns='Longitude Group', values='MeasureValue', aggfunc='mean')
/tmp/ipykernel_17/2516891936.py:18: MatplotlibDeprecationWarning: The seaborn styles shipped by Matplotlib are deprecated since 3.6, as they no longer correspond to the styles shipped by seaborn. However, they will remain available as 'seaborn-v0_8-<style>'. Alternatively, directly use the seaborn API instead.
  plt.style.use('seaborn-whitegrid')

# Phosphorus

# Bin latitude and longitude into groups
phos_data_copy = pd.DataFrame(phos_data)
phos_data_copy['Latitude Group'] = pd.cut(phos_data['Latitude'], bins=10)
phos_data_copy['Longitude Group'] = pd.cut(phos_data['Longitude'], bins=10)

# Pivot the data to create a heatmap
phos_heatmap_data = phos_data_copy.pivot_table(index='Latitude Group', columns='Longitude Group', values='MeasureValue', aggfunc='mean')

# Remove from environment
del phos_data_copy

# Initialize the figure
plt.figure(figsize=(10, 8))

# Set style
plt.style.use('seaborn-whitegrid')

# Plot heatmap
sns.heatmap(phos_heatmap_data, cmap='Reds')

# Customize the plot
plt.title('Heatmap of Mean Phosphorus Concentration (mg/L) by Location')
plt.xlabel('Longitude')
plt.ylabel('Latitude')

plt.tight_layout()
plt.show()

# Remove from environment
del phos_heatmap_data
/tmp/ipykernel_17/190755117.py:9: FutureWarning: The default value of observed=False is deprecated and will change to observed=True in a future version of pandas. Specify observed=False to silence this warning and retain the current behavior
  phos_heatmap_data = phos_data_copy.pivot_table(index='Latitude Group', columns='Longitude Group', values='MeasureValue', aggfunc='mean')
/tmp/ipykernel_17/190755117.py:18: MatplotlibDeprecationWarning: The seaborn styles shipped by Matplotlib are deprecated since 3.6, as they no longer correspond to the styles shipped by seaborn. However, they will remain available as 'seaborn-v0_8-<style>'. Alternatively, directly use the seaborn API instead.
  plt.style.use('seaborn-whitegrid')

Interestingly, nitrogen and phosphorus concentration are both generally higher in the southeast regions of the Chesapeake Bay.

Moving forward, we’ll start with our time series analysis by calculating moving averages for each year-month. We’ll plot these moving averages just to get a general idea of what types of trends we might be looking at.

# Nitrogen

# Remove columns from the nitrogen data that are no longer needed
nitr_data.drop(columns=['Latitude', 'Longitude'], inplace=True)

# Set 'SampleDate' as the index
nitr_data.set_index('SampleDate', inplace=True)

# Resample to monthly frequency and calculate the mean for the 'MeasureValue' column
nitr_monthly_avg = nitr_data['MeasureValue'].resample('M').mean()

# Convert the result to a DataFrame, but keep the datetime index
nitr_monthly_avg = nitr_monthly_avg.to_frame()

# Remove from environment
del nitr_data

# Initialize figure
plt.figure(figsize=(12, 6))

# Plot monthly average nitrogen data
plt.plot(nitr_monthly_avg.index, nitr_monthly_avg['MeasureValue'], color='black', alpha=0.7)

# Customize the plot
plt.title('Monthly Average Concentration of Nitrogen Samples (2010-2019)')
plt.xlabel('Year')
plt.ylabel('Monthly Average Concentration (mg/L)')

# Rotate and align tick labels
plt.gcf().autofmt_xdate()

# Use tight layout to ensure everything fits without overlapping
plt.tight_layout()

plt.show()
/tmp/ipykernel_17/3811197196.py:10: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
  nitr_monthly_avg = nitr_data['MeasureValue'].resample('M').mean()

From this plot, there does appear to be seasonality for nitrogen concentrations, but there is no clear trend component. In addition, there is a notable spike in early 2014.

# Phosphorus

# Remove columns from the nitrogen data that are no longer needed
phos_data.drop(columns=['Latitude', 'Longitude'], inplace=True)

# Set 'SampleDate' as the index
phos_data.set_index('SampleDate', inplace=True)

# Resample to monthly frequency and calculate the mean for the 'MeasureValue' column
phos_monthly_avg = phos_data['MeasureValue'].resample('M').mean()

# Convert the result to a DataFrame, but keep the datetime index
phos_monthly_avg = phos_monthly_avg.to_frame()

# Remove from environment
del phos_data

# Initialize figure
plt.figure(figsize=(12, 6))

# Plot monthly average phosphorus data
plt.plot(phos_monthly_avg.index, phos_monthly_avg['MeasureValue'], color='black', alpha=0.7)

# Customize the plot
plt.title('Monthly Average Concentration of Phosphorus Samples (2010-2019)')
plt.xlabel('Year')
plt.ylabel('Monthly Average Concentration (mg/L)')

# Rotate and align tick labels
plt.gcf().autofmt_xdate()

# Use tight layout to ensure everything fits without overlapping
plt.tight_layout()

plt.show()
/tmp/ipykernel_17/729465269.py:10: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
  phos_monthly_avg = phos_data['MeasureValue'].resample('M').mean()

Similar to the nitrogen plot, phosphorus also seems to exhibit a distinct seasonal component. Again, it is unclear whether there is an underlying trend.

Methods

Autocorrelation Function

The autocorrelation function calculates the correlation between the dependent variable at a given point in time and various time lags for this same variable. Thus, the autocorrelation function provides us with a tool that allows us to better understand any seasonality present in our data. This will be useful for us in subsequent steps of our time series analysis.

STL Decomposition

As a core part of this time series analysis, I’ll be constructing a seasonal trend decomposition using locally estimated scatterplot smoothing (LOESS), which is often abbreviated as a STL decomposition model. STL allows us to separate our monthly average concentrations into three components: seasonal component, trend component, and residual component. Through plotting these components next to each other, we gain a more intuitive understanding of the underlying forces contributing to variation in our dependent variable, which in this case is the monthly average concentration.

As opposed to other decomposition methods, one thing that is particular to STL models is the ability to specify the length of a season. It can be helpful to adjust this input depending on our desired level of smoothing for the trend component. We will use our results from the autocorrelation function to inform our chosen length of seasons. The autocorrelation function is useful in this context because it can tell us after how many lags we see a drop-off in autocorrelation, indicating there is a drop in the significance of the seasonal component.

Results

Autocorrelation Function

# Nitrogen

# Plot autocorrelation function for nitrogen with lags going back three years
sm.graphics.tsa.plot_acf(nitr_monthly_avg['MeasureValue'], lags=36, alpha=0.05)

plt.xlabel('Lags (Months)')
plt.title('Autocorrelation Function for Nitrogen')
plt.show()

# Phosphorus

# Plot autocorrelation function for phosphorus with lags going back three years
sm.graphics.tsa.plot_acf(phos_monthly_avg['MeasureValue'], lags=36)

plt.xlabel('Lags (Months)')
plt.title('Autocorrelation Function for Phosphorus')
plt.show()

For both autocorrelation functions, there is a drop-off in autocorrelation after the t-24 lag, so I will choose 24 months as constituting one seasonal period when constructing my model in the next step.

STL Decomposition

# Nitrogen

# Perform STL decomposition
# Select trend parameter that is about halfway into the second season-period
nitr_stl = STL(nitr_monthly_avg['MeasureValue'], period=24, trend=35)
nitr_decomp = nitr_stl.fit()

nitr_decomp.plot()

# Remove from environment
del nitr_stl

In this plot of the three STL components for nitrogen, it is still difficult to see a long-term trend, despite the line being fairly smooth. There does seem to be a slight downward trend until 2018. However, starting in 2018, there is a clear increase for the rest of the time series.

# Phosphorus

# Perform STL decomposition
# Select trend parameter that is about halfway into the second season-period
phos_stl = STL(phos_monthly_avg['MeasureValue'], period=24, trend=35)
phos_decomp = phos_stl.fit()

phos_decomp.plot()

# Remove from environment
del phos_stl

The STL plot for phosphorus does make it seem like there is a long-term downward trend, but it is difficult to tell how significant it is compared to the other components.The STL plot for phosphorus does make it seem like there is a long-term downward trend, but it is difficult to tell how significant it is because of the long grey bar, which indicates it is least influential of the three components in our STL model.

# Nitrogen

# Extract the components
nitr_trend = nitr_decomp.trend
nitr_seasonal = nitr_decomp.seasonal
nitr_resid = nitr_decomp.resid

# Remove from environment
del nitr_decomp

# Calculate the seasonally adjusted series
nitr_seasonally_adjusted = nitr_monthly_avg['MeasureValue'].values - nitr_seasonal

# Create the plot
plt.figure(figsize=(16, 10))

# Plot all components on the same axis
plt.plot(nitr_monthly_avg.index, nitr_monthly_avg['MeasureValue'].values, label='Monthly Mean', color='black')
plt.plot(nitr_monthly_avg.index, nitr_seasonally_adjusted, label='Seasonal-Adjusted Monthly Mean', color='blue', linewidth=3)
plt.plot(nitr_monthly_avg.index, nitr_trend, label='Trend Component', color='red', linewidth=3)
plt.plot(nitr_monthly_avg.index, nitr_seasonal, label='Seasonal Component', color='green', linewidth=3)

# Customize x-axis for date breaks
plt.gca().xaxis.set_major_locator(plt.MaxNLocator(10))  # Yearly major ticks
plt.gca().xaxis.set_minor_locator(plt.MaxNLocator(20))  # Half-yearly minor ticks
plt.gca().xaxis.set_major_formatter(plt.matplotlib.dates.DateFormatter('%Y-%m'))

# Set labels and title
plt.xlabel('Year-Month')
plt.ylabel('Concentration (mg/L)')
plt.title('Nitrogen in Chesapeake Bay (2010-2019)', fontsize=14)

# Customize legend
plt.legend(loc='upper right')

# Add a grid and theme
plt.grid(True, which='both', linestyle='--', linewidth=0.5)
plt.tight_layout()

# Show the plot
plt.show()

# Remove from environment
del nitr_seasonally_adjusted

I decided to make this visualization to get a better idea of exactly how these components map on to each other. This plot seems to confirm the idea of negligible trend component for nitrogen. Since the x-axis is labeled for each year here, it is also easier to see the seasonality. Each year, nitrogen concentrations increase sharply around December. They then peak around February to March, before decreasing substantially and reaching their minimum around July.

# Phosphorus

# Extract the components
phos_trend = phos_decomp.trend
phos_seasonal = phos_decomp.seasonal
phos_resid = phos_decomp.resid

# Remove from environment
del phos_decomp

# Calculate the seasonally adjusted series
phos_seasonally_adjusted = phos_monthly_avg['MeasureValue'].values - phos_seasonal

# Create the plot
plt.figure(figsize=(16, 10))

# Plot all components on the same axis
plt.plot(phos_monthly_avg.index, phos_monthly_avg['MeasureValue'].values, label='Monthly Mean', color='black')
plt.plot(nitr_monthly_avg.index, phos_seasonally_adjusted, label='Seasonal-Adjusted Monthly Mean', color='blue', linewidth=3)
plt.plot(phos_monthly_avg.index, phos_trend, label='Trend Component', color='red', linewidth=3)
plt.plot(phos_monthly_avg.index, phos_seasonal, label='Seasonal Component', color='green', linewidth=3)

# Customize x-axis for date breaks
plt.gca().xaxis.set_major_locator(plt.MaxNLocator(10))  # Yearly major ticks
plt.gca().xaxis.set_minor_locator(plt.MaxNLocator(20))  # Half-yearly minor ticks
plt.gca().xaxis.set_major_formatter(plt.matplotlib.dates.DateFormatter('%Y-%m'))

# Set labels and title
plt.xlabel('Year-Month')
plt.ylabel('Concentration (mg/L)')
plt.title('Phosphorus in Chesapeake Bay (2010-2019)', fontsize=14)

# Customize legend
plt.legend(loc='upper right')

# Add a grid and theme
plt.grid(True, which='both', linestyle='--', linewidth=0.5)
plt.tight_layout()

# Show the plot
plt.show()

# Remove from environment
del phos_seasonally_adjusted

Our plot for phosphorus further supports the idea that there is a slight downward trend over the decade. If you independently trace the maximums or minimums, the line does seem to be moving downward at an oscillating but fairly consistent rate. Unlike nitrogen, phosphorus concentrations shoot up in the middle of the year around May, have a relatively flat peak lasting from June to August, and then shoot down at the end of Summer.

Linear Regressions of Model Parameters

# Nitrogen

# Run regression of season component on monthly average
X = sm.add_constant(nitr_seasonal)
y = nitr_monthly_avg['MeasureValue'].values
nitr_season_reg = OLS(y, X).fit()

# Print R-squared
r_squared = nitr_season_reg.rsquared
print(f"R-squared: {r_squared:.3f}")
R-squared: 0.725

The adjusted R-squared of 0.73 indicates that seasonality explains about 73% of the variation in nitrogen monthly mean.

# Phosphorus

# Run regression of season component on monthly average
X = sm.add_constant(phos_seasonal)
y = phos_monthly_avg['MeasureValue'].values
phos_season_reg = OLS(y, X).fit()

# Print R-squared
r_squared = phos_season_reg.rsquared
print(f"R-squared: {r_squared:.3f}")
R-squared: 0.867

For phosphorus, the adjusted R-squared of this regression is 0.87, confirming our idea that seasonality is more pronounced with phosphorus.

# Nitrogen

# Calculate months since start
start_date = nitr_monthly_avg.index[0]
nitr_monthly_avg['months_since_start'] = (nitr_monthly_avg.index.year - start_date.year) * 12 + (nitr_monthly_avg.index.month - start_date.month)

# Run regression of trend component on days since start
X = sm.add_constant(nitr_monthly_avg['months_since_start'])
y = nitr_trend
nitr_season_reg = OLS(y, X).fit()

# Print the regression summary
print(nitr_season_reg.summary())

# Get the coefficient (slope)
monthly_trend_estimate = nitr_season_reg.params['months_since_start']

# Get the standard error of the coefficient
se = nitr_season_reg.bse['months_since_start']

# Calculate the t-statistic for 95% confidence interval
# Degrees of freedom: n - k - 1, where n is number of observations and k is number of predictors
df = nitr_season_reg.df_resid
t_value = stats.t.ppf(0.975, df)  # 0.975 for two-tailed 95% CI

# Calculate estimate for 10-year aggregated trend component
fullseries_trend_estimate = 120 * monthly_trend_estimate

# Calculate the confidence interval for 10 year change (120 * monthly trend component)
ci_lower = 120 * (monthly_trend_estimate - t_value * se)
ci_upper = 120 * (monthly_trend_estimate + t_value * se)

# Print results
print(f"Estimated 10-year trend component (mg/L):\n{fullseries_trend_estimate:.3f}")
print(f"95% Confidence Interval for 10-year trend component (mg/L):\n({ci_lower:.3f}, {ci_upper:.3f})")

# Calculate estimate for 10-year aggregated trend component (as percent of January 2010)
fullseries_trend_estimate_aspercent = fullseries_trend_estimate / nitr_trend.values[0]

# Calculate the confidence interval for 10 year change (120 * monthly trend component) as percent of January 2010
ci_lower = ci_lower / nitr_trend.values[0]
ci_upper = ci_upper / nitr_trend.values[0]

# Print results
print(f"Estimated 10-year trend component (as percent of January 2010):\n{fullseries_trend_estimate_aspercent:.3f}")
print(f"95% Confidence Interval for 10-year trend component (as percent of January 2010):\n({ci_lower:.3f}, {ci_upper:.3f})")

# Remove from environment
del nitr_monthly_avg
del nitr_trend
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  trend   R-squared:                       0.046
Model:                            OLS   Adj. R-squared:                  0.038
Method:                 Least Squares   F-statistic:                     5.698
Date:                Tue, 10 Sep 2024   Prob (F-statistic):             0.0186
Time:                        06:46:33   Log-Likelihood:                 193.25
No. Observations:                 120   AIC:                            -382.5
Df Residuals:                     118   BIC:                            -376.9
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                  0.9187      0.009    103.858      0.000       0.901       0.936
months_since_start    -0.0003      0.000     -2.387      0.019      -0.001   -5.23e-05
==============================================================================
Omnibus:                        3.998   Durbin-Watson:                   0.008
Prob(Omnibus):                  0.135   Jarque-Bera (JB):                3.239
Skew:                          -0.289   Prob(JB):                        0.198
Kurtosis:                       2.441   Cond. No.                         137.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Estimated 10-year trend component (mg/L):
-0.037
95% Confidence Interval for 10-year trend component (mg/L):
(-0.067, -0.006)
Estimated 10-year trend component (as percent of January 2010):
-0.039
95% Confidence Interval for 10-year trend component (as percent of January 2010):
(-0.071, -0.007)

In this linear regression, we look at the influence of year-month on our trend component for nitrogen. The regression output tells us that we can say at an alpha equals 0.05 significance level that the 10-year trend component was negative. Furthermore, the first interval tells us that there is a 95% chance that the interval from -0.067 mg/L to -0.006 mg/L contains the true 10-year trend component. The second interval shows that this represents a -0.071% to -0.007% change as compared to the trend component in January 2010.

# Phosphorus

# Calculate months since start
start_date = phos_monthly_avg.index[0]
phos_monthly_avg['months_since_start'] = (phos_monthly_avg.index.year - start_date.year) * 12 + (phos_monthly_avg.index.month - start_date.month)

# Run regression of trend component on days since start
X = sm.add_constant(phos_monthly_avg['months_since_start'])
y = phos_trend
phos_season_reg = OLS(y, X).fit()

# Print the regression summary
print(phos_season_reg.summary())

# Get the coefficient (slope)
monthly_trend_estimate = phos_season_reg.params['months_since_start']

# Get the standard error of the coefficient
se = phos_season_reg.bse['months_since_start']

# Calculate the t-statistic for 95% confidence interval
# Degrees of freedom: n - k - 1, where n is number of observations and k is number of predictors
df = phos_season_reg.df_resid
t_value = stats.t.ppf(0.975, df)  # 0.975 for two-tailed 95% CI

# Calculate estimate for 10-year aggregated trend component
fullseries_trend_estimate = 120 * monthly_trend_estimate

# Calculate the confidence interval for 10 year change (120 * monthly trend component)
ci_lower = 120 * (monthly_trend_estimate - t_value * se)
ci_upper = 120 * (monthly_trend_estimate + t_value * se)

# Print results
print(f"Estimated 10-year trend component (mg/L):\n{fullseries_trend_estimate:.5f}")
print(f"95% Confidence Interval for 10-year trend component (mg/L):\n({ci_lower:.5f}, {ci_upper:.5f})")

# Calculate estimate for 10-year aggregated trend component (as percent of January 2010)
fullseries_trend_estimate_aspercent = fullseries_trend_estimate / phos_trend.values[0]

# Calculate the confidence interval for 10 year change (120 * monthly trend component) as percent of January 2010
ci_lower = ci_lower / phos_trend.values[0]
ci_upper = ci_upper / phos_trend.values[0]

# Print results
print(f"Estimated 10-year trend component (as percent of January 2010):\n{fullseries_trend_estimate_aspercent:.3f}")
print(f"95% Confidence Interval for 10-year trend component (as percent of January 2010):\n({ci_lower:.3f}, {ci_upper:.3f})")
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  trend   R-squared:                       0.796
Model:                            OLS   Adj. R-squared:                  0.795
Method:                 Least Squares   F-statistic:                     461.4
Date:                Tue, 10 Sep 2024   Prob (F-statistic):           1.38e-42
Time:                        06:46:33   Log-Likelihood:                 639.74
No. Observations:                 120   AIC:                            -1275.
Df Residuals:                     118   BIC:                            -1270.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                  0.0609      0.000    284.396      0.000       0.060       0.061
months_since_start -6.683e-05   3.11e-06    -21.480      0.000    -7.3e-05   -6.07e-05
==============================================================================
Omnibus:                       33.059   Durbin-Watson:                   0.016
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                6.540
Skew:                           0.051   Prob(JB):                       0.0380
Kurtosis:                       1.861   Cond. No.                         137.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Estimated 10-year trend component (mg/L):
-0.00802
95% Confidence Interval for 10-year trend component (mg/L):
(-0.00876, -0.00728)
Estimated 10-year trend component (as percent of January 2010):
-0.126
95% Confidence Interval for 10-year trend component (as percent of January 2010):
(-0.138, -0.115)

For phosphorus, the linear regression output tells us that we are confident at the alpha equals 0.05 level that the 10-year trend component was negative. Specifically, we find that there is a 95% chance that the interval between -0.0087 mg/L and -0.0073 mg/L contains the true 10-year trend component. This represents a -0.14% to -0.12% change as compared to the trend component in January 2010.

Conclusion

My time series analysis suggests that seasonality plays a substantial role in contributing to variation in the monthly mean concentration of nitrogen and phosphorus in tidal regions of the Chesapeake Bay. For nitrogen, seasonal trends explained 73% of the variation in monthly means, and the relationship was even stronger for phosphorus, with seasonal trends explaining 87% of the variation. While the seasonal component for nitrogen was highest during Winter, the seasonal component for phosphorus was highest during Summer.

This analysis was also interested in any non-seasonal trend that has occurred since the introduction of TMDL requirements in 2010. For both nitrogen and phosphorus, we find evidence at an alpha level of 0.05 that the 10-year trend component was negative. However, our confidence intervals suggest that, for both nutrient pollutants, these changes in trend represent a less than 0.2% decrease over the 10-year time series.

The main limitation of this analysis was that no form of spatial interpolation was employed to estimate concentrations across the tidal region based on the location of measurements. It would be interesting to compare such an analysis to what we did here, as any significant differences would imply that sampled areas are not spread throughout the region in a representative manner. Further analysis might also investigate what happened at the beginning of 2014 that could have led to the high spike in nitrogen levels at that time, in addition to factors that might have fueled the increase seen over the course of 2018.

Citation

BibTeX citation:
@online{ghanadan2024,
  author = {Ghanadan, Linus},
  title = {Time {Series} {Analysis} of {Nutrient} {Concentration} in
    {Chesapeake} {Bay}},
  date = {2024-01-20},
  url = {https://linusghanadan.github.io/blog/2024-8-20-post/chesapeake-bay-python.html},
  langid = {en}
}
For attribution, please cite this work as:
Ghanadan, Linus. 2024. “Time Series Analysis of Nutrient Concentration in Chesapeake Bay.” January 20, 2024. https://linusghanadan.github.io/blog/2024-8-20-post/chesapeake-bay-python.html.