import numpy as np
import pandas as pd
# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
# Statistics libraries
import statsmodels.api as sm
from statsmodels.tsa.seasonal import STL
from statsmodels.regression.linear_model import OLS
from statsmodels.iolib.summary import Summary
from scipy import stats
# Other libraries
import requests
from io import BytesIO
Link to GitHub Repository
Context
For this final project in my master’s statistics course at UC Santa Barbara, I worked independently to find data, pose a question, and carry out analysis using statistical modeling and Python.
Central Question
Over the ten years that followed the onset of new regulations in 2010, what linear trends best capture how average nitrogen and phosphorus concentrations have changed in the Chespeake Bay?
Summary of Analysis
Proposed statistical question on possible underlying linear trends in average nitrogen and phosphorus concentrations across the Chesapeake Bay since 2010 (when new Clean Water Act regulations were implemented for the Bay) and found appropriate data for answering the question (over 43,000 samples from the Bay’s tidal regions). Constructed two Seasonal-Trend using LOESS (STL) decomposition models to conduct time series analysis of nitrogen and phosphorus concentrations (selected length of seasons based on autocorrelation). For each pollutant, visualized model parameters comparatively and ran regressions of parameters (to determine proportion of variation attributable to seasonality and 95% confidence interval for 10-year linear trend component).
Introduction
The Chesapeake Bay is the largest estuary in the United States, and the largest body of water that is regulated under the Clean Water Act (U.S. Environmental Protection Agency 2023). Federal regulation pertaining to the Bay took decades to implement, and this is in large part because of the Bay’s large size and the many stakeholders involved. In the 1970s, the Bay was identified as one of the first areas in the world to have a marine dead zone, a phenomenon that literally suffocates aquatic life due to lack of oxygen in the water. Despite dead zones being identified in the 1970s, it was not until 2000 that the Bay was designated as an “impaired water” under the Clean Water Act. Then, it took another ten years, until 2010, for the EPA to take the next step of issuing Total Maximum Daily Load (TMDL) requirements, the regulatory action mandating water quality improvement.
Specifically, a TMDL is the maximum amount of a particular pollutant that a body of water can receive and still meet applicable water quality standards (U.S. Environmental Protection Agency 2023). This maximum amount is calculated in pounds based on measurements taken at areas where pollution is likely to end up in the Bay. In their 2010 regulation, the EPA established TMDL requirements for nitrogen, phosphorus, and sediment. Nitrogen and phosphorus, referred to as nutrients because of their role in providing nutrition to many animals and plants, cause algal blooms, which cause marine dead zones through taking in dissolved oxygen and blocking sunlight. Sediment contributes to dead zones by blocking sunlight as well, leading it to also be included in the 2010 TMDL requirements.
This analysis will focus on nitrogen and phosphorus, the two pollutants responsible for algal blooms in the Chesapeake Bay. A 2022 study found that agricultural runoff was the largest source of nutrient pollution, accounting for 48% of nitrogen and 27% of phosphorus in the Chesapeake Bay (Chesapeake Progress, n.d.). Both of these pollutants also get to the Bay as a result of urban and suburban runoff, wastewater treatment plants releasing treated water, and natural sources (e.g., runoff from forests, wetlands, etc.). In addition, about 25% of nitrogen that ends up in the Bay comes from air pollution that is originally emitted to the atmosphere by sources such as cars and factories (Burns et al. 2021. Through a process called atmospheric deposition, these nitrogen compounds react with other chemicals to become nitrous oxides, which can be deposited back to Earth’s surface through precipitation or as dry deposition.
Through conducting a time series analysis of post-2010 nitrogen and phosphorus concentration measurements, my goal is to better understand how concentrations have changed since the introduction of TMDL requirements. I’m also interested in the nature of any seasonality and whether the three time series components (i.e., seasonal, trend, and residual) are consistent across both nitrogen and phosphorus.
Data
Yearly water quality data on the Chesapeake Bay’s tidal and non-tidal regions going back to 1984 is publicly available on the Chesapeake Bay Program (CBP) DataHub (Chesapeake Bay Program DataHub, n.d.). Data is organized into either Tier 1, 2, or 3 depending on how it was collected. While Tier 1 and 2 data can be collected by any interested group, Tier 3 data is collected by monitoring stations overseen by experienced professionals. Only Tier 3 data can be used for governmental regulatory assessments.
For my analysis, I will be using 2010 to 2019 Tier 3 data collected at 143 different monitoring stations positioned throughout the Chesapeake Bay tidal regions, which includes the mainstem Bay and tributary components. Across the 10 years that we are looking at, we’ll have more than 43,800 nitrogen observations and more than 43,500 phosphorus observations.
Below, we import the Python libraries used in this analysis. Then, we read in the yearly water quality data using their CBP DataHub URL. We also process the data by creating separate data.frames for total nitrogen and phosphorus measurements.
# Create a list of data URLs
= [
excel_urls 'https://datahub-content.chesapeakebay.net/traditional_annual_tidal_02jun21/2019_CEDR_tidal_data_01jun21.xlsx',
'https://datahub-content.chesapeakebay.net/traditional_annual_tidal_02jun21/2018_CEDR_tidal_data_01jun21.xlsx',
'https://datahub-content.chesapeakebay.net/traditional_annual_tidal_02jun21/2017_CEDR_tidal_data_11oct18.xlsx',
'https://datahub-content.chesapeakebay.net/traditional_annual_tidal_02jun21/2016_CEDR_tidal_data_15jun17.xlsx',
'https://datahub-content.chesapeakebay.net/traditional_annual_tidal_02jun21/2015_CEDR_tidal_data_15jun17.xlsx',
'https://datahub-content.chesapeakebay.net/traditional_annual_tidal_02jun21/2014_CEDR_tidal_data_15jun17.xlsx',
'https://datahub-content.chesapeakebay.net/traditional_annual_tidal_02jun21/2013_CEDR_tidal_data_15jun17.xlsx',
'https://datahub-content.chesapeakebay.net/traditional_annual_tidal_02jun21/2012_CEDR_tidal_data_15jun17.xlsx',
'https://datahub-content.chesapeakebay.net/traditional_annual_tidal_02jun21/2011_CEDR_tidal_data_15jun17.xlsx',
'https://datahub-content.chesapeakebay.net/traditional_annual_tidal_02jun21/2010_CEDR_tidal_data_15jun17.xlsx'
]
# Create an empty list to store data frames
= []
dfs
# Loop through each URL, read the Excel file directly into pandas, and append to list of data frames
for url in excel_urls:
# Get the content of the Excel file
= requests.get(url)
response
# Read the Excel file directly from the content
= pd.read_excel(BytesIO(response.content), sheet_name=0)
wq_data
dfs.append(wq_data)
# Combine all data frames into a single data frame
= pd.concat(dfs, ignore_index=True) wq_data_combined
# Wrangle data for relevant column variables, filter for TN (total nitrogen), and remove Parameter column
= wq_data_combined[["SampleDate", "Parameter", "MeasureValue", "Latitude", "Longitude"]]
nitr_data = nitr_data[nitr_data["Parameter"] == "TN"]
nitr_data =['Parameter'], inplace=True)
nitr_data.drop(columns
nitr_data.info()
# Wrangle data for relevant column variables, and filter for TP (total phosphorus)
= wq_data_combined[["SampleDate", "Parameter", "MeasureValue", "Latitude", "Longitude"]]
phos_data = phos_data[phos_data["Parameter"] == "TP"]
phos_data =['Parameter'], inplace=True)
phos_data.drop(columns
phos_data.info()
# Remove from environment
del wq_data_combined
<class 'pandas.core.frame.DataFrame'>
Index: 43809 entries, 56 to 2484389
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SampleDate 43809 non-null datetime64[ns]
1 MeasureValue 43809 non-null float64
2 Latitude 43809 non-null float64
3 Longitude 43809 non-null float64
dtypes: datetime64[ns](1), float64(3)
memory usage: 1.7 MB
<class 'pandas.core.frame.DataFrame'>
Index: 43590 entries, 60 to 2484391
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SampleDate 43590 non-null datetime64[ns]
1 MeasureValue 43590 non-null float64
2 Latitude 43590 non-null float64
3 Longitude 43590 non-null float64
dtypes: datetime64[ns](1), float64(3)
memory usage: 1.7 MB
Exploratory Analysis
We’ll start by briefly looking at average concentration by location using a heatmap (courtesy of the seaborn library).
# Nitrogen
# Bin latitude and longitude into groups
= pd.DataFrame(nitr_data)
nitr_data_copy 'Latitude Group'] = pd.cut(nitr_data['Latitude'], bins=10)
nitr_data_copy['Longitude Group'] = pd.cut(nitr_data['Longitude'], bins=10)
nitr_data_copy[
# Pivot the data to create a heatmap
= nitr_data_copy.pivot_table(index='Latitude Group', columns='Longitude Group', values='MeasureValue', aggfunc='mean')
nitr_heatmap_data
# Remove from environment
del nitr_data_copy
# Initialize the figure
=(10, 8))
plt.figure(figsize
# Set style
'seaborn-whitegrid')
plt.style.use(
# Plot heatmap
='Reds')
sns.heatmap(nitr_heatmap_data, cmap
# Customize the plot
'Heatmap of Mean Nitrogen Concentration (mg/L) by Location')
plt.title('Longitude')
plt.xlabel('Latitude')
plt.ylabel(
plt.tight_layout()
plt.show()
# Remove from environment
del nitr_heatmap_data
/tmp/ipykernel_17/2516891936.py:9: FutureWarning: The default value of observed=False is deprecated and will change to observed=True in a future version of pandas. Specify observed=False to silence this warning and retain the current behavior
nitr_heatmap_data = nitr_data_copy.pivot_table(index='Latitude Group', columns='Longitude Group', values='MeasureValue', aggfunc='mean')
/tmp/ipykernel_17/2516891936.py:18: MatplotlibDeprecationWarning: The seaborn styles shipped by Matplotlib are deprecated since 3.6, as they no longer correspond to the styles shipped by seaborn. However, they will remain available as 'seaborn-v0_8-<style>'. Alternatively, directly use the seaborn API instead.
plt.style.use('seaborn-whitegrid')
# Phosphorus
# Bin latitude and longitude into groups
= pd.DataFrame(phos_data)
phos_data_copy 'Latitude Group'] = pd.cut(phos_data['Latitude'], bins=10)
phos_data_copy['Longitude Group'] = pd.cut(phos_data['Longitude'], bins=10)
phos_data_copy[
# Pivot the data to create a heatmap
= phos_data_copy.pivot_table(index='Latitude Group', columns='Longitude Group', values='MeasureValue', aggfunc='mean')
phos_heatmap_data
# Remove from environment
del phos_data_copy
# Initialize the figure
=(10, 8))
plt.figure(figsize
# Set style
'seaborn-whitegrid')
plt.style.use(
# Plot heatmap
='Reds')
sns.heatmap(phos_heatmap_data, cmap
# Customize the plot
'Heatmap of Mean Phosphorus Concentration (mg/L) by Location')
plt.title('Longitude')
plt.xlabel('Latitude')
plt.ylabel(
plt.tight_layout()
plt.show()
# Remove from environment
del phos_heatmap_data
/tmp/ipykernel_17/190755117.py:9: FutureWarning: The default value of observed=False is deprecated and will change to observed=True in a future version of pandas. Specify observed=False to silence this warning and retain the current behavior
phos_heatmap_data = phos_data_copy.pivot_table(index='Latitude Group', columns='Longitude Group', values='MeasureValue', aggfunc='mean')
/tmp/ipykernel_17/190755117.py:18: MatplotlibDeprecationWarning: The seaborn styles shipped by Matplotlib are deprecated since 3.6, as they no longer correspond to the styles shipped by seaborn. However, they will remain available as 'seaborn-v0_8-<style>'. Alternatively, directly use the seaborn API instead.
plt.style.use('seaborn-whitegrid')
Interestingly, nitrogen and phosphorus concentration are both generally higher in the southeast regions of the Chesapeake Bay.
Moving forward, we’ll start with our time series analysis by calculating moving averages for each year-month. We’ll plot these moving averages just to get a general idea of what types of trends we might be looking at.
# Nitrogen
# Remove columns from the nitrogen data that are no longer needed
=['Latitude', 'Longitude'], inplace=True)
nitr_data.drop(columns
# Set 'SampleDate' as the index
'SampleDate', inplace=True)
nitr_data.set_index(
# Resample to monthly frequency and calculate the mean for the 'MeasureValue' column
= nitr_data['MeasureValue'].resample('M').mean()
nitr_monthly_avg
# Convert the result to a DataFrame, but keep the datetime index
= nitr_monthly_avg.to_frame()
nitr_monthly_avg
# Remove from environment
del nitr_data
# Initialize figure
=(12, 6))
plt.figure(figsize
# Plot monthly average nitrogen data
'MeasureValue'], color='black', alpha=0.7)
plt.plot(nitr_monthly_avg.index, nitr_monthly_avg[
# Customize the plot
'Monthly Average Concentration of Nitrogen Samples (2010-2019)')
plt.title('Year')
plt.xlabel('Monthly Average Concentration (mg/L)')
plt.ylabel(
# Rotate and align tick labels
plt.gcf().autofmt_xdate()
# Use tight layout to ensure everything fits without overlapping
plt.tight_layout()
plt.show()
/tmp/ipykernel_17/3811197196.py:10: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
nitr_monthly_avg = nitr_data['MeasureValue'].resample('M').mean()
From this plot, there does appear to be seasonality for nitrogen concentrations, but there is no clear trend component. In addition, there is a notable spike in early 2014.
# Phosphorus
# Remove columns from the nitrogen data that are no longer needed
=['Latitude', 'Longitude'], inplace=True)
phos_data.drop(columns
# Set 'SampleDate' as the index
'SampleDate', inplace=True)
phos_data.set_index(
# Resample to monthly frequency and calculate the mean for the 'MeasureValue' column
= phos_data['MeasureValue'].resample('M').mean()
phos_monthly_avg
# Convert the result to a DataFrame, but keep the datetime index
= phos_monthly_avg.to_frame()
phos_monthly_avg
# Remove from environment
del phos_data
# Initialize figure
=(12, 6))
plt.figure(figsize
# Plot monthly average phosphorus data
'MeasureValue'], color='black', alpha=0.7)
plt.plot(phos_monthly_avg.index, phos_monthly_avg[
# Customize the plot
'Monthly Average Concentration of Phosphorus Samples (2010-2019)')
plt.title('Year')
plt.xlabel('Monthly Average Concentration (mg/L)')
plt.ylabel(
# Rotate and align tick labels
plt.gcf().autofmt_xdate()
# Use tight layout to ensure everything fits without overlapping
plt.tight_layout()
plt.show()
/tmp/ipykernel_17/729465269.py:10: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
phos_monthly_avg = phos_data['MeasureValue'].resample('M').mean()
Similar to the nitrogen plot, phosphorus also seems to exhibit a distinct seasonal component. Again, it is unclear whether there is an underlying trend.
Methods
Autocorrelation Function
The autocorrelation function calculates the correlation between the dependent variable at a given point in time and various time lags for this same variable. Thus, the autocorrelation function provides us with a tool that allows us to better understand any seasonality present in our data. This will be useful for us in subsequent steps of our time series analysis.
STL Decomposition
As a core part of this time series analysis, I’ll be constructing a seasonal trend decomposition using locally estimated scatterplot smoothing (LOESS), which is often abbreviated as a STL decomposition model. STL allows us to separate our monthly average concentrations into three components: seasonal component, trend component, and residual component. Through plotting these components next to each other, we gain a more intuitive understanding of the underlying forces contributing to variation in our dependent variable, which in this case is the monthly average concentration.
As opposed to other decomposition methods, one thing that is particular to STL models is the ability to specify the length of a season. It can be helpful to adjust this input depending on our desired level of smoothing for the trend component. We will use our results from the autocorrelation function to inform our chosen length of seasons. The autocorrelation function is useful in this context because it can tell us after how many lags we see a drop-off in autocorrelation, indicating there is a drop in the significance of the seasonal component.
Results
Autocorrelation Function
# Nitrogen
# Plot autocorrelation function for nitrogen with lags going back three years
'MeasureValue'], lags=36, alpha=0.05)
sm.graphics.tsa.plot_acf(nitr_monthly_avg[
'Lags (Months)')
plt.xlabel('Autocorrelation Function for Nitrogen')
plt.title( plt.show()
# Phosphorus
# Plot autocorrelation function for phosphorus with lags going back three years
'MeasureValue'], lags=36)
sm.graphics.tsa.plot_acf(phos_monthly_avg[
'Lags (Months)')
plt.xlabel('Autocorrelation Function for Phosphorus')
plt.title( plt.show()
For both autocorrelation functions, there is a drop-off in autocorrelation after the t-24 lag, so I will choose 24 months as constituting one seasonal period when constructing my model in the next step.
STL Decomposition
# Nitrogen
# Perform STL decomposition
# Select trend parameter that is about halfway into the second season-period
= STL(nitr_monthly_avg['MeasureValue'], period=24, trend=35)
nitr_stl = nitr_stl.fit()
nitr_decomp
nitr_decomp.plot()
# Remove from environment
del nitr_stl
In this plot of the three STL components for nitrogen, it is still difficult to see a long-term trend, despite the line being fairly smooth. There does seem to be a slight downward trend until 2018. However, starting in 2018, there is a clear increase for the rest of the time series.
# Phosphorus
# Perform STL decomposition
# Select trend parameter that is about halfway into the second season-period
= STL(phos_monthly_avg['MeasureValue'], period=24, trend=35)
phos_stl = phos_stl.fit()
phos_decomp
phos_decomp.plot()
# Remove from environment
del phos_stl
The STL plot for phosphorus does make it seem like there is a long-term downward trend, but it is difficult to tell how significant it is compared to the other components.The STL plot for phosphorus does make it seem like there is a long-term downward trend, but it is difficult to tell how significant it is because of the long grey bar, which indicates it is least influential of the three components in our STL model.
# Nitrogen
# Extract the components
= nitr_decomp.trend
nitr_trend = nitr_decomp.seasonal
nitr_seasonal = nitr_decomp.resid
nitr_resid
# Remove from environment
del nitr_decomp
# Calculate the seasonally adjusted series
= nitr_monthly_avg['MeasureValue'].values - nitr_seasonal
nitr_seasonally_adjusted
# Create the plot
=(16, 10))
plt.figure(figsize
# Plot all components on the same axis
'MeasureValue'].values, label='Monthly Mean', color='black')
plt.plot(nitr_monthly_avg.index, nitr_monthly_avg[='Seasonal-Adjusted Monthly Mean', color='blue', linewidth=3)
plt.plot(nitr_monthly_avg.index, nitr_seasonally_adjusted, label='Trend Component', color='red', linewidth=3)
plt.plot(nitr_monthly_avg.index, nitr_trend, label='Seasonal Component', color='green', linewidth=3)
plt.plot(nitr_monthly_avg.index, nitr_seasonal, label
# Customize x-axis for date breaks
10)) # Yearly major ticks
plt.gca().xaxis.set_major_locator(plt.MaxNLocator(20)) # Half-yearly minor ticks
plt.gca().xaxis.set_minor_locator(plt.MaxNLocator('%Y-%m'))
plt.gca().xaxis.set_major_formatter(plt.matplotlib.dates.DateFormatter(
# Set labels and title
'Year-Month')
plt.xlabel('Concentration (mg/L)')
plt.ylabel('Nitrogen in Chesapeake Bay (2010-2019)', fontsize=14)
plt.title(
# Customize legend
='upper right')
plt.legend(loc
# Add a grid and theme
True, which='both', linestyle='--', linewidth=0.5)
plt.grid(
plt.tight_layout()
# Show the plot
plt.show()
# Remove from environment
del nitr_seasonally_adjusted
I decided to make this visualization to get a better idea of exactly how these components map on to each other. This plot seems to confirm the idea of negligible trend component for nitrogen. Since the x-axis is labeled for each year here, it is also easier to see the seasonality. Each year, nitrogen concentrations increase sharply around December. They then peak around February to March, before decreasing substantially and reaching their minimum around July.
# Phosphorus
# Extract the components
= phos_decomp.trend
phos_trend = phos_decomp.seasonal
phos_seasonal = phos_decomp.resid
phos_resid
# Remove from environment
del phos_decomp
# Calculate the seasonally adjusted series
= phos_monthly_avg['MeasureValue'].values - phos_seasonal
phos_seasonally_adjusted
# Create the plot
=(16, 10))
plt.figure(figsize
# Plot all components on the same axis
'MeasureValue'].values, label='Monthly Mean', color='black')
plt.plot(phos_monthly_avg.index, phos_monthly_avg[='Seasonal-Adjusted Monthly Mean', color='blue', linewidth=3)
plt.plot(nitr_monthly_avg.index, phos_seasonally_adjusted, label='Trend Component', color='red', linewidth=3)
plt.plot(phos_monthly_avg.index, phos_trend, label='Seasonal Component', color='green', linewidth=3)
plt.plot(phos_monthly_avg.index, phos_seasonal, label
# Customize x-axis for date breaks
10)) # Yearly major ticks
plt.gca().xaxis.set_major_locator(plt.MaxNLocator(20)) # Half-yearly minor ticks
plt.gca().xaxis.set_minor_locator(plt.MaxNLocator('%Y-%m'))
plt.gca().xaxis.set_major_formatter(plt.matplotlib.dates.DateFormatter(
# Set labels and title
'Year-Month')
plt.xlabel('Concentration (mg/L)')
plt.ylabel('Phosphorus in Chesapeake Bay (2010-2019)', fontsize=14)
plt.title(
# Customize legend
='upper right')
plt.legend(loc
# Add a grid and theme
True, which='both', linestyle='--', linewidth=0.5)
plt.grid(
plt.tight_layout()
# Show the plot
plt.show()
# Remove from environment
del phos_seasonally_adjusted
Our plot for phosphorus further supports the idea that there is a slight downward trend over the decade. If you independently trace the maximums or minimums, the line does seem to be moving downward at an oscillating but fairly consistent rate. Unlike nitrogen, phosphorus concentrations shoot up in the middle of the year around May, have a relatively flat peak lasting from June to August, and then shoot down at the end of Summer.
Linear Regressions of Model Parameters
# Nitrogen
# Run regression of season component on monthly average
= sm.add_constant(nitr_seasonal)
X = nitr_monthly_avg['MeasureValue'].values
y = OLS(y, X).fit()
nitr_season_reg
# Print R-squared
= nitr_season_reg.rsquared
r_squared print(f"R-squared: {r_squared:.3f}")
R-squared: 0.725
The adjusted R-squared of 0.73 indicates that seasonality explains about 73% of the variation in nitrogen monthly mean.
# Phosphorus
# Run regression of season component on monthly average
= sm.add_constant(phos_seasonal)
X = phos_monthly_avg['MeasureValue'].values
y = OLS(y, X).fit()
phos_season_reg
# Print R-squared
= phos_season_reg.rsquared
r_squared print(f"R-squared: {r_squared:.3f}")
R-squared: 0.867
For phosphorus, the adjusted R-squared of this regression is 0.87, confirming our idea that seasonality is more pronounced with phosphorus.
# Nitrogen
# Calculate months since start
= nitr_monthly_avg.index[0]
start_date 'months_since_start'] = (nitr_monthly_avg.index.year - start_date.year) * 12 + (nitr_monthly_avg.index.month - start_date.month)
nitr_monthly_avg[
# Run regression of trend component on days since start
= sm.add_constant(nitr_monthly_avg['months_since_start'])
X = nitr_trend
y = OLS(y, X).fit()
nitr_season_reg
# Print the regression summary
print(nitr_season_reg.summary())
# Get the coefficient (slope)
= nitr_season_reg.params['months_since_start']
monthly_trend_estimate
# Get the standard error of the coefficient
= nitr_season_reg.bse['months_since_start']
se
# Calculate the t-statistic for 95% confidence interval
# Degrees of freedom: n - k - 1, where n is number of observations and k is number of predictors
= nitr_season_reg.df_resid
df = stats.t.ppf(0.975, df) # 0.975 for two-tailed 95% CI
t_value
# Calculate estimate for 10-year aggregated trend component
= 120 * monthly_trend_estimate
fullseries_trend_estimate
# Calculate the confidence interval for 10 year change (120 * monthly trend component)
= 120 * (monthly_trend_estimate - t_value * se)
ci_lower = 120 * (monthly_trend_estimate + t_value * se)
ci_upper
# Print results
print(f"Estimated 10-year trend component (mg/L):\n{fullseries_trend_estimate:.3f}")
print(f"95% Confidence Interval for 10-year trend component (mg/L):\n({ci_lower:.3f}, {ci_upper:.3f})")
# Calculate estimate for 10-year aggregated trend component (as percent of January 2010)
= fullseries_trend_estimate / nitr_trend.values[0]
fullseries_trend_estimate_aspercent
# Calculate the confidence interval for 10 year change (120 * monthly trend component) as percent of January 2010
= ci_lower / nitr_trend.values[0]
ci_lower = ci_upper / nitr_trend.values[0]
ci_upper
# Print results
print(f"Estimated 10-year trend component (as percent of January 2010):\n{fullseries_trend_estimate_aspercent:.3f}")
print(f"95% Confidence Interval for 10-year trend component (as percent of January 2010):\n({ci_lower:.3f}, {ci_upper:.3f})")
# Remove from environment
del nitr_monthly_avg
del nitr_trend
OLS Regression Results
==============================================================================
Dep. Variable: trend R-squared: 0.046
Model: OLS Adj. R-squared: 0.038
Method: Least Squares F-statistic: 5.698
Date: Tue, 10 Sep 2024 Prob (F-statistic): 0.0186
Time: 06:46:33 Log-Likelihood: 193.25
No. Observations: 120 AIC: -382.5
Df Residuals: 118 BIC: -376.9
Df Model: 1
Covariance Type: nonrobust
======================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
const 0.9187 0.009 103.858 0.000 0.901 0.936
months_since_start -0.0003 0.000 -2.387 0.019 -0.001 -5.23e-05
==============================================================================
Omnibus: 3.998 Durbin-Watson: 0.008
Prob(Omnibus): 0.135 Jarque-Bera (JB): 3.239
Skew: -0.289 Prob(JB): 0.198
Kurtosis: 2.441 Cond. No. 137.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Estimated 10-year trend component (mg/L):
-0.037
95% Confidence Interval for 10-year trend component (mg/L):
(-0.067, -0.006)
Estimated 10-year trend component (as percent of January 2010):
-0.039
95% Confidence Interval for 10-year trend component (as percent of January 2010):
(-0.071, -0.007)
In this linear regression, we look at the influence of year-month on our trend component for nitrogen. The regression output tells us that we can say at an alpha equals 0.05 significance level that the 10-year trend component was negative. Furthermore, the first interval tells us that there is a 95% chance that the interval from -0.067 mg/L to -0.006 mg/L contains the true 10-year trend component. The second interval shows that this represents a -0.071% to -0.007% change as compared to the trend component in January 2010.
# Phosphorus
# Calculate months since start
= phos_monthly_avg.index[0]
start_date 'months_since_start'] = (phos_monthly_avg.index.year - start_date.year) * 12 + (phos_monthly_avg.index.month - start_date.month)
phos_monthly_avg[
# Run regression of trend component on days since start
= sm.add_constant(phos_monthly_avg['months_since_start'])
X = phos_trend
y = OLS(y, X).fit()
phos_season_reg
# Print the regression summary
print(phos_season_reg.summary())
# Get the coefficient (slope)
= phos_season_reg.params['months_since_start']
monthly_trend_estimate
# Get the standard error of the coefficient
= phos_season_reg.bse['months_since_start']
se
# Calculate the t-statistic for 95% confidence interval
# Degrees of freedom: n - k - 1, where n is number of observations and k is number of predictors
= phos_season_reg.df_resid
df = stats.t.ppf(0.975, df) # 0.975 for two-tailed 95% CI
t_value
# Calculate estimate for 10-year aggregated trend component
= 120 * monthly_trend_estimate
fullseries_trend_estimate
# Calculate the confidence interval for 10 year change (120 * monthly trend component)
= 120 * (monthly_trend_estimate - t_value * se)
ci_lower = 120 * (monthly_trend_estimate + t_value * se)
ci_upper
# Print results
print(f"Estimated 10-year trend component (mg/L):\n{fullseries_trend_estimate:.5f}")
print(f"95% Confidence Interval for 10-year trend component (mg/L):\n({ci_lower:.5f}, {ci_upper:.5f})")
# Calculate estimate for 10-year aggregated trend component (as percent of January 2010)
= fullseries_trend_estimate / phos_trend.values[0]
fullseries_trend_estimate_aspercent
# Calculate the confidence interval for 10 year change (120 * monthly trend component) as percent of January 2010
= ci_lower / phos_trend.values[0]
ci_lower = ci_upper / phos_trend.values[0]
ci_upper
# Print results
print(f"Estimated 10-year trend component (as percent of January 2010):\n{fullseries_trend_estimate_aspercent:.3f}")
print(f"95% Confidence Interval for 10-year trend component (as percent of January 2010):\n({ci_lower:.3f}, {ci_upper:.3f})")
OLS Regression Results
==============================================================================
Dep. Variable: trend R-squared: 0.796
Model: OLS Adj. R-squared: 0.795
Method: Least Squares F-statistic: 461.4
Date: Tue, 10 Sep 2024 Prob (F-statistic): 1.38e-42
Time: 06:46:33 Log-Likelihood: 639.74
No. Observations: 120 AIC: -1275.
Df Residuals: 118 BIC: -1270.
Df Model: 1
Covariance Type: nonrobust
======================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
const 0.0609 0.000 284.396 0.000 0.060 0.061
months_since_start -6.683e-05 3.11e-06 -21.480 0.000 -7.3e-05 -6.07e-05
==============================================================================
Omnibus: 33.059 Durbin-Watson: 0.016
Prob(Omnibus): 0.000 Jarque-Bera (JB): 6.540
Skew: 0.051 Prob(JB): 0.0380
Kurtosis: 1.861 Cond. No. 137.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Estimated 10-year trend component (mg/L):
-0.00802
95% Confidence Interval for 10-year trend component (mg/L):
(-0.00876, -0.00728)
Estimated 10-year trend component (as percent of January 2010):
-0.126
95% Confidence Interval for 10-year trend component (as percent of January 2010):
(-0.138, -0.115)
For phosphorus, the linear regression output tells us that we are confident at the alpha equals 0.05 level that the 10-year trend component was negative. Specifically, we find that there is a 95% chance that the interval between -0.0087 mg/L and -0.0073 mg/L contains the true 10-year trend component. This represents a -0.14% to -0.12% change as compared to the trend component in January 2010.
Conclusion
My time series analysis suggests that seasonality plays a substantial role in contributing to variation in the monthly mean concentration of nitrogen and phosphorus in tidal regions of the Chesapeake Bay. For nitrogen, seasonal trends explained 73% of the variation in monthly means, and the relationship was even stronger for phosphorus, with seasonal trends explaining 87% of the variation. While the seasonal component for nitrogen was highest during Winter, the seasonal component for phosphorus was highest during Summer.
This analysis was also interested in any non-seasonal trend that has occurred since the introduction of TMDL requirements in 2010. For both nitrogen and phosphorus, we find evidence at an alpha level of 0.05 that the 10-year trend component was negative. However, our confidence intervals suggest that, for both nutrient pollutants, these changes in trend represent a less than 0.2% decrease over the 10-year time series.
The main limitation of this analysis was that no form of spatial interpolation was employed to estimate concentrations across the tidal region based on the location of measurements. It would be interesting to compare such an analysis to what we did here, as any significant differences would imply that sampled areas are not spread throughout the region in a representative manner. Further analysis might also investigate what happened at the beginning of 2014 that could have led to the high spike in nitrogen levels at that time, in addition to factors that might have fueled the increase seen over the course of 2018.
Citation
@online{ghanadan2024,
author = {Ghanadan, Linus},
title = {Time {Series} {Analysis} of {Nutrient} {Concentration} in
{Chesapeake} {Bay}},
date = {2024-01-20},
url = {https://linusghanadan.github.io/blog/2024-8-20-post/chesapeake-bay-python.html},
langid = {en}
}