Skip to content

Commit

Permalink
Polynomial WIP
Browse files Browse the repository at this point in the history
  • Loading branch information
s2t2 committed Sep 21, 2024
1 parent 4d32d5f commit 68704b1
Show file tree
Hide file tree
Showing 6 changed files with 220 additions and 26 deletions.
8 changes: 4 additions & 4 deletions docs/notes/dataviz/overview.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -145,13 +145,13 @@ Starting with some example data:
```{python}
scatter_data = [
{"income": 30_000, "life_expectancy": 65.5},
{"income": 30_000, "life_expectancy": 62.1},
{"income": 35_000, "life_expectancy": 62.1},
{"income": 50_000, "life_expectancy": 66.7},
{"income": 50_000, "life_expectancy": 71.0},
{"income": 55_000, "life_expectancy": 71.0},
{"income": 70_000, "life_expectancy": 72.5},
{"income": 70_000, "life_expectancy": 77.3},
{"income": 75_000, "life_expectancy": 77.3},
{"income": 90_000, "life_expectancy": 82.9},
{"income": 90_000, "life_expectancy": 80.0},
{"income": 95_000, "life_expectancy": 80.0},
]
```

Expand Down
25 changes: 17 additions & 8 deletions docs/notes/dataviz/trendlines.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -9,24 +9,20 @@ execute:

# Charts with Trendlines





Consider the previous scatter plot example:



```{python}
scatter_data = [
{"income": 30_000, "life_expectancy": 65.5},
{"income": 30_000, "life_expectancy": 62.1},
{"income": 35_000, "life_expectancy": 62.1},
{"income": 50_000, "life_expectancy": 66.7},
{"income": 50_000, "life_expectancy": 71.0},
{"income": 55_000, "life_expectancy": 71.0},
{"income": 70_000, "life_expectancy": 72.5},
{"income": 70_000, "life_expectancy": 77.3},
{"income": 75_000, "life_expectancy": 77.3},
{"income": 90_000, "life_expectancy": 82.9},
{"income": 90_000, "life_expectancy": 80.0},
{"income": 95_000, "life_expectancy": 80.0},
]
incomes = []
Expand All @@ -51,6 +47,8 @@ fig.show()

Upon viewing the chart, looks like there may be evidence of a trend.

## Linear Trends

The [`scatter` function](https://plotly.com/python-api-reference/generated/plotly.express.scatter) has some trend-line related parameters:

```{python}
Expand All @@ -68,6 +66,13 @@ fig.show()
Under the hood, `plotly` uses the `statsmodels` package to calculate the trend, so you may have to install that package as well.
:::

A linear trend assumes that there is a straight-line relationship between the independent and dependent variables. In the context of US GDP data, a linear trend suggests that GDP changes at a constant rate over time. When applying linear regression, the goal is to find the best-fit line that minimizes the residuals (differences between the predicted and actual values) under the assumption that the underlying relationship is linear.

Linear regression is simple and interpretable but can be overly restrictive when the real-world data follows a more complex, non-linear pattern.


## Non-linear Trends

In addition to \"ols\" trend, which is an Ordinary Least Squares linear trend, we can use a \"lowess\" trend which is a [non-parametric method](https://www.investopedia.com/terms/n/nonparametric-statistics.asp) that can be a better fit for non-linear relationships:

```{python}
Expand All @@ -82,3 +87,7 @@ fig.show()
```

If you notice, for the lowess trend, there is a slight bend in the curve.

LOWESS (Locally Weighted Scatterplot Smoothing) is a non-parametric method that fits multiple local regressions to different segments of the data. Instead of assuming a global linear relationship, it captures local patterns by fitting simple models in small neighborhoods around each point. These local models are then combined to create a smooth curve that adjusts to non-linearities in the data.

A LOWESS trend can adapt to sudden changes, curves, and other complex behaviors in the data, making it ideal for datasets where the relationship between variables changes over time. In the case of US GDP, a LOWESS trend might capture short-term fluctuations and inflection points that a simple linear model would miss.
Original file line number Diff line number Diff line change
Expand Up @@ -20,18 +20,19 @@ Fetching the data, going back as far as possible:

```{python}
from pandas_datareader import get_data_fred
from datetime import datetime
DATASET_NAME = "POPTHM"
df = get_data_fred(DATASET_NAME, start=datetime(1900,1,1))
df = get_data_fred(DATASET_NAME, start="1900-01-01")
print(len(df))
df
```

:::{.callout-tip title="Data Source"}
Here is some more information about the ["POPTHM" dataset](https://fred.stlouisfed.org/series/POPTHM):

>"Population includes resident population plus armed forces overseas. The monthly estimate is the average of estimates for the first of the month and the first of the following month." The data is expressed in "Thousands", and is "Not Seasonally Adjusted".
"Population includes resident population plus armed forces overseas. The monthly estimate is the average of estimates for the first of the month and the first of the following month."

The data is expressed in "Thousands", and is "Not Seasonally Adjusted".
:::


Expand Down
187 changes: 183 additions & 4 deletions docs/notes/predictive-modeling/time-series-forecasting/polynomial.qmd
Original file line number Diff line number Diff line change
@@ -1,20 +1,199 @@
# Regression with Polynomial Features for Time Series Forecasting


```{python}
#| echo: false
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
#from pandas import set_option
#set_option('display.max_rows', 6)
```

## Data Loading

As an example time series dataset that follows a quadratic trend, let's consider this dataset of U.S. GDP over time, from the Federal Reserve Economic Data (FRED).

Fetching the data, going back as far as possible:

```{python}
from pandas_datareader import get_data_fred
df = get_data_fred("GDP", start="1900-01-01")
df.index.name = "date"
df.rename(columns={"GDP": "gdp"}, inplace=True)
df.head()
```

:::{.callout-note title="Data Source"}
Here is some more information about the ["GDP" dataset](https://fred.stlouisfed.org/series/GDP):

"Gross domestic product (GDP), the featured measure of U.S. output, is the market value of the goods and services produced by labor and property located in the United States.

The data is expressed in "Billions of Dollars", and is a "Seasonally Adjusted Annual Rate".

The dataset frequency is "Quarterly".
:::

## Data Exploration

Plotting the data over time with a linear trendline to examine a possible linear relationship:

```{python}
import plotly.express as px
px.scatter(df, y="gdp", title="US GDP (Quarterly) vs Linear Trend", height=450,
labels={"gdp": "GDP (in billions of USD)"},
trendline="ols", trendline_color_override="red"
)
```

Linear trend might not be the best fit.

Plotting the data over time with a Lowess trendline to examine a possible non-linear relationship:

```{python}
import plotly.express as px
px.scatter(df, y="gdp", title="US GDP (Quarterly) vs Lowess Trend", height=450,
labels={"gdp": "GDP (in billions of USD)"},
trendline="lowess", trendline_color_override="red"
)
```

In this case, a non-linear trend seems to fit better.

Let's perform a linear regression and an exponential features regression more formally, and compare the results.

## Linear Regression


Sorting time series data:

```{python}
from pandas import to_datetime
df.sort_values(by="date", ascending=True, inplace=True)
df["time_step"] = range(1, len(df)+1)
df.head()
```

Identifying labels and features (x/y split):

```{python}
x = df[['time_step']]
y = df['gdp']
print(x.shape)
print(y.shape)
```

Test/train split for time-series data:

```{python}
training_size = round(len(df) * .8)
x_train = x.iloc[:training_size] # all before cutoff
y_train = y.iloc[:training_size] # all before cutoff
x_test = x.iloc[training_size:] # all after cutoff
y_test = y.iloc[training_size:] # all after cutoff
print("TRAIN:", x_train.shape, y_train.shape)
print("TEST:", x_test.shape, y_test.shape)
```

### Model Training

Results and Interpretation:
Training a linear regression model:

Train R-squared: 0.85
```{python}
from sklearn.linear_model import LinearRegression
This indicates that the linear regression model explains about 85% of the variance in the GDP data during the training period. It suggests that the model fits the training data reasonably well.
model = LinearRegression()
model.fit(x_train, y_train)
```

A negative R-squared score on the test set means that the model performs poorly on future data, doing worse than a simple horizontal line (mean prediction). This is a clear indication that the linear regression model is not capturing the temporal patterns in the GDP data and fails to generalize beyond the training period.
Examining the coefficients and line of best fit:

```{python}
print("COEF:", model.coef_)
print("INTERCEPT:", model.intercept_)
```

Examining the training results:

```{python}
from sklearn.metrics import mean_squared_error, r2_score
y_pred_train = model.predict(x_train)
r2_train = r2_score(y_train, y_pred_train)
print("R^2 (TRAINING):", r2_train)
mse_train = mean_squared_error(y_train, y_pred_train)
print("MSE (TRAINING):", mse_train)
```

A strong positive that the linear regression model explains about 85% of the variance in the GDP data during the training period. It suggests that the model fits the training data reasonably well.

These results are promising, however what we really care about is how the model generalizes to the test set.

### Prediction and Evaluation


Examining the test results:


```{python}
y_pred = model.predict(x_test)
r2 = r2_score(y_test, y_pred)
print("R^2 (TEST):", r2.round(3))
mse = mean_squared_error(y_test, y_pred)
print("MSE (TEST):", mse.round(3))
```

Storing the predictions back in the original data:

```{python}
df.loc[x_train.index, "y_pred_train"] = y_pred_train
# Add predictions for the test set
df.loc[x_test.index, "y_pred_test"] = y_pred
```

Charting the predictions:


```{python}
import plotly.express as px
fig = px.line(df, y=['gdp', 'y_pred_train', 'y_pred_test'],
title='Linear Regression on GDP Ta andime Series Data',
labels={'value': 'GDP', 'date': 'Date'},
)
# update legend:
fig.update_traces(line=dict(color='blue'), name="Actual GDP", selector=dict(name='gdp'))
fig.update_traces(line=dict(color='green'), name="Predicted GDP (Train)", selector=dict(name='y_pred_train'))
fig.update_traces(line=dict(color='red'), name="Predicted GDP (Test)", selector=dict(name='y_pred_test'))
fig.show()
```




It seems although the model performs well on the training set, it performs poorly on future data it hasn't seen yet, and doesn't generalize beyond the training period.


## Polynomial

After observing the linear regression model, which relied on the original features, struggled to capture the complexity of the GDP data on future or unseen data, we can alternatively try training a linear regression model on polynomial features instead.

By transforming the original features into higher-order terms, **polynomial features** allow the model to capture non-linear relationships, offering greater flexibility and improving the model's ability to generalize to more complex patterns in the data."

$$y = mx + b$$

$$y = ax^2 + bx + c$$
Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,17 @@ df
:::{.callout-tip title="Data Source"}
Here is some more information about the ["PAYNSA" dataset](https://fred.stlouisfed.org/series/PAYNSA):

> All Employees: Total Nonfarm, commonly known as Total Nonfarm Payroll, is a measure of the number of U.S. workers in the economy that excludes proprietors, private household employees, unpaid volunteers, farm employees, and the unincorporated self-employed.
> This measure accounts for approximately 80 percent of the workers who contribute to Gross Domestic Product (GDP).
> This measure provides useful insights into the current economic situation because it can represent the number of jobs added or lost in an economy.
> Increases in employment might indicate that businesses are hiring which might also suggest that businesses are growing. Additionally, those who are newly employed have increased their personal incomes, which means (all else constant) their disposable incomes have also increased, thus fostering further economic expansion.
> Generally, the U.S. labor force and levels of employment and unemployment are subject to fluctuations due to seasonal changes in weather, major holidays, and the opening and closing of schools.
> The Bureau of Labor Statistics (BLS) adjusts the data to offset the seasonal effects to show non-seasonal changes: for example, women's participation in the labor force; or a general decline in the number of employees, a possible indication of a downturn in the economy. To closely examine seasonal and non-seasonal changes, the BLS releases two monthly statistical measures: the seasonally adjusted All Employees: Total Nonfarm (PAYEMS) and All Employees: Total Nonfarm (PAYNSA), which is not seasonally adjusted.
"All Employees: Total Nonfarm, commonly known as Total Nonfarm Payroll, is a measure of the number of U.S. workers in the economy that excludes proprietors, private household employees, unpaid volunteers, farm employees, and the unincorporated self-employed."

"Generally, the U.S. labor force and levels of employment and unemployment are subject to fluctuations due to seasonal changes in weather, major holidays, and the opening and closing of schools."

"The Bureau of Labor Statistics (BLS) adjusts the data to offset the seasonal effects to show non-seasonal changes: for example, women's participation in the labor force; or a general decline in the number of employees, a possible indication of a downturn in the economy.

To closely examine seasonal and non-seasonal changes, the BLS releases two monthly statistical measures: the seasonally adjusted All Employees: Total Nonfarm (PAYEMS) and All Employees: Total Nonfarm (PAYNSA), which is not seasonally adjusted."

This "PYYNSA" data is expressed in "Thousands of Persons", and is "Not Seasonally Adjusted".

The dataset frequency is "Monthly".
:::


Expand Down
2 changes: 1 addition & 1 deletion docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ yfinance


# predictive modeling:
scikit-learn
scikit-learn==1.3.2 # match colab environment
joblib
ucimlrepo

Expand Down

0 comments on commit 68704b1

Please sign in to comment.