From 68704b17eda2165d76f367e9a8cf3b5ec2aaccc9 Mon Sep 17 00:00:00 2001 From: MJ Rossetti Date: Fri, 20 Sep 2024 21:07:22 -0400 Subject: [PATCH] Polynomial WIP --- docs/notes/dataviz/overview.qmd | 8 +- docs/notes/dataviz/trendlines.qmd | 25 ++- .../time-series-forecasting/index.qmd | 7 +- .../time-series-forecasting/polynomial.qmd | 187 +++++++++++++++++- .../time-series-forecasting/seasonality.qmd | 17 +- docs/requirements.txt | 2 +- 6 files changed, 220 insertions(+), 26 deletions(-) diff --git a/docs/notes/dataviz/overview.qmd b/docs/notes/dataviz/overview.qmd index 5478fd0..faa4f38 100644 --- a/docs/notes/dataviz/overview.qmd +++ b/docs/notes/dataviz/overview.qmd @@ -145,13 +145,13 @@ Starting with some example data: ```{python} scatter_data = [ {"income": 30_000, "life_expectancy": 65.5}, - {"income": 30_000, "life_expectancy": 62.1}, + {"income": 35_000, "life_expectancy": 62.1}, {"income": 50_000, "life_expectancy": 66.7}, - {"income": 50_000, "life_expectancy": 71.0}, + {"income": 55_000, "life_expectancy": 71.0}, {"income": 70_000, "life_expectancy": 72.5}, - {"income": 70_000, "life_expectancy": 77.3}, + {"income": 75_000, "life_expectancy": 77.3}, {"income": 90_000, "life_expectancy": 82.9}, - {"income": 90_000, "life_expectancy": 80.0}, + {"income": 95_000, "life_expectancy": 80.0}, ] ``` diff --git a/docs/notes/dataviz/trendlines.qmd b/docs/notes/dataviz/trendlines.qmd index 38e1f83..0ed9bc1 100644 --- a/docs/notes/dataviz/trendlines.qmd +++ b/docs/notes/dataviz/trendlines.qmd @@ -9,10 +9,6 @@ execute: # Charts with Trendlines - - - - Consider the previous scatter plot example: @@ -20,13 +16,13 @@ Consider the previous scatter plot example: ```{python} scatter_data = [ {"income": 30_000, "life_expectancy": 65.5}, - {"income": 30_000, "life_expectancy": 62.1}, + {"income": 35_000, "life_expectancy": 62.1}, {"income": 50_000, "life_expectancy": 66.7}, - {"income": 50_000, "life_expectancy": 71.0}, + {"income": 55_000, "life_expectancy": 71.0}, {"income": 70_000, "life_expectancy": 72.5}, - {"income": 70_000, "life_expectancy": 77.3}, + {"income": 75_000, "life_expectancy": 77.3}, {"income": 90_000, "life_expectancy": 82.9}, - {"income": 90_000, "life_expectancy": 80.0}, + {"income": 95_000, "life_expectancy": 80.0}, ] incomes = [] @@ -51,6 +47,8 @@ fig.show() Upon viewing the chart, looks like there may be evidence of a trend. +## Linear Trends + The [`scatter` function](https://plotly.com/python-api-reference/generated/plotly.express.scatter) has some trend-line related parameters: ```{python} @@ -68,6 +66,13 @@ fig.show() Under the hood, `plotly` uses the `statsmodels` package to calculate the trend, so you may have to install that package as well. ::: +A linear trend assumes that there is a straight-line relationship between the independent and dependent variables. In the context of US GDP data, a linear trend suggests that GDP changes at a constant rate over time. When applying linear regression, the goal is to find the best-fit line that minimizes the residuals (differences between the predicted and actual values) under the assumption that the underlying relationship is linear. + +Linear regression is simple and interpretable but can be overly restrictive when the real-world data follows a more complex, non-linear pattern. + + +## Non-linear Trends + In addition to \"ols\" trend, which is an Ordinary Least Squares linear trend, we can use a \"lowess\" trend which is a [non-parametric method](https://www.investopedia.com/terms/n/nonparametric-statistics.asp) that can be a better fit for non-linear relationships: ```{python} @@ -82,3 +87,7 @@ fig.show() ``` If you notice, for the lowess trend, there is a slight bend in the curve. + +LOWESS (Locally Weighted Scatterplot Smoothing) is a non-parametric method that fits multiple local regressions to different segments of the data. Instead of assuming a global linear relationship, it captures local patterns by fitting simple models in small neighborhoods around each point. These local models are then combined to create a smooth curve that adjusts to non-linearities in the data. + +A LOWESS trend can adapt to sudden changes, curves, and other complex behaviors in the data, making it ideal for datasets where the relationship between variables changes over time. In the case of US GDP, a LOWESS trend might capture short-term fluctuations and inflection points that a simple linear model would miss. diff --git a/docs/notes/predictive-modeling/time-series-forecasting/index.qmd b/docs/notes/predictive-modeling/time-series-forecasting/index.qmd index c45bd45..95b9219 100644 --- a/docs/notes/predictive-modeling/time-series-forecasting/index.qmd +++ b/docs/notes/predictive-modeling/time-series-forecasting/index.qmd @@ -20,10 +20,9 @@ Fetching the data, going back as far as possible: ```{python} from pandas_datareader import get_data_fred -from datetime import datetime DATASET_NAME = "POPTHM" -df = get_data_fred(DATASET_NAME, start=datetime(1900,1,1)) +df = get_data_fred(DATASET_NAME, start="1900-01-01") print(len(df)) df ``` @@ -31,7 +30,9 @@ df :::{.callout-tip title="Data Source"} Here is some more information about the ["POPTHM" dataset](https://fred.stlouisfed.org/series/POPTHM): ->"Population includes resident population plus armed forces overseas. The monthly estimate is the average of estimates for the first of the month and the first of the following month." The data is expressed in "Thousands", and is "Not Seasonally Adjusted". +"Population includes resident population plus armed forces overseas. The monthly estimate is the average of estimates for the first of the month and the first of the following month." + +The data is expressed in "Thousands", and is "Not Seasonally Adjusted". ::: diff --git a/docs/notes/predictive-modeling/time-series-forecasting/polynomial.qmd b/docs/notes/predictive-modeling/time-series-forecasting/polynomial.qmd index daaa085..7a0f92b 100644 --- a/docs/notes/predictive-modeling/time-series-forecasting/polynomial.qmd +++ b/docs/notes/predictive-modeling/time-series-forecasting/polynomial.qmd @@ -1,20 +1,199 @@ # Regression with Polynomial Features for Time Series Forecasting +```{python} +#| echo: false + +import warnings +warnings.simplefilter(action='ignore', category=FutureWarning) + +#from pandas import set_option +#set_option('display.max_rows', 6) +``` + +## Data Loading + +As an example time series dataset that follows a quadratic trend, let's consider this dataset of U.S. GDP over time, from the Federal Reserve Economic Data (FRED). + +Fetching the data, going back as far as possible: + +```{python} +from pandas_datareader import get_data_fred + +df = get_data_fred("GDP", start="1900-01-01") +df.index.name = "date" +df.rename(columns={"GDP": "gdp"}, inplace=True) +df.head() +``` + +:::{.callout-note title="Data Source"} +Here is some more information about the ["GDP" dataset](https://fred.stlouisfed.org/series/GDP): + +"Gross domestic product (GDP), the featured measure of U.S. output, is the market value of the goods and services produced by labor and property located in the United States. + +The data is expressed in "Billions of Dollars", and is a "Seasonally Adjusted Annual Rate". + +The dataset frequency is "Quarterly". +::: + +## Data Exploration + +Plotting the data over time with a linear trendline to examine a possible linear relationship: + +```{python} +import plotly.express as px + +px.scatter(df, y="gdp", title="US GDP (Quarterly) vs Linear Trend", height=450, + labels={"gdp": "GDP (in billions of USD)"}, + trendline="ols", trendline_color_override="red" +) +``` + +Linear trend might not be the best fit. + +Plotting the data over time with a Lowess trendline to examine a possible non-linear relationship: + +```{python} +import plotly.express as px + +px.scatter(df, y="gdp", title="US GDP (Quarterly) vs Lowess Trend", height=450, + labels={"gdp": "GDP (in billions of USD)"}, + trendline="lowess", trendline_color_override="red" +) +``` + +In this case, a non-linear trend seems to fit better. + +Let's perform a linear regression and an exponential features regression more formally, and compare the results. ## Linear Regression +Sorting time series data: + +```{python} +from pandas import to_datetime + +df.sort_values(by="date", ascending=True, inplace=True) +df["time_step"] = range(1, len(df)+1) +df.head() +``` + +Identifying labels and features (x/y split): + +```{python} +x = df[['time_step']] +y = df['gdp'] +print(x.shape) +print(y.shape) +``` + +Test/train split for time-series data: + +```{python} +training_size = round(len(df) * .8) + +x_train = x.iloc[:training_size] # all before cutoff +y_train = y.iloc[:training_size] # all before cutoff + +x_test = x.iloc[training_size:] # all after cutoff +y_test = y.iloc[training_size:] # all after cutoff + +print("TRAIN:", x_train.shape, y_train.shape) +print("TEST:", x_test.shape, y_test.shape) +``` +### Model Training -Results and Interpretation: +Training a linear regression model: -Train R-squared: 0.85 +```{python} +from sklearn.linear_model import LinearRegression -This indicates that the linear regression model explains about 85% of the variance in the GDP data during the training period. It suggests that the model fits the training data reasonably well. +model = LinearRegression() +model.fit(x_train, y_train) +``` -A negative R-squared score on the test set means that the model performs poorly on future data, doing worse than a simple horizontal line (mean prediction). This is a clear indication that the linear regression model is not capturing the temporal patterns in the GDP data and fails to generalize beyond the training period. +Examining the coefficients and line of best fit: +```{python} +print("COEF:", model.coef_) +print("INTERCEPT:", model.intercept_) +``` + +Examining the training results: + +```{python} +from sklearn.metrics import mean_squared_error, r2_score + +y_pred_train = model.predict(x_train) + +r2_train = r2_score(y_train, y_pred_train) +print("R^2 (TRAINING):", r2_train) + +mse_train = mean_squared_error(y_train, y_pred_train) +print("MSE (TRAINING):", mse_train) +``` + +A strong positive that the linear regression model explains about 85% of the variance in the GDP data during the training period. It suggests that the model fits the training data reasonably well. + +These results are promising, however what we really care about is how the model generalizes to the test set. + +### Prediction and Evaluation + + +Examining the test results: + + +```{python} +y_pred = model.predict(x_test) + +r2 = r2_score(y_test, y_pred) +print("R^2 (TEST):", r2.round(3)) + +mse = mean_squared_error(y_test, y_pred) +print("MSE (TEST):", mse.round(3)) +``` + +Storing the predictions back in the original data: + +```{python} +df.loc[x_train.index, "y_pred_train"] = y_pred_train + +# Add predictions for the test set +df.loc[x_test.index, "y_pred_test"] = y_pred +``` + +Charting the predictions: + + +```{python} +import plotly.express as px + +fig = px.line(df, y=['gdp', 'y_pred_train', 'y_pred_test'], + title='Linear Regression on GDP Ta andime Series Data', + labels={'value': 'GDP', 'date': 'Date'}, +) +# update legend: +fig.update_traces(line=dict(color='blue'), name="Actual GDP", selector=dict(name='gdp')) +fig.update_traces(line=dict(color='green'), name="Predicted GDP (Train)", selector=dict(name='y_pred_train')) +fig.update_traces(line=dict(color='red'), name="Predicted GDP (Test)", selector=dict(name='y_pred_test')) + +fig.show() +``` + + + + +It seems although the model performs well on the training set, it performs poorly on future data it hasn't seen yet, and doesn't generalize beyond the training period. ## Polynomial + +After observing the linear regression model, which relied on the original features, struggled to capture the complexity of the GDP data on future or unseen data, we can alternatively try training a linear regression model on polynomial features instead. + +By transforming the original features into higher-order terms, **polynomial features** allow the model to capture non-linear relationships, offering greater flexibility and improving the model's ability to generalize to more complex patterns in the data." + +$$y = mx + b$$ + +$$y = ax^2 + bx + c$$ diff --git a/docs/notes/predictive-modeling/time-series-forecasting/seasonality.qmd b/docs/notes/predictive-modeling/time-series-forecasting/seasonality.qmd index ebc4794..3d4e539 100644 --- a/docs/notes/predictive-modeling/time-series-forecasting/seasonality.qmd +++ b/docs/notes/predictive-modeling/time-series-forecasting/seasonality.qmd @@ -34,12 +34,17 @@ df :::{.callout-tip title="Data Source"} Here is some more information about the ["PAYNSA" dataset](https://fred.stlouisfed.org/series/PAYNSA): -> All Employees: Total Nonfarm, commonly known as Total Nonfarm Payroll, is a measure of the number of U.S. workers in the economy that excludes proprietors, private household employees, unpaid volunteers, farm employees, and the unincorporated self-employed. -> This measure accounts for approximately 80 percent of the workers who contribute to Gross Domestic Product (GDP). -> This measure provides useful insights into the current economic situation because it can represent the number of jobs added or lost in an economy. -> Increases in employment might indicate that businesses are hiring which might also suggest that businesses are growing. Additionally, those who are newly employed have increased their personal incomes, which means (all else constant) their disposable incomes have also increased, thus fostering further economic expansion. -> Generally, the U.S. labor force and levels of employment and unemployment are subject to fluctuations due to seasonal changes in weather, major holidays, and the opening and closing of schools. -> The Bureau of Labor Statistics (BLS) adjusts the data to offset the seasonal effects to show non-seasonal changes: for example, women's participation in the labor force; or a general decline in the number of employees, a possible indication of a downturn in the economy. To closely examine seasonal and non-seasonal changes, the BLS releases two monthly statistical measures: the seasonally adjusted All Employees: Total Nonfarm (PAYEMS) and All Employees: Total Nonfarm (PAYNSA), which is not seasonally adjusted. +"All Employees: Total Nonfarm, commonly known as Total Nonfarm Payroll, is a measure of the number of U.S. workers in the economy that excludes proprietors, private household employees, unpaid volunteers, farm employees, and the unincorporated self-employed." + +"Generally, the U.S. labor force and levels of employment and unemployment are subject to fluctuations due to seasonal changes in weather, major holidays, and the opening and closing of schools." + +"The Bureau of Labor Statistics (BLS) adjusts the data to offset the seasonal effects to show non-seasonal changes: for example, women's participation in the labor force; or a general decline in the number of employees, a possible indication of a downturn in the economy. + +To closely examine seasonal and non-seasonal changes, the BLS releases two monthly statistical measures: the seasonally adjusted All Employees: Total Nonfarm (PAYEMS) and All Employees: Total Nonfarm (PAYNSA), which is not seasonally adjusted." + +This "PYYNSA" data is expressed in "Thousands of Persons", and is "Not Seasonally Adjusted". + +The dataset frequency is "Monthly". ::: diff --git a/docs/requirements.txt b/docs/requirements.txt index feed6cf..7b0b75e 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -31,7 +31,7 @@ yfinance # predictive modeling: -scikit-learn +scikit-learn==1.3.2 # match colab environment joblib ucimlrepo