From 386be19eb0e97eed91893501da76d514c78281c2 Mon Sep 17 00:00:00 2001
From: Chris Endemann <endemann@wisc.edu>
Date: Tue, 26 Sep 2023 08:45:17 -0500
Subject: [PATCH 01/14] Update setup.md

---
 setup.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/setup.md b/setup.md
index 062623c..8ebdd75 100644
--- a/setup.md
+++ b/setup.md
@@ -3,7 +3,7 @@ title: Setup
 ---
 # Software Packages Required
 
-You will need to have an installation of Python 3 with the matplotlib, pandas, numpy and optionally opencv packages. 
+You will need to have an installation of Python 3 with the matplotlib, pandas, numpy and opencv packages. If you can't successfully install opencv, you may use Goolge Colab on day 2 of the workshop.
 
 The [Anaconda Distribution](https://www.anaconda.com/products/individual#Downloads) includes all of these except opencv by default.
 

From 88380c6323123361506e5e386e144e058ff23d03 Mon Sep 17 00:00:00 2001
From: Chris Endemann <endemann@wisc.edu>
Date: Tue, 26 Sep 2023 17:50:22 -0500
Subject: [PATCH 02/14] Update 02-regression.md

---
 _episodes/02-regression.md | 92 +++++++++++++++++++++++++++++++++++---
 1 file changed, 86 insertions(+), 6 deletions(-)

diff --git a/_episodes/02-regression.md b/_episodes/02-regression.md
index 36ab10e..40ab049 100644
--- a/_episodes/02-regression.md
+++ b/_episodes/02-regression.md
@@ -35,10 +35,59 @@ y_data = [4,5,7,10,15]
 ~~~
 {: .language-python}
 
-Let's take a look at the math required to fit a line of best fit to this data. Open `regression_helper_functions.py` and view the code for the `least_squares()` function. The equations you see in this function are derived using some calculus. Specifically, to find a slope and y-intercept that minimizes the sum of squared errors (SSE), we have to take the partial derivative of SSE w.r.t. both of the model's parameters — slope and y-intercept. We can set those partial derivatives to zero (where the rate of SSE change goes to zero) to find the optimal values of these parameters. The terms used in the for loop are derived from these partial derivatives.
+We can use the `least_squares()` helper function to calculate a line of best fit through this data. 
+
+Let's take a look at the math required to fit a line of best fit to this data. Open `regression_helper_functions.py` and view the code for the `least_squares()` function. 
+~~~
+def least_squares(data: List[List[float]]) -> Tuple[float, float]:
+    """
+    Calculate the line of best fit for a data matrix of [x_values, y_values] using 
+    ordinary least squares optimization.
+
+    Args:
+        data (List[List[float]]): A list containing two equal-length lists, where the 
+        first list represents x-values and the second list represents y-values.
+
+    Returns:
+        Tuple[float, float]: A tuple containing the slope (m) and the y-intercept (c) of 
+        the line of best fit.
+    """
+    x_sum = 0
+    y_sum = 0
+    x_sq_sum = 0
+    xy_sum = 0
+
+    # Ensure the list of data has two equal-length lists
+    assert len(data) == 2
+    assert len(data[0]) == len(data[1])
+
+    n = len(data[0])
+    # Least squares regression calculation
+    for i in range(0, n):
+        if isinstance(data[0][i], str):
+            x = int(data[0][i])  # Convert date string to int
+        else:
+            x = data[0][i]  # For GDP vs. life-expect data
+        y = data[1][i]
+        x_sum = x_sum + x
+        y_sum = y_sum + y
+        x_sq_sum = x_sq_sum + (x ** 2)
+        xy_sum = xy_sum + (x * y)
+
+    m = ((n * xy_sum) - (x_sum * y_sum))
+    m = m / ((n * x_sq_sum) - (x_sum ** 2))
+    c = (y_sum - m * x_sum) / n
 
-To see how ordinary least squares optimization is derived, visit: [https://are.berkeley.edu/courses/EEP118/current/derive_ols.pdf](https://are.berkeley.edu/courses/EEP118/current/derive_ols.pdf)
+    print("Results of linear regression:")
+    print("m =", format(m, '.5f'), "c =", format(c, '.5f'))
 
+    return m, c
+~~~
+{: .language-python}
+
+The equations you see in this function are derived using some calculus. Specifically, to find a slope and y-intercept that minimizes the sum of squared errors (SSE), we have to take the partial derivative of SSE w.r.t. both of the model's parameters — slope and y-intercept. We can set those partial derivatives to zero (where the rate of SSE change goes to zero) to find the optimal values of these model coefficients (a.k.a parameters a.k.a. weights). 
+
+To see how ordinary least squares optimization is fully derived, visit: [https://are.berkeley.edu/courses/EEP118/current/derive_ols.pdf](https://are.berkeley.edu/courses/EEP118/current/derive_ols.pdf)
 ~~~
 from regression_helper_functions import least_squares
 m, b = least_squares([x_data,y_data])
@@ -52,11 +101,20 @@ m = 1.51829 c = 0.30488
 {: .output}
 
 We can use our new model to generate a line that predicts y-values at all x-coordinates fed into the model. Open `regression_helper_functions.py` and view the code for the `get_model_predictions()` function. Find the FIXME tag in the function, and fill in the missing code to output linear model predicitons.
-
 ~~~
-def get_model_predictions(x_data, m, c):
-    """Using the input slope (m) and y-intercept (c), calculate linear model predictions (y-values) for a given list of x-coordinates."""
-    
+def get_model_predictions(x_data: List[float], m: float, c: float) -> List[float]:
+    """
+    Calculate linear model predictions (y-values) for a given list of x-coordinates using 
+    the provided slope and y-intercept.
+
+    Args:
+        x_data (List[float]): A list of x-coordinates for which predictions are calculated.
+        m (float): The slope of the linear model.
+        c (float): The y-intercept of the linear model.
+
+    Returns:
+        List[float]: A list of predicted y-values corresponding to the input x-coordinates.
+    """
     linear_preds = []
     for x in x_data:
         # FIXME: Uncomment below line and complete the line of code to get a model prediction from each x value
@@ -82,6 +140,28 @@ We can now plot our model predictions along with the actual data using the `make
 
 ~~~
 from regression_helper_functions import make_regression_graph
+help(make_regression_graph)
+~~~
+{: .language-python}
+
+~~~
+Help on function make_regression_graph in module regression_helper_functions:
+
+make_regression_graph(x_data: List[float], y_data: List[float], y_pred: List[float], axis_labels: Tuple[str, str]) -> None
+    Plot data points and a model's predictions (line) on a graph.
+    
+    Args:
+        x_data (List[float]): A list of x-coordinates for data points.
+        y_data (List[float]): A list of corresponding y-coordinates for data points.
+        y_pred (List[float]): A list of predicted y-values from a model (line).
+        axis_labels (Tuple[str, str]): A tuple containing the labels for the x and y axes.
+    
+    Returns:
+        None: The function displays the plot but does not return a value.
+~~~
+{: .output}
+
+~~~
 make_regression_graph(x_data, y_data, y_preds, ['X', 'Y'])
 ~~~
 {: .language-python}

From 6ca901c31bd1697c0fe83f86aefeaab2bce9c4f8 Mon Sep 17 00:00:00 2001
From: Chris Endemann <endemann@wisc.edu>
Date: Tue, 26 Sep 2023 18:17:49 -0500
Subject: [PATCH 03/14] Update 02-regression.md

---
 _episodes/02-regression.md | 42 +++++++++++++++++++++++++++++++++++---
 1 file changed, 39 insertions(+), 3 deletions(-)

diff --git a/_episodes/02-regression.md b/_episodes/02-regression.md
index 40ab049..f767a62 100644
--- a/_episodes/02-regression.md
+++ b/_episodes/02-regression.md
@@ -167,10 +167,46 @@ make_regression_graph(x_data, y_data, y_preds, ['X', 'Y'])
 {: .language-python}
 
 ### Testing the accuracy of a linear regression model
-We now have a linear model for some data. It would be useful to test how accurate that model is. We can do this by computing the y value for every x value used in our original data and comparing the model’s y value with the original. We can turn this into a single overall error number by calculating the root mean square error (RMSE), this squares each comparison, takes the sum of all of them, divides this by the number of items and finally takes the square root of that value. By squaring and square rooting the values we prevent negative errors from cancelling out positive ones. The RMSE gives us an overall error number which we can then use to measure our model’s accuracy with. 
+We now have a linear model for some training data. It would be useful to assess how accurate that model is. 
+
+One popular measure of a model's error is the Root Mean Squared Error (RMSE). RMSE is expressed in the same units as the data being measured. This makes it easy to interpret because you can directly relate it to the scale of the problem. For example, if you're predicting house prices in dollars, the RMSE will also be in dollars, allowing you to understand the average prediction error in a real-world context.
+
+To calculate the RMSE, we:
+1. Calculate the sum of squared differences (SSE) between observed values of y and predicted values of y: `SSE = (y-y_pred)**2`
+2. Convert the SSE into the mean-squared error by dividing by the total number of obervations, n, in our data: `MSE = SSE/n`
+3. Take the square root of the MSE: `RMSE = math.sqrt(MSE)`
+   
+The RMSE gives us an overall error number which we can then use to measure our model’s accuracy with. 
 
 Open `regression_helper_functions.py` and view the code for the `measure_error()` function. Find the FIXME tag in the function, and fill in the missing code to calculate RMSE.
 
+~~~
+import math
+def measure_error(y: List[float], y_pred: List[float]) -> float:
+    """
+    Calculate the Root Mean Square Error (RMSE) of a model's predictions.
+
+    Args:
+        y (List[float]): A list of actual (observed) y values.
+        y_pred (List[float]): A list of predicted y values from a model.
+
+    Returns:
+        float: The RMSE (root mean square error) of the model's predictions.
+    """
+    assert len(y)==len(y_pred)
+    err_total = 0
+    for i in range(0,len(y)):
+        # add up the squared error for each observation
+        # FIXME: Uncomment the below line and fill in the blank to add up the squared error for each observation
+#         err_total = err_total + ________
+        # SOLUTION
+        err_total = err_total + (y[i] - y_pred[i])**2
+
+    err = math.sqrt(err_total / len(y))
+    return err
+~~~
+{: .language-python}
+
 ~~~
 import math
 def measure_error(data1, data2):
@@ -201,8 +237,8 @@ print(measure_error(y_data,y_preds))
 
 This will output an error of 0.7986268703523449, which means that on average the difference between our model and the real values is 0.7986268703523449. The less linear the data is the bigger this number will be. If the model perfectly matches the data then the value will be zero.
 
-> ## Model Parameters VS Hyperparameters
-> Model parameters/coefficients/weights are parameters that are learned during the model-fitting stage. How many parameters does our linear model have? In addition, what hyperparameters does this model have, if any?
+> ## Model Parameters (a.k.a. coefs or weights) VS Hyperparameters
+> Model parameters/coefficients/weights are parameters that are learned during the model-fitting stage. That is, they are estimated from the data. How many parameters does our linear model have? In addition, what hyperparameters does this model have, if any?
 > 
 > > ## Solution
 > > In a univariate linear model (with only one variable predicting y), the two parameters learned from the data include the model's slope and its intercept. One hyperparameter of a linear model is the number of variables being used to predict y. In our previous example, we used only one variable, x, to predict y. However, it is possible to use additional predictor variables in a linear model (e.g., multivariate linear regression).

From d2f027aee979c10736ea8b5287142e7250b0e37b Mon Sep 17 00:00:00 2001
From: Chris Endemann <endemann@wisc.edu>
Date: Tue, 26 Sep 2023 19:30:37 -0500
Subject: [PATCH 04/14] Update 02-regression.md

---
 _episodes/02-regression.md | 73 +++-----------------------------------
 1 file changed, 5 insertions(+), 68 deletions(-)

diff --git a/_episodes/02-regression.md b/_episodes/02-regression.md
index f767a62..45ce16f 100644
--- a/_episodes/02-regression.md
+++ b/_episodes/02-regression.md
@@ -207,23 +207,6 @@ def measure_error(y: List[float], y_pred: List[float]) -> float:
 ~~~
 {: .language-python}
 
-~~~
-import math
-def measure_error(data1, data2):
-    """Calculating RMSE (root mean square error) of model."""
-    
-    assert len(data1) == len(data2)
-    err_total = 0
-    for i in range(0, len(data1)):
-        # FIXME: Uncomment the below line and fill in the blank to add up the squared error for each observation
-#         err_total = err_total + ________
-        err_total = err_total + (data1[i] - data2[i]) ** 2
-
-    err = math.sqrt(err_total / len(data1))
-    return err
-~~~
-{: .language-python}
-
 Using this function, let's calculate the error of our model in term's of its RMSE. Since we are calculating RMSE on the same data that was used to fit or "train" the model, we call this error the model's training error.
 ~~~
 from regression_helper_functions import measure_error
@@ -335,6 +318,10 @@ Train RMSE = 0.32578
 ~~~
 {: .output}
 
+Quick Quiz
+1. Based on the above result, how much do we expect life expectancy to change each year?
+2. What does an RMSE value of 0.33 indicate?
+
 Let's see how the model performs in terms of its ability to predict future years. Run the `process_life_expectancy_data()` function again using the period 1950-1980 to train the model, and the period 2010-2016 to test the model's performance on unseen data.
 
 ~~~
@@ -343,7 +330,7 @@ m, c = process_life_expectancy_data("data/gapminder-life-expectancy.csv",
 ~~~
 {: .language-python}
 
-When we train our model using data between 1950 and 1980, we aren't able to accurately predict life expectancy in later decades. To explore this issue further, try out the excercise in the following section
+When we train our model using data between 1950 and 1980, we aren't able to accurately predict life expectancy in later decades. To explore this issue further, try out the excercise below.
 
 > ## Models Fit Their Training Data — For Better Or Worse
 > What happens to the test RMSE as you extend the training data set to include additional dates? Try out a couple of ranges  (e.g., 1950:1990, 1950:2000, 1950:2005); Explain your observations.
@@ -364,57 +351,7 @@ When we train our model using data between 1950 and 1980, we aren't able to accu
 > {: .solution}
 {: .challenge}
 
-> ## Predicting Life Expectancy
-> 1) Model Germany's predicted life expectancy between the years 1950 and 2000. What is the value of and c?
-> 
-> 2) Use the linear model you’ve just created to predict life expectancy in Germany for every year between 2001 and 2016. How accurate are your answers? If you worked for a pension scheme would you trust your answers to predict the future costs for paying pensioners?
-> > ## Solution
-> > ~~~
-> > m,c = process_life_expectancy_data("data/gapminder-life-expectancy.csv", "Germany", [1950, 2000])
-> > 
-> > for x in range(2001,2017):
-> >     print(x,0.212219909502 * x - 346.784909502)
-> > ~~~
-> > {: .language-python}
-> > 
-> > ~~~
-> > df = pd.read_csv('data/gapminder-life-expectancy.csv',index_col="Life expectancy")
-> > for x in range(2001,2017):
-> >     y = m*x + c
-> >     real = df.loc['Germany', str(x)]
-> >     print(x, "Predicted", y, "Real", real, "Difference", y-real)
-> >     
-> > ~~~
-> > {: .language-python}
-> >
-> > Predicted answers
-> > ~~~
-> > 2001 Predicted 77.86712941175517 Real 78.4 Difference -0.5328705882448332
-> > 2002 Predicted 78.07934932125704 Real 78.6 Difference -0.5206506787429532
-> > 2003 Predicted 78.29156923075897 Real 78.8 Difference -0.5084307692410306
-> > 2004 Predicted 78.50378914026084 Real 79.2 Difference -0.6962108597391676
-> > 2005 Predicted 78.71600904976276 Real 79.4 Difference -0.683990950237245
-> > 2006 Predicted 78.92822895926463 Real 79.7 Difference -0.7717710407353735
-> > 2007 Predicted 79.1404488687665 Real 79.9 Difference -0.7595511312335077
-> > 2008 Predicted 79.35266877826842 Real 80.0 Difference -0.6473312217315765
-> > 2009 Predicted 79.56488868777029 Real 80.1 Difference -0.5351113122297022
-> > 2010 Predicted 79.77710859727222 Real 80.3 Difference -0.5228914027277796
-> > 2011 Predicted 79.98932850677409 Real 80.5 Difference -0.5106714932259138
-> > 2012 Predicted 80.20154841627601 Real 80.6 Difference -0.3984515837239826
-> > 2013 Predicted 80.41376832577788 Real 80.7 Difference -0.2862316742221225
-> > 2014 Predicted 80.6259882352798 Real 80.7 Difference -0.07401176472019699
-> > 2015 Predicted 80.83820814478167 Real 80.8 Difference 0.03820814478167733
-> > 2016 Predicted 81.0504280542836 Real 80.9 Difference 0.1504280542835943
-> > ~~~
-> > {: .output}
-> >
-> > Answers are between 0.15 years over and 0.77 years under the reality.
-> > If this was being used in a pension scheme it might lead to a slight under prediction of life expectancy and cost the pension scheme a little more than expected.
-> {: .solution}
-{: .challenge}
-
 # Logarithmic Regression
-
 We've now seen how we can use linear regression to make a simple model and use that to predict values, but what do we do when the relationship between the data isn't linear?
 
 As an example lets take the relationship between income (GDP per Capita) and life expectancy. The gapminder website will [graph](https://www.gapminder.org/tools/#$state$time$value=2017&showForecast:true&delay:206.4516129032258;&entities$filter$;&dim=geo;&marker$axis_x$which=life_expectancy_years&domainMin:null&domainMax:null&zoomedMin:45&zoomedMax:84.17&scaleType=linear&spaceRef:null;&axis_y$which=gdppercapita_us_inflation_adjusted&domainMin:null&domainMax:null&zoomedMin:115.79&zoomedMax:144246.37&spaceRef:null;&size$domainMin:null&domainMax:null&extent@:0.022083333333333333&:0.4083333333333333;;&color$which=world_6region;;;&chart-type=bubbles) this for us.

From 4a2c8e002e3265beeac31443f14ca30601d4d16a Mon Sep 17 00:00:00 2001
From: Chris Endemann <endemann@wisc.edu>
Date: Tue, 26 Sep 2023 19:46:03 -0500
Subject: [PATCH 05/14] Update 02-regression.md

---
 _episodes/02-regression.md | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/_episodes/02-regression.md b/_episodes/02-regression.md
index 45ce16f..19e8c18 100644
--- a/_episodes/02-regression.md
+++ b/_episodes/02-regression.md
@@ -385,9 +385,19 @@ Let's start by reading in the data. We'll collect GDP and life expectancy from t
 
 ~~~
 from regression_helper_functions import read_data
+help(read_data)
+~~~
+{: .language-python}
+
+~~~
 data = read_data("data/worldbank-gdp.csv",
              "data/gapminder-life-expectancy.csv", "1980")
-data
+~~~
+{: .language-python}
+
+~~~
+print(data.shape)
+data.head()
 ~~~
 {: .language-python}
 

From f77ab87a0c4c25a47ab876aeac23130874620471 Mon Sep 17 00:00:00 2001
From: Chris Endemann <endemann@wisc.edu>
Date: Tue, 26 Sep 2023 20:54:45 -0500
Subject: [PATCH 06/14] Update 02-regression.md

---
 _episodes/02-regression.md | 31 +++++++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/_episodes/02-regression.md b/_episodes/02-regression.md
index 19e8c18..cd40bf8 100644
--- a/_episodes/02-regression.md
+++ b/_episodes/02-regression.md
@@ -401,6 +401,32 @@ data.head()
 ~~~
 {: .language-python}
 
+Let's check out how GDP changes with life expectancy with a simple scatterplot.
+
+~~~
+import matplotlib.pyplot as plt
+plt.scatter(data['Life Expectancy'], data['GDP'])
+plt.xlabel('Life Expectancy')
+plt.ylabel('GDP');
+~~~
+{: .language-python}
+
+Clearly, this is not a linearly relationship. Let's see what how log(GDP) changes with life expectancy.
+
+We can use `apply()` to run a function on all elements of a pandas series (alternatively np.log() can be used directly on a pandas series)
+~~~
+import math
+data['GDP'].apply(math.log) 
+~~~
+{: .language-python}
+
+~~~
+plt.scatter(data['Life Expectancy'], data['GDP'].apply(math.log))
+plt.xlabel('Life Expectancy')
+plt.ylabel('log(GDP)');
+~~~
+{: .language-python}
+
 ### Model GDP vs Life Expectancy
 Review the `process_lifeExpt_gdp_data()` function found in `regression_helper_functions.py`. Review the FIXME tags found in the function and try to fix them. Afterwards, use this function to model life-expectancy versus GDP for the year 1980.
 ~~~
@@ -426,7 +452,7 @@ def process_lifeExpt_gdp_data(gdp_file, life_expectancy_file, year):
         y_pred = math.exp(y_pred)
         gdp_preds_transformed.append(y_pred)
 
-    # plot both the transformed and untransformed data
+    # Plot both the transformed and untransformed data 
     make_regression_graph(life_exp, gdp_log, gdp_preds, ['Life Expectancy', 'log(GDP)'])
     make_regression_graph(life_exp, gdp, gdp_preds_transformed, ['Life Expectancy', 'GDP'])
 
@@ -444,12 +470,13 @@ process_lifeExpt_gdp_data("data/worldbank-gdp.csv",
 ~~~
 {: .language-python}
 
-On average, our model over or underestimates GDP by 14233.73. GDP is predicted to grow by .128 for each year added to life.
+On average, our model over or underestimates GDP by 8741.12499. GDP is predicted to grow by .127 for each year added to life.
 
 > ## Removing outliers from the data
 > The correlation of GDP and life expectancy has a few big outliers that are probably increasing the error rate on this model. These are typically countries with very high GDP and sometimes not very high life expectancy. These tend to be either small countries with artificially high GDPs such as Monaco and Luxemborg or oil rich countries such as Qatar or Brunei. Kuwait, Qatar and Brunei have already been removed from this data set, but are available in the file worldbank-gdp-outliers.csv. Try experimenting with adding and removing some of these high income countries to see what effect it has on your model's error rate.
 > Do you think its a good idea to remove these outliers from your model?
 > How might you do this automatically?
+> 
 {: .challenge}
 
 {% include links.md %}

From 27df1d5f0973a83f7ccf22b5992ad4e6738c2485 Mon Sep 17 00:00:00 2001
From: Chris Endemann <endemann@wisc.edu>
Date: Tue, 26 Sep 2023 21:06:47 -0500
Subject: [PATCH 07/14] Update 02-regression.md

---
 _episodes/02-regression.md | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/_episodes/02-regression.md b/_episodes/02-regression.md
index cd40bf8..539ca49 100644
--- a/_episodes/02-regression.md
+++ b/_episodes/02-regression.md
@@ -480,3 +480,14 @@ On average, our model over or underestimates GDP by 8741.12499. GDP is predicted
 {: .challenge}
 
 {% include links.md %}
+
+### More on removing outliers
+Whether or not it's a good idea to remove outliers from your model depends on the specific goals and context of your analysis. Here are some considerations:
+1. Impact on Model Accuracy: Outliers can significantly affect the accuracy of a statistical model. They can pull the regression line towards them, leading to a less accurate representation of the majority of the data. Removing outliers may improve the model's predictive accuracy.
+2. Data Integrity: It's important to consider whether the outliers are a result of data entry errors or represent legitimate data points. If they are due to errors, removing them can be a good idea to maintain data integrity.
+3. Contextual Relevance: Consider the context of your analysis. Are the outliers relevant to the problem you're trying to solve? For example, if you're studying income inequality or the impact of extreme wealth on life expectancy, you may want to keep those outliers.
+4. Model Interpretability: Removing outliers can simplify the model and make it more interpretable. However, if the outliers have meaningful explanations, removing them might lead to a less accurate model.
+
+To automatically identify and remove outliers, you can use statistical methods like the Z-score or the IQR (Interquartile Range) method:
+1. Z-Score: Calculate the Z-score for each data point and remove data points with Z-scores above a certain threshold (e.g., 3 or 4 standard deviations from the mean).
+2. IQR Method: Calculate the IQR for the data and then remove data points that fall below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR, where Q1 and Q3 are the first and third quartiles, respectively.

From bd55656f791c95ea43dc6bde0bece8a65237f9d3 Mon Sep 17 00:00:00 2001
From: Chris Endemann <endemann@wisc.edu>
Date: Tue, 26 Sep 2023 21:42:46 -0500
Subject: [PATCH 08/14] Update 03-introducing-sklearn.md

---
 _episodes/03-introducing-sklearn.md | 77 -----------------------------
 1 file changed, 77 deletions(-)

diff --git a/_episodes/03-introducing-sklearn.md b/_episodes/03-introducing-sklearn.md
index 9ddf883..a61bd9d 100644
--- a/_episodes/03-introducing-sklearn.md
+++ b/_episodes/03-introducing-sklearn.md
@@ -123,84 +123,7 @@ plt.show()
 ~~~
 {: .language-python}
 
-> ## Comparing the Scikit learn and our own linear regression implementations.
-> Adjust both the original program and the sklearn version to calculate the life expectancy for Germany between 1950 and 2000. What are the values (m and c) of linear equation
-> linking date and life expectancy? Are they the same in both?
-> > ## Solution
-> > ~~~
-> > process_life_expectancy_data("../data/gapminder-life-expectancy.csv", "Germany", 1950, 2000)
-> > ~~~
-> > {: .language-python}
-> >
-> > m= 0.212219909502 c= -346.784909502
-> > They should be identical
-> {: .solution}
-{: .challenge}
-
-
-> ## Predicting Life Expectancy
-> Use the linear equation you've just created to predict life expectancy in Germany for every year between 2001 and 2016. How accurate are your answers?
-> If you worked for a pension scheme would you trust your answers to predict the future costs for paying pensioners?
-> > ## Solution
-> > ~~~
-> > for x in range(2001,2017):
-> >     print(x,0.212219909502 * x - 346.784909502)
-> > ~~~
-> > {: .language-python}
-> >
-> > Predicted answers:
-> > ~~~
-> > 2001 77.86712941150199
-> > 2002 78.07934932100403
-> > 2003 78.29156923050601
-> > 2004 78.503789140008
-> > 2005 78.71600904951003
-> > 2006 78.92822895901202
-> > 2007 79.140448868514
-> > 2008 79.35266877801604
-> > 2009 79.56488868751802
-> > 2010 79.77710859702
-> > 2011 79.98932850652199
-> > 2012 80.20154841602402
-> > 2013 80.41376832552601
-> > 2014 80.62598823502799
-> > 2015 80.83820814453003
-> > 2016 81.05042805403201
-> > ~~~
-> > Compare with the real values:
-> > ~~~
-> > df = pd.read_csv('../data/gapminder-life-expectancy.csv',index_col="Life expectancy")
-> > for x in range(2001,2017):
-> >     y = 0.215621719457 * x - 351.935837103
-> >     real = df.loc['Germany', str(x)]
-> >     print(x, "Predicted", y, "Real", real, "Difference", y-real)
-> > ~~~
-> > {: .language-python}
-> >
-> > ~~~
-> > 2001 Predicted 77.86712941150199 Real 78.4 Difference -0.532870588498
-> > 2002 Predicted 78.07934932100403 Real 78.6 Difference -0.520650678996
-> > 2003 Predicted 78.29156923050601 Real 78.8 Difference -0.508430769494
-> > 2004 Predicted 78.503789140008 Real 79.2 Difference -0.696210859992
-> > 2005 Predicted 78.71600904951003 Real 79.4 Difference -0.68399095049
-> > 2006 Predicted 78.92822895901202 Real 79.7 Difference -0.771771040988
-> > 2007 Predicted 79.140448868514 Real 79.9 Difference -0.759551131486
-> > 2008 Predicted 79.35266877801604 Real 80.0 Difference -0.647331221984
-> > 2009 Predicted 79.56488868751802 Real 80.1 Difference -0.535111312482
-> > 2010 Predicted 79.77710859702 Real 80.3 Difference -0.52289140298
-> > 2011 Predicted 79.98932850652199 Real 80.5 Difference -0.510671493478
-> > 2012 Predicted 80.20154841602402 Real 80.6 Difference -0.398451583976
-> > 2013 Predicted 80.41376832552601 Real 80.7 Difference -0.286231674474
-> > 2014 Predicted 80.62598823502799 Real 80.7 Difference -0.074011764972
-> > 2015 Predicted 80.83820814453003 Real 80.8 Difference 0.03820814453
-> > 2016 Predicted 81.05042805403201 Real 80.9 Difference 0.150428054032
-> > ~~~
-> {: .solution}
-{: .challenge}
-
-
 ## Polynomial regression
-
 Linear regression obviously has its limits for working with data that isn't linear. Scikit-learn has a number of other regression techniques
 which can be used on non-linear data. Some of these (such as isotonic regression) will only interpolate data in the range of the training
 data and can't extrapolate beyond it. One non-linear technique that works with many types of data is polynomial regression. This creates a polynomial

From a3c03a06b1c4a8fef16e1e8f79406c020326a5c9 Mon Sep 17 00:00:00 2001
From: Chris Endemann <endemann@wisc.edu>
Date: Wed, 27 Sep 2023 06:53:48 -0500
Subject: [PATCH 09/14] Update 02-regression.md

---
 _episodes/02-regression.md | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/_episodes/02-regression.md b/_episodes/02-regression.md
index 539ca49..9e9b718 100644
--- a/_episodes/02-regression.md
+++ b/_episodes/02-regression.md
@@ -117,13 +117,14 @@ def get_model_predictions(x_data: List[float], m: float, c: float) -> List[float
     """
     linear_preds = []
     for x in x_data:
-        # FIXME: Uncomment below line and complete the line of code to get a model prediction from each x value
-#         y = _______
+        y_pred = None # FIXME: get a model prediction from each x value
+
         # ANSWER
-        y = m * x + c
+        y_pred = m * x + c
         
         #add the result to the linear_data list
-        linear_preds.append(y)
+        linear_preds.append(y_pred)
+
     return(linear_preds)
 ~~~
 {: .language-python}

From 7149392bc9fe61d3c671dddfa94c438e6f3b5f1f Mon Sep 17 00:00:00 2001
From: Chris Endemann <endemann@wisc.edu>
Date: Wed, 27 Sep 2023 06:57:14 -0500
Subject: [PATCH 10/14] Update 02-regression.md

---
 _episodes/02-regression.md | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/_episodes/02-regression.md b/_episodes/02-regression.md
index 9e9b718..ea8d62e 100644
--- a/_episodes/02-regression.md
+++ b/_episodes/02-regression.md
@@ -119,7 +119,7 @@ def get_model_predictions(x_data: List[float], m: float, c: float) -> List[float
     for x in x_data:
         y_pred = None # FIXME: get a model prediction from each x value
 
-        # ANSWER
+        # SOLUTION
         y_pred = m * x + c
         
         #add the result to the linear_data list
@@ -197,9 +197,7 @@ def measure_error(y: List[float], y_pred: List[float]) -> float:
     assert len(y)==len(y_pred)
     err_total = 0
     for i in range(0,len(y)):
-        # add up the squared error for each observation
-        # FIXME: Uncomment the below line and fill in the blank to add up the squared error for each observation
-#         err_total = err_total + ________
+        err_total = None # FIXME: add up the squared error for each observation
         # SOLUTION
         err_total = err_total + (y[i] - y_pred[i])**2
 

From 56f770041f11558e70eec5e741e1113a05c8270b Mon Sep 17 00:00:00 2001
From: Chris Endemann <endemann@wisc.edu>
Date: Wed, 27 Sep 2023 07:06:36 -0500
Subject: [PATCH 11/14] Update 02-regression.md

---
 _episodes/02-regression.md | 67 ++++++++++++++++++++++++++++----------
 1 file changed, 49 insertions(+), 18 deletions(-)

diff --git a/_episodes/02-regression.md b/_episodes/02-regression.md
index ea8d62e..05ce520 100644
--- a/_episodes/02-regression.md
+++ b/_episodes/02-regression.md
@@ -248,8 +248,26 @@ print(df.index) # There are 243 countries in this dataset.
 
 Let's try to model life expectancy as a function of time for individual countries. To do this, review the 'process_life_expectancy_data()' function found in regression_helper_functions.py. Review the FIXME tags found in the function and try to fix them. Afterwards, use this function to model life expectancy in the UK between the years 1950 and 1980. How much does the model predict life expectancy to increase or decrease per year?
 ~~~
-def process_life_expectancy_data(filename, country, train_data_range, test_data_range=None):
-    """Model and plot life expectancy over time for a specific country. Model is fit to data spanning train_data_range, and tested on data spanning test_data_range"""
+def process_life_expectancy_data(
+    filename: str,
+    country: str,
+    train_data_range: Tuple[int, int],
+    test_data_range: Optional[Tuple[int, int]] = None) -> Tuple[float, float]:
+    """
+    Model and plot life expectancy over time for a specific country.
+    
+    Args:
+        filename (str): The filename of the CSV data file.
+        country (str): The name of the country for which life expectancy is modeled.
+        train_data_range (Tuple[int, int]): A tuple representing the date range (start, end) used 
+                                            for fitting the model.
+        test_data_range (Optional[Tuple[int, int]]): A tuple representing the date range 
+                                                     (start, end) for testing the model.
+        
+    Returns:
+        Tuple[float, float]: A tuple containing the slope (m) and the y-intercept (c) of the 
+        line of best fit.
+    """
 
     # Extract date range used for fitting the model
     min_date_train = train_data_range[0]
@@ -260,28 +278,30 @@ def process_life_expectancy_data(filename, country, train_data_range, test_data_
 
     # get the data used to estimate line of best fit (life expectancy for specific country across some date range)
     # we have to convert the dates to strings as pandas treats them that way
-    y_data_train = df.loc[country, str(min_date_train):str(max_date_train)]
+    y_train = df.loc[country, str(min_date_train):str(max_date_train)]
 
     # create a list with the numerical range of min_date to max_date
     # we could use the index of life_expectancy but it will be a string
     # we need numerical data
-    x_data_train = list(range(min_date_train, max_date_train + 1))
+    x_train = list(range(min_date_train, max_date_train + 1))
 
     # calculate line of best fit
     # FIXME: Uncomment the below line of code and fill in the blank
-#     m, c = _______([x_data_train, y_data_train])
-    m, c = least_squares([x_data_train, y_data_train])
+#     m, c = _______([x_train, y_train])
+    m, c = least_squares([x_train, y_train])
 
-    # Get model predictions for test data. 
+    # Get model predictions for train data. 
     # FIXME: Uncomment the below line of code and fill in the blank 
-#     y_preds_train = _______(x_data_train, m, c)
-    y_preds_train = get_model_predictions(x_data_train, m, c)
-    
+#     y_train_pred = _______(x_train, m, c)
+    y_train_pred = get_model_predictions(x_train, m, c)
+
     # FIXME: Uncomment the below line of code and fill in the blank
-#     train_error = _______(y_data_train, y_preds_train)
-    train_error = measure_error(y_data_train, y_preds_train)    
+#     train_error = _______(y_train, y_train_pred)
+    train_error = measure_error(y_train, y_train_pred)
+
     print("Train RMSE =", format(train_error,'.5f'))
-    make_regression_graph(x_data_train, y_data_train, y_preds_train, ['Year', 'Life Expectancy'])
+    if test_data_range is None:
+        make_regression_graph(x_train, y_train, y_train_pred, ['Year', 'Life Expectancy'])
     
     # Test RMSE
     if test_data_range is not None:
@@ -290,14 +310,25 @@ def process_life_expectancy_data(filename, country, train_data_range, test_data_
             max_date_test=min_date_test
         else:
             max_date_test = test_data_range[1]
-        x_data_test = list(range(min_date_test, max_date_test + 1))
-        y_data_test = df.loc[country, str(min_date_test):str(max_date_test)]
-        y_preds_test = get_model_predictions(x_data_test, m, c)
-        test_error = measure_error(y_data_test, y_preds_test)    
+            
+        # extract test data (x and y)
+        x_test = list(range(min_date_test, max_date_test + 1))
+        y_test = df.loc[country, str(min_date_test):str(max_date_test)]
+        
+        # get test predictions
+        y_test_pred = get_model_predictions(x_test, m, c)
+        
+        # measure test error
+        test_error = measure_error(y_test, y_test_pred)    
         print("Test RMSE =", format(test_error,'.5f'))
-        make_regression_graph(x_data_train+x_data_test, pd.concat([y_data_train,y_data_test]), y_preds_train+y_preds_test, ['Year', 'Life Expectancy'])
+        
+        # plot train and test data along with line of best fit 
+        make_regression_graph(x_train, y_train, y_train_pred,
+                              ['Year', 'Life Expectancy'], 
+                              x_test, y_test, y_test_pred)
 
     return m, c
+
 ~~~
 {: .language-python}
 

From 7c733e9397413b0f82c47ce7a7913fce60b34208 Mon Sep 17 00:00:00 2001
From: Chris Endemann <endemann@wisc.edu>
Date: Wed, 27 Sep 2023 07:11:26 -0500
Subject: [PATCH 12/14] Update 02-regression.md

---
 _episodes/02-regression.md | 44 ++++++++++++++++++++++++++------------
 1 file changed, 30 insertions(+), 14 deletions(-)

diff --git a/_episodes/02-regression.md b/_episodes/02-regression.md
index 05ce520..e6b33cf 100644
--- a/_episodes/02-regression.md
+++ b/_episodes/02-regression.md
@@ -460,32 +460,48 @@ plt.ylabel('log(GDP)');
 ### Model GDP vs Life Expectancy
 Review the `process_lifeExpt_gdp_data()` function found in `regression_helper_functions.py`. Review the FIXME tags found in the function and try to fix them. Afterwards, use this function to model life-expectancy versus GDP for the year 1980.
 ~~~
-def process_lifeExpt_gdp_data(gdp_file, life_expectancy_file, year):
-    """Model and plot life expectancy vs GDP in a specific year."""
+def process_life_expt_gdp_data(gdp_file: str, life_expectancy_file: str, year: str) -> None:
+    """
+    Model and plot the relationship between life expectancy and log(GDP) for a specific year.
+
+    Args:
+        gdp_file (str): The file path to the GDP data file.
+        life_expectancy_file (str): The file path to the life expectancy data file.
+        year (str): The specific year for which data is analyzed.
+
+    Returns:
+        None: The function generates and displays plots but does not return a value.
+    """
     data = read_data(gdp_file, life_expectancy_file, year)
 
     gdp = data["GDP"].tolist()
-    gdp_log = data["GDP"].apply(math.log).tolist()
+    # FIXME: uncomment the below line and fill in the blank
+#    log_gdp = data["GDP"].apply(____).tolist()
+    # SOLUTION
+    log_gdp = data["GDP"].apply(math.log).tolist()
+
     life_exp = data["Life Expectancy"].tolist()
 
-    m, c = least_squares([life_exp, gdp_log])
+    m, c = least_squares([life_exp, log_gdp])
 
     # model predictions on transformed data
+    log_gdp_preds = []
+    # predictions converted back to original scale
     gdp_preds = []
-    # list for plotting model predictions on top of untransformed GDP. For this, we will need to transform the model's predicitons.
-    gdp_preds_transformed = []
     for x in life_exp:
-        y_pred = m * x + c
-        gdp_preds.append(y_pred)
+        log_gdp_pred = m * x + c
+        log_gdp_preds.append(log_gdp_pred)
         # FIXME: Uncomment the below line of code and fill in the blank
-#         y_pred = math._______
-        y_pred = math.exp(y_pred)
-        gdp_preds_transformed.append(y_pred)
+#         gdp_pred = _____(log_gdp_pred)
+        # SOLUTION
+        gdp_pred = math.exp(log_gdp_pred)
+        gdp_preds.append(gdp_pred)
 
-    # Plot both the transformed and untransformed data 
-    make_regression_graph(life_exp, gdp_log, gdp_preds, ['Life Expectancy', 'log(GDP)'])
-    make_regression_graph(life_exp, gdp, gdp_preds_transformed, ['Life Expectancy', 'GDP'])
+    # plot both the transformed and untransformed data
+    make_regression_graph(life_exp, log_gdp, log_gdp_preds, ['Life Expectancy', 'log(GDP)'])
+    make_regression_graph(life_exp, gdp, gdp_preds, ['Life Expectancy', 'GDP'])
 
+    # typically it's best to measure error in terms of the original data scale
     train_error = measure_error(gdp_preds, gdp)
     print("Train RMSE =", format(train_error,'.5f'))
 ~~~

From 8557e57246a3a731ea51f42eaa7ae716ff751b4c Mon Sep 17 00:00:00 2001
From: Chris Endemann <endemann@wisc.edu>
Date: Wed, 27 Sep 2023 07:18:39 -0500
Subject: [PATCH 13/14] Update 03-introducing-sklearn.md

---
 _episodes/03-introducing-sklearn.md | 71 +++++++++++++++++------------
 1 file changed, 43 insertions(+), 28 deletions(-)

diff --git a/_episodes/03-introducing-sklearn.md b/_episodes/03-introducing-sklearn.md
index a61bd9d..b922d8a 100644
--- a/_episodes/03-introducing-sklearn.md
+++ b/_episodes/03-introducing-sklearn.md
@@ -40,7 +40,8 @@ The scikit-learn regression function is much more capable than the simple one we
 
 ~~~
 def process_life_expectancy_data_sklearn(filename, country, train_data_range, test_data_range=None):
-    """Model and plot life expectancy over time for a specific country. Model is fit to data spanning train_data_range, and tested on data spanning test_data_range"""
+    """Model and plot life expectancy over time for a specific country. Model is fit to data 
+    spanning train_data_range, and tested on data spanning test_data_range"""
 
     # Extract date range used for fitting the model
     min_date_train = train_data_range[0]
@@ -49,38 +50,47 @@ def process_life_expectancy_data_sklearn(filename, country, train_data_range, te
     # Read life expectancy data
     df = pd.read_csv(filename, index_col="Life expectancy")
 
-    # get the data used to estimate line of best fit (life expectancy for specific country across some date range)
+    # get the data used to estimate line of best fit (life expectancy for specific 
+    # country across some date range)
+    
     # we have to convert the dates to strings as pandas treats them that way
-    y_data_train = df.loc[country, str(min_date_train):str(max_date_train)]
+    y_train = df.loc[country, str(min_date_train):str(max_date_train)]
     
     # create a list with the numerical range of min_date to max_date
     # we could use the index of life_expectancy but it will be a string
     # we need numerical data
-    x_data_train = list(range(min_date_train, max_date_train + 1))
+    x_train = list(range(min_date_train, max_date_train + 1))
     
     # NEW: Sklearn functions typically accept numpy arrays as input. This code will convert our list data into numpy arrays (N rows, 1 column)
-    x_data_train = np.array(x_data_train).reshape(-1, 1)
-    y_data_train = np.array(y_data_train).reshape(-1, 1)
-
-    # FIXME: calculate line of best fit using sklearn. OLD VERSION: m, c = least_squares([x_data_train, y_data_train])
-    #ANSWER
-    regression = skl_lin.LinearRegression().fit(x_data_train, y_data_train)
-    m = regression.coef_[0][0] # coefs stored as in matrix as (n_targets, n_features), where n_targets is the number of variables in Y, and n_features is the number of variables in X
-    c = regression.intercept_[0] 
+    x_train = np.array(x_train).reshape(-1, 1)
+    y_train = np.array(y_train).reshape(-1, 1)
+
+    # OLD VERSION: m, c = least_squares([x_train, y_train])
+    regression = None # FIXME: calculate line of best fit and extract m and c using sklearn. 
+    regression = skl_lin.LinearRegression().fit(x_train, y_train)
+    
+    # extract slope (m) and intercept (c)
+    m = regression.coef_[0][0] # store coefs as (n_targets, n_features), where n_targets is the number of variables in Y, and n_features is the number of variables in X
+    c = regression.intercept_[0]
     
     # print model parameters
     print("Results of linear regression:")
     print("m =", format(m,'.5f'), "c =", format(c,'.5f'))
 
-    # FIXME: get model predictions for test data. OLD VERSION: y_preds_train = get_model_predictions(x_data_train, m, c)
-    #ANSWER
-    y_preds_train = regression.predict(x_data_train)
+    # OLD VERSION: y_train_pred = get_model_predictions(x_train, m, c)
+    y_train_pred = None # FIXME: get model predictions for test data. 
+    y_train_pred = regression.predict(x_train)
     
-    # FIXME: calculate model train set error. OLD VERSION: train_error = measure_error(y_data_train, y_preds_train)    
-    train_error = math.sqrt(skl_metrics.mean_squared_error(y_data_train, y_preds_train))
+    # OLD VERSION: train_error = measure_error(y_train, y_train_pred) 
+    train_error = None # FIXME: calculate model train set error. 
+    train_error = math.sqrt(skl_metrics.mean_squared_error(y_train, y_train_pred))
 
     print("Train RMSE =", format(train_error,'.5f'))
-    make_regression_graph(x_data_train, y_data_train, y_preds_train, ['Year', 'Life Expectancy'])
+    if test_data_range is None:
+        make_regression_graph(x_train.tolist(), 
+                              y_train.tolist(), 
+                              y_train_pred.tolist(), 
+                              ['Year', 'Life Expectancy'])
     
     # Test RMSE
     if test_data_range is not None:
@@ -89,19 +99,24 @@ def process_life_expectancy_data_sklearn(filename, country, train_data_range, te
             max_date_test=min_date_test
         else:
             max_date_test = test_data_range[1]
-        x_data_test = list(range(min_date_test, max_date_test + 1))
-        y_data_test = df.loc[country, str(min_date_test):str(max_date_test)]
+        x_test = list(range(min_date_test, max_date_test + 1))
+        y_test = df.loc[country, str(min_date_test):str(max_date_test)]
         
-        x_data_test = np.array(x_data_test).reshape(-1, 1)
-        y_data_test = np.array(y_data_test).reshape(-1, 1)
+        # convert data to numpy array
+        x_test = np.array(x_test).reshape(-1, 1)
+        y_test = np.array(y_test).reshape(-1, 1)
         
-        y_preds_test = regression.predict(x_data_test)
-        test_error = math.sqrt(skl_metrics.mean_squared_error(y_data_test, y_preds_test))
+        # get predictions
+        y_test_pred = regression.predict(x_test)
+        
+        # measure error
+        test_error = math.sqrt(skl_metrics.mean_squared_error(y_test, y_test_pred))
         print("Test RMSE =", format(test_error,'.5f'))
-        make_regression_graph(np.concatenate((x_data_train, x_data_test), axis=0), 
-                              np.concatenate((y_data_train, y_data_test), axis=0), 
-                              np.concatenate((y_preds_train, y_preds_test), axis=0), 
-                              ['Year', 'Life Expectancy'])
+        
+        # plot train and test data along with line of best fit 
+        make_regression_graph(x_train.tolist(), y_train.tolist(), y_train_pred.tolist(),
+                              ['Year', 'Life Expectancy'], 
+                              x_test.tolist(), y_test.tolist(), y_test_pred.tolist())
 
     return m, c
 ~~~

From 38f638ae1d8e56cc95bb46acae7658ea26ccfa1a Mon Sep 17 00:00:00 2001
From: Chris Endemann <endemann@wisc.edu>
Date: Wed, 27 Sep 2023 07:28:17 -0500
Subject: [PATCH 14/14] Update 03-introducing-sklearn.md

---
 _episodes/03-introducing-sklearn.md | 104 +++++++++++++++++++++++++++-
 1 file changed, 103 insertions(+), 1 deletion(-)

diff --git a/_episodes/03-introducing-sklearn.md b/_episodes/03-introducing-sklearn.md
index b922d8a..2c701c1 100644
--- a/_episodes/03-introducing-sklearn.md
+++ b/_episodes/03-introducing-sklearn.md
@@ -147,7 +147,7 @@ equation of the form y = a + bx + cx^2 + dx^3 etc. The more terms we add to the
 Scikit-learn includes a polynomial modelling tool as part of its pre-processing library which we'll need to add to our list of imports.
 
 1. Add the following line of code to the top of regression_helper_functions(): `import sklearn.preprocessing as skl_pre`
-2. Review the process_life_expectancy_data_poly() function
+2. Review the process_life_expectancy_data_poly() function and fix the FIXME tags
 3. Fit a linear model to a 5-degree polynomial transformation of x (dates). For a 5-degree polynomial applied to one feature (dates), we will get six new features or predictors: [1, x, x^2, x^3, x^4, x^5]
 
 ~~~
@@ -155,6 +155,108 @@ import sklearn.preprocessing as skl_pre
 ~~~
 {: .language-python}
 
+Fix the FIXME tags.
+~~~
+def process_life_expectancy_data_poly(degree: int, 
+                                      filename: str, 
+                                      country: str, 
+                                      train_data_range: Tuple[int, int], 
+                                      test_data_range: Optional[Tuple[int, int]] = None) -> None:
+    """
+    Model and plot life expectancy over time for a specific country using polynomial regression.
+
+    Args:
+        degree (int): The degree of the polynomial regression.
+        filename (str): The CSV file containing the data.
+        country (str): The name of the country for which the model is built.
+        train_data_range (Tuple[int, int]): A tuple specifying the range of training data years (min_date, max_date).
+        test_data_range (Optional[Tuple[int, int]]): A tuple specifying the range of test data years (min_date, max_date).
+
+    Returns:
+        None: The function displays plots but does not return a value.
+    """
+
+    # Extract date range used for fitting the model
+    min_date_train = train_data_range[0]
+    max_date_train = train_data_range[1]
+    
+    # Read life expectancy data
+    df = pd.read_csv(filename, index_col="Life expectancy")
+
+    # get the data used to estimate line of best fit (life expectancy for specific country across some date range)
+    # we have to convert the dates to strings as pandas treats them that way
+    y_train = df.loc[country, str(min_date_train):str(max_date_train)]
+    
+    # create a list with the numerical range of min_date to max_date
+    # we could use the index of life_expectancy but it will be a string
+    # we need numerical data
+    x_train = list(range(min_date_train, max_date_train + 1))
+    
+    # This code will convert our list data into numpy arrays (N rows, 1 column)
+    x_train = np.array(x_train).reshape(-1, 1)
+    y_train = np.array(y_train).reshape(-1, 1)
+    
+    # Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2]. 
+    # for a 5-degree polynomial applied to one feature (dates), we will get six new features: [1, x, x^2, x^3, x^4, x^5]
+    polynomial_features = None # FIXME: initialize polynomial features, [1, x, x^2, x^3, ...]
+    polynomial_features = skl_pre.PolynomialFeatures(degree=degree)
+    
+    x_poly_train = None # FIXME:  apply polynomial transformation to training data
+    x_poly_train = polynomial_features.fit_transform(x_train)        
+
+    print('x_train.shape', x_train.shape)
+    print('x_poly_train.shape', x_poly_train.shape)
+
+    # Calculate line of best fit using sklearn.
+    regression = None # fit regression model
+    regression = skl_lin.LinearRegression().fit(x_poly_train, y_train)  
+
+    # Get model predictions for test data
+    y_train_pred = regression.predict(x_poly_train)
+    
+    # Calculate model train set error   
+    train_error = math.sqrt(skl_metrics.mean_squared_error(y_train, y_train_pred))
+
+    print("Train RMSE =", format(train_error,'.5f'))
+    if test_data_range is None:
+        make_regression_graph(x_train.tolist(), 
+                              y_train.tolist(), 
+                              y_train_pred.tolist(), 
+                              ['Year', 'Life Expectancy'])
+    
+    # Test RMSE
+    if test_data_range is not None:
+        min_date_test = test_data_range[0]
+        if len(test_data_range)==1:
+            max_date_test=min_date_test
+        else:
+            max_date_test = test_data_range[1]
+            
+        # index data
+        x_test = list(range(min_date_test, max_date_test + 1))
+        y_test = df.loc[country, str(min_date_test):str(max_date_test)]
+        
+        # convert to numpy array 
+        x_test = np.array(x_test).reshape(-1, 1)
+        y_test = np.array(y_test).reshape(-1, 1)
+        
+        # transform x data
+        x_poly_test = polynomial_features.fit_transform(x_test)
+        
+        # get predictions on transformed data
+        y_test_pred = regression.predict(x_poly_test)
+        
+        # measure error
+        test_error = math.sqrt(skl_metrics.mean_squared_error(y_test, y_test_pred))
+        print("Test RMSE =", format(test_error,'.5f'))
+        
+        # plot train and test data along with line of best fit 
+        make_regression_graph(x_train.tolist(), y_train.tolist(), y_train_pred.tolist(),
+                              ['Year', 'Life Expectancy'], 
+                              x_test.tolist(), y_test.tolist(), y_test_pred.tolist())
+~~~
+{: .language-python}
+
 Next, let's fit a polynomial regression model of life expectancy in the UK between the years 1950 and 1980. How many predictor variables are used to predict life expectancy in this model? What do you notice about the plot? What happens if you decrease the degree of the polynomial?
 
 There are 6 predictor variables in a 5-degree polynomial: [1, x, x^2, x^3, x^4, x^5]. The model appears to fit the data quite well when a 5-degree polynomial is used. As we decrease the degree of the polynomial, the model fits the training data less precisely.