Skip to content

Commit

Permalink
Bias vs variance
Browse files Browse the repository at this point in the history
  • Loading branch information
s2t2 committed Sep 17, 2024
1 parent c82b365 commit 2910040
Show file tree
Hide file tree
Showing 4 changed files with 19 additions and 261 deletions.
Binary file added docs/images/bias-variance-tradeoff.ppm
Binary file not shown.
22 changes: 6 additions & 16 deletions docs/notes/predictive-modeling/ml-foundations/data-encoding.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,18 @@ When preparing features (`x` values) for training machine learning models, the m
So if we have categorical or textual data, we will need to use a **data encoding** strategy to represent the data in a different way.


For categorical data, we'll use either an ordinal or one-hot encoding strategy, depending on whether there is a certain ordered relationship in the data or not.
For categorical data, we'll use either an ordinal or one-hot encoding strategy, depending on whether there is a certain ordered relationship in the data or not. For time-series data, we can use time step encoding.


## Ordinal Encoding for Categorical Data

If the data has an order about it, where one category means more or less than others, then we will convert the categories into a linear range of numbered values.

## One-hot Encoding for Categorical Data


If the data is truly categorical, where there is no ordinal relationship present, we will perform "one-hot" encoding.


## Time Step Encoding for Time-series Data

Expand All @@ -28,18 +33,3 @@ df.sort_values(by="date", ascending=True, inplace=True)
df["time_step"] = range(1, len(df) + 1)
df
```



## One-hot Encoding for Categorical Data


If the data is truly categorical, where there is no ordinal relationship present, we will perform "one-hot" encoding.







## Bag of Words for Natural Language Processing
17 changes: 13 additions & 4 deletions docs/notes/predictive-modeling/ml-foundations/generalization.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ In technical terms, overfitting happens when a model has low bias but high varia

Common causes of overfitting include:

+ Using a model that is too complex for the given data (e.g., deep neural networks on small datasets).
+ Using a model that is too complex for the given data (e.g. deep neural networks on small datasets).
+ Training the model for too long without proper regularization.
+ Using too many features or irrelevant features.

Expand All @@ -35,7 +35,7 @@ In technical terms, underfitting happens when a model has high bias but low vari

Common causes of underfitting include:

+ Using a model that is too simple for the task at hand (e.g., linear regression for non-linear data).
+ Using a model that is too simple for the task at hand (e.g. linear regression for non-linear data).
+ Not training the model long enough or with sufficient data.
+ Using too few features or ignoring important features.

Expand All @@ -46,9 +46,18 @@ Symptoms of underfitting:

### Finding a Balance

The goal in predictive modeling is to find a model that strikes a balance between overfitting and underfitting. This balance is achieved by using appropriate model complexity, proper data preprocessing, and regularization techniques. A model that generalizes well will have low error on both the training and testing datasets.
In the context of generalization, bias and variance represent two types of errors that can affect a model's performance. **Bias** refers to errors introduced by overly simplistic models that fail to capture the underlying patterns in the data, leading to underfitting. A high-bias model makes strong assumptions about the data, resulting in consistently poor predictions on both the training and test sets. On the other hand, **variance** refers to errors caused by overly complex models that fit the training data too closely, capturing noise along with the signal. This leads to overfitting, where the model performs well on the training data but poorly on unseen test data.


![Illustration of bias vs variance, using a bulls-eye. Source: [Gudivada 2017 Data](https://www.researchgate.net/publication/318432363_Data_Quality_Considerations_for_Big_Data_and_Machine_Learning_Going_Beyond_Data_Cleaning_and_Transformations).](../../../images/bias-variance-tradeoff.ppm)

The challenge in machine learning is to find the right balance between bias and variance, often called the bias-variance tradeoff, in order to achieve good generalization. A model with the right balance will generalize well to new data by capturing the essential patterns without being too sensitive to specific details in the training data.


![Three different models, illustrating trade-offs between overfitting and underfitting. Source: [AWS Machine Learning](https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html).](../../../images/aws-underfitting-overfitting.png)



![Three different models, illustrating trade-offs between overfitting and underfitting. Source: [AWS Machine Learning](https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html)](../../../images/aws-underfitting-overfitting.png)



Expand Down

This file was deleted.

0 comments on commit 2910040

Please sign in to comment.