Skip to content

Latest commit

 

History

History
executable file
·
134 lines (83 loc) · 13 KB

READMELong.md

File metadata and controls

executable file
·
134 lines (83 loc) · 13 KB

LinReg

By Mohammed Alqumairi

About

LinReg is a regression problem solver, that makes use of the Linear Regression and Gradient Descent algorithms. It takes in a dataset, and calculates the weights and bias for the hyperplane of best fit. These can then be used to make predictions about unknown samples.

The equation for the hypothesis (h) of a multi-variate linear regression:

h = b + (w1 * x1) + (w2 * x2) + (w3 * x3) ... + (wn * xn)

Where:

  • b is the bias (aka the intercept)
  • w is the weight. One weight for each feature (corresponding to a column in the dataset), until wn
  • x is the feature value for a specific sample
  • n is the number of features

LinReg calculates the bias term, and a weight for each feature. The bias will be outputed as a float. The weights will be outputed as a list, of length n. One weight for each feature, in the order those features appear in the dataset.

This ReadMe will hold instructions detailing how to use LinReg. For a summarized version, see the shortened README.

Contents

How to Run the Program

Step 1: Ensure your current directory is the "coursework" folder that this file is contained in. Then run:
runhaskell Main

Step 2: You will be greeted with a welcome message, and asked to type the name of the text file holding your dataset. Let's use an example dataset found in input.txt. Type:
input.txt

To use your own input file, ensure it is formatted correctly.

Step 3: You will be asked to insert the learning rate you wish to use for the purposes of gradient descent. Type:
0.5

Step 4: You will then be asked to insert the number of epochs you wish the gradient descent algorithm to run for. Assuming your learning rate is small enough, the more the epochs, the more accurate the result will be. This however comes at the cost of performance. Let's go with 1000 epochs for now. Type:
1000

Notice: The learning rate and number of epochs, may need be tuned for different datasets. For information on how to tune them, read below.

Result: After some time (proportional to the size of the dataset, and the number of epochs chosen), LinReg will output a value for the bias, and a list of weights. You can compare these results with those calculated by Scikit-Learn using the Python Verifier. As you can see, LinReg can achieve very accurate results!

Formatting the Input file

The input file will hold your dataset. This will be a text file (.txt) that you must place inside the "coursework" folder.

The dataset in the input file is a table, formatted using the following rules:

  • Every row of the table corresponds to a new line in the text file
  • Every column is separated with a single space
  • The first row is the header row, holding the title for each column. The final column is your target column.
  • For the functioning of the Python Verifier, the final column must be titled "y"
  • Apart from the first header row in the text file, all values in the dataset must be numerical

Feel free to check the sample input files (input.txt, input1.txt, input2.txt, input3.txt, and boston.txt) if you require further clarification.

Verifying the Results

The output from LinReg may be evaluated by comparing it with results obtained using an established statistics library. For the examiner's convenience, I have included a "Verifier.py" file which makes use of Scikit-Learn for this purpose. This is not necessary for the functioning of LinReg, and the examiner is free to select another method for verification purposes.

To use Verifier.py, ensure you have Python, Numpy, Pandas, and Scikit-Learn installed. The script was tested using Python 3.8.3 using the Anaconda environment. Then follow the steps below.

Step 1: Ensure your current directory is the "coursework" folder. Run:
python Verifier.py

Step 2: You will be prompted to enter the name of the text file containing your dataset. Type that in. For example:
input.txt

Result: The Verifier will output a value for the bias, and an array of weights, using Scikit-Learn. You can compare these with the results obtained using LinReg.

How to Tune Learning Rate and Epochs

Since I am using gradient descent to tune my parameters, the learning rate and the number of epochs to iterate for, are the two hyperparameters that the user needs to tune.

The learning rate determines the step size of the Gradient Descent algorithm. A number between 0 and 1. If the learning rate is too small, the algorithm will learn more slowly, hence you will need more epochs to reach the optimal result. If the learning rate is too high, the algorithm will diverge, outputting incorrect parameters (when this happens, usually the ouputted weights array will contain some values which are NaN). Ideally, you want the largest possible learning rate that does not cause the algorithm to diverge. I reccomend starting with 0.5 for the learning rate. If the outputted weights array contains NaN values, try again with a smaller learning weight (e.g. 0.4). Repeat until the algorithm stops diverging.

The number of epochs is the number of iterations for Gradient Descent you wish the algorithm to run for. Assuming your chosen learning rate is small enough, the more the epochs, the more accurate the outputted result will be. However, this comes at the cost of performance- the more the epochs, the longer it will take the program to execute. I reccomend starting with 1000 epochs. After LinReg finishes executing, run again using a greater number of epochs (e.g. 10000). If there is a signficant disparity between the results after the two runs, repeat this process with an even greater number of epochs, until the disparity is no longer significant.

Tests

I have tabulated the ouputs of LinReg using each of the sample input files. The ouputs obtained from Scikit-Learn using the same inputs were added for comparison's sake.

Input File Learning Rate Epochs LinReg Bias LinReg Weights SciKit-Learn Bias SciKit-Learn Weights
input.txt 0.5 1000 4.880177 0.198294
0.7188662
0.8524282
4.880165322068354 0.19829517
0.71886712
0.85242901
input1.txt 0.5 1000 -0.24000283 0.92800057 -0.24000767368750608 0.92800153
input2.txt 0.5 1000 1.9693025 0.9346864
0.89539075
0.67222714
1.969294057080127 0.93468698
0.89539156
0.67222764
input3.txt 0.5 1000 0.16018465 1.1546878
0.9532798
0.9994971
0.95564324
0.9078755
1.1276116
1.0319495
0.9178561
0.8091981
1.0298774
0.16016152436949227 1.15468819
0.95328017
0.99949774
0.95564382
0.9078763
1.12761249
1.03194998
0.91785616
0.80919865
1.02987786

I then tried using a real dataset on LinReg- the Boston house prices dataset (retrieved from Kaggle). After some tuning, I found that a good learning rate was around 0.4, and that Gradient Descent required around 10,000 epochs to reach a near optimal result.

Warning: this test will take signficant time to execute, given the size of the dataset, and the number of epochs chosen to reach the results.

Input File Learning Rate Epochs LinReg Bias LinReg Weights SciKit-Learn Bias SciKit-Learn Weights
boston.txt 0.4 10000 36.146996 -0.107844375
4.641067e-2
2.0260321e-2
2.6879923
-17.630888
3.8292494
5.622324e-4
-1.4714662
0.30482042
-1.2311089e-2
-0.94805914
9.349744e-3
-0.5237158
36.45948838509019 -0.108011358
4.64204584e-2
2.05586264e-2
2.68673382
-17.7666112
3.80986521
6.92224640e-4
-1.47556685
0.306049479
-1.23345939e-2
-0.952747232
9.31168327e-3
-0.524758378

About the Code

The main function (in Main.hs) is the entry point to LinReg, and the only place where IO interaction occurs. The user chooses the text file holding the dataset, and selects their learning rate (lr) and the number of epochs (e) to run gradient descent for.

The text file is passed to parseInput (ParseInput.hs), its purpose is to parse the txt file into two matrices:

  • an m by n matrix for the features (where m is the number of samples, and n is the number of features)
  • an m by 1 matrix for the target (where m is the number of samples)

Every matrix, alongside its shape, are held by the Matrix data type (defined in Matrices.hs).

The two matrices are wrapped by the Dataset type (defined in Datasets.hs). And this dataset, alongside the user's chosen lr and e, is wrapped by the Input type (defined in Input.hs). This is to improve the modularity of the program.

The matricies, are then passed to linReg (Regression.hs) to apply the Linear Regression algorithm using Gradient Descent. This makes use of various higher order functions in Matrices.hs (e.g. eleWiseOp), as well as recursion (gradDescent in Regression.hs), to calculate the weights and bias. The weights and bias are stored as a Coefficients datatype (Coefficients.hs), and printed to console.

Encountered Challenge

A challenge encountered was the peformance of LinReg on datasets where there exists a significant disparity in the scales of various columns. For example, the boston house prices dataset (in boston.txt), where the NOX column held values between 0 and 1, whereas the TAX column held values in the hundreds. In such cases, LinReg required an extremley small lr (around 1e-8), and an extremly large e (around 100,000) in order to reach near optimal results, translating to extremly long execution times (nearly half an hour!).

I was able to drastically improve pefromance by employing feature scaling. This involved scaling each column by dividing the values therin, by the maximum value in that column. This ensured that all columns have values between 0 and 1. LinReg performs feature scaling on all inputs automatically, before passing the dataset to the linReg function (in Regression.hs). The outputted coefficents are then "unscaled" to remove the impact of feature scaling on the output.

After feature scaling, a good result can be obtained for the boston house prices dataset, with an lr of 0.4, and just 10,000 epochs. And while this still takes some time to process (given the size of the dataset, and the fact I'm using gradient descent instead of the normal equation), it represents a noteworthy improvement over the performance of LinReg without feature scaling.

Third Party Contributions