Update markdown instructions for Penguin classification exercise (1) #66

surbhigoel77 · 2024-07-01T09:22:20Z

Amendments to Penguin classifiication exercise to aid readability.

Propagate solution notebook changes to exercise notebook.
Propagate all changes to colab. This is done by rebasing or merging into the colab branch.

ma595 · 2024-07-01T10:14:25Z

exercises/01_penguin_classification.ipynb

+    "  - Print ``data`` and recognise that ``load_penguins`` has returned the dataframe.\n",
+    "- Analyse which features it might make sense to use in order to classify the species of the penguins.\n",
+    "  - You can print the column names using ``pd.DataFrame.keys()``\n",
+    "  - You can also obtain useful statical information on the dataset using ``pd.DataFrame.Series.describe()``"


statistical

I think 'Consider' is probably fine here, but either is fine.

ma595 · 2024-07-01T10:14:54Z

worked-solutions/01_penguin_classification_solutions.ipynb

@@ -108,23 +108,25 @@
   "source": [
    "### Task 2: creating a ``torch.utils.data.Dataset``\n",
    "\n",
-    "All PyTorch dataset objects are subclasses of the ``torch.utils.data.Dataset`` class. To make a custom dataset, create a class which inherits from the ``Dataset`` class, implement some methods (the Python magic (or dunder) methods ``__len__`` and ``__getitem__``) and supply some data.\n",
+    "To be able to use Pytorch functionalities, we need to make the dataset compatible with Pytorch. We do it using PyTorch's Dataset class called ``torch.utils.data.Dataset``. \n",


I still think we should keep the bit on torch.utils.data.Dataset. I really like the new comment however. Perhaps this should be the first sentence:

"To be able to use Pytorch functionalities, we need to make the dataset compatible with Pytorch. We do it using PyTorch's Dataset class called ``torch.utils.data.Dataset``. \n",

Followed by:
"All PyTorch dataset objects are subclasses of the torch.utils.data.Dataset class. To make a custom dataset, create a class which inherits from the Dataset class, implement some methods (the Python magic (or dunder) methods __len__ and __getitem__) and supply some data.\n",

Sorry, I realise this is fine. It's just been restructured.

ma595

Some minor comments.

…lutions

ma595

These changes are helpful to understand the material. I've made a few comments here and there.

Your git profile configuration is a bit messed up. Can you please fix this.
You've probably already been done this, but when commiting the solutions, restart the kernel, clear the outputs and run everything through once more.
Make sure to restart the kernel and clear the outputs for the exercises before committing.
I think adding some steps (as python comments) in task 9, task 10 and task 11 would make it easier to complete. See how I've done this in PR Simpler PenguinDataset #71 for the first set of tasks. Maybe we should open another PR for these changes?

Some other things worth mentioning either in the session itself or perhaps in a later PR:

What are the implications of the batch size (i.e. too large and model is prone to overfitting / too small and we get inefficient use of hardware)
Task 6: We could mention feature scaling / normalisation using batch norm.
Optimiser (could include the lr here).
What model.train() and model.eval() are doing. Just noticed there's a mention of model.train() and model.eval() already.
Whether the train & valid routines lead to good code or not. (repetition - often this is packaged in a training class / alternatively use pytorch lightning.

ma595 · 2024-07-05T08:57:55Z

worked-solutions/01_penguin_classification_solutions.ipynb

@@ -108,23 +108,25 @@
   "source": [
    "### Task 2: creating a ``torch.utils.data.Dataset``\n",
    "\n",
-    "All PyTorch dataset objects are subclasses of the ``torch.utils.data.Dataset`` class. To make a custom dataset, create a class which inherits from the ``Dataset`` class, implement some methods (the Python magic (or dunder) methods ``__len__`` and ``__getitem__``) and supply some data.\n",
+    "To be able to use Pytorch functionalities, we need to make the dataset compatible with Pytorch. We do it using PyTorch's Dataset class called ``torch.utils.data.Dataset``. \n",


I still think we should keep the bit on torch.utils.data.Dataset. I really like the new comment however. Perhaps this should be the first sentence:

"To be able to use Pytorch functionalities, we need to make the dataset compatible with Pytorch. We do it using PyTorch's Dataset class called ``torch.utils.data.Dataset``. \n",

Followed by:
"All PyTorch dataset objects are subclasses of the torch.utils.data.Dataset class. To make a custom dataset, create a class which inherits from the Dataset class, implement some methods (the Python magic (or dunder) methods __len__ and __getitem__) and supply some data.\n",

Sorry, I realise this is fine. It's just been restructured.

ma595 · 2024-07-05T09:00:10Z

worked-solutions/01_penguin_classification_solutions.ipynb

-    "    - We allow the model to learn directly from the training set—i.e. we fit the function to these data.\n",
-    "    - During training, we monitor the model's performance on the validation set in order to check how it's doing on unseen data. Normally, people use the validation performance to determine when to stop the training process.\n",
+    "- Note: Here we create a training and validation set.\n",
+    "    - We allow the model to learn from the training setn i.e. we fit the function to these data.\n",


has this been fixed?

ma595 · 2024-07-05T09:01:34Z

worked-solutions/01_penguin_classification_solutions.ipynb

    "  - ``train``— A boolean variable determining if the model returns the training or validation split (``True`` for training).\n",
-    "  - ``x_tfms``— A ``Compose`` object with functions which will convert the raw input to a tensor. This argument is _optional_.\n",
+    "  - ``x_tfms``— A ``Compose`` object with functions which will convert the raw input to a tensor. This argument is _optional_. Remember Pytorch deals with tensors only.\n",


how about: Recall that PyTorch deals with torch.Tensorss only

ma595 · 2024-07-05T09:03:25Z

worked-solutions/01_penguin_classification_solutions.ipynb

-    "    - The number of items we supply at once is called the batch size.\n",
-    "  - The ``DataLoader`` can also randomly shuffle the data each epoch (when training).\n",
-    "  - It allows us to load different mini-batches in parallel, which can be very useful for larger datasets and images that can't all fit in memory at once.\n",
+    "    - The number of observations we pass at once is called the batch size.\n",


I prefer 'items' and 'supply' here

ma595 · 2024-07-05T09:04:16Z

worked-solutions/01_penguin_classification_solutions.ipynb

@@ -323,19 +359,19 @@
    "### Task 5: Creating ``DataLoaders``—and why\n",
    "\n",
    "- Once we have created a ``Dataset`` object, we wrap it in a ``DataLoader``.\n",
-    "  - The ``DataLoader`` object allows us to put our inputs and targets in mini-batches, which makes for more efficient training.\n",
+    "  - The ``DataLoader`` utility allows us to load and put inputs and targets in mini-batches, which makes iterating over the dataset during training more efficient.\n",


I think I prefer the old comment. It's slightly more succinct.

ma595 · 2024-07-05T09:11:52Z

worked-solutions/01_penguin_classification_solutions.ipynb

@@ -495,14 +534,18 @@
   "source": [
    "### Task 8: Selecting an optimiser\n",
    "\n",
+    "Optimiser updates the model's weights and biases on the basis of the gradient of the loss function w.r.t each weight and bias. The aim of weight updation is to reduce the loss. The weight updation goes on until the loss is minimised as much as possible. \n",


The optimiser updates the model's weights...

I wouldn't use the word 'updation', use 'the weight update' or weight update process/procedure.

again, it's a little vague. "The weight update procedure proceeds until the loss is minimised subject to some stopping criterion?"

ma595 · 2024-07-05T09:12:22Z

worked-solutions/01_penguin_classification_solutions.ipynb

    "While we talked about stochastic gradient descent in the slides, most people use the so-called [Adam optimiser](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html).\n",
    "\n",
-    "You can think of it as a more complex and improved implementation of SGD."
+    "You can think of it as a slightly more complex and improved implementation of SGD."


I think I prefer the old comment.

ma595 · 2024-07-05T09:16:19Z

worked-solutions/01_penguin_classification_solutions.ipynb

    "  - This activation function is good for classifcation when the result is one of ``A or B or C``.\n",
    "    - It's bad if you even want to assign two classification to one images—say a photo of a dog _and_ a cat.\n",
    "  - It turns the raw outputs, or logits, into \"psuedo probabilities\", and we take our prediction to be the most probable class.\n",
    "\n",
+    "\n",
+    "- The purpose of training loop is that for each epoch, all the mini-batches are sequentially passed through the model to make predictions, compute the loss, and update the model parameters and repeat it until we receive a satisfactory performance.\n",


The purpose of the training loop

and update the model parameters. This is repeated until we receive a satisfactory performance.

This sentence is useful but it needs to be reworded.

ma595 · 2024-07-05T11:25:13Z

worked-solutions/01_penguin_classification_solutions.ipynb

    "  - This activation function is good for classifcation when the result is one of ``A or B or C``.\n",
    "    - It's bad if you even want to assign two classification to one images—say a photo of a dog _and_ a cat.\n",
    "  - It turns the raw outputs, or logits, into \"psuedo probabilities\", and we take our prediction to be the most probable class.\n",
    "\n",
+    "\n",
+    "- The purpose of training loop is that for each epoch, all the mini-batches are sequentially passed through the model to make predictions, compute the loss, and update the model parameters and repeat it until we receive a satisfactory performance.\n",
+    "- The validation loop works similarly, but the purpose is to validate the model training's performance without any weight updates. Hence, we are not calculating any gradient during validation. This is made sure using the ``no_grad`` decorator with validation loop.\n",


The @no_grad decorator disables gradient tracking in the function. We could point out that it is optional.

We also don't need gradients in the batch_level_accuracy() function.

ma595 · 2024-07-05T11:26:24Z

exercises/01_penguin_classification.ipynb

+    "  - Print ``data`` and recognise that ``load_penguins`` has returned the dataframe.\n",
+    "- Analyse which features it might make sense to use in order to classify the species of the penguins.\n",
+    "  - You can print the column names using ``pd.DataFrame.keys()``\n",
+    "  - You can also obtain useful statical information on the dataset using ``pd.DataFrame.Series.describe()``"


I think 'Consider' is probably fine here, but either is fine.

ma595 · 2024-07-05T13:36:33Z

This will also need to be rebased on #71

…cise1 solutions

surbhigoel77 · 2024-07-05T14:08:09Z

Thanks for the comments @ma595. I have made the changes as highlighted by you.

ma595

Missed a few other changes during the first review, and resolved some other requests as they were really good additions.

ma595 · 2024-07-05T16:14:35Z

worked-solutions/01_penguin_classification_solutions.ipynb

    "\n",
    "\n",
-    "Note: we are going to use batch normalisation layers in our network, which don't work if the batch size is one. This can happen on the last batch, if we don't choose a batch size that evenly divides the number of items in the data set. To avoid this, we can set the ``drop_last`` argument to ``True``. The last batch, which will be of size ``len(data_set) % batch_size`` gets dropped, and the data are reshuffled. This is only relevant during the training process - validation will use population statistics."
+    "Note: we are going to use batch normalisation in our network, which doesn't work if the batch size is one. This can happen on the last batch, if we don't choose a batch size that evenly divides the number of items in the data set. To avoid this, we can set the ``drop_last`` argument to ``True``. The last batch, which will be of size ``len(data_set) % batch_size`` gets dropped, and the data are reshuffled. This is only relevant during the training process - validation will use population statistics."


What is the reason for changing frombatch normalisation layers to batch normalisation?

It is just to lay more focus on batch normalisation as a concept. Can revert the change if that looks better

ma595 · 2024-07-05T16:15:19Z

worked-solutions/01_penguin_classification_solutions.ipynb

@@ -520,17 +558,20 @@
    "\n",
    "- Before we jump in and write these loops, we must first choose an activation function to apply to the model's outputs.\n",
    "  - Here we are going to use the softmax activation function: see [the PyTorch docs](https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html).\n",
-    "  - For those of you who've studied physics, you may be remininded of the partition function in thermodynamics.\n",


I quite like this comment and would like it to be kept.

worked-solutions/01_penguin_classification_solutions.ipynb

Updated the markdown for Task1 in Exercise1 penguin

4d198d8

surbhigoel77 self-assigned this Jul 1, 2024

surbhigoel77 marked this pull request as draft July 1, 2024 09:22

Surbhi Goel added 2 commits July 1, 2024 10:29

Updated Task1 markdown in Exercise1 solutions

6f39653

Updated Task2 markdown in Exercise1 solutions

224d3cd

ma595 reviewed Jul 1, 2024

View reviewed changes

ma595 requested changes Jul 1, 2024

View reviewed changes

Surbhi Goel added 11 commits July 1, 2024 13:35

Updated Task3 markdown in Exercise1 solutions

dcee528

Updated Task3 and Task4 markdown in Exercise1 solutions

74bb0c6

Updated Task5 markdown in Exercise1 solutions

5ebca66

Updated Task6 markdown in Exercise1 solutions

97ef68d

Updated Task7 markdown in Exercise1 solutions

7d4a8c2

Updated Task8 markdown in Exercise1 solutions

4cddfee

Updated Task9 markdown in Exercise1 solutions

16534f6

Updated Task9 and 10 markdown in Exercise1 solutions

acb9039

Updated introduction of dataset markdown in Exercise1 solutions

a772685

Updated Task1 code in Exercise1 solutions

ce19a72

Updated location of pytorch dataset for penguins code in Exercise1 so…

525602c

…lutions

ma595 requested changes Jul 5, 2024

View reviewed changes

ma595 changed the title ~~Updated the markdown for Task1 in Exercise1 penguin~~ Update markdown instructionss for Penguin classification exercise (1) Jul 5, 2024

ma595 changed the title ~~Update markdown instructionss for Penguin classification exercise (1)~~ Update markdown instructions for Penguin classification exercise (1) Jul 5, 2024

Made the changes requested in the PR review for penguins code in Exer…

937d2ba

…cise1 solutions

ma595 reviewed Jul 5, 2024

View reviewed changes

Surbhi Goel added 4 commits July 9, 2024 10:24

Corrected the markdown in Task5.

71c17a6

Added Repo access intsructions, virtual env creation and a summary slide

bf82e32

Updated Task10 and Task11

5a25012

Updated Task10 and Task11

a3fda0f

surbhigoel77 requested a review from ma595 July 10, 2024 19:44

surbhigoel77 marked this pull request as ready for review July 10, 2024 19:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update markdown instructions for Penguin classification exercise (1) #66

Update markdown instructions for Penguin classification exercise (1) #66

surbhigoel77 commented Jul 1, 2024 •

edited by ma595

Loading

ma595 Jul 1, 2024

ma595 Jul 5, 2024

ma595 Jul 1, 2024 •

edited

Loading

ma595 Jul 5, 2024 •

edited

Loading

ma595 left a comment

ma595 left a comment •

edited

Loading

ma595 Jul 5, 2024 •

edited

Loading

ma595 Jul 5, 2024

ma595 Jul 5, 2024

ma595 Jul 5, 2024

ma595 Jul 5, 2024

ma595 Jul 5, 2024

ma595 Jul 5, 2024 •

edited

Loading

ma595 Jul 5, 2024

ma595 Jul 5, 2024

ma595 Jul 5, 2024

ma595 Jul 5, 2024

ma595 Jul 5, 2024

ma595 commented Jul 5, 2024

surbhigoel77 commented Jul 5, 2024

ma595 left a comment

ma595 Jul 5, 2024

surbhigoel77 Jul 9, 2024

ma595 Jul 5, 2024

surbhigoel77 Jul 9, 2024

Update markdown instructions for Penguin classification exercise (1) #66

Are you sure you want to change the base?

Update markdown instructions for Penguin classification exercise (1) #66

Conversation

surbhigoel77 commented Jul 1, 2024 • edited by ma595 Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ma595 Jul 1, 2024 • edited Loading

Choose a reason for hiding this comment

ma595 Jul 5, 2024 • edited Loading

Choose a reason for hiding this comment

ma595 left a comment

Choose a reason for hiding this comment

ma595 left a comment • edited Loading

Choose a reason for hiding this comment

ma595 Jul 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ma595 Jul 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ma595 commented Jul 5, 2024

surbhigoel77 commented Jul 5, 2024

ma595 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

surbhigoel77 commented Jul 1, 2024 •

edited by ma595

Loading

ma595 Jul 1, 2024 •

edited

Loading

ma595 Jul 5, 2024 •

edited

Loading

ma595 left a comment •

edited

Loading

ma595 Jul 5, 2024 •

edited

Loading

ma595 Jul 5, 2024 •

edited

Loading