Machine Learning Course - Class Project 1 - Fall 2023

Project Overview

This repository contains code and files for implementing various machine learning methods for the second project of the machine learning class. The project's main goal is to develop and evaluate different machine learning algorithms on a twitter dataset for binary sentiment analysis. This repository contains files to create submissions for AI crowd and to recreate the plos shown in the report.

Contributors

Aly Elbindary
André Schakkal
Peter Harmouch

Group Name: APASHE

Files and Structure

The structure of the folder is the following:

ML-PROJECT2-2-APASHE/
├── data/
│ ├── test_data.txt
│ ├── train_neg.txt
│ ├── train_neg_full.txt
│ ├── train_pos.txt
│ └── train_pos_full.txt
├── plots/
│ ├── word_embeddings/
│ │ ├── cooc_files/
│ │ ├── encoded_dfs/
│ │ ├── saved_results/
│ │ ├── vocab_cut_files/
│ │ ├── vocab_full_files/
│ │ ├── vocab_pkl_files/
│ │ └── word_embeddings_plots.ipynb
│ ├── tf_idf/
│ │ └── tf_idf_plots.ipynb
│ ├── transformers/
│ │ └── transformers_plots.ipynb
│ └──general_plots.ipynb
├── submissions/
│ └── submission.csv
├── weights/
│ └── best_model_weights_1.pt
│ └── best_model_weights_2.pt
│ └── best_model_weights_3.pt
├── run.ipynb
├── train.ipynb
├── plots.ipynb
├── helpers.py
├── requirements.txt
└── README.md

The most important files and folders are the following:

run.ipynb: This Jupyter Notebook imports our best model (BERTweet) with its weights, applies it to the dataset, and creates a CSV file suitable for submission on AI Crowd
train.ipynb: This Jupyter Notebook is dedicated to training our best working model. It creates a txt file that can then be imported in run.ipynb.
helpers.py: This Python script contains useful functions used throughout the project.
plots: Since we explore three different NLP techniques within this project, we dedicate a subfolder for each method (namely Word embeddings, TF-IDF and Transformers) having a plots.ipynb that generate their corresponding plots. these plots includes hyperparameter search plots, plots for model comparison and plots for visualization.

Data

The project data should be put in the folder called data. You can download the dataset from the following URL: aicrowd text classification challenge dataset.

Weights

Since the best model was achieved Employing an ensemble of the three best BERTweet models, the weights of these three models need to be downloaded. The weights can be downloaded from here. Each weight file is approximately 527MB in size, totaling 1.581GB for all three models. Once downloaded, place each weight file into the weights folder of your project directory.

Running the Code

To run the code in this project, follow these steps:

Make sure you have the necessary libraries (numpy, and matplotlib), installed in your Python environment. You can set up a Conda environment with the required libraries using the following steps:

conda create --name ml-project python=3.9
conda activate ml-project
conda install -r requirements.txt

If you only want to test our model, you can use an ensemble of the existing weights of the best 3 BERTweet pretrained models. Obtain the prediction file by running the file run.ipynb. Ensure you run all the blocks in order. Submit this file on aicrowd to get the accuracy on the testing set.
If you want to train the model yourself, open and run the train.ipynb notebook. Make sure to run all the blocks in order. After training, you can use the pretrained model by opening and running the run.ipynb notebook to create a submission file for AI Crowd.
Use the plots folder to generate relevant plots.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning Course - Class Project 1 - Fall 2023

Project Overview

Contributors

Files and Structure

Data

Weights

Running the Code

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
__pycache__		__pycache__
data		data
plots		plots
submissions		submissions
weights		weights
README.md		README.md
helpers.py		helpers.py
requirements.txt		requirements.txt
run.ipynb		run.ipynb
train.ipynb		train.ipynb

CS-433/ml-project-2-apashe

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Course - Class Project 1 - Fall 2023

Project Overview

Contributors

Files and Structure

Data

Weights

Running the Code

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages