This repository contains the code and methodology for the competition task of identifying whether an essay was written by a student or generated by a large language model (LLM). The dataset comprises approximately 10,000 essays in the training set and about 9,000 essays in the hidden test set.
- Data Preprocessing Cleaned and preprocessed the text data to prepare it for model training. Tokenization, stop words removal, and punctuation handling were performed.
- Model Training Utilized a pre-trained BERT model, fine-tuning it on the training dataset. Addressed the imbalance in the dataset to prevent overfitting.
- Evaluation Employed cross-validation techniques to evaluate model performance and make adjustments.
- Prediction Applied the trained model to the test set to classify essays as either student-written or LLM-generated. Files notebook.ipynb: The main notebook containing the code for data preprocessing, model training, evaluation, and prediction. train_essays.csv: The training dataset with essays. test_essays.csv: The dummy test dataset for validation purposes.
This work leverages the BERT model, a powerful NLP tool developed by Google. Thanks to the competition organizers for providing the dataset and the opportunity to participate.
This project is licensed under the MIT License. See the LICENSE file for details.