Skip to content

MaiAditya/Detecting-Generated-Text----BERT-Model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Detecting-Generated-Text----BERT-Model

Introduction

This repository contains the code and methodology for the competition task of identifying whether an essay was written by a student or generated by a large language model (LLM). The dataset comprises approximately 10,000 essays in the training set and about 9,000 essays in the hidden test set.

Methodology

  1. Data Preprocessing Cleaned and preprocessed the text data to prepare it for model training. Tokenization, stop words removal, and punctuation handling were performed.
  2. Model Training Utilized a pre-trained BERT model, fine-tuning it on the training dataset. Addressed the imbalance in the dataset to prevent overfitting.
  3. Evaluation Employed cross-validation techniques to evaluate model performance and make adjustments.
  4. Prediction Applied the trained model to the test set to classify essays as either student-written or LLM-generated. Files notebook.ipynb: The main notebook containing the code for data preprocessing, model training, evaluation, and prediction. train_essays.csv: The training dataset with essays. test_essays.csv: The dummy test dataset for validation purposes.

Acknowledgments

This work leverages the BERT model, a powerful NLP tool developed by Google. Thanks to the competition organizers for providing the dataset and the opportunity to participate.

License

This project is licensed under the MIT License. See the LICENSE file for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published