- Competition Here: https://www.kaggle.com/competitions/llm-detect-ai-generated-text
In recent years, the development of Large Language Models (LLMs) is becoming matured, making the text they generate increasingly difficult to distinguish from human writing. The competition required participants to develop a machine learning model capable of accurately detecting whether an essay was written by a student or an LLM. The competition dataset included essays written by students and articles generated by various LLMs. This competition was a typical binary classification problem, with the evaluation metric being AUC.
Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.
- Official Dataset: https://www.kaggle.com/competitions/llm-detect-ai-generated-text/data?select=train_essays.csv
- Pile and Ultra: https://www.kaggle.com/datasets/canming/piles-and-ultra-data
- Human vs. LLM Text Corpus: https://www.kaggle.com/datasets/starblasters8/human-vs-llm-text-corpus
- DAIGT V2 Train Dataset: https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset
-
First, a linear model based on a specific argumentative essay dataset (DAIGT V2 Train Dataset) with a distribution similar to the competition dataset:
- a)Filtering out repetitive data based on similarity;
- b)Pre-training a tokenizer with test set texts, then tokenizing training and test set texts to obtain statistical features with a consistent vocabulary;
- c)After tokenization, using TFIDF to obtain Ngram (3,5) text statistical feature vectors;
- d)Inputting the above features into an ensemble classifier composed of MultinomialNB and SGDClassifier for training, then predicting the results.
-
Second, a LLM based on large datasets (Pile and Ultra, Human vs. LLM Text Corpus):
- a)Collecting open-source data from the internet, coming from both human writings and LLM dialogues;
- b)After simple processing of the large-scale data, inputting it into the text binary classification model deberta-v3-small for fine-tuning, obtaining the trained weights, and then performing inference on Kaggle.
-
Third, open-source language models (open-source results from other participants, mainly used for ensemble): Using third-party datasets for language model fine-tuning training completion, then predicting results on Kaggle.
- Reference Model Here: https://www.kaggle.com/code/mustafakeser4/train-detectai-distilroberta-0-927
-
Fourth, ensemble prediction: weighted fusion of the prediction results of the three modeling methods after rank scaling to obtain the final prediction.(according to the leaderboard)