Name: Tianzuo Zhang
My contact info: Twitter Linkedin Wechat: dvzhangtz Kaggle
I also upload my homework to Github
Make an article scoring system for English Language Learners.
As a Kaggle user ( my account ), I found a very interesting competition . I really hope I can solve this problem in my homework.
The goal of this competition is to assess the language proficiency of 8th-12th grade English Language Learners (ELLs). Utilizing a dataset of essays written by ELLs will help to develop proficiency models that better support all students.
In the dataset given by the competition, every essays have been scored according to six analytic measures: cohesion, syntax, vocabulary, phraseology, grammar, and conventions. Each measure represents a component of proficiency in essay writing, with greater scores corresponding to greater proficiency in that measure. The scores range from 1.0 to 5.0 in increments of 0.5.
Our task is to predict the score of each of the six measures for the essays given in the test set
With this dataset, come to our method. No doubt we must use Bert or other transformer based model to solve this nlp question.
The Transformer models are pre-trained on the general domain corpus. But for our task, its data distribution may be different from a transformer trained on a different corpus e.g. RoBERTa trained on BookCorpus, Wiki, CC-News, OpenWebText, Stories.
What is more, this competition give me a very small train set, if I use it finetune my bert model directly, It must be over fit.
Therefore the idea is, we can further pre-train the transformer with masked language model and next sentence prediction tasks on the domain-specific data.
As a result, we need some domain specific data.
So here come to the other dataset. The first one is the dataset I scrape from Lang8, it is a multilingo language learning platform. In this platform there are lots of language learner post blogs, writing by the language they are learning.
The second dataset is from another Kaggle competition, which is very similar from this one.
Using this two dataset, I continue pretrain my bert and then finetune it with the dataset given by this competition.
conda create -n kaggle python=3.7
conda activate kaggle
pip install kaggle
pip install lxml
pip install IPython
pip install matplotlib
pip install scikit-learn
pip install iterative-stratification==0.1.7
mkdir -p input/commonlitreadabilityprize
You can skip this part, use the dataset downloaded directly.
We should download it from Api
Firstly, log in to Kaggle, go to ACCOUT Secondly, you can create New Api, getting a kaggle.json file
Thirdly, copy this file to your home/.kaggle For example, I copy it to my ~/.kaggle, since I use Ubuntu
the static dataset is stored in ./input/
Do not forget join in feedback-prize-english-language-learning, feedback-prize-2021 competition first!
Or you can skip this part, use the dataset downloaded directly. Download is easy
mkdir -p input/feedback-prize-english-language-learning
mkdir -p input/feedback-prize-2021
cd input/feedback-prize-english-language-learning
kaggle competitions download -c feedback-prize-english-language-learning
unzip commonlitreadabilityprize.zip
cd ../feedback-prize-2021
kaggle competitions download -c feedback-prize-2021
unzip feedback-prize-2021.zip
the static dataset is stored in ./data/static
python scraper.py --scrape
python scraper.py --static ./data/static/lang8.csv
python scraper.py
- Return the complete scraped datasets.
- Kindly remind: it is very very slow, since this website will block crawler's IP, and I did not use IP pool. So I sleep about half minute after I crawl every page. If you must run it, use tmux to keep it running.
We should preprocess our data to make it available for further pretrain.
python continuePretrainDataPre.py
and we will get ./input/mlm_data.csv
Use the preprocessed data to further pretrain our model.
python continuePretrain.py
python pretrainFtFeedback2.py