Skip to content

dvzhang/feedback-prize-english-language-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Readme

Name: Tianzuo Zhang

My contact info: Twitter Linkedin Wechat: dvzhangtz Kaggle

I also upload my homework to Github

0. Background

0.1. Goal:

Make an article scoring system for English Language Learners.

0.2. Motivation:

As a Kaggle user ( my account ), I found a very interesting competition . I really hope I can solve this problem in my homework.

The goal of this competition is to assess the language proficiency of 8th-12th grade English Language Learners (ELLs). Utilizing a dataset of essays written by ELLs will help to develop proficiency models that better support all students.

In the dataset given by the competition, every essays have been scored according to six analytic measures: cohesion, syntax, vocabulary, phraseology, grammar, and conventions. Each measure represents a component of proficiency in essay writing, with greater scores corresponding to greater proficiency in that measure. The scores range from 1.0 to 5.0 in increments of 0.5.

Our task is to predict the score of each of the six measures for the essays given in the test set

0.3 Method description

With this dataset, come to our method. No doubt we must use Bert or other transformer based model to solve this nlp question.

The Transformer models are pre-trained on the general domain corpus. But for our task, its data distribution may be different from a transformer trained on a different corpus e.g. RoBERTa trained on BookCorpus, Wiki, CC-News, OpenWebText, Stories.

What is more, this competition give me a very small train set, if I use it finetune my bert model directly, It must be over fit.

Therefore the idea is, we can further pre-train the transformer with masked language model and next sentence prediction tasks on the domain-specific data.

picture

As a result, we need some domain specific data.

So here come to the other dataset. The first one is the dataset I scrape from Lang8, it is a multilingo language learning platform. In this platform there are lots of language learner post blogs, writing by the language they are learning.

The second dataset is from another Kaggle competition, which is very similar from this one.

Using this two dataset, I continue pretrain my bert and then finetune it with the dataset given by this competition.

1. Setup you env

conda create -n kaggle python=3.7
conda activate kaggle

pip install kaggle
pip install lxml
pip install IPython
pip install matplotlib 
pip install scikit-learn
pip install iterative-stratification==0.1.7

mkdir -p input/commonlitreadabilityprize

2. Dataset

You can skip this part, use the dataset downloaded directly.

2.1 Competition Dataset

We should download it from Api

2.1.1 How to use Kaggle api

Firstly, log in to Kaggle, go to ACCOUT pic Secondly, you can create New Api, getting a kaggle.json file pic

Thirdly, copy this file to your home/.kaggle For example, I copy it to my ~/.kaggle, since I use Ubuntu

the static dataset is stored in ./input/

2.1.2 How to download data from Api

Do not forget join in feedback-prize-english-language-learning, feedback-prize-2021 competition first!

Or you can skip this part, use the dataset downloaded directly. Download is easy

mkdir -p input/feedback-prize-english-language-learning
mkdir -p input/feedback-prize-2021

cd input/feedback-prize-english-language-learning
kaggle competitions download -c feedback-prize-english-language-learning
unzip commonlitreadabilityprize.zip

cd ../feedback-prize-2021
kaggle competitions download -c feedback-prize-2021
unzip feedback-prize-2021.zip

2.2 Data scraped from Lang-8 website

the static dataset is stored in ./data/static

2.2.1 scraper.py --scrape

python scraper.py --scrape
  • This will scrape the data but return only 5 entries of each dataset. pic

2.2. scraper.py --static <path_to_dataset>

python scraper.py --static ./data/static/lang8.csv
  • This will return the static dataset scraped from the web and stored in database or CSV file pic

2.3. scraper.py

python scraper.py 
  • Return the complete scraped datasets.
  • Kindly remind: it is very very slow, since this website will block crawler's IP, and I did not use IP pool. So I sleep about half minute after I crawl every page. If you must run it, use tmux to keep it running. pic

3. Further Pretrain

3.1 Data preperation

We should preprocess our data to make it available for further pretrain.

python continuePretrainDataPre.py

and we will get ./input/mlm_data.csv

3.2 Train

Use the preprocessed data to further pretrain our model.

python continuePretrain.py

4. Use the pretrained model in our competition dataset

python pretrainFtFeedback2.py

5. Results

Using this evaluation metric: picture My score is 0.477671232111189

The result detail can be found in submission.csv picture

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published