Skip to content

This is the source code of "OL4TeX: Adaptive Online Learning for Text Classification under Distribution Shifts".

License

Notifications You must be signed in to change notification settings

bigbases/online-learning-text

Repository files navigation

OL4TeX: Adaptive Online Learning for Text Classification under Distribution Shifts

Data Preparation

Datasets are used in the following file structure:

│adaptive-model-update/
├──data/
│  ├── cybersecurity
│  │   ├── cybersecurity_source.csv
│  │   ├── cybersecurity_target.csv
│  ├── disaster
│  │   ├── disaster_source.csv
│  │   ├── disaster_target.csv
│  ├── review
│  │   ├── hotel_review.csv
  • cs_source.csv: You can download it from: here
  • cs_target.csv: You can download it from: here
  • disaster_source.csv: You can download it from: here
  • disaster_target.csv: Please refer to emergency.csv file.
  • hotel_review.csv: You can download it from: here

Setups

All code was developed and tested on Nvidia RTX A4000 (48SMs, 16GB) the following environment.

  • Ubuntu 18.04
  • python 3.6.9
  • gensim 3.8.3
  • keras 2.6.0
  • numpy 1.19.5
  • pandas 1.1.5
  • tensorflow 2.6.2

Implementation

To pre-train the model, run the following script using command line:

sh run_pretrain_offline.sh

To adapt the model online, run the following script using command line:

sh run_update_online.sh

Hyperparameters

The following options can be passed to main.py

  • -dataset: Name of the dataset. (Supported names are cybersecurity, disaster, review)
  • -model: Neural architecture of the OnlineAdaptor. (Supported models are CNN, LSTM, Transformer)
  • -adjust_weight: Relative importance between learning efficiency and accuracy. Default is 0.5.
  • -epochs: Epochs for training model. Deault is 20.
  • -event_size: Size of streaming batches.
  • -batch_size: Size of batch to train the model.
  • -keyword_size: Size of keyword set to calculate the frequency indicator.
  • -embedding_size: Size of embedding layer.
  • -output_path: Path for the output results.
  • -token_path: Path for saving and loading tokenizer.
  • -model_path: Path for saving and loading machine learning-based OnlineAdaptor.
  • -ml_path: Path for saving and loading machine learning-based AccPredictor.
  • -pretrain: Execute the model pre-training in offline.
  • -update: Execute the model update in online.

About

This is the source code of "OL4TeX: Adaptive Online Learning for Text Classification under Distribution Shifts".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published