Main Research:
A new framework for spoken language understanding in task-oriented dialgoue systems to approach the naturalness (tackle irregularities). This research project mainly focuses on the motivations of understanding human utterances and machine interactions, then forming knowledge abstraction for downstream policy learning tasks.
Mainly given a dialogue turn, it will return the updated slot-values and satisfaction level.
It has the following features & aims:
-
NLU:
-
Surface Model:
a. Multi-intent clustering framework
b. intent+slot-filling -
Pertinence Model:
a. Evaluation framework
-
-
DST:
- State Tracker
- Data: Redefine labels in dialogue dataset (some are not clear).
- Fine-tune: Train Bert model with single sentence datasets and apply to dialogue datasets for clustering.
- Surface: Check dcec convergence.
- Surface: Attention words with masking mechanism.
- Surface: Fewer labels to train as possible.
Data preprocessing pipeline.
Associated files and folders
data/train_data.py
data/dialogue_data.py
- Go to
config.py
to select data type - Run the following to generate raw_data.pkl for the following use.
Single sentence: ATIS/Semantic parsing datasetDialogue: MultiWOZ2.1 datasetpython data/train_data.py
python data/dialogue_data.py
To extract contextualized representations, the service fine-tune BERT model to generate the pretrained sentence embeddings.
Associated files and folders
bert_finetune.py
bert_nsp.py
finetune_results/
checkpoints/
To train single sentence dataset: atis/top semantic
python bert_finetune.py train --datatype=atis
python bert_finetune.py train --datatype=semantic
To train MultiWOZ dataset with next-sentence prediction:
python bert_nsp.py train
There are three modes for testing the atis/top semantic BERT embeddings:
python bert_finetune.py test --mode=[mode_type]
mode_type:
- embedding:
For this mode, it will generate and store the sentence embeddings for every training data - data:
For this mode, it will do original text classification based on the dataset - user:
For this mode, you can type in any kind of sentence and it will classify a specific label for the corresponding sentence.
To extract BERT embeddings on single sentence level from dialogue dataset:
python bert_nsp.py test --mode=embedding
For this mode, it will generate and store the sentence embeddings for every training data
After obtaining embeddings, we could use them for surface model:
please check here for more details:
- intent clustering
- intent+slot-filling
After obtaining embeddings, we could use them for pertinence model.