This is a Python 3.6.1 and PyTorch 1.0 implementation of the paper referenced below.
Source code of our EMNLP 2018 paper: SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-DomainText-to-SQL Task .
@InProceedings{Yu&al.18.emnlp.syntax,
author = {Tao Yu and Michihiro Yasunaga and Kai Yang and Rui Zhang and Dongxu Wang and Zifan Li and Dragomir Radev},
title = {SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-Domain Text-to-SQL Task},
year = {2018},
booktitle = {Proceedings of EMNLP},
publisher = {Association for Computational Linguistics},
}
Please look at Atakan_Okan_Text2SQL.pdf in main directory.
- The code uses Python 3.7 and Pytorch 1.0.0 GPU.
- Install Python dependency:
pip install -r requirements.txt
- Download the dataset from the Spider task website to be updated, and put
tables.json
,train.json
, anddev.json
underdata/
directory. - Download the pretrained Glove, and put it as
glove/glove.%dB.%dd.txt
- Download
evaluation.py
andprocess_sql.py
from the Spider github page - Download preprocessed train/dev datasets and pretrained models from here. It contains:
-
generated_datasets/
generated_data
for original Spider training datasets, pretrained models can be found atgenerated_data/saved_models
generated_data_augment
for original Spider + augmented training datasets, pretrained models can be found atgenerated_data_augment/saved_models
You could find preprocessed train/dev data in generated_datasets/
.
To generate them by yourself, update dirs under TODO
in preprocess_train_dev_data.py
, and run the following command to generate training files for each module:
python preprocess_train_dev_data.py train|dev
data/
contains raw train/dev/test data and table filegenerated_datasets/
described as abovemodels/
contains the code for each module.evaluation.py
is for evaluation. It usesprocess_sql.py
.train.py
is the main file for training. Usetrain_all.sh
to train all the modules (see below).test.py
is the main file for testing. It usessupermodel.sh
to call the trained modules and generate SQL queries. In practice, and usetest_gen.sh
to generate SQL queries.generate_wikisql_augment.py
for cross-domain data augmentation
Run train_all.sh
to train all the modules.
It looks like:
python train.py \
--data_root path/to/generated_data \
--save_dir path/to/save/trained/module \
--history_type full|no \
--table_type std|no \
--train_component <module_name> \
--epoch <num_of_epochs>
Run test_gen.sh
to generate SQL queries.
test_gen.sh
looks like:
SAVE_PATH=generated_datasets/generated_data/saved_models_hs=full_tbl=std
python test.py \
--test_data_path path/to/raw/test/data \
--models path/to/trained/module \
--output_path path/to/print/generated/SQL \
--history_type full|no \
--table_type std|no \
Run model with question = What are the maximum and minimum budget of the departments?
and database name = department_management
docker build -t model-app
docker login
-> enter your credentialsdocker images
-> get the image id of the model's containerdocker tag <your image id> <your docker hub id>/<app name>
docker push <your docker hub name>/<app-name>
After pushing the Docker image to Docker Hub & creating the Kubernetes cluster; run the following in Cloud Shell:
kubectl run model-app --image=atakanokan/model-app --port 5000
- Verify by
kubectl get pods
kubectl expose deployment model-app --type=LoadBalancer --port 80 --target-port 5000
kubectl get service
and get the cluster-ip
And run the following from local terminal:
curl -X GET 'http://<your service IP>/output?english_question=What+are+the+maximum+and+minimum+budget+of+the+departments%3F&database_name=department_management'
Follow the general evaluation process in the Spider github page.
You could find preprocessed augmented data at generated_datasets/generated_data_augment
.
If you would like to run data augmentation by yourself, first download wikisql_tables.json
and train_patterns.json
from here, and then run python generate_wikisql_augment.py
to generate more training data.
The implementation is based on SQLNet. Please cite it too if you use this code.