Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training data (aggregate_paraphrase_corpus_0) #5

Open
ovesreinier opened this issue Sep 15, 2018 · 6 comments
Open

Training data (aggregate_paraphrase_corpus_0) #5

ovesreinier opened this issue Sep 15, 2018 · 6 comments

Comments

@ovesreinier
Copy link

Hello Victor.
I would like to thank u first for your contribution.

I am trying to retrain your model but the aggregate_paraphrase_corpus_0 is missing,
Could you share me the files or maybe explain the format of the files ?

Thanks

@jasonray716
Copy link

I need this training data as well.
Could you share me the download link or how to create this format of dataset?

@SeekPoint
Copy link

how to get the trainging dataset

@tim5go
Copy link

tim5go commented Dec 19, 2019

@vsuthichai
I would like to have the training data as well, is it possible to share with me privately?

@LBartolini
Copy link

Hi, I know has already passed some time since you were asking these files.
I'm not @vsuthichai but I think I understand how to generate training data.
First thing you need to download the data from internet (just search para-nmt-50m-demo).
Next you need to run the file "preprocess_data.py" passing as parameter the file you downladed called "para-nmt-50m-small.txt".
This will create a bunch of files called "para-nmt-50m-small.txt + ".
Now the last thing you need to do is create the sentence embeddings (I need to find out how to do) and correct all the import strings where all these files are used in the code.

Finally You should be able to train your model. Make sure that the dataset you use is formatted like so "Source sentence" + "\t + "final sentence".
I need now to translate all the dataset to italian and try to train in italian...
Wish me luck

@kay312
Copy link

kay312 commented Sep 5, 2020

Hi, I know has already passed some time since you were asking these files.
I'm not @vsuthichai but I think I understand how to generate training data.
First thing you need to download the data from internet (just search para-nmt-50m-demo).
Next you need to run the file "preprocess_data.py" passing as parameter the file you downladed called "para-nmt-50m-small.txt".
This will create a bunch of files called "para-nmt-50m-small.txt + ".
Now the last thing you need to do is create the sentence embeddings (I need to find out how to do) and correct all the import strings where all these files are used in the code.

Finally You should be able to train your model. Make sure that the dataset you use is formatted like so "Source sentence" + "\t + "final sentence".
I need now to translate all the dataset to italian and try to train in italian...
Wish me luck

thanks for your comment, have you secceeded? I'm doing the similar thing, translate these to chinese.

@LBartolini
Copy link

Hi, I know has already passed some time since you were asking these files.
I'm not @vsuthichai but I think I understand how to generate training data.
First thing you need to download the data from internet (just search para-nmt-50m-demo).
Next you need to run the file "preprocess_data.py" passing as parameter the file you downladed called "para-nmt-50m-small.txt".
This will create a bunch of files called "para-nmt-50m-small.txt + ".
Now the last thing you need to do is create the sentence embeddings (I need to find out how to do) and correct all the import strings where all these files are used in the code.
Finally You should be able to train your model. Make sure that the dataset you use is formatted like so "Source sentence" + "\t + "final sentence".
I need now to translate all the dataset to italian and try to train in italian...
Wish me luck

thanks for your comment, have you secceeded? I'm doing the similar thing, translate these to chinese.

hi, I didn't really succeeded. I tried to use the training data and translate into Italian. The thing is that the translation weren't good and the training dataset wasn't big enough (maybe because I only used the para-nmt whereas the author of the repository used a bunch of them). I tried to train anyway but I didn't have good results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants