Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about training a BERTax model for phylum to species taxonomy classification #10

Open
Steven-GUHK opened this issue Jun 1, 2023 · 10 comments

Comments

@Steven-GUHK
Copy link

Steven-GUHK commented Jun 1, 2023

Hi! I have read your paper about BERTax. It is wonderful and very inspiring. I'm interested in training a BERTax model for my own application: predict the phylum, class, order, family, genus, and species of a DNA sequence. Since I need to predict six labels, I plan to add three more taxonomy layers after the original BERTax taxonomy layers. Also, I need to use different training and testing datasets. Currently, my dataset looks like this:
species_1.fasta:

sequence_1
ATCG...
sequence_2
ATCG...
...

species_2.fasta

sequence_1
ATCG...
sequence_2
ATCG...
...

species_n.faste

sequence_1
ATCG...
sequence_2
ATCG...
...
where each fasta file is a species that has a corresponding taxonomy label (from phylum to species). Inside each fasta file, there may contain more than one sequence of this species.

I have read your instruction about how to prepare the data for training. I think I should convert my data into this format:
Screen Shot 2023-06-01 at 3 44 03 PM

Thank you very much if you could provide me with some suggestions about my task!

@f-kretschmer
Copy link
Collaborator

Hi!

Since we used the gene-model-structure only in the beginning of our development, more work would be required to adapt the training process for the new task. But for the genomic model, probably not a lot has to be changed. Your data would need to be in the "fragment"-type of structure, which can be generated from multifastas (https://github.com/f-kretschmer/bertax_training#multi-fastas-with-taxids). You would have to concatenate your data into a single file and adapt the header of each sequence in the following way:

>species_1_taxid 0 
ATCG....
>species_1_taxid 1
ATCG....

....

>species_1_taxid m
ATCG....
>species_2_taxid m + 1
ATCG....

...

>species_n_taxid x
ATCG....

The first value of the header is simply the NCBI-TaxID (https://www.ncbi.nlm.nih.gov/taxonomy), from which the classes/ranks for each species can be retrieved. The second is a running index, so that each sequence in the multi fasta has a different header. This file can then be converted with https://github.com/f-kretschmer/bertax_training/blob/master/preprocessing/fasta2fragments.py to the "fragments"-format. For training (fine-tuning), the CLI-argument --ranks can be used to specify which ranks to train (and thus which output-layers to use). Don't hesitate to ask if you have further questions!

@Steven-GUHK
Copy link
Author

Steven-GUHK commented Jun 5, 2023

Hi!

Since we used the gene-model-structure only in the beginning of our development, more work would be required to adapt the training process for the new task. But for the genomic model, probably not a lot has to be changed. Your data would need to be in the "fragment"-type of structure, which can be generated from multifastas (https://github.com/f-kretschmer/bertax_training#multi-fastas-with-taxids). You would have to concatenate your data into a single file and adapt the header of each sequence in the following way:

>species_1_taxid 0 
ATCG....
>species_1_taxid 1
ATCG....

....

>species_1_taxid m
ATCG....
>species_2_taxid m + 1
ATCG....

...

>species_n_taxid x
ATCG....

The first value of the header is simply the NCBI-TaxID (https://www.ncbi.nlm.nih.gov/taxonomy), from which the classes/ranks for each species can be retrieved. The second is a running index, so that each sequence in the multi fasta has a different header. This file can then be converted with https://github.com/f-kretschmer/bertax_training/blob/master/preprocessing/fasta2fragments.py to the "fragments"-format. For training (fine-tuning), the CLI-argument --ranks can be used to specify which ranks to train (and thus which output-layers to use). Don't hesitate to ask if you have further questions!

Thanks for your suggestion! Here is what I have done:

  1. I concatenated all the fasta files into one fasta file with TaxID and index header as you suggested. Then I used the fatsa2fragments.py to generate two files: train_fragments.json and train_species_picked.txt

  2. Because I don't have ['Viruses', 'Archaea', 'Bacteria', 'Eukaryota'] classes, I just change them to ['train']

Screen Shot 2023-06-05 at 11 00 52 AM
  1. I run the command in the GitHub: python -m models.bert_nc fragments_root_dir --batch_size 32 --head_num 5 --transformer_num 12 --embed_dim 250 --feed_forward_dim 1024 --dropout_rate 0.05 --name bert_nc_C2 --epochs 10 and successfully trained a model.

  2. Then I'm going to fine-tune the model. Because I need to predict phylum to species, I modified the bert_nc_finetune.py as

Screen Shot 2023-06-05 at 11 08 40 AM Then I run the command: _python -m models.bert_nc_finetune bert_nc_C2.h5 fragments_root_dir --multi_tax --epochs 15 --batch_size 24 --save_name _small_trainingset_filtered_fix_classes_selection --store_predictions --nr_seqs 1000000000_ However, there is a problem: Screen Shot 2023-06-05 at 11 14 07 AM It seems that it takes too many memories to load all the data at once. I deleted the np.array() operation and the error disappears. Screen Shot 2023-06-05 at 2 35 13 PM

However, I encounter another problem like this:
fine-tune-error.log

Do you have any suggestions about my steps above? Thank you very much!

@f-kretschmer
Copy link
Collaborator

I haven't seen this error before, could you first try to see if this error also comes up if you change back the np.array lines, maybe by first using a smaller training dataset (so it fits into memory)? Perhaps using np.asarray instead of np.array could also reduce memory size. The data type (and also size!) in your screenshot (Unable to allocate 2.99 TiB for an array .... and data type <U...) is quite strange. The data should all be numerical; perhaps check that the three variables x, y, y_species all have the expected contents (before or at line 76 in bert_nc_finetune.py) .

@Steven-GUHK
Copy link
Author

Steven-GUHK commented Jun 5, 2023

I print the number, type, and value of x, y, y_species before shuffle:
Screen Shot 2023-06-05 at 4 30 27 PM

The result is
Screen Shot 2023-06-05 at 4 31 01 PM

and after shuffle, the data type is
Screen Shot 2023-06-05 at 4 31 17 PM

@Steven-GUHK
Copy link
Author

Steven-GUHK commented Jun 6, 2023

Following the previous problem, I find it doesn't matter whether I use np.array(x) or not in function load_fragments(). Because I use the preprocessing.make_dataset.py to generate the train.tsv and test.tsv files. I find that the generated files are the same without using np.array(x). Here is a screenshot:
Screen Shot 2023-06-06 at 3 02 20 PM

Then I use the train.tsv and test.tsv to train the model with argument --use_defined_train_test_set. However, this problem still exists.
fine-tune-error.log

I have uploaded the pre_trained model and two files here: https://drive.google.com/drive/folders/1TUSTrjlGbtYqVBcUmybAVXxLEcvG8duT?usp=sharing
Because the train.tsv is too large, you can use the test.tsv two times.

Very appreciate it if you can help me out 🙏

@Steven-GUHK
Copy link
Author

I haven't seen this error before, could you first try to see if this error also comes up if you change back the np.array lines, maybe by first using a smaller training dataset (so it fits into memory)? Perhaps using np.asarray instead of np.array could also reduce memory size. The data type (and also size!) in your screenshot (Unable to allocate 2.99 TiB for an array .... and data type <U...) is quite strange. The data should all be numerical; perhaps check that the three variables x, y, y_species all have the expected contents (before or at line 76 in bert_nc_finetune.py) .

Sorry to bother you again, but here is a strange thing:
To test whether it is my problem, I use the model provided in resources/bert_nc_C2_final.h5 to fine-tune. I use a small dataset so that there is no ArrayMemoryError problem. The only thing I change is the models/model.py I change the file name:
Screen Shot 2023-06-08 at 11 54 31 AM

Here is my command to run the bert_nc_finetune.py:
nohup python -m models.bert_nc_finetune resources/bert_nc_C2_final.h5 fragments_root_dir --multi_tax --epochs 15 --batch_size 24 --save_name small --store_predictions --nr_seqs 1000000000 > fine-tune.log 2> fine-tune-error.log

And these are the output:
fine-tune-error.log
fine-tune.log
The problem still exists.

I have uploaded the train_small.fasta, train_small_fragments.json, and train_small_species_picked.txt here:
https://drive.google.com/drive/folders/1TUSTrjlGbtYqVBcUmybAVXxLEcvG8duT?usp=sharing

Could you please help me to check why? Is it because I changed the list of names? Thank you very much!

@f-kretschmer
Copy link
Collaborator

Just a heads up and apology that I haven't been able to look into it in detail yet. I can't see anything wrong with your data or commands immediately, it might be the case that the error is related to tensorflow internals and occurring because of package version conflicts (keras-bert, which BERTax depends on, does not work with all versions of tensorflow or keras). I'll write back when I find something.

@Steven-GUHK
Copy link
Author

Steven-GUHK commented Jun 15, 2023

Update: After several days try, I update my tensorflow version to 2.12.0 and it can be trained normally. But I have to reduce batch_size to 1 otherwise there will be a memory error. I wonder about the impact of batch_size on the final accuracy.

One more question is I found that the sample_weight in the pre-training model bert_nc.py only contains the superkingdom. As I want to train six ranks, do I need to make some changes in bert_nc.py? Or I just need to change the code in bert_nc_finetune.py.

Also, there is a problem with testing after training:
Screen Shot 2023-06-16 at 11 16 13 AM
Screen Shot 2023-06-16 at 11 17 59 AM
Screen Shot 2023-06-16 at 11 18 37 AM
Screen Shot 2023-06-16 at 11 19 53 AM
It seems that the model has two inputs but only received one. If the training process has no problem, how could the test have since the train and test data have the same format?

@f-kretschmer
Copy link
Collaborator

Good to hear that changing the tensorflow version solved the first issue!

  • Regarding the sample-weights, you should be fine with only balancing the highest rank (as is done), as in pre-training the class labels do not get used anyways.
  • I am wondering about the testing problem as well. The input of the model consists of the tokenized sequence itself and segments, which are not really used for fine-tuning but stem from the pre-training tasks (next sentence prediction). The segment input should be just "0"s with the length of the input sequence/tokens. Although, according to the error message, the function seems to be doing something fundamentally wrong if the input has shape (None, None). You might need to look into the code of the function. Sorry I can't be of more help right now.

@Steven-GUHK
Copy link
Author

Thanks for your information. I found that it is true that the generator will return a list that contains tokens and segments and the segments are all 0s. However, I don't know what's wrong with the predict() function in that it doesn't unpack the list. So, I do it manually. Finally, I get the results. But it is awful. The accuracy of phylum is around 0.02.

I use 1075 species' DNA to pre-train and fine-tune the model. In each species, I choose 10 sequences that do not appear in training for testing. So there are 10750 sequences for testing. Here are three logs:
pre-training.log.zip
fine-tune.log.zip
test.log.zip

I find that the final losses are larger than the initial ones. Do you think I should pre-train the model? Or I should just fine-tune your pre-trained model. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants