Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem creating seq files when running setup_clade_ap.py. #48

Open
teagerv opened this issue May 21, 2022 · 7 comments
Open

Problem creating seq files when running setup_clade_ap.py. #48

teagerv opened this issue May 21, 2022 · 7 comments

Comments

@teagerv
Copy link
Contributor

teagerv commented May 21, 2022

Question Where is the -s parameter (SEQGZFOLDER) for setup_clade_ap.py meant to point?

Issue: I seem to be having a problem populating the gzip directory with sequences. The .table file is all populated from the ncbi db, but it's not finding the sequences. I'm not sure where the -s parameter is supposed to be pointing maybe? ~/ is where all the compressed ncbi files are from phlawd_db_maker.

snail@snailbuntu:~/PyPHLAWD/src$ python3 setup_clade_ap.py -t Architaenioglossa -b /media/snail/RED1/ncbi/inv.db -o ~/Desktop/ -s ~/ -l ~/Desktop/logfile
STARTING PYPHLAWD *。ヾ(。>v<。)ノ゙*。
MAKING TREE Architaenioglossa ٩(๑꒦ິȏ꒦ິ๑)۶
MAKING DIRS IN /home/snail/Desktop ヽ(*´∀`)ノ゙
PROBLEM CREATING /home/snail/Desktop/Architaenioglossa_75116 (´;ω;`)
POPULATING DIRS /home/snail/Desktop ヽ/❀o ل͜ o\ノ
Traceback (most recent call last):
  File "/home/snail/PyPHLAWD/src/populate_dirs_first.py", line 47, in <module>
    mfid_in(tid,DB,dirl+dirr+"/"+orig+".fas",dirl+dirr+"/"+orig+".table",gzfileloc,True,limitlist = taxalist) 
  File "/home/snail/PyPHLAWD/src/get_subset_genbank.py", line 275, in make_files_with_id_internal
    idstoseq = get_seqs_from_gz(gzfileloc,fn,files_ids[fn])
  File "/home/snail/PyPHLAWD/src/get_subset_genbank.py", line 24, in get_seqs_from_gz
    fl = gzip.open(gzdir+"/"+filename,"r")
  File "/usr/lib/python3.8/gzip.py", line 58, in open
    binary_file = GzipFile(filename, gz_mode, compresslevel)
  File "/usr/lib/python3.8/gzip.py", line 173, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/home/snail//seqs.Viviparus subpurpureus voucher USNM 1292588 histone 3 (H3) gene, partial cds.'
CREATED TEMPDIR_44273/
CLUSTERING SINGLE /home/snail/Desktop/Architaenioglossa_75116/Cyclophoroidea_75117/Megalomastomatidae_928797/Acroptychia_928777 ヽ(。´・д・)ノ
Traceback (most recent call last):
  File "/home/snail/PyPHLAWD/src/cluster_tree.py", line 38, in <module>
    tablename = [x for x in files if ".table" in x][0]
IndexError: list index out of range
PYPHLAWD DONE ヽ(^□^。)ノ
Total time (H:M:S): 0:00:00.638717 ٩(º౪º๑)۶
(⌐■_■) 

Steps taken: Followed the steps on the Install page. Built phlawd_db_maker and all dependencies without errors. Built the database with phlawd_db_maker with no errors. Followed directions on the Runs page for a clustering analysis. Python version is 3.8.10

I know Python pretty well, so if I find a fix I'll make a pull request.

@hmarx
Copy link

hmarx commented Jul 19, 2022

I'm having this same issue on Python 3.9.13. Have there been any updates?

@teagerv
Copy link
Contributor Author

teagerv commented Jul 23, 2022

Solution: I figured it out, you have to make a file with the NCBI ids that you want to include if you're subsetting taxa, or it won't populate with any sequences (this is described in the 'Runs' doc). Don't know why I decided that wasn't relevant last time I looked at this...

There is a helper script if you already have a file with all the names, but I just used a quick BioPython script to pull them and it's running now:

from Bio import Entrez

def main():
    Entrez.email = ""
    db_type = 'nucleotide'
    search_terms = '(Architaenioglossa[Orgn])'
    output_file = '/home/snail/Desktop/architaenioglossa_taxalist.txt'

    returned_ids = esearch(search_terms, db_type)
    make_taxalist(returned_ids, output_file)

    return

def esearch(search_terms, db_type):
    
    handle = Entrez.esearch(db=db_type, term = search_terms, idtype="acc", retmax = )
    record = Entrez.read(handle)
    print('Search returned %s results.\n' %record["Count"])
    
    ids = record["IdList"]

    return ids

def make_taxalist(ids, output):
    
    with open(output, 'a') as fh:

        for i in ids:
            fh.write(f'{i}\n')

    return

if __name__ == '__main__':
    main()

Just set your search terms to the subset you want, set retmax to at least the number of taxa, and put in a random email (not sure if this is required).

@YingyingYang2019
Copy link

YingyingYang2019 commented Nov 14, 2022

Hi, I have the same problems! And I have provided the taxalist, still does work! Does anyone can help? Thanks!
The code and results are shown here:

yang@bdchxy-PowerEdge-M630-VRTX:~$ python application/PyPHLAWD-master/src/setup_clade_ap.py -t Fagales -b /storage/phlawd_db_maker-master/DB/pln.db -s /storage/phlawd_db_maker-master/DB -o application/PyPHLAWD-master/examples/clustered/ -l application/PyPHLAWD-master/examples/clustered/ -f ncbi_sp_ids_938.txt

STARTING PYPHLAWD (⌯꒪͒ ꌂ̇ ꒪͒)
LIMITING TO TAXA IN ncbi_sp_ids_938.txt
MAKING TREE Fagales (✧ ꒪◞౪◟꒪)
MAKING DIRS IN application/PyPHLAWD-master/examples/clustered ヾ(≧∪≦*)ノ〃
PROBLEM CREATING application/PyPHLAWD-master/examples/clustered/Fagales_3502 (゜´Д`゜)
POPULATING DIRS application/PyPHLAWD-master/examples/clustered ₊·◟(˶╹̆ꇴ╹̆˵)◜‧
Traceback (most recent call last):
File "/home/yang/application/PyPHLAWD-master/src/populate_dirs_first.py", line 47, in
mfid_in(tid,DB,dirl+dirr+"/"+orig+".fas",dirl+dirr+"/"+orig+".table",gzfileloc,True,limitlist = taxalist)
File "/home/yang/application/PyPHLAWD-master/src/get_subset_genbank.py", line 275, in make_files_with_id_internal
idstoseq = get_seqs_from_gz(gzfileloc,fn,files_ids[fn])
File "/home/yang/application/PyPHLAWD-master/src/get_subset_genbank.py", line 24, in get_seqs_from_gz
fl = gzip.open(gzdir+"/"+filename,"r")
File "/home/yang/anaconda3/envs/python3.8/lib/python3.8/gzip.py", line 58, in open
binary_file = GzipFile(filename, gz_mode, compresslevel)
File "/home/yang/anaconda3/envs/python3.8/lib/python3.8/gzip.py", line 173, in init
fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/storage/phlawd_db_maker-master/DB//seqs.Ticodendron incognitum chloroplast rbcL gene for ribulose-1,5-bisphosphate carboxylase large subunit, partial cds.'
CREATED TEMPDIR_69418/
CLUSTERING SINGLE application/PyPHLAWD-master/examples/clustered/Fagales_3502/Fagaceae_3503/Chrysolepis_21022 (ノ′Дヾ)
Traceback (most recent call last):
File "/home/yang/application/PyPHLAWD-master/src/cluster_tree.py", line 38, in
tablename = [x for x in files if ".table" in x][0]
IndexError: list index out of range
PYPHLAWD DONE ٩(๑˃́ꇴ˂̀๑)۶
Total time (H:M:S): 0:00:06.033473 ◦°˚(*❛‿❛)/˚°◦ (⌐■_■)

@bheimbu
Copy link

bheimbu commented Jan 2, 2023

Hi and a happy new year,

I'm experiencing the same issue, any help would be highly appreciated?!

It would also be nice if the website (https://fephyfofum.github.io/PyPHLAWD/) could be updated as there is no more setup_clade.py (which is now called setup_clade_ap.py).

Cheers Bastian

@YingyingYang2019
Copy link

Hi bheimubu! Happy new year!
For this question " I'm experiencing the same issue, any help would be highly appreciated?! It would also be nice if the website (https://fephyfofum.github.io/PyPHLAWD/) could be updated as there is no more setup_clade.py (which is now called setup_clade_ap.py).", mine works with the old version PyPhlawd. Therefore, if you have an old version, you could try. The new version doesn't work well this time. Good luck!

Yingyya

@bheimbu
Copy link

bheimbu commented Jan 3, 2023

Hi @YingyingYang2019,

you make my day, it's working with the old version (downloaded as source code from here).

Cheers Bastian

@harsimranpadam
Copy link

Hi. I would just like to add that I was having the same trouble. If there is anything you figure out, please keep me updated. I also couldn't understand how to have the genus & sequence for this. If that is possible, please let me know.
The code is here, in which I am running trouble in:

python3 setup_clade_ap.py -t Laurales -b /Users/administrator_ge/Desktop/pln.db -s /Users/administrator_ge/Desktop/seq -o /Users/administrator_ge/Desktop/output -l /Users/administrator_ge/Desktop/logfile.md.gz -f /Users/administrator_ge/Desktop/taxalist.txt

STARTING PYPHLAWD ٩(⚙ȏ⚙)۶
LIMITING TO TAXA IN /Users/administrator_ge/Desktop/taxalist.txt
MAKING TREE Laurales ╰(✧∇✧)╯
MAKING DIRS IN /Users/administrator_ge/Desktop/output Σ(ノ°▽°)ノ
PROBLEM CREATING /Users/administrator_ge/Desktop/output/Laurales_3432 (;へ:)
POPULATING DIRS /Users/administrator_ge/Desktop/output Σ(*ノ´>ω<。`)ノ
Traceback (most recent call last):
File "/Users/administrator_ge/apps/PyPHLAWD/src/populate_dirs_first.py", line 47, in
mfid_in(tid,DB,dirl+dirr+"/"+orig+".fas",dirl+dirr+"/"+orig+".table",gzfileloc,True,limitlist = taxalist)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/administrator_ge/apps/PyPHLAWD/src/get_subset_genbank.py", line 275, in make_files_with_id_internal
idstoseq = get_seqs_from_gz(gzfileloc,fn,files_ids[fn])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/administrator_ge/apps/PyPHLAWD/src/get_subset_genbank.py", line 24, in get_seqs_from_gz
fl = gzip.open(gzdir+"/"+filename,"r")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/gzip.py", line 58, in open
binary_file = GzipFile(filename, gz_mode, compresslevel)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/gzip.py", line 174, in init
fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/Users/administrator_ge/Desktop/seq//seqs.Hernandia nymphaeifolia trnL-trnF intergenic spacer region and trnF gene, partial sequence; chloroplast gene for chloroplast product.'
CREATED TEMPDIR_77128/
CLUSTERING SINGLE /Users/administrator_ge/Desktop/output/Laurales_3432/Hernandiaceae_22009/Gyrocarpus_13552 (ノдヽ)
Traceback (most recent call last):
File "/Users/administrator_ge/apps/PyPHLAWD/src/cluster_tree.py", line 38, in
tablename = [x for x in files if ".table" in x][0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range
PYPHLAWD DONE ୧༼✿ ͡◕ д ◕͡ ༽୨
Total time (H:M:S): 0:01:01.869942 ヽ(^o^)丿
(⌐■_■)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants