You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You've made it far enough now that we can start chaining tools together into (small) pipelines to ask questions about the pangenome of species or isolates of interest. While we've spent most of our examples and problems sets working on S. aureus genomes, we're going to change it up and take a look at a bacterial species with an infamously large pangenome: Neisseria meningitidis. N. meningitidis is a common human commensal, and one of the major causes of bacterial meningitis world-wide. If you're in the US, you've likely been vaccinated against the most common circulating serotypes (called A, C, W, and Y), and possibly one other serotype responsible for some particularly nasty infections (B).
In this problem set, you'll collect the genomes for 8 isolates from NCBI. This is going to include a reference serotype B strain, called MC58. This strain is very well characterized and one of the major lab strains studied in recent literature.
I'll be providing minimal instructions; do your best to work through this and do hesitate to ask us questions. When I do include code, I have left sections it very generic. Some text will need to be replaced with the specifics for your device or strain.
Since we haven't discussed downloading public data in class, I made a short video showing how its done:
Breifly, we navigate to NCBI's nucleotide database, paste in our accession number in the search bar, and hit search. This will bring up the record for this isolate. You then click on "Send to" in the upper right corner of the white box, then click "File", and finally select "FASTA" from the Format drop down box. This will automatically download the record to your computer. BE SURE TO RENAME THEM WITH THE PROPER IDENTIFIER WHEN YOU MOVE THEM INTO A FOLDER FOR THE PROBLEM SET.
Get the data.
1. Download the FASTA formatted sequences of your genomes from NCBI webpage or using a command-line tool (Hint: what about using ncbi-datasets tool?). Create a problem_set_7 directory in your MGJW folder and move these files into that folder.
Below are the accession numbers if you will use the website:
NC_003112.2
NC_017516.1
NZ_CP009419.1
NZ_CP007668.1
NC_017517.1
NC_017515.1
NC_017505.1
NC_017512.1
Below are the accession numbers if you will use the ncbi-datasets tool:
GCF_000008805.1
GCF_000191445.1
GCF_000800275.1
GCF_001029835.1
GCF_000191485.1
GCF_000191505.1
GCF_000152165.1
GCF_000253215.1
Part 1
Prokka
2. Create a subdirectory in your problem set 7 folder for the Prokka output files.
3. Standardize the annotations by running Prokka on these fasta files. Feel free to do this with a shell script if you're comfortable. HINT Be sure you are in the proper Conda environment.
How many CDS did prokka identify? Can you figure out how many were identified when the submitted ran PGAP during the upload process? Hint Check the genome record page in the nucleotide database where you downloaded everything from.
Part 2
Roary
4. Collect the gff files from Prokka (again either by hand or with a script if you are able), and create a subdirectory called <./Roary/>.
5. Run Roary. What do all of these options I'm having you include mean?