MGJW Problem set 7

Background:

You've made it far enough now that we can start chaining tools together into (small) pipelines to ask questions about the pangenome of species or isolates of interest. While we've spent most of our examples and problems sets working on S. aureus genomes, we're going to change it up and take a look at a bacterial species with an infamously large pangenome: Neisseria meningitidis. N. meningitidis is a common human commensal, and one of the major causes of bacterial meningitis world-wide. If you're in the US, you've likely been vaccinated against the most common circulating serotypes (called A, C, W, and Y), and possibly one other serotype responsible for some particularly nasty infections (B).

In this problem set, you'll collect the genomes for 8 isolates from NCBI. This is going to include a reference serotype B strain, called MC58. This strain is very well characterized and one of the major lab strains studied in recent literature.

I'll be providing minimal instructions; do your best to work through this and do hesitate to ask us questions. When I do include code, I have left sections it very generic. Some text will need to be replaced with the specifics for your device or strain.

Since we haven't discussed downloading public data in class, I made a short video showing how its done:

Video_instructions

Breifly, we navigate to NCBI's nucleotide database, paste in our accession number in the search bar, and hit search. This will bring up the record for this isolate. You then click on "Send to" in the upper right corner of the white box, then click "File", and finally select "FASTA" from the Format drop down box. This will automatically download the record to your computer. BE SURE TO RENAME THEM WITH THE PROPER IDENTIFIER WHEN YOU MOVE THEM INTO A FOLDER FOR THE PROBLEM SET.

Get the data.

1. Download the FASTA formatted sequences of your genomes from NCBI webpage or using a command-line tool (Hint: what about using ncbi-datasets tool?). Create a problem_set_7 directory in your MGJW folder and move these files into that folder.

Below are the accession numbers if you will use the website:

NC_003112.2
NC_017516.1
NZ_CP009419.1
NZ_CP007668.1
NC_017517.1
NC_017515.1
NC_017505.1
NC_017512.1

Below are the accession numbers if you will use the ncbi-datasets tool:

GCF_000008805.1
GCF_000191445.1
GCF_000800275.1
GCF_001029835.1
GCF_000191485.1
GCF_000191505.1
GCF_000152165.1
GCF_000253215.1

Part 1

Prokka

2. Create a subdirectory in your problem set 7 folder for the Prokka output files.

3. Standardize the annotations by running Prokka on these fasta files. Feel free to do this with a shell script if you're comfortable. HINT Be sure you are in the proper Conda environment.

Conda:

prokka --outdir <STRAIN NAME> --kingdom Bacteria --prefix <STRAIN NAME> --gcode 11 --genus Neisseria <file>.fasta

Docker:

Pull the docker image

docker pull qianxuanshe/mgjw-sep2023-s7:latest

Run the docker image

docker run -it qianxuanshe/mgjw-sep2023-s7

Make directory for Prokka output and go to the directory

mkdir prokka_results
cd prokka_results

Run Prokka

prokka --outdir /session7/prokka_results/NC_003112.2 --kingdom Bacteria --locustag NC_003112.2 --genus Neisseria --gcode 11 --species meningitidis  --prefix NC_003112.2 /session7/fna/NC_003112.2.fna
prokka --outdir /session7/prokka_results/NC_017516.1 --kingdom Bacteria --locustag NC_017516.1 --genus Neisseria --gcode 11 --species meningitidis  --prefix NC_017516.1 /session7/fna/NC_017516.1.fna
prokka --outdir /session7/prokka_results/NZ_CP009419.1 --kingdom Bacteria --locustag NZ_CP009419.1 --genus Neisseria --gcode 11 --species meningitidis  --prefix NZ_CP009419.1 /session7/fna/NZ_CP009419.1.fna
prokka --outdir /session7/prokka_results/NZ_CP007668.1 --kingdom Bacteria --locustag NZ_CP007668.1 --genus Neisseria --gcode 11 --species meningitidis  --prefix NZ_CP007668.1 /session7/fna/NZ_CP007668.1.fna
prokka --outdir /session7/prokka_results/NC_017517.1 --kingdom Bacteria --locustag NC_017517.1 --genus Neisseria --gcode 11 --species meningitidis  --prefix NC_017517.1 /session7/fna/NC_017517.1.fna
prokka --outdir /session7/prokka_results/NC_017515.1 --kingdom Bacteria --locustag NC_017515.1 --genus Neisseria --gcode 11 --species meningitidis  --prefix NC_017515.1 /session7/fna/NC_017515.1.fna
prokka --outdir /session7/prokka_results/NC_017505.1 --kingdom Bacteria --locustag NC_017505.1 --genus Neisseria --gcode 11 --species meningitidis  --prefix NC_017505.1 /session7/fna/NC_017505.1.fna
prokka --outdir /session7/prokka_results/NC_017512.1 --kingdom Bacteria --locustag NC_017512.1 --genus Neisseria --gcode 11 --species meningitidis  --prefix NC_017512.1 /session7/fna/NC_017512.1.fna

How many CDS did prokka identify? Can you figure out how many were identified when the submitted ran PGAP during the upload process? Hint Check the genome record page in the nucleotide database where you downloaded everything from.

Part 2

Roary

4. Collect the gff files from Prokka (again either by hand or with a script if you are able), and create a subdirectory called <./Roary/>.

5. Run Roary. What do all of these options I'm having you include mean?

Conda:

roary -f <OUTPUT DIRECTORY> -e -n -i 95 -cd 99 -r -s *.gff

Docker:

roary -f <OUTPUT DIRECTORY> -e -n -i 95 -cd 99 -r -s /session7/gff/*.gff

6. Run Roary again, changing the cut-off scores (-i) to 75, then again with 50. How do the results change?

7. Run Roary one last time:

Conda:

roary -f <OUTPUT DIRECTORY> -e -n -i 95 -cd 90 -r -s *.gff

Docker:

roary -f <OUTPUT DIRECTORY> -e -n -i 95 -cd 99 -r -s /session7/gff/*.gff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem_Set7.md

Problem_Set7.md

MGJW Problem set 7

Background:

In this problem set, you'll collect the genomes for 8 isolates from NCBI. This is going to include a reference serotype B strain, called MC58. This strain is very well characterized and one of the major lab strains studied in recent literature.

I'll be providing minimal instructions; do your best to work through this and do hesitate to ask us questions. When I do include code, I have left sections it very generic. Some text will need to be replaced with the specifics for your device or strain.

Since we haven't discussed downloading public data in class, I made a short video showing how its done:

Get the data.

1. Download the FASTA formatted sequences of your genomes from NCBI webpage or using a command-line tool (Hint: what about using ncbi-datasets tool?). Create a problem_set_7 directory in your MGJW folder and move these files into that folder.

Below are the accession numbers if you will use the website:

Below are the accession numbers if you will use the ncbi-datasets tool:

Part 1

Prokka

2. Create a subdirectory in your problem set 7 folder for the Prokka output files.

3. Standardize the annotations by running Prokka on these fasta files. Feel free to do this with a shell script if you're comfortable. HINT Be sure you are in the proper Conda environment.

Conda:

Docker:

Pull the docker image

Run the docker image

Make directory for Prokka output and go to the directory

Run Prokka

How many CDS did prokka identify? Can you figure out how many were identified when the submitted ran PGAP during the upload process? Hint Check the genome record page in the nucleotide database where you downloaded everything from.

Part 2

Roary

4. Collect the gff files from Prokka (again either by hand or with a script if you are able), and create a subdirectory called <./Roary/>.

5. Run Roary. What do all of these options I'm having you include mean?

Conda:

Docker:

6. Run Roary again, changing the cut-off scores (-i) to 75, then again with 50. How do the results change?

7. Run Roary one last time:

Conda:

Docker:

How did this change the results?

Files

Problem_Set7.md

Latest commit

History

Problem_Set7.md

File metadata and controls

MGJW Problem set 7

Background:

In this problem set, you'll collect the genomes for 8 isolates from NCBI. This is going to include a reference serotype B strain, called MC58. This strain is very well characterized and one of the major lab strains studied in recent literature.

I'll be providing minimal instructions; do your best to work through this and do hesitate to ask us questions. When I do include code, I have left sections it very generic. Some text will need to be replaced with the specifics for your device or strain.

Since we haven't discussed downloading public data in class, I made a short video showing how its done:

Get the data.

1. Download the FASTA formatted sequences of your genomes from NCBI webpage or using a command-line tool (Hint: what about using ncbi-datasets tool?). Create a problem_set_7 directory in your MGJW folder and move these files into that folder.

Below are the accession numbers if you will use the website:

Below are the accession numbers if you will use the ncbi-datasets tool:

Part 1

Prokka

2. Create a subdirectory in your problem set 7 folder for the Prokka output files.

3. Standardize the annotations by running Prokka on these fasta files. Feel free to do this with a shell script if you're comfortable. HINT Be sure you are in the proper Conda environment.

Conda:

Docker:

Pull the docker image

Run the docker image

Make directory for Prokka output and go to the directory

Run Prokka

How many CDS did prokka identify? Can you figure out how many were identified when the submitted ran PGAP during the upload process? Hint Check the genome record page in the nucleotide database where you downloaded everything from.

Part 2

Roary

4. Collect the gff files from Prokka (again either by hand or with a script if you are able), and create a subdirectory called <./Roary/>.

5. Run Roary. What do all of these options I'm having you include mean?

Conda:

Docker:

6. Run Roary again, changing the cut-off scores (-i) to 75, then again with 50. How do the results change?

7. Run Roary one last time:

Conda:

Docker:

How did this change the results?