Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make array of VCF info and GFF group field. #6

Open
ghuls opened this issue Jun 20, 2012 · 1 comment
Open

Make array of VCF info and GFF group field. #6

ghuls opened this issue Jun 20, 2012 · 1 comment

Comments

@ghuls
Copy link

ghuls commented Jun 20, 2012

GTF file:

$ zcat ./test.gtf.gz
chr21   hg19_knownGene  exon    33026871    33027740    0.000000    -   .   gene_id "uc002yoz.1"; transcript_id "uc002yoz.1"; 
chr21   hg19_knownGene  exon    33030247    33030540    0.000000    -   .   gene_id "uc002yoz.1"; transcript_id "uc002yoz.1"; 
chr21   hg19_knownGene  exon    33031710    33031813    0.000000    -   .   gene_id "uc002yoz.1"; transcript_id "uc002yoz.1"; 
chr21   hg19_knownGene  start_codon 33032083    33032085    0.000000    +   .   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  CDS 33032083    33032154    0.000000    +   0   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  exon    33031935    33032154    0.000000    +   .   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  CDS 33036103    33036199    0.000000    +   0   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  exon    33036103    33036199    0.000000    +   .   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  CDS 33038762    33038831    0.000000    +   2   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  exon    33038762    33038831    0.000000    +   .   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  CDS 33039571    33039688    0.000000    +   1   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  exon    33039571    33039688    0.000000    +   .   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  CDS 33040784    33040888    0.000000    +   0   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  stop_codon  33040889    33040891    0.000000    +   .   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  exon    33040784    33041243    0.000000    +   .   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";

List some fields of GTF with bioawk:

$ ./bioawk -c gff '{ print $feature,$start,$group }' ./test.gtf.gz
exon    33026871    gene_id "uc002yoz.1"; transcript_id "uc002yoz.1"; 
exon    33030247    gene_id "uc002yoz.1"; transcript_id "uc002yoz.1"; 
exon    33031710    gene_id "uc002yoz.1"; transcript_id "uc002yoz.1"; 
start_codon 33032083    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
CDS 33032083    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
exon    33031935    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
CDS 33036103    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
exon    33036103    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
CDS 33038762    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
exon    33038762    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
CDS 33039571    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
exon    33039571    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
CDS 33040784    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
stop_codon  33040889    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
exon    33040784    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";

It would be nice if for the GFF group field and the VCF info field, $group and $info,
is an array which used each subfeature as key:

$ ./bioawk -c gff '{ print $feature,$start,$group[gene_id],$group[transcript_id] }' ./test.gtf.gz
exon    33026871    uc002yoz.1  uc002yoz.1
exon    33030247    uc002yoz.1  uc002yoz.1
exon    33031710    uc002yoz.1  uc002yoz.1
start_codon 33032083    uc002ypa.3  uc002ypa.3
CDS 33032083    uc002ypa.3  uc002ypa.3
exon    33031935    uc002ypa.3  uc002ypa.3
CDS 33036103    uc002ypa.3  uc002ypa.3
exon    33036103    uc002ypa.3  uc002ypa.3
CDS 33038762    uc002ypa.3  uc002ypa.3
exon    33038762    uc002ypa.3  uc002ypa.3
CDS 33039571    uc002ypa.3  uc002ypa.3
exon    33039571    uc002ypa.3  uc002ypa.3
CDS 33040784    uc002ypa.3  uc002ypa.3
stop_codon  33040889    uc002ypa.3  uc002ypa.3
exon    33040784    uc002ypa.3  uc002ypa.3

At the moment it is not supported and I get the following error:

./bioawk: can't assign to group; it's an array name.
 source line number 1
@ghuls
Copy link
Author

ghuls commented Jun 26, 2012

This:

$ ./bioawk -c gff '{ print $feature,$start,$group[gene_id],$group[transcript_id] }' ./test.gtf.gz

should be:

$ ./bioawk -c gff '{ print $feature,$start,$group["gene_id"],$group["transcript_id"] }' ./test.gtf.gz

of course.

@ghuls ghuls closed this as completed Jun 26, 2012
@ghuls ghuls reopened this Jun 26, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant