Skip to content
Pierre Lindenbaum edited this page Apr 6, 2016 · 17 revisions

##Motivation

javascript version of awk for bioinformatics

##Compilation

Requirements / Dependencies

Download and Compile

$ git clone "https://github.com/lindenb/jvarkit.git"
$ cd jvarkit
$ make bioalcidae

by default, the libraries are not included in the jar file, so you shouldn't move them (https://github.com/lindenb/jvarkit/issues/15#issuecomment-140099011 ). You can create a bigger but standalone executable jar by addinging standalone=yes on the command line:

$ git clone "https://github.com/lindenb/jvarkit.git"
$ cd jvarkit
$ make bioalcidae standalone=yes

The required libraries will be downloaded and installed in the dist directory.

edit 'local.mk' (optional)

The a file local.mk can be created edited to override/add some paths.

For example it can be used to set the HTTP proxy:

http.proxy.host=your.host.com
http.proxy.port=124567

##Synopsis

$ java -jar dist/bioalcidae.jar  [options] (stdin|file) 

Options

  • -o|--output (OUTPUT-FILE) Output file. Default:stdout
  • -f|--jsfile (SCRIPT.JS) Javascript file. Use either javascript file or javascript expression.
  • -e|--jsexpr (SCRIPT) Javascript expression. Use either javascript file or javascript expression.
  • -F|--format (FORMAT) force format: one of VCF BAM SAM FASTQ FASTA
  • -h|--help print help
  • -version|--version show version and exit

##Source Code

Main code is: https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/bioalcidae/BioAlcidae.java

Bioinformatics file javascript-based reformatter ( java engine http://openjdk.java.net/projects/nashorn/ ). Something like awk for VCF, BAM, SAM, FASTQ, FASTA etc...

The program injects the following variables:

VCF

for VCF , the program injects the following variables:

Fasta

  • iter a java.util.Iterator

	public class Fasta 
		{
		public String getSequence();
		public String getName();
		public void print();
		public int getSize();
		public char charAt(int i);
		}

BAM or SAM

FASTQ

Example

BAM

getting an histogram of the length of the reads


L={};
while(iter.hasNext()) {
	var rec=iter.next();
	if( rec.getReadUnmappedFlag() || rec.isSecondaryOrSupplementary()) continue;
	var n= rec.getReadLength();
	if(n in L) {L[n]++;} else {L[n]=1;}
	}
for(var i in L) {
	out.println(""+i+"\t"+L[i]);
	}

Fasta

"Creating a consensus based on 'x' number of fasta files" ( https://www.biostars.org/p/185162/#185168)


$ echo -e ">A_2795\nTCAGAAAGAACCTC\n>B_10\nTCAGAAAGCACCTC\n>C_3\nTCTGAAAGCACTTC" |\
java -jar ~/src/jvarkit-git/dist/bioalcidae.jar -F fasta -e 'var consensus=[];while(iter.hasNext()) { var seq=iter.next();out.printlnseq.name+"\t"+seq.sequence);for(var i=0;i< seq.length();++i) {while(consensus.length <= i) consensus.push({}); var b = seq.charAt(i);if(b in consensus[i]) {consensus[i][b]++;} else {consensus[i][b]=1;} } } out.print("Cons.\t"); for(var i=0;i< consensus.length;i++) {var best=0,base="N"; for(var b in consensus[i]) {if(consensus[i][b]>best) { best= consensus[i][b];base=b;}} out.print(base);} out.println();' 


A_2795      TCAGAAAGAACCTC
B_10         TCAGAAAGCACCTC
C_3           TCTGAAAGCACTTC
Cons.        TCAGAAAGCACCTC

VCF

Reformating a VCF we want to reformat a VCF with header


CHROM POS REF ALT GENOTYPE_SAMPLE1 GENOTYPE_SAMPLE2 ... GENOTYPE_SAMPLEN

we use the following javascript file:


var samples = header.sampleNamesInOrder;
out.print("CHROM\tPOS\tREF\tALT");
for(var i=0;i< samples.size();++i)
	{
	out.print("\t"+samples.get(i));
	}
out.println();

while(iter.hasNext())
	{
	var ctx = iter.next();
	if(ctx.alternateAlleles.size()!=1) continue;
	out.print(ctx.chr +"\t"+ctx.start+"\t"+ctx.reference.displayString+"\t"+ctx.alternateAlleles.get(0).displayString);
	for(var i=0;i< samples.size();++i)
		{
		var g = ctx.getGenotype(samples.get(i));

		out.print("\t");

		if(g.isHomRef())
			{
			out.print("0");
			}
		else if(g.isHomVar())
			{
			out.print("2");
			}
		else if(g.isHet())
			{
			out.print("1");
			}
		else
			{
			out.print("-9");
			}
		}
	out.println();
	}


$ curl -s  "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz" | \
gunzip -c | java -jar ./dist/bioalcidae.jar -f jeter.js -F vcf | head -n 5 | cut -f 1-10

CHROM	POS	REF	ALT	HG00096	HG00097	HG00099	HG00100	HG00101	HG00102
22	16050075	A	G	0	0	0	0	0	0
22	16050115	G	A	0	0	0	0	0	0
22	16050213	C	T	0	0	0	0	0	0
22	16050319	C	T	0	0	0	0	0	0


for 1000 genome data, print CHROM/POS/REF/ALT/AF(europe):


$ curl  "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5a.20130502.sites.vcf.gz" |  gunzip -c |\
java -jar dist/bioalcidae.jar  -F VCF -e 'while(iter.hasNext()) {var ctx=iter.next(); if(!ctx.hasAttribute("EUR_AF") || ctx.alternateAlleles.size()!=1) continue; out.println(ctx.chr+"\t"+ctx.start+"\t"+ctx.reference.displayString+"\t"+ctx.alternateAlleles.get(0).displayString+"\t"+ctx.getAttribute("EUR_AF"));}' 

1	10177	A	AC	0.4056
1	10235	T	TA	0
1	10352	T	TA	0.4264
1	10505	A	T	0
1	10506	C	G	0
1	10511	G	A	0
1	10539	C	A	0.001
1	10542	C	T	0
1	10579	C	A	0
1	10616	CCGCCGTTGCAAAGGCGCGCCG	C	0.994
(...)

See also

Contribute

License

The project is licensed under the MIT license.

Citing

Should you cite bioalcidae ? https://github.com/mr-c/shouldacite/blob/master/should-I-cite-this-software.md

The current reference is:

http://dx.doi.org/10.6084/m9.figshare.1425030

Lindenbaum, Pierre (2015): JVarkit: java-based utilities for Bioinformatics. figshare. http://dx.doi.org/10.6084/m9.figshare.1425030

Clone this wiki locally