Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sam: keep otherTags when adding a duplicate reference on BAM read #154

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ckingsford
Copy link

This PR fixes an issue where the BAM reader (via bam.DecodeBinary) did not preserve additional tags for reference entries in the header. If, for example, a reference had a DS:foo tag, it would not be preserved in the final header after decoding.

This occurred because the BAM format lists the references in two places: the "text" portion and a binary-encoded list of references. The reading of the binary-encoded list (in bam.DecodeBinary) overrode some parts of the references read via the text portion and did not preserve the otherTags field of references that had already been read in the text portion of the header.

The fix is in sam.Header.AddReference: when adding a reference that has already been seen, if the reference being added has a nil otherTags, use the existing tags. This follows the pattern already used in AddReference for the md5, assemID, etc. fields.

@kortschak
Copy link
Member

Thanks for sending this. Can you post a small reproducer so that I can understand the issue better?

@ckingsford
Copy link
Author

Sure, here: mappings.bam.gz is a .bam file with just the header where references each have a DS tag. Output of samtools view -H | head is:

@HD     VN:1.0  SO:unknown
@SQ     SN:ENST00000456328.2    LN:1657 DS:T
@SQ     SN:ENST00000450305.2    LN:632  DS:T
@SQ     SN:ENST00000488147.1    LN:1351 DS:T
@SQ     SN:ENST00000619216.1    LN:68   DS:T
@SQ     SN:ENST00000473358.1    LN:712  DS:T
@SQ     SN:ENST00000469289.1    LN:535  DS:T
@SQ     SN:ENST00000607096.1    LN:138  DS:T
@SQ     SN:ENST00000417324.1    LN:1187 DS:T
@SQ     SN:ENST00000461467.1    LN:590  DS:T

It's the DS:T (or any other non-length tags) that is not saved in the hts lib reading of the header. A small example to see that is:

package main

import (
        "fmt"
        "os"

        "github.com/biogo/hts/bam"
        "github.com/biogo/hts/sam"
)

func main() {
        f, _ := os.Open("mappings.bam")
        b, _ := bam.NewReader(f, 4)
        tag := sam.NewTag("DS")

        for _, r := range b.Header().Refs() {
                // expected: NAME LEN (D) or NAME LEN (T) when run on attached .bam (see samtools view -H mappings.bam)
                // result: NAME LEN ()
                fmt.Printf("%s %d (%s)\n", r.Name(), r.Len(), r.Get(tag))
        }
}

@kortschak
Copy link
Member

OK. I think this is correct fix for the fast path that's used in deserialisation, but I don't think it properly handles the case when a user is adding a constructed Reference where the tags are not empty. In that case the header and reference tags should be merged.

Looking at AddReference with some distance now, I'm concerned about how I clobber the reference table's MD5, assembly ID, species and URI. I think it's too late to fix this, but I should add some docs explaining what happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants