Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return sparse representation for genotype #310

Open
quattro opened this issue Aug 30, 2024 · 0 comments
Open

Return sparse representation for genotype #310

quattro opened this issue Aug 30, 2024 · 0 comments

Comments

@quattro
Copy link

quattro commented Aug 30, 2024

Hi @brentp ,

thanks again for developing such a fantastic tool. We use it for nearly every project in my group!

I'm curious if it would be possible to add a new property/method to VariantInfo that returns a sparse representation of genotypes. Ideally, something like var.sparse_genotypes that returns a (values, indices) for non-zero genotypes and sample indices where those occur.

This is already achievable with numpy filtering of var.gt_types, it is somewhat slow, and I'm curious if doing this in Cython space is faster.

The overall goal is to be able to build a sparse genotype matrix across all variants, which would look something like,

vcf = VCF(...)
data = []
indices = []
for vdx, var in enumerate(vcf):
   _data, _idxs = var.sparse_genotypes(include_missing=False)
  # construct local index
  _idx = np.column_stack((_idxs, np.ones_like(_idxs) * vdx))
  data.append(_data)
  indices.append(_idx)

data = np.concatenate(data)
indices = np.concatenate(indices)

n = len(vcf.samples)
p = vdx # last variant
sp_geno_mat = coo_matrix(data, indices, shape=(n, p))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant