H5MANIPULATOR
requires libraries for HDF5 files. On Windows, these are bundled with the rhdf5
package. On Linux/Unix, you will need to first install hdf5 libraries.
This can usually be accomplished with:
sudo apt-get install hdf5-dev
or
sudo yum install hdf5-dev
Once hdf5 libraries are available, you can proceed to installation of rhdf5
.
The rhdf5
package is provided through BioConductor, and can be installed using:
if(!"BiocManager" %in% .packages(all.available = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("rhdf5")
H5MANIPULATOR
also requires the data.table
, ids
, and Matrix
packages, which are available on CRAN and should be automatically installed by install_github()
.
This package can be installed from Github using the devtools
package.
You may first need to register your GitHub PAT, as this is a private repository.
Get access token from github:
- Navigate to Settings / Developer settings
- Click Personal access tokens
- Generate new token (or re-generate if you have an existing one but you didn't copy it to your password manager).
- Under Select scopes
- Give the token repo scope
from github:
Make sure to copy your new personal access token now. You won’t be able to see it again!
Sys.setenv(GITHUB_PAT = "your-access-token-here")
devtools::install_github("bwh-bioinformatics-hub/H5MANIPULATOR")
The h5 produced here has the addition of richer metadata and additional results.
We can read and .h5 file directoy into a Seurat object using read_h5_seurat():
library(H5MANIPULATOR)
so <- read_h5_seurat(h5_file)
This function places the RNA-seq counts in the "RNA" assay.
Likewise, we can read directly into a SingleCellExperiment object for use with BioConductor packages using read_h5_sce().
library(H5MANIPULATOR)
sce <- read_h5_sce(h5_file)
Note that this requires a recent version of SingleCellExperiment (>= 1.8.0).
There is a convenience function to directly read the main cell x gene matrix from the HDF5 file, read_h5_dgCMatrix():
library(H5MANIPULATOR)
mat <- read_h5_dgCMatrix(h5_file)
Note: By default, this matrix will be 1-indexed for convenient use in R. If you would rather retrieve a 0-indexed matrix, set the index1 parameter to FALSE:
mat <- read_h5_dgCMatrix(h5_file,
index1 = FALSE)
A convenience function is provided to retrieve all cell/observation-based metadata, read_h5_cell_meta():
cell_meta <- read_h5_cell_meta(h5_file)
Note that for the test dataset, this is only the cell barcodes, as additional metadata are not present.
A similar function is also provided for gene/feature-based metadata, read_h5_feature_meta():
feat_meta <- read_h5_feature_meta(h5_file)
To read the entirety of an HDF5 file as a list object, use h5dump():
library(H5MANIPULATOR)
h5_list <- h5dump(h5_file)
str(h5_list)
This is a very raw representation of the contents of these HDF5 files. You may want to convert the major components to a sparse matrix (for cell x gene counts), and a data.frame (for metadata):
h5_list <- h5_list_convert_to_dgCMatrix(h5_list,
target = "matrix")
mat <- h5_list$matrix_dgCMatrix
feature_metadata <- as.data.frame(h5_list$matrix$features[-1])
Now, mat will consist of a dgCMatrix with genes as rows and barcodes as columns, and feature_metadata will be a data.frame with genes as rows and various metadata as columns.
For this test dataset, there isn't any cell metadata. However, files that are generated by our pipeline will include a substantial metadata set stored in matrix/observations. This can be retrieved with:
cell_metadata <- cbind(data.frame(barcodes = h5_list$matrix$barcodes),
as.data.frame(h5_list$matrix$observations))