diff --git a/docs/geniml/code/create-consensus-peaks-python.md b/docs/geniml/code/create-consensus-peaks-python.md new file mode 100644 index 0000000..5cacdc5 --- /dev/null +++ b/docs/geniml/code/create-consensus-peaks-python.md @@ -0,0 +1,15268 @@ + + + + + +create-consensus-peaks-python + + + + + + + + + + + + + + + + + + + + + +
+
+ + +
+
+
+
+ + +
+
+
+ + +
+ +
+
+
+ + +
+
+
+ + +
+ +
+
+
+ + +
+
+
+ + +
+ +
+
+
+ + +
+
+
+ + +
+ +
+ + + + + +
+ +
+
+
+ + +
+
+
+ + +
+ +
+
+
+ + +
+
+
+ + +
+ +
+
+
+ + +
+
+
+ + +
+ +
+ + + + + +
+ +
+
+ + +
+ +
+ + + + + +
+ +
+
+ + +
+ +
+ + + + + +
+ +
+
+
+ + +
+
+
+ + +
+ +
+ + + + + +
+ +
+
+ + +
+ +
+ + + + + +
+ +
+
+
+ + +
+
+
+ + +
+ +
+ + + + + +
+ +
+
+ + +
+ +
+ + + + + +
+ +
+
+ + +
+ +
+ + + + + + + + + diff --git a/docs/geniml/notebooks/create-consensus-peaks-python.ipynb b/docs/geniml/notebooks/create-consensus-peaks-python.ipynb new file mode 100644 index 0000000..cb386d7 --- /dev/null +++ b/docs/geniml/notebooks/create-consensus-peaks-python.ipynb @@ -0,0 +1,504 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "66137bf9-4f78-4b89-aa64-6734400df15e", + "metadata": {}, + "source": [ + "# How to build a new universe?\n", + "\n", + "## Data preprocessing\n", + "This is a jupyter version of CLI tutorial that can be found [here](../tutorials/create-consensus-peaks.md). You will use here python functions instead of CLI to build and assess different universes. Files that you will use here can be downloaded from XXX. In there you will find a compressed folder:\n", + "\n", + "```\n", + "consensus:\n", + " - raw\n", + " test_1.bed\n", + " test_2.bed\n", + " test_3.bed\n", + " test_4.bed\n", + " file_list.txt\n", + " chrom.sizes\n", + "```\n", + "\n", + "In the raw folder there are example BED files used in this tutorial and in file_list.txt are names of files you will analyze. Additionally there is a file with chromosome sizes, which you will use to preprocess the data. \n", + "\n", + "Here we assume that you already have files of the genome coverage by the analyzed collection. The example of how to create them can be found [here](../tutorials/create-consensus-peaks.md)." + ] + }, + { + "cell_type": "markdown", + "id": "a8bc1fc3-0c43-43af-908e-0798afbbe459", + "metadata": { + "tags": [] + }, + "source": [ + "## Coverage cutoff universe\n", + "\n", + "First, you will create a coverage cutoff universe (CC). This is the simplest type of a universe that only includes genomic positions with coverage greater or equal to cutoff *x*. This cutoff by default is calculated using simple likelihood model that calculates the probability of appearing in a collection. The universe can be build just based on genome coverage:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "141886ce-3d97-4edb-8a9b-bdd5dff53bce", + "metadata": {}, + "outputs": [], + "source": [ + "from geniml.universe.cc_universe import cc_universe\n", + "cc_universe(\"coverage/\", file_out=\"universe_cc.bed\")" + ] + }, + { + "cell_type": "markdown", + "id": "56af8029-97b1-4bd6-81a9-af86a6b6c83e", + "metadata": {}, + "source": [ + "Depending on the task the universe can be smooth by setting ```merge``` option with the distance below witch peaks should be merged together and \n", + "`filter_size` with minimum size of peak that should be part of the universe. Instead of using maximum likelihood cutoff one can also defined cutoff with `cutoff` option. If it is set to 1 the result is union universe, and when to number of files it wil produce intersection universe:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "6e1c433d-e2f4-4fdd-9377-1d417f64102b", + "metadata": {}, + "outputs": [], + "source": [ + "cc_universe(\"coverage/\", file_out=\"universe_union.bed\", cutoff=1)\n", + "cc_universe(\"coverage/\", file_out=\"universe_intersection.bed\", cutoff=4)" + ] + }, + { + "cell_type": "markdown", + "id": "f6d63171-4784-433d-a3d2-875c2ecd7be3", + "metadata": {}, + "source": [ + "## Coverage cutoff flexible universe\n", + "A more complex version of coverage cutoff universe is coverage cutoff flexible universe (CCF). In contrast to its' fixed version it produces flexible universes. It uses two cutoffs calculated based on maximum likelihood cutoff, making a confidence interval around the optimal cutoff value. Despite the fact that the CFF universe is more complex it is build using the same input as the CC universe: \n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "4f431b2a-49e9-4ad5-8fcf-d77fa343e818", + "metadata": {}, + "outputs": [], + "source": [ + "from geniml.universe.ccf_universe import ccf_universe\n", + "\n", + "ccf_universe(\"coverage/\", file_out=\"universe_ccf.bed\")" + ] + }, + { + "cell_type": "markdown", + "id": "66a7b97d-b2ac-449d-b162-5b71b15ab358", + "metadata": {}, + "source": [ + "## Maximum likelihood universe\n", + "In the previous examples both CC anf CCF universes used simple likelihood model to calculate the cutoff. However, we also developed more complex likelihood model that takes into account the positions of starts and ends of the regions in the collection. This LH model can build based on coverage files:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "5d1cd9be-9ba8-42a3-8543-bfbc1fcfa153", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Function 'main' executed in 0.0001min\n" + ] + } + ], + "source": [ + "from geniml.likelihood.build_model import main\n", + "\n", + "main(\"model.tar\", \"coverage/\",\n", + " \"all\",\n", + " file_no=4)" + ] + }, + { + "cell_type": "markdown", + "id": "d05ebdb6-cf66-4b0c-abbd-2fc75f093350", + "metadata": {}, + "source": [ + "The resulting tar archiver contains LH model. This model can be used as a scoring function that assigns to each position probability of it being a start, core or end of a region. It can be both used for universe assessment and universe building. Combination of LH model and optimization algorithm for building flexible universes results in maximum likelihood universe (ML):" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "d50948d0-f46b-4eb0-a2ab-9121f315ac21", + "metadata": {}, + "outputs": [], + "source": [ + "from geniml.universe.ml_universe import ml_universe\n", + "\n", + "ml_universe(\"model.tar\",\n", + " \"coverage\",\n", + " \"all\",\n", + " \"universe_ml.bed\")" + ] + }, + { + "cell_type": "markdown", + "id": "220c0272-4c83-43b0-83a2-93cd7f3292a9", + "metadata": {}, + "source": [ + "## HMM \n", + "The forth presented method of creating universes utilizes Hidden Markov Models. In this approach the parts of flexible regions are hidden states of the model, while genome coverage by the collections are emissions. The resulting universe is called Hidden Markov Model universe. It can be build only based on the genome coverage by the collection:" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "0bc3ea52-25cc-4d62-8727-12ea9f928c24", + "metadata": {}, + "outputs": [], + "source": [ + "from geniml.universe.hmm_universe import hmm_universe\n", + "\n", + "hmm_universe(\"coverage/\",\n", + " \"universe_hmm.bed\")" + ] + }, + { + "cell_type": "markdown", + "id": "bab159ce-aa6f-4b56-9696-59f7db7c81b2", + "metadata": {}, + "source": [ + "# How to assess new universe?\n", + "\n", + "So far you used many different methods for creating new universes. But choosing, which universe represents data the best can be challenging. To help with this decision we created three different metrics for assessing universe fit to the region collections: a base-level overlap score (F10), a region boundary distance score (RBD), and a likelihood score (LH). Here we present an example, which calculates all these metrics for HMM universe:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "6b36ce9d-5412-4ba8-afe7-964909806e0d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'Universe F10: 0.93'" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from geniml.assess.assess import get_f_10_score\n", + "\n", + "f10 = get_f_10_score(\n", + " \"raw/\",\n", + " 'file_list.txt',\n", + " \"universe_hmm.bed\",\n", + " 1)\n", + "\n", + "f\"Universe F10: {f10:.2f}\"" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "7b29421c-8543-4fc9-ac83-ad6f7d0f70df", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'Universe RBS: 0.77'" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from geniml.assess.assess import get_mean_rbs\n", + "rbs = get_mean_rbs(\"raw/\",\n", + " 'file_list.txt',\n", + " \"universe_hmm.bed\", 1)\n", + "f\"Universe RBS: {rbs:.2f}\"" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "711fbb4f-4502-499e-8b5d-e879e26a0124", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'Universe LH: -127156.87'" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from geniml.assess.assess import get_likelihood\n", + "lh = get_likelihood(\n", + " \"model.tar\",\n", + " \"universe_hmm.bed\",\n", + " \"coverage/\"\n", + ")\n", + "f\"Universe LH: {lh:.2f}\" " + ] + }, + { + "cell_type": "markdown", + "id": "171e1240-e12a-450a-9df9-ad1f0d97e398", + "metadata": {}, + "source": [ + "Both region boundary score and likelihood can be also calculated taking into account universe flexibility:" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "fa597da1-d452-4583-973d-92263512b38e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'Universe flexible RBS: 0.98'" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from geniml.assess.assess import get_mean_rbs\n", + "rbs_flex = get_mean_rbs(\n", + " \"raw/\",\n", + " 'file_list.txt',\n", + " \"universe_hmm.bed\",\n", + " 1,\n", + " flexible=True)\n", + "f\"Universe flexible RBS: {rbs_flex:.2f}\"" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "315a1dce-e045-41a0-b1ed-e06d117bebaa", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'Universe flexible LH: -127156.87'" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "lh_flex = get_likelihood(\n", + " \"model.tar\",\n", + " \"universe_hmm.bed\",\n", + " \"coverage/\"\n", + ")\n", + "f\"Universe flexible LH: {lh_flex:.2f}\" " + ] + }, + { + "cell_type": "markdown", + "id": "ece5e3df-647f-46ab-bd95-4add00ebdfd5", + "metadata": {}, + "source": [ + "In CLI version of this [tutorial](../tutorials/create-consensus-peaks.md) it was shown how to calculate an assessment file with all the metrics. This file can be further summarized into specific metrics assessing the fit of a universe to a whole collection. " + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "462ffa69-3867-42b3-84f9-0cbe394dbd20", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
fileunivers/filefile/universeuniverse&filemedian_dist_file_to_universemedian_dist_file_to_universe_flexmedian_dist_universe_to_filemedian_dist_universe_to_file_flex
0test_1.bed2506403363027.00.076.50.0
1test_2.bed1803146433327.00.070.07.5
2test_3.bed29490318728.00.0225.0224.5
3test_4.bed2071546406527.00.0116.5105.5
\n", + "
" + ], + "text/plain": [ + " file univers/file file/universe universe&file \\\n", + "0 test_1.bed 2506 403 3630 \n", + "1 test_2.bed 1803 146 4333 \n", + "2 test_3.bed 2949 0 3187 \n", + "3 test_4.bed 2071 546 4065 \n", + "\n", + " median_dist_file_to_universe median_dist_file_to_universe_flex \\\n", + "0 27.0 0.0 \n", + "1 27.0 0.0 \n", + "2 28.0 0.0 \n", + "3 27.0 0.0 \n", + "\n", + " median_dist_universe_to_file median_dist_universe_to_file_flex \n", + "0 76.5 0.0 \n", + "1 70.0 7.5 \n", + "2 225.0 224.5 \n", + "3 116.5 105.5 " + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from geniml.assess.assess import get_rbs_from_assessment_file, get_f_10_score_from_assessment_file\n", + "import pandas as pd\n", + "\n", + "assessment_file_path = \"test_assess_data.csv\"\n", + "df = pd.read_csv(assessment_file_path)\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "4f9f3a13", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Universe \n", + "F10: 0.93\n", + "RBS: 0.77\n", + "flexible RBS: 0.98\n" + ] + } + ], + "source": [ + "rbs = get_rbs_from_assessment_file(assessment_file_path)\n", + "f_10 = get_f_10_score_from_assessment_file(assessment_file_path)\n", + "rbs_flex = get_rbs_from_assessment_file(assessment_file_path, flexible=True)\n", + "print(f\"Universe \\nF10: {f_10:.2f}\\nRBS: {rbs:.2f}\\nflexible RBS: {rbs_flex:.2f}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "76201732-509f-46ca-a703-ec804a03e097", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/geniml/tutorials/create-consensus-peaks.md b/docs/geniml/tutorials/create-consensus-peaks.md index 88768ab..4eba331 100644 --- a/docs/geniml/tutorials/create-consensus-peaks.md +++ b/docs/geniml/tutorials/create-consensus-peaks.md @@ -1,151 +1,107 @@ -# How to build a new universe - -We will start with simple example of how to create a consensus peak set from a -collection of files. Example data can be found in `tests/consenus/raw`. +# How to build a new universe? ## Data preprocessing +In this tutorial, you will use CLI of geniml package to build different types of universes from example files, which can be downloaded from XXX. In there you will find a compressed folder: -First step of analysis is creating three tracks with genome coverage by peaks, -their starts and ends. To do that we have to: - -1. install [uniwig](https://github.com/databio/uniwig/tree/smoothing), make sure to use branch dev -2. use [create_unsorted.sh](https://github.com/databio/uniwig/blob/smoothing/create_unsorted.sh) to make three bigWig: - - {prefix}_start.bw - with smoothed coverage of genome by starts - - {prefix}_core.bw - with coverage of genome by peaks - - {prefix}_start.bw - with smoothed coverage of genome by ends +``` +consensus: + - raw + test_1.bed + test_2.bed + test_3.bed + test_4.bed + file_list.txt + chrom.sizes +``` -In this tutorial we will use prefix "all" as it is a default prefix in -`geniml` module +In the raw folder there are example BED files used in this tutorial and in file_list.txt are names of files you will analyze. Additionally there is a file with chromosome sizes, which you will use to preprocess the data. -## Coverage cutoff universe +To build any kind of a universe you need bigWig files with genome coverage by the analyzed collection, which can be made it using [uniwig](https://github.com/databio/uniwig/). First we have to combine all the analyzed files into one BED file: -We will start by making a coverage universe with cutoff that results in maximum -likelihood universe. We can build it through CLI: +``` +cat raw/* > raw/combined_files.bed +``` -```console - geniml build-universe cc --coverage-folder tests/consenus/coverage/ \ - --output-file tests/consenus/universe/universe.bed +This combined file can next be used to prepare the genome coverage tracks, with window size for smoothing of breakpoints set to 25: -``` +``` +$UNIWIG_PATH/bin/uniwig -m 25 raw/combined_files.bed chrom.sizes coverage/all +``` -Where: +This will create three files: `coverage/all_start.bw`, `coverage/all_core.bw`, `coverage/all_end.bw`, with coverage of the genome by regions' starts, regions and regions' ends respectively. Those files can be loaded into Genomic Viewer for visualization. -- `--coverage-folder`, takes the path to bigWig folder with genome coverage by collection -- `--output-file`, takes the path to output file +## Coverage cutoff universe -Or we can import it directly into Python: +First, you will create a coverage cutoff universe (CC). This is the simplest type of a universe that only includes genomic positions with coverage greater or equal to cutoff *x*. This cutoff by default is calculated using simple likelihood model that calculates the probability of appearing in a collection. The universe can be build just based on genome coverage: ``` -from geniml.universe.cc_universe import cc_universe +geniml build-universe cc --coverage-folder coverage/ \ + --output-file universe_cc.bed -cc_universe("tests/consenus/coverage/all_core.bw", - file_out="tests/consenus/universe/universe.bed") -``` +``` -Depending on the task we can also smooth the output universe by setting `--merge` +Depending on the task the universe can be smooth by setting `--merge` flag with the distance beloved witch peaks should be merged together and -`--filter-size` with minimum size of peak that should be part of the universe. We can also not use the maximum likelihood cut-off and instead of it use user defined cutoff. For that we have to set `--cutoff` . If we set it to 1 we get union universe, and when to number of files we will get intersection universe. +`--filter-size` with minimum size of peak that should be part of the universe. Instead of using maximum likelihood cutoff one can also defined cutoff with `--cutoff` flag. If it is set to 1 the result is union universe, and when to number of analyzed files it wil produce intersection universe. ## Coverage cutoff flexible universe -Next presented universe is coverage cutoff flexible universe. We can do it through CLI: +A more complex version of coverage cutoff universe is coverage cutoff flexible universe (CCF). In contrast to its' fixed version it produces flexible universe. It builds confidence interval around the maximum likelihood cutoff. This results in two values one for the cutoff for boundaries, and the other one for the region core. Despite the fact that the CFF universe is more complex it is build using the same input as the CC universe: ``` - geniml build-universe ccf --coverage-folder tests/consenus/coverage/ \ - --output-file tests/consenus/universe/universe.bed +geniml build-universe ccf --coverage-folder coverage/ \ + --output-file universe_ccf.bed ``` -Where: - -- `--coverage-folder`, takes the path to bigWig file with genome coverage by collection -- `--output-file`, takes the path to output file - -Or we can import it directly into python: -``` -from geniml.universe.ccf_universe import ccf_universe - -ccf_universe("tests/consenus/coverage/all_core.bw", - file_out="tests/consenus/universe/universe.bed") -``` - ## Maximum likelihood universe -Another type of universe that we can make is maximum likelihood flexible universe. To make it first we have to have a likelihood model of genome coverage by collection of files. - -#### Making likelihood model: -To make a likelihood model we can use this CLI: +In the previous examples both CC anf CCF universes used simple likelihood model to calculate the cutoff. However, we also developed more complex likelihood model that takes into account the positions of starts and ends of the regions in the collection. This LH model can build based on coverage files and number of analyzed files: ``` -geniml lh build_model --model-file tests/consenus/model.tar \ - --coverage-folder tests/consenus/coverage/ \ - --file-no 4 -``` - -Where: - -- `--model-file`, takes the name of tar archive that will contain the likelihood model -- `--file-no`, number of files used in analysis -- `--coverage-folder` path to folder with coverage tracks - -Or, we can do it directly in python: - -``` -from geniml.likelihood.build_model import main - -main("tests/consenus/model.tar", "tests/consesnus/coverage", - "all", - file_no=4) +geniml lh build_model --model-file model.tar \ + --coverage-folder coverage/ \ + --file-no `wc -l file_list.txt` ``` -#### Making universe: -Now that we have the model we make the universe: +The resulting tar archiver contains LH model. This model can be used as a scoring function that assigns to each position probability of it being a start, core or end of a region. It can be both used for universe assessment and universe building. Combination of LH model and optimization algorithm for building flexible universes results in maximum likelihood universe (ML): ``` -geniml build-universe ml --model-file tests/consenus/model.tar \ - --output-file tests/consenus/universe/universe.bed \ - --coverage-folder tests/consesnus/coverage +geniml build-universe ml --model-file model.tar \ + --coverage-folder coverage/ \ + --output-file universe_ml.bed ``` -Where: - -- `--model-file`, takes the name of tar archive that contains the likelihood model -- `--output-file`, takes the path to output file -- `--coverage-folder` path to folder with coverage tracks - -Similarly, we can do it in python: +## HMM +The forth presented method of creating universes utilizes Hidden Markov Models (HMM). In this approach the parts of flexible regions are hidden states of the model, while genome coverage by the collections are emissions. The resulting universe is called Hidden Markov Model universe. It can be build only based on the genome coverage by the collection: ``` -from geniml.universe.ml_universe import ml_universe +geniml build-universe hmm --coverage-folder coverage/ \ + --output-file universe_hmm.bed -ml_universe("tests/consesnus/model.tar", - "/home/hee6jn/Documents/geniml/tests/consesnus/coverage", - "all", - "tests/consenus/universe/universe.bed") ``` -## HMM -Another approach to making flexible universes is using Hidden Markov Models. -We can do it for example with: +# How to assess new universe? -``` -geniml build-universe hmm --out-file tests/consenus/universe/universe.bed \ - --coverage-folder tests/consenus/coverage/ \ - --save-max-cove -``` - -Where: +So far you used many different methods for creating new universes. But choosing, which universe represents data the best can be challenging. To help with this decision we created three different metrics for assessing universe fit to the region collections: a base-level overlap score, a region boundary score, and a likelihood score. The two first metrics can be calculated separately for each file in the collections and than summarized. To calculate them you need raw files as well as the analyzed universe. It is also necessary to choose at least one assessment metric to be calculated: -- `--output-file`, takes the path to output file -- `--coverage-folder`, path to folder with coverage tracks -- `--coverage-prefix` prefix used in uniwig for making files, default is "all" -- `--not-normlaize`, is a flag that specifies whether not to normalize tracks before running HMM -- `--save-max-cove`, is a flag that specifies whether to save maximum coverage of each output peak +* `--overlap` - to calculate base pair overlap between universe and regions in the file, number of base pair only in the universe, number of base pair only in the file, which can be used to calculate F10 score; +* `--distance` - to calculate median of distance form regions in the raw file to the universe; +* `--distance-universe-to-file` - to calculate median of distance form the universe to regions in the raw file; +* `--distance-flexible` - to calculate median of distance form regions in the raw file to the universe taking into account universe flexibility; +* `--distance-flexible-universe-to-file` - - to calculate median of distance form the universe to regions in the raw file taking into account universe flexibility. -Similarly, we can do it in python: +Here we present an example, which calculates all possible metrics for HMM universe: ``` -from geniml.universe.hmm_universe import hmm_universe - -hmm_universe("tests/consenus/coverage/", - "tests/consenus/universe/universe.bed", - save_max_cove=True) -``` \ No newline at end of file + geniml assess-universe --raw-data-folder raw/ \ + --file-list file_list.txt \ + --universe universe_hmm.bed \ + --folder-out . \ + --pref test_assess \ + --overlap \ + --distance \ + --distance-universe-to-file \ + --distance-flexible \ + --distance-flexible-universe-to-file +``` +The resulting file is called test_assess_data.csv, and contains columns with the raw calculated metrics for each file: *file*, *univers/file*, *file/universe*, *universe&file*, *median_dist_file_to_universe*, *median_dist_file_to_universe_flex*, *median_dist_universe_to_file*, *median_dist_universe_to_file_flex*. +More information about assessing fit of universe to a collection of files can be found in jupyter notebook version of this tutorial tha can be found [here](../code/create-consensus-peaks-python.md). \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml index 2af8af6..6b7faac 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -97,8 +97,8 @@ nav: - Populate a vector store: geniml/tutorials/load-qdrant-with-cell-embeddings.md - Predict cell-types using KNN: geniml/tutorials/cell-type-annotation-with-knn.md - Evaluate embeddings: geniml/tutorials/evaluation.md - - Create consensus peaks: geniml/tutorials/create-consensus-peaks.md - - Assess universe fit: geniml/tutorials/assess-universe.md + - Create consensus peaks with CLI: geniml/tutorials/create-consensus-peaks.md + - Create consensus peaks with python: geniml/code/create-consensus-peaks-python.md - Fine-tune embeddings: geniml/tutorials/fine-tune-region2vec-model.md - Randomize bed files: geniml/tutorials/bedshift.md - Create evaluation dataset with bedshift: geniml/tutorials/bedshift-evaluation-guide.md