From e682eb25c240a2d9bec427f94675aeacca41757b Mon Sep 17 00:00:00 2001 From: Olga Botvinnik Date: Fri, 22 May 2015 18:10:17 -0700 Subject: [PATCH] add proper H2 level titles to the notebook --- doc/tutorial/getting_started_brainspan.ipynb | 1648 +++++++++--------- 1 file changed, 823 insertions(+), 825 deletions(-) diff --git a/doc/tutorial/getting_started_brainspan.ipynb b/doc/tutorial/getting_started_brainspan.ipynb index 699ff484..6544abb9 100644 --- a/doc/tutorial/getting_started_brainspan.ipynb +++ b/doc/tutorial/getting_started_brainspan.ipynb @@ -1,4 +1,823 @@ { + "cells": [ + { + "cell_type": "raw", + "metadata": {}, + "source": [ + ".. _getting_started_brainspan:\n", + "\n", + ".. currentmodule:: flotilla" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "==========================================================================================\n", + "Correcting for batch effects in the Allen Brain Institute Developmental Transcriptome data\n", + "==========================================================================================\n", + "\n", + "This tutorial will go in-depth into how to download and format data for ``flotilla``, as well as how to look for and correct batch effects in your gene expression data.\n", + "\n", + "`BrainSpan `_ is an interface to data from the developing human brain. They have samples extracted from many post-mortem brains, and the dataset I'm most interested in is the developmental transcriptome, which has tissues collected from 42 donors of ages ranging from post conception week 8 to 40 years old. This is an incredible resource for anyone interested in studying changes in gene expression within brain regions across developmental time. \n", + "\n", + "One major issue is the \"batch effects\" that occur and skew the data because of factors unrelated to the biological phenomenon. For example, each donor was could have been collected at a different time, at different post-mortem times, from either fresh or frozen tissue, collected at different facilities, sectioned by a different scientist (who has their own unique style of dissecting tissue), and prepared tissue samples for sequencing libraries for RNA-sequencing differently. \n", + "\n", + "For BrainSpan, they have a detailed `white paper `_ describing every step of their process, and they've controlled as many variables as they could, which is everything downstream of sectioning the tissue. Different people could have sectioned the different donor's tissues, and some tissues were frozen (which had to be thawed) and some were fresh (which had to be chilled), so there's a lot of factors that could affect the results that have nothing to do with the actual biology of the human brain.\n", + "\n", + "A commonly used software package for adjusting for batch effects is called ComBat, originally designed for microarray batches (`Johnson and Li, *Biostatistics* 2007 `_). Their software is available as part of an R package called ```sva`` `_ on [BioconductoR](http://www.bioconductor.org/). I wanted to use it for this BrainSpan data, because the donor-specific signals are much stronger than the tissue-specific signals, but I knew I'd have to use ``rpy2`` to call R and didn't really want to do that. Thankfully, an `awesome human being `_ ported the contents of ``combat`` to Python to create ```combat.py`` `_!\n", + "\n", + "In this blog post, I will use the BrainSpan data as an example of how to use ``combat.py``. As a bonus, this feature has been added to the large-scale RNA-seq analysis software `flotilla `_ so you can normalize your batch effects any way you like! This hasn't yet been completely built into ``flotilla`` with an interactive widget and all, because I'm still figuring out the implementation, like how you would specify the linear model to batch effect correct against, e.g. batch correct for the the age, gender, post-mortem interval and RNA integrity number all at once.\n", + "\n", + "First, let's import all the python packages we'll need." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import StringIO\n", + "\n", + "import flotilla\n", + "from flotilla.external.combat import combat\n", + "\n", + "# Turn on inline plots\n", + "%matplotlib inline" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "Obtaining data\n", + "==============\n", + "\n", + "Downloading the data is fairly straightforward, it's all contained in a zip archive online. This is the *\"RNA-Seq Gencode v10 summarized to genes\"* download link from BrainSpan's `Downloads `_ page." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "%%bash \n", + "wget http://www.brainspan.org/api/v2/well_known_file_download/267666525" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "Now we'll unzip the file on the command line." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "%%bash\n", + "unzip 267666525" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Ooh, there's a readme! Let's take a look at what it says." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "cat readme.txt" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "Okay awesome, so when we read in the ``expression_matrix.csv``, we'll set the first column as the index and make sure to not use a header. We'll have to match up the genes described in the ``rows_metadata.csv`` with the rows in ``expression_matrix.csv``. And we'll use ``columns_metadata.csv`` to get the information about the samples. \n", + "\n", + "Finally, we'll transpose the ``expression_matrix.csv`` because we want our matrix to be oriented in (samples, features), as is standard in machine learning." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "expression = pd.read_csv('expression_matrix.csv', index_col=0, header=None)\n", + "print(expression.shape)\n", + "expression.head()" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "Great, we have the expression matrix. Let's take a look at the metadata about the samples." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "columns_metadata = pd.read_csv('columns_metadata.csv', index_col=0)\n", + "print(columns_metadata.shape)\n", + "columns_metadata.head()" + ] + }, + { + "cell_type": "raw", + "metadata": { + "collapsed": false + }, + "source": [ + "We need a column to be the sample id, which we will create as a combination of the ``donor_name`` and the ``structure_acronym``" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "columns_metadata['sample_id'] = columns_metadata.donor_name + '_' + columns_metadata.structure_acronym\n", + "print(columns_metadata['sample_id'].unique().shape)\n", + "columns_metadata['sample_id'].head()" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "Now we can use this newly created column as the ``index`` (rownames) of ``columns_metadata``." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "columns_metadata = columns_metadata.set_index('sample_id')\n", + "columns_metadata.head()" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "Similarly, we can set the ``columns`` of ``expression`` with this ``columns_metadata.index`` since they're exactly the same length" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "expression.columns = columns_metadata.index\n", + "expression.head()" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "Read expression feature data\n", + "============================\n", + "\n", + "This package came with a ``rows_metadata`` file which is the metadata for the rows or the features aka the genes of the expression matrix." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "rows_metadata = pd.read_csv('rows_metadata.csv', index_col=0)\n", + "rows_metadata.head()" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "Let's set the ``ensembl_gene_id`` as the index, since they are unique (unlike ``gene_symbol``) and match up with other tables that we have available in ``flotilla``." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "expression_feature_data = rows_metadata.set_index('ensembl_gene_id')\n", + "expression_feature_data.head()" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "Now let's set the expression matrix's rows as the ``ensembl_gene_id``s from ``expression_feature_data``." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "expression.index = expression_feature_data.index\n", + "expression.head()" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "Reformat sample metadata\n", + "========================\n", + "\n", + "The sample metadata has the individual patient age, but not the age ranges as documented in the `white paper `_ describing the sample harvesting and RNA sequencing procedures. We'll use the age ranges provided in the document, and add a column of age ranges to the ``columns_metadata`` dataframe." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# From the technical white paper: \n", + "# http://help.brain-map.org/download/attachments/3506181/Transcriptome_Profiling.pdf?version=1&modificationDate=1382036562736\n", + "development_string = \"\"\"Stage,Age,Developmental Period\n", + "1,4-7 pcw,Embryonic\n", + "2A,8-9 pcw,Early prenatal\n", + "2B,10-12 pcw,Early prenatal\n", + "3A,13-15 pcw,Early mid-prenatal\n", + "3B,16-18 pcw,Early mid-prenatal\n", + "4,19-24 pcw,Late_mid-prenatal\n", + "5,25-38 pcw,Late_prenatal\n", + "6,Birth-5 months,Early_infancy\n", + "7,6-18 months,Late_infancy\n", + "8,19 months-5 yrs,Early_childhood\n", + "9,6-11 yrs,Late_childhood\n", + "10,12-19 yrs,Adolescence\n", + "11,20-60+ yrs,Adulthood\n", + "\"\"\"\n", + "\n", + "development = pd.read_csv(StringIO.StringIO(development_string))\n", + "development" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "First of all, what are the unique ages in our ``columns_metadata`` dataframe?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "columns_metadata.age.unique()" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "There's quite the mixture of units ... ``pcw`` for \"post-conception week,\" ``mos`` for \"months\" and ``yrs`` for years. This will take a bit of string manipulation to convert individual ages to a range.\n", + "\n", + "First, lets convert age ranges in ``development`` to lists of single values." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def age_range_to_single_values(x):\n", + " if x == 'Birth-5 months':\n", + " return ['{} mos'.format(age) for age in range(1, 6)]\n", + " elif x == '19 months-5 yrs':\n", + " return ['{} mos'.format(age) for age in range(19, 24)] + ['{} yrs'.format(age) for age in range(2, 6)]\n", + " else:\n", + " age_range, time_unit = x.split()\n", + " start, stop = map(int, age_range.rstrip('+').split('-'))\n", + " if time_unit == 'months':\n", + " time_unit = 'mos'\n", + " age_range = range(start, stop+1)\n", + " if time_unit == 'mos':\n", + " time_units = ['mos' if a % 12 != 0 else 'yrs' for a in age_range]\n", + " age_range = [a if a % 12 != 0 else a/12 for a in age_range]\n", + " else:\n", + " time_units = [time_unit] * len(age_range)\n", + " return ['{} {}'.format(age, unit) for age, unit in zip(age_range, time_units)]\n", + "\n", + "development.Age.map(age_range_to_single_values)" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "Now we can create a mapping of age ranges to single values." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "range_to_single_value = dict(zip(development.Age, development.Age.map(age_range_to_single_values)))" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "Now we can reverse this dictionary to get the mapping of individual ages to an age range." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "single_value_to_range = {}\n", + "for k, values in range_to_single_value.iteritems():\n", + " for v in values:\n", + " single_value_to_range[v] = k\n", + "single_value_to_range = pd.Series(single_value_to_range)\n", + "single_value_to_range.head()" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "Now lets make a dataframe of the mappings of a single age value to the age range, so we can then merge it with our `development` dataframe. We'll need to assign a name to the series first, let's use ``\"age_range\"``." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "single_value_to_range_df = single_value_to_range.to_frame()\n", + "single_value_to_range_df = single_value_to_range_df.reset_index()\n", + "single_value_to_range_df.columns = ['age_single_value', 'age_range']\n", + "single_value_to_range_df.head()" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "Now we can add this column to our growing ``develompent`` dataframe by using ``merge``, which matches up the values in the columns that we specifed (like a ``VLOOKUP`` in Microsoft Excel), and merges the tables. We want to use an ``'outer'`` merge specifically because otherwise the \"VLOOKUP\" will stop when it reaches the first single age and we'll only get one single age value per age range." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "development = development.merge(single_value_to_range_df, left_on='Age', right_on='age_range', how='outer')\n", + "development.head()" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "Now we merge the `development` dataframe with the `columns_metadata` dataframe. We first need to `reset_index` of `columns_metadata` because otherwise the row names that we worked so hard to create will disappear after the `merge`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "columns_metadata = columns_metadata.reset_index()\n", + "columns_metadata = columns_metadata.merge(development, left_on='age', right_on='age_single_value')\n", + "columns_metadata.head()" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "Let's remove the ``Age`` and ``age_single_value`` columns since it's confusing that there's three \"age-type\" columns. We'll just keep the first one, `age` which was in the dataframe originally." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "columns_metadata = columns_metadata.drop(['Age', 'age_single_value'], axis=1)" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "Now let's set the ``sample_id`` as the index, so it matches up with the ``expression`` dataframe." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "columns_metadata = columns_metadata.set_index('sample_id')\n", + "columns_metadata.head()" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "Create a ``flotilla`` Study!\n", + "============================\n", + "\n", + "\n", + "Now we are ready to create a ``flotilla`` study! Some notes about the arguments:\n", + "\n", + "* ``columns_metadata``: This is the only required argument for all ``flotilla`` studies, which is the metadata on the samples. This one doesn't need a keyword (don't need to specify it with ``keyword=value`` like the other ones)\n", + "* ``metadata_phenotype_col``: This is which column in the metadata to use as the \"phenotype,\" which means that the samples will be colored by their value in this column. In our case, the samples will be colored by area of the brain from which they were harvested, with ``\"structure_name\"``.\n", + "* ``metadata_ignore_subset_columns`` (as of ``v0.2.8``): This indicates which columns of the metadata we don't want to use for creating subsets of samples. I picked ``donor_id`` since it's redundant with the ``donor_name``, and ``structure_id`` and ``structure_name`` are redundant with ``structure_acronym``, and ``Stage`` is redundant with ``Developmental Period``, and we don't want to clutter our options.\n", + "* ``metadata_minimum_samples``: This is the minimum number of samples that a feature must be detected in for it to pass our filter. In the case of gene expression only data, we may not actually remove anything, **unless** we set the ``expression_thresh`` keyword. If this is set, then genes which have expression less than the threshold in fewer than the minimum number of samples will be discarded. For example, we have ``metadata_minimum_samples=3`` and ``expression_thresh=1``. That means that only genes with expression at least 1 in at least three samples will be left in.\n", + "* ``expression_data``: This is a (samples, features) expression matrix. Since our ``expression`` matrix was (features, samples), we transposed it on the input\n", + "* ``expresion_thresh``: Minimum value of expression threshold of a gene so it will be counted, in at least ``metadata_minimum_samples``. This is on the **original, non-log-transformed, non-plus-one'd data**.\n", + "* ``expression_plus_one``: If True, add 1 to all expression values. This makes log-transforming the data easier since there are no zeros, and there won't be any values smaller than 1, so our logged data won't have negative values, which is nice. Plus it makes our data a lot easier to interpret. If we let $\\log_2(0) = $ NA, then we'd have have differences in values from say -8 to 8, which actually represent $2^{-8}$ to $2^8$. Personally, I'd like everything less than 1 to be ignored, so I add 1 to the expression value and then $\\log$-transform with abandon.\n", + "* ``expression_log_base``: This is the base of the log you'd like your data to be transformed with. For example, here we say to use ``2`` so our expression data will be $\\log_2$. We could also use ``10`, ``np.e``, ``313``, whatever we wanted.\n", + "* ``species``: Because we're using `ENSEMBL `_ ids, we can use some built-in helpfulness of ``flotilla``, which will automatically download a `\"species datapackage\" `_ that I curated, from Amazon S3. This has annotations like transcription factors, rna binding proteins, and more. Be warned: this only works for ``\"hg19\"`` and ``\"mm10\"``.\n", + "\n", + "Without further ado, let's get started!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "study_uncorrected = flotilla.Study(columns_metadata,\n", + " metadata_phenotype_col='structure_name',\n", + " metadata_ignore_subset_cols=['donor_id', 'structure_id', 'structure_acronym', 'Stage'],\n", + " metadata_minimum_samples=3,\n", + " expression_data=expression.T,\n", + " expression_thresh=1,\n", + " expression_plus_one=True, \n", + " expression_log_base=2,\n", + " species='hg19')" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "Let's take a look at the shapes of the original expression data (under `.expression.data_original`) versus the filtered expression data (under `.expression.data`) and see if any genes were removed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(study_uncorrected.expression.data_original.shape)\n", + "print(study_uncorrected.expression.data.shape)" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "Wow, about 22,000 genes were removed! That means around 22,000 genes weren't appreciably expressed in our data. When we ``study_uncorrected.save()`` our ``Study`` object, we'll actually always save the original, uncorrected data, along with all the modifications we made so we can always get back to the originals.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "study_uncorrected.save('brainspan_uncorrected')" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "Let's take a look at this ``datapackage.json``." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "! cat $HOME/flotilla_packages/brainspan_uncorrected/datapackage.json" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "Principal Component Analysis\n", + "============================\n", + "\n", + "The first thing I do when I check out a new dataset is do Principal Component Analysis (PCA) on it, which shows the variance of all samples. For this data, I'd like to do analyses which explore differences between areas of the brain, so I expect for similar areas of the brain to cluster together. Since we set our ``metadata_phenotype_col`` as ``\"structure_name\"``, so the scatterplot will be colored by where in the brain they come from." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "study_uncorrected.plot_pca()" + ] + }, + { + "cell_type": "raw", + "metadata": { + "collapsed": true + }, + "source": [ + "Uh-oh! This looks like confetti was thrown on the ground :(\n", + "\n", + "That means that the samples aren't clustering by their structures like we suspected. Maybe the variance is coming from the age instead... what if we colored by the ``\"Stage\"`` (an ordered version of the age range) instead? Let's also turn off the vectors for now because we just want to look at the shape of the PCA." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "pd.Categorical(study_uncorrected.metadata.data.Stage, categories=['2A','2B'] + map(str, range(3, 12)), ordered=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "study_uncorrected.plot_pca(color_samples_by='Stage', show_vectors=False)" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "Hmm, that does seem to work better, because the colors are clustering together. But it doesn't seem to make sense by the ordering. The \"2A\" stage isn't clustering near the \"2B\" stage, and the \"3A\" and \"3B\" are in between. Let's try clustering by the donor name instead." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "study_uncorrected.plot_pca(color_samples_by='donor_name', show_vectors=False)" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "Uh oh... looks like the samples are clustering by donor. Smells like we need some batch correction if you ask me. Well, this tutorial wouldn't have been titled \"batch effects\" for nothing :)\n", + "\n", + "Batch correction using COMBAT\n", + "=============================\n", + "\n", + "To use ``combat`` for batch correction, we'll need to do a bit of preprocessing. First, we'll use the non-original data, i.e. the data that's already been *plus-oned*, *log-transformed*, and *min-sampled*. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "data = study_uncorrected.expression.data" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "Second, combat uses the variance and mean within batches to correct for the batch effects. Let's say three samples is a minimum. Since we'll be correcting by donor name, we'll need to make sure that we're only using data from donors with at least three brain structures. We'll do this using a ``groupby`` on the donor name, which will get us all samples from the same donor, and filter for at least three of these samples. The metadata's underlying data is stored as ``study_uncorrected.metadata.data``, and we can access the ``donor_name`` column using ``study_uncorrected.metdata.data.donor_name``." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "data = data.groupby(study_uncorrected.metadata.data.donor_name).filter(lambda x: len(x) >= 3)\n", + "data.shape" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "Huh, we had 524 samples originally, so this really didn't remove that much.\n", + "\n", + "We'll need to use the transpose of the data because ``combat`` expects a (features, samples) matrix." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "data = data.T\n", + "data.head()" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "Next, we'll use the donor name as the categorical variable for the batch. Notice that the sample names are one columns now so we access ``data.columns`` instead of ``data.index``." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "batch = study_uncorrected.metadata.data.donor_name[data.columns]" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "Finally, we're ready for the batch correction! We'll run ``combat`` with our data, the column indicating the batch, and the ``mod`` argument (model matrix) as ``False``, because we want to do a very coarse correction." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "expression_batch_corrected = combat(data, batch, False)\n", + "expression_batch_corrected.head()" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "Great, now let's make a brand new, corrected study! This is very similar code to before, except we won't do any of the expression transformations." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "study_corrected = flotilla.Study(columns_metadata,\n", + " metadata_phenotype_col='structure_name',\n", + " metadata_ignore_subset_cols=['donor_id', 'structure_id', 'structure_name', 'Stage'],\n", + " metadata_minimum_samples=3,\n", + " expression_data=expression_batch_corrected.T,\n", + " species='hg19')\n", + "study_corrected.plot_pca()" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "Yay, now it's clustering by structure type! Quick, let's save this before we forget!!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "study_corrected.save('brainspan_batch_corrected')" + ] + } + ], "metadata": { "kernelspec": { "display_name": "Python 2", @@ -16,829 +835,8 @@ "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.9" - }, - "name": "" - }, - "nbformat": 3, - "nbformat_minor": 0, - "worksheets": [ - { - "cells": [ - { - "cell_type": "raw", - "metadata": {}, - "source": [ - ".. _getting_started_brainspan:\n", - "\n", - ".. currentmodule:: flotilla" - ] - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "==========================================================================================\n", - "Correcting for batch effects in the Allen Brain Institute Developmental Transcriptome data\n", - "==========================================================================================\n", - "\n", - "This tutorial will go in-depth into how to download and format data for ``flotilla``, as well as how to look for and correct batch effects in your gene expression data.\n", - "\n", - "`BrainSpan `_ is an interface to data from the developing human brain. They have samples extracted from many post-mortem brains, and the dataset I'm most interested in is the developmental transcriptome, which has tissues collected from 42 donors of ages ranging from post conception week 8 to 40 years old. This is an incredible resource for anyone interested in studying changes in gene expression within brain regions across developmental time. \n", - "\n", - "One major issue is the \"batch effects\" that occur and skew the data because of factors unrelated to the biological phenomenon. For example, each donor was could have been collected at a different time, at different post-mortem times, from either fresh or frozen tissue, collected at different facilities, sectioned by a different scientist (who has their own unique style of dissecting tissue), and prepared tissue samples for sequencing libraries for RNA-sequencing differently. \n", - "\n", - "For BrainSpan, they have a detailed `white paper `_ describing every step of their process, and they've controlled as many variables as they could, which is everything downstream of sectioning the tissue. Different people could have sectioned the different donor's tissues, and some tissues were frozen (which had to be thawed) and some were fresh (which had to be chilled), so there's a lot of factors that could affect the results that have nothing to do with the actual biology of the human brain.\n", - "\n", - "A commonly used software package for adjusting for batch effects is called ComBat, originally designed for microarray batches (`Johnson and Li, *Biostatistics* 2007 `_). Their software is available as part of an R package called ```sva`` `_ on [BioconductoR](http://www.bioconductor.org/). I wanted to use it for this BrainSpan data, because the donor-specific signals are much stronger than the tissue-specific signals, but I knew I'd have to use ``rpy2`` to call R and didn't really want to do that. Thankfully, an `awesome human being `_ ported the contents of ``combat`` to Python to create ```combat.py`` `_!\n", - "\n", - "In this blog post, I will use the BrainSpan data as an example of how to use ``combat.py``. As a bonus, this feature has been added to the large-scale RNA-seq analysis software `flotilla `_ so you can normalize your batch effects any way you like! This hasn't yet been completely built into ``flotilla`` with an interactive widget and all, because I'm still figuring out the implementation, like how you would specify the linear model to batch effect correct against, e.g. batch correct for the the age, gender, post-mortem interval and RNA integrity number all at once.\n", - "\n", - "First, let's import all the python packages we'll need." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "import numpy as np\n", - "import pandas as pd\n", - "import StringIO\n", - "\n", - "import flotilla\n", - "from flotilla.external.combat import combat\n", - "\n", - "# Turn on inline plots\n", - "%matplotlib inline" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "Obtaining data\n", - "==============\n", - "\n", - "Downloading the data is fairly straightforward, it's all contained in a zip archive online. This is the *\"RNA-Seq Gencode v10 summarized to genes\"* download link from BrainSpan's `Downloads `_ page." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "%%bash \n", - "wget http://www.brainspan.org/api/v2/well_known_file_download/267666525" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "Now we'll unzip the file on the command line." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "%%bash\n", - "unzip 267666525" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Ooh, there's a readme! Let's take a look at what it says." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "cat readme.txt" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "Okay awesome, so when we read in the ``expression_matrix.csv``, we'll set the first column as the index and make sure to not use a header. We'll have to match up the genes described in the ``rows_metadata.csv`` with the rows in ``expression_matrix.csv``. And we'll use ``columns_metadata.csv`` to get the information about the samples. \n", - "\n", - "Finally, we'll transpose the ``expression_matrix.csv`` because we want our matrix to be oriented in (samples, features), as is standard in machine learning." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "expression = pd.read_csv('expression_matrix.csv', index_col=0, header=None)\n", - "print(expression.shape)\n", - "expression.head()" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "Great, we have the expression matrix. Let's take a look at the metadata about the samples." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "columns_metadata = pd.read_csv('columns_metadata.csv', index_col=0)\n", - "print(columns_metadata.shape)\n", - "columns_metadata.head()" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": { - "collapsed": false - }, - "source": [ - "We need a column to be the sample id, which we will create as a combination of the ``donor_name`` and the ``structure_acronym``" - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "columns_metadata['sample_id'] = columns_metadata.donor_name + '_' + columns_metadata.structure_acronym\n", - "print(columns_metadata['sample_id'].unique().shape)\n", - "columns_metadata['sample_id'].head()" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "Now we can use this newly created column as the ``index`` (rownames) of ``columns_metadata``." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "columns_metadata = columns_metadata.set_index('sample_id')\n", - "columns_metadata.head()" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "Similarly, we can set the ``columns`` of ``expression`` with this ``columns_metadata.index`` since they're exactly the same length" - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "expression.columns = columns_metadata.index\n", - "expression.head()" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "Read expression feature data\n", - "----------------------------\n", - "\n", - "This package came with a ``rows_metadata`` file which is the metadata for the rows or the features aka the genes of the expression matrix." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "rows_metadata = pd.read_csv('rows_metadata.csv', index_col=0)\n", - "rows_metadata.head()" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "Let's set the ``ensembl_gene_id`` as the index, since they are unique (unlike ``gene_symbol``) and match up with other tables that we have available in ``flotilla``." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "expression_feature_data = rows_metadata.set_index('ensembl_gene_id')\n", - "expression_feature_data.head()" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "Now let's set the expression matrix's rows as the ``ensembl_gene_id``s from ``expression_feature_data``." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "expression.index = expression_feature_data.index\n", - "expression.head()" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "Reformat sample metadata\n", - "---------------------\n", - "\n", - "The sample metadata has the individual patient age, but not the age ranges as documented in the `white paper `_ describing the sample harvesting and RNA sequencing procedures. We'll use the age ranges provided in the document, and add a column of age ranges to the ``columns_metadata`` dataframe." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "# From the technical white paper: \n", - "# http://help.brain-map.org/download/attachments/3506181/Transcriptome_Profiling.pdf?version=1&modificationDate=1382036562736\n", - "development_string = \"\"\"Stage,Age,Developmental Period\n", - "1,4-7 pcw,Embryonic\n", - "2A,8-9 pcw,Early prenatal\n", - "2B,10-12 pcw,Early prenatal\n", - "3A,13-15 pcw,Early mid-prenatal\n", - "3B,16-18 pcw,Early mid-prenatal\n", - "4,19-24 pcw,Late_mid-prenatal\n", - "5,25-38 pcw,Late_prenatal\n", - "6,Birth-5 months,Early_infancy\n", - "7,6-18 months,Late_infancy\n", - "8,19 months-5 yrs,Early_childhood\n", - "9,6-11 yrs,Late_childhood\n", - "10,12-19 yrs,Adolescence\n", - "11,20-60+ yrs,Adulthood\n", - "\"\"\"\n", - "\n", - "development = pd.read_csv(StringIO.StringIO(development_string))\n", - "development" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "First of all, what are the unique ages in our ``columns_metadata`` dataframe?" - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "columns_metadata.age.unique()" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "There's quite the mixture of units ... ``pcw`` for \"post-conception week,\" ``mos`` for \"months\" and ``yrs`` for years. This will take a bit of string manipulation to convert individual ages to a range.\n", - "\n", - "First, lets convert age ranges in ``development`` to lists of single values." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "def age_range_to_single_values(x):\n", - " if x == 'Birth-5 months':\n", - " return ['{} mos'.format(age) for age in range(1, 6)]\n", - " elif x == '19 months-5 yrs':\n", - " return ['{} mos'.format(age) for age in range(19, 24)] + ['{} yrs'.format(age) for age in range(2, 6)]\n", - " else:\n", - " age_range, time_unit = x.split()\n", - " start, stop = map(int, age_range.rstrip('+').split('-'))\n", - " if time_unit == 'months':\n", - " time_unit = 'mos'\n", - " age_range = range(start, stop+1)\n", - " if time_unit == 'mos':\n", - " time_units = ['mos' if a % 12 != 0 else 'yrs' for a in age_range]\n", - " age_range = [a if a % 12 != 0 else a/12 for a in age_range]\n", - " else:\n", - " time_units = [time_unit] * len(age_range)\n", - " return ['{} {}'.format(age, unit) for age, unit in zip(age_range, time_units)]\n", - "\n", - "development.Age.map(age_range_to_single_values)" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "Now we can create a mapping of age ranges to single values." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "range_to_single_value = dict(zip(development.Age, development.Age.map(age_range_to_single_values)))" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "Now we can reverse this dictionary to get the mapping of individual ages to an age range." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "single_value_to_range = {}\n", - "for k, values in range_to_single_value.iteritems():\n", - " for v in values:\n", - " single_value_to_range[v] = k\n", - "single_value_to_range = pd.Series(single_value_to_range)\n", - "single_value_to_range.head()" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "Now lets make a dataframe of the mappings of a single age value to the age range, so we can then merge it with our `development` dataframe. We'll need to assign a name to the series first, let's use ``\"age_range\"``." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "single_value_to_range_df = single_value_to_range.to_frame()\n", - "single_value_to_range_df = single_value_to_range_df.reset_index()\n", - "single_value_to_range_df.columns = ['age_single_value', 'age_range']\n", - "single_value_to_range_df.head()" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "Now we can add this column to our growing ``develompent`` dataframe by using ``merge``, which matches up the values in the columns that we specifed (like a ``VLOOKUP`` in Microsoft Excel), and merges the tables. We want to use an ``'outer'`` merge specifically because otherwise the \"VLOOKUP\" will stop when it reaches the first single age and we'll only get one single age value per age range." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "development = development.merge(single_value_to_range_df, left_on='Age', right_on='age_range', how='outer')\n", - "development.head()" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "Now we merge the `development` dataframe with the `columns_metadata` dataframe. We first need to `reset_index` of `columns_metadata` because otherwise the row names that we worked so hard to create will disappear after the `merge`." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "columns_metadata = columns_metadata.reset_index()\n", - "columns_metadata = columns_metadata.merge(development, left_on='age', right_on='age_single_value')\n", - "columns_metadata.head()" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "Let's remove the ``Age`` and ``age_single_value`` columns since it's confusing that there's three \"age-type\" columns. We'll just keep the first one, `age` which was in the dataframe originally." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "columns_metadata = columns_metadata.drop(['Age', 'age_single_value'], axis=1)" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "Now let's set the ``sample_id`` as the index, so it matches up with the ``expression`` dataframe." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "columns_metadata = columns_metadata.set_index('sample_id')\n", - "columns_metadata.head()" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "Now we are ready to create a ``flotilla`` study! Some notes about the arguments:\n", - "\n", - "* ``columns_metadata``: This is the only required argument for all ``flotilla`` studies, which is the metadata on the samples. This one doesn't need a keyword (don't need to specify it with ``keyword=value`` like the other ones)\n", - "* ``metadata_phenotype_col``: This is which column in the metadata to use as the \"phenotype,\" which means that the samples will be colored by their value in this column. In our case, the samples will be colored by area of the brain from which they were harvested, with ``\"structure_name\"``.\n", - "* ``metadata_ignore_subset_columns`` (as of ``v0.2.8``): This indicates which columns of the metadata we don't want to use for creating subsets of samples. I picked ``donor_id`` since it's redundant with the ``donor_name``, and ``structure_id`` and ``structure_name`` are redundant with ``structure_acronym``, and ``Stage`` is redundant with ``Developmental Period``, and we don't want to clutter our options.\n", - "* ``metadata_minimum_samples``: This is the minimum number of samples that a feature must be detected in for it to pass our filter. In the case of gene expression only data, we may not actually remove anything, **unless** we set the ``expression_thresh`` keyword. If this is set, then genes which have expression less than the threshold in fewer than the minimum number of samples will be discarded. For example, we have ``metadata_minimum_samples=3`` and ``expression_thresh=1``. That means that only genes with expression at least 1 in at least three samples will be left in.\n", - "* ``expression_data``: This is a (samples, features) expression matrix. Since our ``expression`` matrix was (features, samples), we transposed it on the input\n", - "* ``expresion_thresh``: Minimum value of expression threshold of a gene so it will be counted, in at least ``metadata_minimum_samples``. This is on the **original, non-log-transformed, non-plus-one'd data**.\n", - "* ``expression_plus_one``: If True, add 1 to all expression values. This makes log-transforming the data easier since there are no zeros, and there won't be any values smaller than 1, so our logged data won't have negative values, which is nice. Plus it makes our data a lot easier to interpret. If we let $\\log_2(0) = $ NA, then we'd have have differences in values from say -8 to 8, which actually represent $2^{-8}$ to $2^8$. Personally, I'd like everything less than 1 to be ignored, so I add 1 to the expression value and then $\\log$-transform with abandon.\n", - "* ``expression_log_base``: This is the base of the log you'd like your data to be transformed with. For example, here we say to use ``2`` so our expression data will be $\\log_2$. We could also use ``10`, ``np.e``, ``313``, whatever we wanted.\n", - "* ``species``: Because we're using `ENSEMBL `_ ids, we can use some built-in helpfulness of ``flotilla``, which will automatically download a `\"species datapackage\" `_ that I curated, from Amazon S3. This has annotations like transcription factors, rna binding proteins, and more. Be warned: this only works for ``\"hg19\"`` and ``\"mm10\"``.\n", - "\n", - "Without further ado, let's get started!" - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "study_uncorrected = flotilla.Study(columns_metadata,\n", - " metadata_phenotype_col='structure_name',\n", - " metadata_ignore_subset_cols=['donor_id', 'structure_id', 'structure_acronym', 'Stage'],\n", - " metadata_minimum_samples=3,\n", - " expression_data=expression.T,\n", - " expression_thresh=1,\n", - " expression_plus_one=True, \n", - " expression_log_base=2,\n", - " species='hg19')" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "Let's take a look at the shapes of the original expression data (under `.expression.data_original`) versus the filtered expression data (under `.expression.data`) and see if any genes were removed." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "print(study_uncorrected.expression.data_original.shape)\n", - "print(study_uncorrected.expression.data.shape)" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "Wow, about 22,000 genes were removed! That means around 22,000 genes weren't appreciably expressed in our data. When we ``study_uncorrected.save()`` our ``Study`` object, we'll actually always save the original, uncorrected data, along with all the modifications we made so we can always get back to the originals.\n" - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "study_uncorrected.save('brainspan_uncorrected')" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "Let's take a look at this ``datapackage.json``." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "! cat $HOME/flotilla_packages/brainspan_uncorrected/datapackage.json" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "Principal Component Analysis\n", - "----------------------------\n", - "\n", - "The first thing I do when I check out a new dataset is do Principal Component Analysis (PCA) on it, which shows the variance of all samples. For this data, I'd like to do analyses which explore differences between areas of the brain, so I expect for similar areas of the brain to cluster together. Since we set our ``metadata_phenotype_col`` as ``\"structure_name\"``, so the scatterplot will be colored by where in the brain they come from." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "study_uncorrected.plot_pca()" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": { - "collapsed": true - }, - "source": [ - "Uh-oh! This looks like confetti was thrown on the ground :(\n", - "\n", - "That means that the samples aren't clustering by their structures like we suspected. Maybe the variance is coming from the age instead... what if we colored by the ``\"Stage\"`` (an ordered version of the age range) instead? Let's also turn off the vectors for now because we just want to look at the shape of the PCA." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "pd.Categorical(study_uncorrected.metadata.data.Stage, categories=['2A','2B'] + map(str, range(3, 12)), ordered=True)" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "study_uncorrected.plot_pca(color_samples_by='Stage', show_vectors=False)" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "Hmm, that does seem to work better, because the colors are clustering together. But it doesn't seem to make sense by the ordering. The \"2A\" stage isn't clustering near the \"2B\" stage, and the \"3A\" and \"3B\" are in between. Let's try clustering by the donor name instead." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "study_uncorrected.plot_pca(color_samples_by='donor_name', show_vectors=False)" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "Uh oh... looks like the samples are clustering by donor. Smells like we need some batch correction if you ask me. Well, this tutorial wouldn't have been titled \"batch effects\" for nothing :)\n", - "\n", - "Batch correction using COMBAT\n", - "-----------------------------\n", - "\n", - "To use ``combat`` for batch correction, we'll need to do a bit of preprocessing. First, we'll use the non-original data, i.e. the data that's already been *plus-oned*, *log-transformed*, and *min-sampled*. " - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "data = study_uncorrected.expression.data" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "Second, combat uses the variance and mean within batches to correct for the batch effects. Let's say three samples is a minimum. Since we'll be correcting by donor name, we'll need to make sure that we're only using data from donors with at least three brain structures. We'll do this using a ``groupby`` on the donor name, which will get us all samples from the same donor, and filter for at least three of these samples. The metadata's underlying data is stored as ``study_uncorrected.metadata.data``, and we can access the ``donor_name`` column using ``study_uncorrected.metdata.data.donor_name``." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "data = data.groupby(study_uncorrected.metadata.data.donor_name).filter(lambda x: len(x) >= 3)\n", - "data.shape" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "Huh, we had 524 samples originally, so this really didn't remove that much.\n", - "\n", - "We'll need to use the transpose of the data because ``combat`` expects a (features, samples) matrix." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "data = data.T\n", - "data.head()" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "Next, we'll use the donor name as the categorical variable for the batch. Notice that the sample names are one columns now so we access ``data.columns`` instead of ``data.index``." - ] - }, - { - "cell_type": "code", - "collapsed": true, - "input": [ - "batch = study_uncorrected.metadata.data.donor_name[data.columns]" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "Finally, we're ready for the batch correction! We'll run ``combat`` with our data, the column indicating the batch, and the ``mod`` argument (model matrix) as ``False``, because we want to do a very coarse correction." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "expression_batch_corrected = combat(data, batch, False)\n", - "expression_batch_corrected.head()" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "Great, now let's make a brand new, corrected study! This is very similar code to before, except we won't do any of the expression transformations." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "study_corrected = flotilla.Study(columns_metadata,\n", - " metadata_phenotype_col='structure_name',\n", - " metadata_ignore_subset_cols=['donor_id', 'structure_id', 'structure_name', 'Stage'],\n", - " metadata_minimum_samples=3,\n", - " expression_data=expression_batch_corrected.T,\n", - " species='hg19')\n", - "study_corrected.plot_pca()" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "Yay, now it's clustering by structure type! Quick, let's save this before we forget!!" - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "study_corrected.save('brainspan_batch_corrected')" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": null - } - ], - "metadata": {} } - ] -} \ No newline at end of file + }, + "nbformat": 4, + "nbformat_minor": 0 +}