-
Notifications
You must be signed in to change notification settings - Fork 22
Load real UK Biobank data
- Table of Contents
- Duplicated data-fields
- Unicode decoding errors
- Data-fields codings
- Loading other types of data
- Load withdrawals
If you are loading several CSV files (from different datasets, for instance, data refreshes or new data requests), and they happen to have duplicated data-fields (for example, data-field 50 present in both dataset 1 and 2), ukbREST will load the one present in the latest dataset. To infer that, it will take the number present in your CSV files (the dataset ID). So if you have three files, ukb00.csv
, ukb01.csv
and ukb50.csv
, they will be loaded in this order: ukb50.csv
, ukb01.csv
and ukb00.csv
.
If duplicated data-fields are found in the loading stage, you will see messages like these:
...
2018-11-25 06:34:49,054 - ukbrest - WARNING - Column c25756_2_0 already loaded from /var/lib/phenotype/ukb24989.csv. Skipping.
2018-11-25 06:34:49,055 - ukbrest - WARNING - Column c25757_2_0 already loaded from /var/lib/phenotype/ukb24989.csv. Skipping.
...
When loading real UK Biobank data, you could find this error:
2018-08-01 23:53:52,219 - ukbrest - INFO - Working on /var/lib/phenotype/example15_00.csv
[...]
2018-08-01 23:53:52,378 - ukbrest - WARNING - No encodings.txt found, assuming utf-8
2018-08-01 23:53:52,530 - ukbrest - ERROR - Unicode decoding error when reading CSV file. Activate debug to show more details.
That means the CSV has a different unicode (ukbREST uses utf-8
by default). To fix it, you need to
specify the correct encoding for that file in a text file named encodings.txt
in your phenotype folder
(where you have your CSV/HTML files). For the example message below (where the file being loaded is example15_00.csv
), the content of your encodings.txt
file should be:
example15_00.csv latin1
The encodings.txt
file has one line per CSV file. If you run into this issue, you can try different encodings like latin1
or cp1252
(see here for a full list of encodings supported in Python) or use some tool to try to detect it (like uchardet
). You just need to specify an encoding when you run into this issue, for the rest utf-8
is used.
ukbREST allows you to load data-field codings. By default, when using the --load-codings
parameter of the Docker image, ukbREST will load several data codings that are publicly available from the UK Biobank Data Showcase. However, you could have data-fields in your application whose coding was not loaded by default. To load the exact list of codings for your application data, follow this procedure.
Once the loading process finishes, you can get all the data-field codings in your data by connecting to the PostgreSQL database and exporting a list of codings:
\copy (select distinct coding from fields where coding is not null) to /tmp/all_codings.txt (format csv)
The file /tmp/all_codings.txt
is just a list of coding numbers, one per line, that you can use
to download all coding files using the download_codings.sh
script (which you can get from this repository):
$ mkdir /tmp/codings && cd /tmp/codings
$ [UKBREST_CODE]/utils/scripts/download_codings.sh /tmp/all_codings.txt
When you downloaded all coding files (with names like coding_100329.tsv
for coding code 100329), place them in a folder, for example /tmp/codings
, and run this command:
$ docker run --rm --net ukb \ -v /tmp/codings:/var/lib/codings \ -e UKBREST_DB_URI="postgresql://test:test@pg:5432/ukb" \ hakyimlab/ukbrest --load-codings
You'll see an output like this one:
2018-07-09 19:19:50,353 - ukbrest - INFO - Loading codings from /var/lib/codings
2018-07-09 19:19:51,121 - ukbrest - INFO - Processing coding file: coding_489.tsv
2018-07-09 19:19:51,190 - ukbrest - INFO - Processing coding file: coding_238.tsv
[...]
Once finished, you'll have in your database a table called codings
, that will let you link your data with,
for instance, ICD10 codes (through data-coding 19 in this case).
You can load other types of samples data, like Sample-QC and relatedness (See this page for more information).
For example, to load Sample-QC and relatedness data, create a subfolder in your phenotype directory named
samples_data
and copy the Sample-QC file (ukb_sqc_vZ.txt
)
with a new file name samplesqc.txt
(note that this file does not have a samples ID column, so you must add this column using the .fam
file from your application; read more about that
here). And also copy the relatedness file (ukbA_rel_sP.txt
)
with name relatedness.txt
. Although the names samplesqc.txt
and relatedness.txt
are not mandatory, you must specify the .txt
extension to let ukbrest find the files and load them. Finally, run this command:
$ docker run --rm --net ukb \
-v /full/path/to/phenotype/folder/:/var/lib/phenotype \
-e UKBREST_DB_URI="postgresql://test:test@pg:5432/ukb" \
hakyimlab/ukbrest --load-samples-data --identifier-columns relatedness.txt:ID1,ID2
2018-08-06 22:43:00,179 - ukbrest - INFO - Loading samples data from file: samplesqc.txt
2018-08-06 22:48:28,681 - ukbrest - INFO - Adding primary key
2018-08-06 22:48:29,147 - ukbrest - INFO - Adding columns to 'fields' table
2018-08-06 22:48:29,180 - ukbrest - INFO - Loading samples data from file: relatedness.txt
2018-08-06 22:48:52,616 - ukbrest - INFO - Adding primary key
2018-08-06 22:48:52,682 - ukbrest - INFO - Adding columns to 'fields' table
A new table for each file will be created, that you can later use to make your queries.
With this method you can load other kinds of data of samples. Just put the files in the samples_data
folder with .txt
extension and then run the command above. You can specify the ID columns with --identifier-columns
(the format is file1.txt:column1 file2.txt:column2
), skip some columns with --skip-columns
(the format is
file1.txt:column1 file2.txt:column2,column3
), and specify file separators with --separators
(file1.txt:, file2.txt:;
).
You can also load a list of participant ID who have withdrawn consent to continue participating in the study.
You get this list from the UK Biobank as a CSV file (in fact, they are files with just one ID per line, with no header); place all these files in a folder, for example, ~/withdrawls
, and run this command:
$ docker run --rm --net ukb \
-e UKBREST_DB_URI="postgresql://test:test@pg:5432/ukb" \
-v ~/withdrawls:/var/lib/withdrawals \
hakyimlab/ukbrest --load-withdrawals
ukbREST currently supports primary care records as well as certain hospital inpatient record datasets. Currently supported datasets are:
- gp_clinical clinical records of primary care events
- gp_registrations health system registration dates
- gp_scripts prescriptions resulting from primary care visits
- hesin main UK Biobank hospital inpatient record table detailing hospital episode-level data
- hesin_diag ICD codes for diagnoses delivered when in inpatient care
These records can be downloaded from the UK Biobank's data showcase as tab-separated text files. Suppose the primary care tables are located in ~/primary_care/
and the hospital inpatient records are in ~/hospital_inpatient/
. Then this command will load the files:
$ docker run --rm --net ukb \
-e UKBREST_DB_URI="postgresql://test:test@pg:5432/ukb" \
-v ~/hospital_inpatient:/var/lib/hospital_inpatient \
-v ~/primary_care:/var/lib/primary_care