Skip to content

5.1.Available Data

Xuan Mai PHAM edited this page Aug 22, 2024 · 163 revisions

This section provides a summary of the data available through the Digital Research Alliance of Canada (the Alliance).

All data is stored on the Beluga cluster under the rpp-aevans-ab allocation in the following directory:

/lustre03/project/6008063/neurohub/UKB

Tabular

The tabular files contain data that can be summarized by a few entries (e.g. age, blood pressure). The available data fields are summarized in the Data Dictionary (right-click -> save -> open .html file in your browser), and the most recent version of the data is stored in Tabular/. The Unique Data Identifier (UDI) for each piece of data consists of three parts: [Datafield]-[Instance Index].[Array Index]:

  • Datafield refers to the type of data: 2207 is the datafield for whether the subject wears glasses or contact lenses.
  • Instance Index refers to the instance of data acquisition. 0-1 are the initial and repeat assessments. 2-3 are the imaging and repeat imaging visits.
  • Array Index refers to the index within an array. Some datafields have multiple values (e.g. 3060-0.0, 3060-0.1, 3060-0.2), and these are stored separately.

A number of formats are available: csv, sas, stata, and r.

  • .csv:
    Comma-separated values. General-purpose format. Each row describes a subject; each column describes a datafield.
  • .sas / .sd2:
    Format for SAS statistical analysis package.
  • .stata:
    Format for the Stata statistical analysis package.
  • .r / .tab:
    Format for use with R.
  • .txt:
    Tab-delimited values. Similar to .csv.
  • .bulk:
    List of bulk fields per participant.
  • .html:
    Documentation about the field data dictionary and encodings.

Detailed information on how to explore the fields and columns is available in the README document in the Tabular folder:

/lustre03/project/6008063/neurohub/UKB/Tabular/README

On Beluga, you can open the HTML file (the data dictionary) with a tool like w3m, where you can search the metadata (including field ID and column number)

w3m current.html

To explore a TSV-formatted version of this data, you can use a command line like awk to, for example, print the 2000th column and see the contents of a particular column.

awk -F'\t' '{ print 000 }' current.tab | more

Given the UK Biobank tabular data is a massive table of over 20 thousand columns for 0.5 million participants, the Computational Brain Anatomy Laboratory at the Douglas Institute has developed a tool to effectively manage this data. The tool will assist you in querying data by transforming the raw tabular data supplied by the UKBB into a format more suitable for analysis in python/R/MATLAB.

To emphasize the utility of that tool, it makes a perfect complement to NeuroHub LORIS DQT.

More information and scripts on how to handle the vast tabular data can be found here.

Versions

The data is periodically updated. Old versions are stored in tabular/archive/, with the directories identified by the code for the basket. In the case of subject withdrawals, old versions are also purged and researchers are expected to remove withdrawn subjects from any local subsets.

Latest data fields and category addition

The latest data available on Beluga have been downloaded from the UK Biobank’s cloud-based Research Analysis Platform (UKB-RAP).

You can access them in the following directory:

/lustre03/project/6008063/neurohub/UKB/Tabular/RAP

Category Data field instance Description
2406 131022 NA Date G20 first reported (parkinson's disease)
2406 131023 NA Source of report of G20 (parkinson's disease)
100 24419 2-3 Measure of head motion in T1 structural image
1839 30900 0-2-3 Number of proteins measured
110 26500 2-3 T2-FLAIR used (in addition to T1) to run FreeSurfer
154 120090 NA Scale of how much relief pain treatments or medications have provided in the last 24 hours
100094 22189 NA Townsend deprivation index at recruitment
301 26206 NA Standard PRS for alzheimer's disease (AD)
302 26207 NA Enhanced PRS for alzheimer's disease (AD)
999 41000 NA Case-control status for COVID19 imaging repeat
999 41001 NA Source of positive COVID test result
157 24100 2-3 LV end diastolic volume
157 24101 2-3 LV end systolic volume
157 24102 2-3 LV stroke volume
157 24103 2-3 LV ejection fraction
157 24105 2-3 LV myocardial mass
157 24110 2-3 LA maximum volume
157 24111 2-3 LA minimum volume
157 24112 2-3 LA stroke volume
157 24113 2-3 LA ejection fraction
157 24118 2-3 Ascending aorta maximum area
157 24119 2-3 Ascending aorta minimum area
157 24120 2-3 Ascending aorta distensibility
157 24121 2-3 Descending aorta maximum area
157 24122 2-3 Descending aorta minimum area
157 24123 2-3 Descending aorta distensibility
112 24485 2-3 Total volume of peri-ventricular white matter hyperintensities
112 24486 2-3 Total volume of deep white matter hyperintensities
162 31060 NA Left ventricular ejection fraction
162 31061 NA Left ventricular end diastolic volume
162 31062 NA Left ventricular end systolic volume
162 31063 NA Left ventricular mass
162 31064 NA Left ventricular stroke volume
162 31068 NA Right ventricular ejection fraction
2003 41286 NA Mother age on date of delivery
Category Description
1020 Derived accelerometry
100079 Derived OCT (optical coherence tomography) measures
301 Standard PRS
302 Enhanced PRS
107 Diffusion brain MRI
119 Arterial spin labelling brain MRI
220 NMR metabolomics
1039 Food (and other) preferences
506 Paired associate learning
109 Susceptibility weighted brain MRI
100094 Baseline characteristics
100117 Estimated food nutrients yesterday
100118 Total weight by food group yesterday
100016 Retinal optical coherence tomography
Note for category 220 :
  • 168 fields are available in /lustre03/project/6008063/neurohub/UKB/Tabular/672635
  • 82 are available in /lustre03/project/6008063/neurohub/UKB/Tabular/RAP

Working with CSV Files

awk is a good, general-purpose tool for slicing and dicing csv files. Simpler and faster, though, is using XSV. To load XSV on Alliance resources:

module load rust
cargo install xsv
export PATH=~/.cargo/bin:$PATH

Imaging

Multiple modalities are available from the UK Biobank and the data can be found on Beluga in the following directory:

/lustre03/project/6008063/neurohub/UKB/Bulk

The following table summarizes the current status on Alliance:

Category Datafield Instance Description Status
106 20217 2-3 MR - Task functional brain MRI - DICOM Available (raw)
107 20218 2-3 MR - Diffusion brain MRI - DICOM Available (raw)
109 20219 2-3 MR - Susceptibility weighted brain images - DICOM Available (raw)
108 20224 2-3 MR - Phoenix - DICOM Available (raw)
111 20225 2-3 MR - Functional brain images - resting - DICOM Available (raw)
111 20227 2-3 MR - Resting-state fMRI Available (raw)
106 20249 2-3 MR - Task fMRI Available (raw)
107 20250 2-3 MR - Diffusion Available (raw)
109 20251 2-3 MR - SWI Available (raw)
110 20252 2-3 MR - T1-weighted Available (raw)
112 20253 2-3 MR - FLAIR Available (raw)
119 20266 2-3 MR - Arterial spin labelling brain images - DICOM Available (raw)
111 25750 2-3 MR - Resting functional MRI full correlation matrix, dimension 25 Available (raw)
111 25751 2-3 MR - Resting functional MRI full correlation matrix, dimension 100 Available (raw)
111 25752 2-3 MR - Resting partial correlation matrix, dimension 25 Available (raw)
111 25753 2-3 MR - Resting partial correlation matrix, dimension 100 Available (raw)
111 25754 2-3 MR - Resting component amplitudes, dimension 25 Available (raw)
111 25755 2-3 MR - Resting component amplitudes, dimension 100 Available (raw)

The latest data field addition

Category Datafield Instance Description Status
109 20219 2-3 Susceptibility weighted brain images - DICOM Available (raw)
119 20266 2-3 Arterial spin labelling brain images - DICOM Available (raw)

Physical measures

This category contains information from physical measurements done at the Assessment Centre. Currently, the following data field is available on Beluga:

/lustre03/project/6008063/neurohub/UKB/Bulk/20205

Category Datafield Instance Description Status
104 20205 2-3 ECG at rest Available (raw)

Physical activity

This category provides measurements recorded via a wrist-worn accelerometer. The main data collection (for 100,000 participants) was between June 2013 and January 2016. In 2018, a subset of participants was asked to repeat the exercise up to four times each on a quarterly basis to examine the influence of seasonal effects on the measurements. These seasonal repeats are currently ongoing. The following data fields are currently available on Beluga:

/lustre03/project/6008063/neurohub/UKB/Bulk/9000x

Category Datafield Instance Description Status
727 90001 0-1-2-3-4 Acceleration data (cwa, raw format) Available (raw)
727 90004 NA Acceleration intensity time-series (Epoch) Available (raw)

Genetics

Multiple types of genetics data were acquired from UKB subjects.

You can find the data in Genotype_Results/ and Imputation/ directories in:

/lustre03/project/6008063/neurohub/UKB/Bulk

The following sections summarize the current status on Alliance:

Genotype

Category Datafield Description Status
100313 22002 Genotyping process and sample QC - CEL Files Available
100315 22418 Calls Available
100315 22419 Genotype confidences Available
100315 22437 Copy number variants B-allele frequencies Available
100315 22431 Copy number variants, log2ratios Available
100315 22430 Intensities Available
100319 22438 Haplotypes Available
100319 22828 Imputation from genotype (WTCHG) Available
Category Description Status
263 Genotypes Available

This above category is available in the directory:

/lustre03/project/6008063/neurohub/UKB/Bulk/488127/ukb_rel_a45551_s488127.dat

Exome

You can find the Exome data in the genetics/ directory:

/lustre03/project/6008063/neurohub/ukbb/genetics/exome

Category Datafield Description Status
171 23151 Variant call files Available
171 23152 Variant call files indices Available
171 23153 CRAM files Available
171 23154 CRAM indices Available
171 23155 Population-level variants (PLINK) Available
171 23156 Population-level variants (pVCF) Available

Preprocessed data

NeuroHub users can have access to the following types of Preprocessed data:

1. Diffusion-weighted imaging

The data are processed with Tractoflow and available on Beluga at the following path:

/lustre03/project/rpp-aevans-ab/neurohub/UKB/Derived/tractoflow_out

Documentation on the TractoFlow UKBiobank process is available for your reference.

2. CIVET output

The CIVET outputs have been created out of the UK Biobank 43,000 sujects or so MINC files corresponding to anatomical T1s.

Documentation on CIVET Outputs process of the UK Biobank preprocessing through CBRAIN and the LORIS DQT is available for your reference.

3. CIVET manual QC (Quality Control)

Thanks to the Computational Brain Anatomy Laboratory work at the Douglas Institute on manual CIVET Output QC (Quality Control) for UK BioBank 40,000 users, the CIVET QC data are now available through CBRAIN and soon, on the LORIS DQT. Documentation on CIVET manual QC access is available for your reference.

4. UKBiobank fMRIprep outputs

All UKBiobank fMRIprep outputs are now available in CBRAIN and Beluga. You can find more information about UKBiobank fMRIprep outputs for your reference.

5. T1w and FLAIR scans processed by NIST-MNI minipipeline

The outputs are anatomical MRI scans of the UK Biobank imaging data processed in-house tools, based on minc-toolkit-v2 version 1.9.17.

The files are available in Beluga at the following path:

/lustre03/project/6008063/neurohub/UKB/Derived/vfonov/out

Documentation about the NIST-MNI-minipipeline is available for your reference.

Food Preference Questionnaire and Paired Associate Learning

The Food Preference Questionnaire fields are available as a separate CSV table. The questionnaire includes 150 items, which comprise food items that reflect both sensory preferences (bitter, sweet etc.) and foodstuff preferences (fruit, vegetables, meat, etc.). More information can be found here.

In addition, the Paired Associate Learning fields are also available as a separate CSV table. In the test participants are shown 12 pairs of words then, after an interval, presented with the first word of 10 of these pairs and asked to select the matching second word from a choice of 4 alternatives. More information can be found here.

Both categories are available at the following path on Beluga:

/lustre03/project/6008063/neurohub/UKB/Tabular/RAP

Status Codes

Status Meaning
Available Data is accessible to NeuroHub users
Available (raw) The raw data is available, but derivatives may not be.
Deploying Data will soon be available.
Fetching Data is currently being transferred to Beluga
Processing Data is undergoing processing (e.g. format, structure).
Not on Beluga Data has not been downloaded.

Requests

Are there new datafields or return datasets that you'd like to be made available? Send us an email @ support@neurohub.ca with a text file with the following information in a single column:

  • Datafields, prepended with 'F' (e.g.: "F20252")
  • Return datasets, prepended with 'R' (e.g.: "R123")
  • SNP IDs prepended with 'S' (e.g. "S456").

Is there bulk data that is already authorized but unavailable? Let us know @ support@neurohub.ca; we prioritize data with community interest.

Clone this wiki locally