5.1.Available Data

This section provides a summary of the data available through the Digital Research Alliance of Canada (the Alliance).

All data is stored on the Beluga cluster under the rpp-aevans-ab allocation in the following directory:

/lustre03/project/6008063/neurohub/UKB

Tabular

The tabular files contain data that can be summarized by a few entries (e.g. age, blood pressure). The available data fields are summarized in the Data Dictionary (right-click -> save -> open .html file in your browser), and the most recent version of the data is stored in Tabular/. The Unique Data Identifier (UDI) for each piece of data consists of three parts: [Datafield]-[Instance Index].[Array Index]:

Datafield refers to the type of data: 2207 is the datafield for whether the subject wears glasses or contact lenses.
Instance Index refers to the instance of data acquisition. 0-1 are the initial and repeat assessments. 2-3 are the imaging and repeat imaging visits.
Array Index refers to the index within an array. Some datafields have multiple values (e.g. 3060-0.0, 3060-0.1, 3060-0.2), and these are stored separately.

A number of formats are available: csv, sas, stata, and r.

.csv:
Comma-separated values. General-purpose format. Each row describes a subject; each column describes a datafield.
.sas / .sd2:
Format for SAS statistical analysis package.
.stata:
Format for the Stata statistical analysis package.
.r / .tab:
Format for use with R.
.txt:
Tab-delimited values. Similar to .csv.
.bulk:
List of bulk fields per participant.
.html:
Documentation about the field data dictionary and encodings.

Detailed information on how to explore the fields and columns is available in the README document in the Tabular folder:

/lustre03/project/6008063/neurohub/UKB/Tabular/README

On Beluga, you can open the HTML file (the data dictionary) with a tool like w3m, where you can search the metadata (including field ID and column number)

w3m current.html

To explore a TSV-formatted version of this data, you can use a command line like awk to, for example, print the 2000th column and see the contents of a particular column.

awk -F'\t' '{ print 000 }' current.tab | more

UK Biobank Tabular Preprocessing tool

Given the UK Biobank tabular data is a massive table of over 20 thousand columns for 0.5 million participants, the Computational Brain Anatomy Laboratory at the Douglas Institute has developed a tool to effectively manage this data. The tool will assist you in querying data by transforming the raw tabular data supplied by the UKBB into a format more suitable for analysis in python/R/MATLAB.

To emphasize the utility of that tool, it makes a perfect complement to NeuroHub LORIS DQT.

More information and scripts on how to handle the vast tabular data can be found here.

Versions

The data is periodically updated. Old versions are stored in tabular/archive/, with the directories identified by the code for the basket. In the case of subject withdrawals, old versions are also purged and researchers are expected to remove withdrawn subjects from any local subsets.

Latest data fields and category addition

The latest data available on Beluga have been downloaded from the UK Biobank’s cloud-based Research Analysis Platform (UKB-RAP).

You can access them in the following directory:

/lustre03/project/6008063/neurohub/UKB/Tabular/RAP

Category	Data field	instance	Description
2406	131022	NA	Date G20 first reported (parkinson's disease)
2406	131023	NA	Source of report of G20 (parkinson's disease)
100	24419	2-3	Measure of head motion in T1 structural image
1839	30900	0-2-3	Number of proteins measured
110	26500	2-3	T2-FLAIR used (in addition to T1) to run FreeSurfer
154	120090	NA	Scale of how much relief pain treatments or medications have provided in the last 24 hours
100094	22189	NA	Townsend deprivation index at recruitment
301	26206	NA	Standard PRS for alzheimer's disease (AD)
302	26207	NA	Enhanced PRS for alzheimer's disease (AD)
999	41000	NA	Case-control status for COVID19 imaging repeat
999	41001	NA	Source of positive COVID test result
157	24100	2-3	LV end diastolic volume
157	24101	2-3	LV end systolic volume
157	24102	2-3	LV stroke volume
157	24103	2-3	LV ejection fraction
157	24105	2-3	LV myocardial mass
157	24110	2-3	LA maximum volume
157	24111	2-3	LA minimum volume
157	24112	2-3	LA stroke volume
157	24113	2-3	LA ejection fraction
157	24118	2-3	Ascending aorta maximum area
157	24119	2-3	Ascending aorta minimum area
157	24120	2-3	Ascending aorta distensibility
157	24121	2-3	Descending aorta maximum area
157	24122	2-3	Descending aorta minimum area
157	24123	2-3	Descending aorta distensibility
112	24485	2-3	Total volume of peri-ventricular white matter hyperintensities
112	24486	2-3	Total volume of deep white matter hyperintensities
162	31060	NA	Left ventricular ejection fraction
162	31061	NA	Left ventricular end diastolic volume
162	31062	NA	Left ventricular end systolic volume
162	31063	NA	Left ventricular mass
162	31064	NA	Left ventricular stroke volume
162	31068	NA	Right ventricular ejection fraction
2003	41286	NA	Mother age on date of delivery

Category	Description
1020	Derived accelerometry
100079	Derived OCT (optical coherence tomography) measures
301	Standard PRS
302	Enhanced PRS
107	Diffusion brain MRI
119	Arterial spin labelling brain MRI
220	NMR metabolomics
1039	Food (and other) preferences
506	Paired associate learning
109	Susceptibility weighted brain MRI
100094	Baseline characteristics
100117	Estimated food nutrients yesterday
100118	Total weight by food group yesterday
100016	Retinal optical coherence tomography

Note for category 220 :

168 fields are available in /lustre03/project/6008063/neurohub/UKB/Tabular/672635
82 are available in /lustre03/project/6008063/neurohub/UKB/Tabular/RAP

Working with CSV Files

awk is a good, general-purpose tool for slicing and dicing csv files. Simpler and faster, though, is using XSV. To load XSV on Alliance resources:

module load rust
cargo install xsv
export PATH=~/.cargo/bin:$PATH

Imaging

Multiple modalities are available from the UK Biobank and the data can be found on Beluga in the following directory:

/lustre03/project/6008063/neurohub/UKB/Bulk

The following table summarizes the current status on Alliance:

Category	Datafield	Instance	Description	Status
106	20217	2-3	MR - Task functional brain MRI - DICOM	Available (raw)
107	20218	2-3	MR - Diffusion brain MRI - DICOM	Available (raw)
109	20219	2-3	MR - Susceptibility weighted brain images - DICOM	Available (raw)
108	20224	2-3	MR - Phoenix - DICOM	Available (raw)
111	20225	2-3	MR - Functional brain images - resting - DICOM	Available (raw)
111	20227	2-3	MR - Resting-state fMRI	Available (raw)
106	20249	2-3	MR - Task fMRI	Available (raw)
107	20250	2-3	MR - Diffusion	Available (raw)
109	20251	2-3	MR - SWI	Available (raw)
110	20252	2-3	MR - T1-weighted	Available (raw)
112	20253	2-3	MR - FLAIR	Available (raw)
119	20266	2-3	MR - Arterial spin labelling brain images - DICOM	Available (raw)
111	25750	2-3	MR - Resting functional MRI full correlation matrix, dimension 25	Available (raw)
111	25751	2-3	MR - Resting functional MRI full correlation matrix, dimension 100	Available (raw)
111	25752	2-3	MR - Resting partial correlation matrix, dimension 25	Available (raw)
111	25753	2-3	MR - Resting partial correlation matrix, dimension 100	Available (raw)
111	25754	2-3	MR - Resting component amplitudes, dimension 25	Available (raw)
111	25755	2-3	MR - Resting component amplitudes, dimension 100	Available (raw)

The latest data field addition

Category	Datafield	Instance	Description	Status
109	20219	2-3	Susceptibility weighted brain images - DICOM	Available (raw)
119	20266	2-3	Arterial spin labelling brain images - DICOM	Available (raw)

Physical measures

This category contains information from physical measurements done at the Assessment Centre. Currently, the following data field is available on Beluga:

/lustre03/project/6008063/neurohub/UKB/Bulk/20205

Category	Datafield	Instance	Description	Status
104	20205	2-3	ECG at rest	Available (raw)

Physical activity

This category provides measurements recorded via a wrist-worn accelerometer. The main data collection (for 100,000 participants) was between June 2013 and January 2016. In 2018, a subset of participants was asked to repeat the exercise up to four times each on a quarterly basis to examine the influence of seasonal effects on the measurements. These seasonal repeats are currently ongoing. The following data fields are currently available on Beluga:

/lustre03/project/6008063/neurohub/UKB/Bulk/9000x

Category	Datafield	Instance	Description	Status
727	90001	0-1-2-3-4	Acceleration data (cwa, raw format)	Available (raw)
727	90004	NA	Acceleration intensity time-series (Epoch)	Available (raw)

Genetics

Multiple types of genetics data were acquired from UKB subjects.

You can find the data in Genotype_Results/ and Imputation/ directories in:

/lustre03/project/6008063/neurohub/UKB/Bulk

The following sections summarize the current status on Alliance:

Genotype

Category	Datafield	Description	Status
100313	22002	Genotyping process and sample QC - CEL Files	Available
100315	22418	Calls	Available
100315	22419	Genotype confidences	Available
100315	22437	Copy number variants B-allele frequencies	Available
100315	22431	Copy number variants, log2ratios	Available
100315	22430	Intensities	Available
100319	22438	Haplotypes	Available
100319	22828	Imputation from genotype (WTCHG)	Available

Category	Description	Status
263	Genotypes	Available

This above category is available in the directory:

/lustre03/project/6008063/neurohub/UKB/Bulk/488127/ukb_rel_a45551_s488127.dat

Exome

You can find the Exome data in the genetics/ directory:

/lustre03/project/6008063/neurohub/ukbb/genetics/exome

Category	Datafield	Description	Status
171	23151	Variant call files	Available
171	23152	Variant call files indices	Available
171	23153	CRAM files	Available
171	23154	CRAM indices	Available
171	23155	Population-level variants (PLINK)	Available
171	23156	Population-level variants (pVCF)	Available

Preprocessed data

NeuroHub users can have access to the following types of Preprocessed data:

1. Diffusion-weighted imaging

The data are processed with Tractoflow and available on Beluga at the following path:

/lustre03/project/rpp-aevans-ab/neurohub/UKB/Derived/tractoflow_out

Documentation on the TractoFlow UKBiobank process is available for your reference.

2. CIVET output

The CIVET outputs have been created out of the UK Biobank 43,000 sujects or so MINC files corresponding to anatomical T1s.

Documentation on CIVET Outputs process of the UK Biobank preprocessing through CBRAIN and the LORIS DQT is available for your reference.

3. CIVET manual QC (Quality Control)

Thanks to the Computational Brain Anatomy Laboratory work at the Douglas Institute on manual CIVET Output QC (Quality Control) for UK BioBank 40,000 users, the CIVET QC data are now available through CBRAIN and soon, on the LORIS DQT. Documentation on CIVET manual QC access is available for your reference.

4. UKBiobank fMRIprep outputs

All UKBiobank fMRIprep outputs are now available in CBRAIN and Beluga. You can find more information about UKBiobank fMRIprep outputs for your reference.

5. T1w and FLAIR scans processed by NIST-MNI minipipeline

The outputs are anatomical MRI scans of the UK Biobank imaging data processed in-house tools, based on minc-toolkit-v2 version 1.9.17.

The files are available in Beluga at the following path:

/lustre03/project/6008063/neurohub/UKB/Derived/vfonov/out

Documentation about the NIST-MNI-minipipeline is available for your reference.

Food Preference Questionnaire and Paired Associate Learning

Category 1039

The Food Preference Questionnaire fields are available as a separate CSV table. The questionnaire includes 150 items, which comprise food items that reflect both sensory preferences (bitter, sweet etc.) and foodstuff preferences (fruit, vegetables, meat, etc.). More information can be found here.

Category 506

In addition, the Paired Associate Learning fields are also available as a separate CSV table. In the test participants are shown 12 pairs of words then, after an interval, presented with the first word of 10 of these pairs and asked to select the matching second word from a choice of 4 alternatives. More information can be found here.

Both categories are available at the following path on Beluga:

/lustre03/project/6008063/neurohub/UKB/Tabular/RAP

Status Codes

Status	Meaning
Available	Data is accessible to NeuroHub users
Available (raw)	The raw data is available, but derivatives may not be.
Deploying	Data will soon be available.
Fetching	Data is currently being transferred to Beluga
Processing	Data is undergoing processing (e.g. format, structure).
Not on Beluga	Data has not been downloaded.

Requests

Are there new datafields or return datasets that you'd like to be made available? Send us an email @ support@neurohub.ca with a text file with the following information in a single column:

Datafields, prepended with 'F' (e.g.: "F20252")
Return datasets, prepended with 'R' (e.g.: "R123")
SNP IDs prepended with 'S' (e.g. "S456").

Is there bulk data that is already authorized but unavailable? Let us know @ support@neurohub.ca; we prioritize data with community interest.

Questions or Issues?

For any questions or issues related to the registration process for accessing the UK Biobank data through the McGill agreement, please contact the NeuroHub team by email at support@neurohub.ca.

For any questions or issues specifically related to accessing the NeuroHub portal or using any of the NeuroHub features and functionalities, please contact the NeuroHub team by email at support@neurohub.ca.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly