Skip to content
/ peax Public

Peax is a tool for interactive visual pattern search and exploration in epigenomic data based on unsupervised representation learning with autoencoders

License

Notifications You must be signed in to change notification settings

Novartis/peax

Repository files navigation

Peax

A Visual Pattern Explorer For Epigenomic Data Using Unsupervised Deep Representation Learning

Intro Slides Presentation Paper

Peax's UI

Epigenomic data expresses a rich body of diverse patterns that help to identify regulatory elements like promoter, enhancers, etc. But finding these patterns reliably genome wide is challenging. Peax is a tool for interactive visual pattern search and exploration of epigenomic patterns based on unsupervised representation learning with convolutional autoencoders. The visual search is driven by manually labeled genomic regions for actively learning a classifier to reflect your notion of interestingness.

Citation: Lekschas et al., Peax: Interactive Visual Pattern Search in Sequential Data Using Unsupervised Deep Representation Learning, Computer Graphics Forum, 2020, doi: 10.1111/cgf.13971.

More Details: peax.lekschas.de

Installation

Requirements:

Install:

git clone https://github.com/Novartis/peax && cd peax
make install

Do not fear, make install is just a convenience function for setting up conda and installing npm packages.

Notes:

  • If you're a macOS user you might need to brew install libpng and openssl for the pybbi package (see here) and xz for pysam (if you see an error related to lzma.h).

Overview

Peax consists of three main parts:

  1. A server application for serving genomic and autoencoded data on the web. [/server].
  2. A user interface for exploring, visualizing, and interactively labeling genomic regions. [/ui].
  3. A set of examples showing how to configure Peax and build your own. [/examples]

Data

We provide 6 autoencoders trained on 3 kb, 12 kb, and 120 kb window sizes (with 25, 100, and 1000 bp binning) on DNase-seq and histone mark ChIP-seq data (H3K4me1, H3K4me3, H3K27ac, H3K9ac, H3K27me3, H3K9me3, and H3K36me).

You can find detailed descriptions of the autoencoders at zenodo.org/record/2609763. When you follow the Quick Start instructions, you will automatically download the related autoencoders.

Quick start

Peax comes with 6 autoencoders for DNase-seq and histone mark ChIP-seq data and several example configurations for which we provide convenience scripts to get you started as quickly as possible.

For instance, run one of the following commands to start Peax with a DNase-seq track for 3 kb, 12 kb, and 120 kb genomic windows.

Command Window Size Step Freq. Chromosomes
make example-3kb 3 kb 2 21
make example-12kb 12 kb 3 20-21
make example-120kb 120 kb 6 17-21

Note: The first time Peax is started it will precompute the datasets for exploration. This can take a few minutes depending on your hardware. Also, these demos will only prepare the above mentioned chromosomes, so don't try to search for patterns on another chromosome. It won't work! For your own data you can freely configure this of course.

The scripts will download test ENCODE tracks and use the matching configuration to start the server. More examples are described in /examples.

Get Started

In the following we describe how you can configure Peax for your own data.

Configure Peax with your data

Next, you need to configure Peax with your data to tell it which tracks you want to visualize in HiGlass and which of those tracks are encodable using an (auto)encoder.

The fastest way to get started is to copy the example config:

cp config.json.sample config.json

The config file has 10 top level properties:

Field Description Dtype
encoders List of encoders. list
datasets List of tracks. list
coords Genome coordinates. Peax currently supports hg19, hg38, mm9, and mm10 str
chroms Chromosomes to to be searched. If omitted all chromosomes will be prepared for searching. list
step_freq Step frequency of the sliding window approach. E.g., given an encoder with window size 12 kb, a step frequency of 6 means that every 2 kb a 12 kb window will be extracted from the bigWig. int
db_path Relative path to the sqlite db for storing searches. str
normalize_tracks If true the y-scale of tracks within a window will be normalized to the minimum and maximum value. This is useful for exploring differential signal. bool
variable_target If true the window with the highest prediction probability will be shown in the query view. bool
classifier The class name of an SciKit Learn Classifier str
classifier_params A dictionary of parameters to customize the classifier obj

The main parts to adjust are encoders and datasets. encoders is a list of (auto)encoder definitions for different datatypes.T here are two ways to configure an (auto)encoder: (a) point to a pre-defined autoencoder or (b) configure from scratch.

Assuming you want to use predefined encoders all you have to do is to specify the path to the encoder configuration

Example:

{
  "encoders": "examples/encoders.json"
}

The encoder configuration file is a dictionary with the top level keys acting as the identifier. Given the example from above the file could look like this:

{
  "histone-mark-chip-seq-3kb": {},
  "dnase-seq-3kb": {}
}

See [encoders.json](encoders.json) for an example. The specific definition if an autoencoder is the same as described in the following.

To configure an autoencoder from scratch you need to provide a dictionary with the following required format:

Field Description Defaults Dtype
autoencoder Relative path to your pickled autoencoder model. (hdf5 file) str
encoder Relative path to your pickled encoder model. (hdf5 file) str
decoder Relative path to your pickled decoder model. (hdf5 file) str
content_type Unique string describing the content this autoencoder can handle. Data tracks with the same content type will be encoded by this autoencoder. str
window_size Window size in base pairs used for training the autoencoder. int
resolution Resolution or bin size of the window in base pairs. int
latent_dim Number of latent dimensions of the encoded windows. int
input_dim Number of input dimensions for Keras. For 1D data these are 3: samples, data length (which is window_size / resolution), channels. 3 int
channels Number of channels of the input data. This is normally 1. 1 int
model_args List of arguments passed to a custom encoder model 1 int

Note that if you have specified an autoencoder you do not need to provide separate encoder and decoder models.

Example:

{
  "encoder": "path/to/my-12kb-chip-seq-encoder.h5",
  "decoder": "path/to/my-12kb-chip-seq-decoder.h5",
  "content_type": "histone-mark-chip-seq",
  "window_size": 12000,
  "resolution": 100,
  "channels": 1,
  "input_dim": 3,
  "latent_dim": 12
}

Datasets require the following format: