This repository has been archived by the owner on May 29, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 53
/
500_Effectively_Using_the_DelayedArray_Framework.Rmd
908 lines (671 loc) · 42.9 KB
/
500_Effectively_Using_the_DelayedArray_Framework.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
```{r include=FALSE, cache=FALSE}
knitr::opts_chunk$set(
comment = "#>",
collapse = TRUE,
cache = TRUE,
cache.lazy = FALSE
)
```
# 500: Effectively using the DelayedArray framework to support the analysis of large datasets
Authors:
Peter Francis Hickey^[Department of Biostatistics, Johns Hopkins University],
<br/>
Last modified: 19 June, 2018.
## Overview
### Description
This workshop will teach the fundamental concepts underlying the DelayedArray framework and related infrastructure.
It is intended for package developers who want to learn how to use the DelayedArray framework to support the analysis of large datasets, particularly through the use of on-disk data storage.
The first part of the workshop will provide an overview of the DelayedArray infrastructure and introduce computing on DelayedArray objects using delayed operations and block-processing.
The second part of the workshop will present strategies for adding support for DelayedArray to an existing package and extending the DelayedArray framework.
Students can expect a mixture of lecture and question-and-answer session to teach the fundamental concepts.
There will be plenty of examples to illustrate common design patterns for writing performant code, although we will not be writing much code during the workshop.
### Pre-requisites
* Solid understanding of R
* Familiarity with common operations on arrays (e.g., `colSums()` and those available in the **matrixStats** package)
* Familiarity with object oriented programming, particularly S4, will be useful but is not essential
* No familiarity required of technical details of particular data storage backends (e.g., HDF5, sparse matrices)
* No familiarity required of particular biological techniques (e.g., single-cell RNA-seq)
### Participation
Questions and discussion are encouraged!
This will be especially important to guide the second half of the worshop which focuses on integrating DelayedArray into an existing or new Bioconductor package.
Students will be expected to be able to follow and reason about example R code.
### _R_ / _Bioconductor_ packages used
* DelayedArray
* HDF5Array
* SummarizedExperiment
* DelayedMatrixStats
* beachmat
### Time outline
| Activity | Time |
|---------------------------------------------------|------|
| Introductory slides | 15m |
| Part 1: Overview of DelayedArray framework | 45m |
| Part 2: Incoporating DelayedArray into a package | 45m |
| Questions and discussion | 15m |
### Workshop goals and objectives
#### Learning goals
* Identify when it is useful to use a *DelayedArray* instead of an ordinary array or other array-like data structure.
* Become familiar with the fundamental concepts of delayed operations, block-processing, and realization.
* Learn of existing functions and packages for constructing and computing on DelayedArray objects, avoiding the need to re-invent the wheel.
* Learn common design patterns for writing performant code that operates on a DelayedArray.
* Evaluate whether an existing function that operates on an ordinary array can be readily adapted to work on a DelayedArray.
* Reason about potential bottlenecks in algorithms operating on DelayedArray objects.
#### Learning objectives
* Understand the differences between a *DelayedArray* instance and an instance of a subclass (e.g., *HDF5Array*, *RleArray*).
* Know what types of operations 'degrade' an instance of a *DelayedArray* subclass to a *DelayedArray*, as well as when and why this matters.
* Construct a *DelayedArray*:
* From an in-memory array-like object.
* From an on-disk data store (e.g., HDF5).
* From scratch by writing data to a *RealizationSink*.
* Take a function that operates on rows or columns of a matrix and apply it to a DelayedMatrix.
* Use block-processing on a *DelayedArray* to compute:
* A univariate (scalar) summary statistic (e.g., `max()`).
* A multivariate (vector) summary statistic (e.g., `colSums()` or `rowMeans()`).
* A multivariate (array-like) summary statistic (e.g., `rowRanks()`).
* Design an algorithm that imports data into a DelayedArray.
## Introductory material
Data from a high-throughput biological assay, such as single-cell RNA-sequencing (scRNA-seq), will often be summarised as a matrix of counts, where rows correspond to features and columns to samples^[Higher-dimensional arrays may be appropriate for some types of assays.].
Within **Bioconductor**, the *SummarizedExperiment* class is the recommended container for such data, offering a rich interface that tightly links assay measurements to data on the features and the samples.
![The _SummarizedExperiment_ class is used to store rectangular arrays of experimental results (_assays_). Although each _assay_ is here drawn as a matrix, higher-dimensional arrays are also supported.](100_Morgan_RBiocForAll/SummarizedExperiment.png)
Traditionally, the assay data are stored in-memory as an ordinary *array* object^[In R, a *matrix* is just a 2-dimensional *array*].
Storing the data in-memory becomes a real pain with the ever-growing size of 'omics datasets; it is now not uncommon to collect $10,000-100,000,000$ measurements on $100 - 1,000,000$ samples, which would occupy $10-1,000$ gigabytes (GB) if stored in-memory as ordinary R arrays!
Let's take as an example some single-cell RNA-seq data on 1.3 million brain cells from embryonic mice, generated by 10X Genomics^[These data are available in the **TENxBrainData** Bioconductor package].
```{r, message = FALSE}
library(TENxBrainData)
# NOTE: This will download the data and may take a little while on the first
# run. The result will be cached, however, so subsequent runs are near
# instantaneous.
tenx <- TENxBrainData()
# The data are stored in a SingleCellExperiment, an extension of the
# SummarizedExperiment class.
class(tenx)
dim(tenx)
# How big much memory do the counts data use?
counts <- assay(tenx, "counts", withDimnames = FALSE)
print(object.size(counts))
```
The counts data only require a tiny amount of RAM despite it having `r nrow(tenx)` rows and `r ncol(tenx)` columns.
And the `counts` object still "feels" like an ordinary R matrix:
```{r}
# Oooh, pretty-printing.
counts
# Let's take a subset of the data
counts[1:10, 1:10]
# Let's compute column sums (crude library sizes) for the first 100 samples.
# TODO: DelayedArray:: prefix shouldn't be necessary
DelayedArray::colSums(counts[, 1:100])
```
The reason for the small memory footprint and matrix-like "feel" of the `counts` object is because the counts data are in fact stored on-disk in a Hierarchical Data Format (**HDF5**) file and we are interacting with the data via the DelayedArray framework.
### Discussion
**TODO:** Make these 'discussions/challenges' into a boxes that visually break out
- Talk with your neighbour about the sorts of 'big data' you are analysing, the challenges you've faced, and strategies you're using to tackle this.
- **TODO:** (keep this?) Play around with the `tenx` and `counts` data. Could use this to demonstrate the upcoming challenges of chunking (e.g., compute row-wise summary).
**TABLE** gives some examples of contemporary experiments and the memory requirements if these data are to be stored in-memory as ordinary R arrays.
```{r, echo = FALSE}
# TODO: Add WGBS (mCA)
# TODO: An example of an RleArray-type dataset (ChIP-seq)
x <- data.frame(
Assay = c("scRNA-seq", "WGBS (mCG)"),
Size = c(
as.character(
memuse::howbig(27998, 1306127, type = "integer", prefix = "SI")),
as.character(
2 * memuse::howbig(29307073, 145, type = "integer", prefix = "SI") +
memuse::howbig(29307073, 145, type = "double", prefix = "SI"))),
"`nrow`" = c("27,998", "29,307,073"),
"`ncol`" = c("1,306,127", "145"),
"Number of assays" = c(1, 3),
Type = c("sparse integer", "2 x dense integer, 1 x dense double"),
Reference = c(
"https://bioconductor.org/packages/TENxBrainData/",
"eGTEx (unpublished)"),
check.names = FALSE,
stringsAsFactors = FALSE)
# TODO: Manually update order to be increasing in x$Size
knitr::kable(x[2:1, ], row.names = FALSE)
```
There are various strategies for handling such large data in R.
For example, we could use:
1. Sparse matrices for single-cell RNA-seq data.
2. Run length encoded vectors for ChIP-seq coverage profiles.
3. Storing data on disk, and only bringing into memory as required, for whole genome methylation data^[This strategy can also be effectively applied to sparse and repetitive data by using on-disk compression of the data.].
Each of these approaches has its strengths, as well as weaknesses and idiosyncrasies.
For example,
- **Matrix** doesn't support long vectors
- Sparsity is readily lost (e.g., normalization often destroys it)
- etc.
**TODO:** Lead discussion of limitations
### Why learn DelayedArray?
The high-level goals of the DelayedArray framework are:
- Provide a common R interface to array-like data, where the data may be in-memory or on-disk.
- Support delayed operations, which avoid doing any computation until the result is required.
- Support block-processing of the data, which enables bounded-memory and parallel computations.
From the **DelayedArray** `DESCRIPTION`:
**TODO:** Use **desc** to extract Description field?
> Wrapping an array-like object (typically an on-disk object) in a DelayedArray object allows one to perform common array operations on it without loading the object in memory. In order to reduce memory usage and optimize performance, operations on the object are either delayed or executed using a block processing mechanism. Note that this also works on in-memory array-like objects like DataFrame objects (typically with Rle columns), Matrix objects, and ordinary arrays and data frames. (https://bioconductor.org/packages/release/bioc/html/DelayedArray.html)
These goals are similar to that of the **tibble** and **dplyr** packages:
> A tibble, or `tbl_df`, is a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not. Tibbles are data.frames that are lazy and surly (http://tibble.tidyverse.org/#overview)
> dplyr is designed to abstract over how the data is stored. That means as well as working with local data frames, you can also work with remote database tables, using exactly the same R code. (https://dplyr.tidyverse.org/#overview)
An important feature of the DelayedArray framework is that it supports all these strategies, and more, with a common interface that aims to feel like the interface to ordinary arrays, which provides familiarity to R users.
#### Learning goal
- Identify when it is useful to use a *DelayedArray* instead of an ordinary array or other array-like data structure.
## Overview of DelayedArray framework
The core of DelayedArray framework is implemented in the **DelayedArray** Bioconductor package.
Other packages extend the framework or simply use it as is to enable analyses of large datasets.
### The **DelayedArray** package
The **DelayedArray** package defines the key classes, generics, and methods^[The **DelayedArray** package, like all core Bioconductor packages, uses the [S4 object oriented programming system.](http://adv-r.had.co.nz/OO-essentials.html#s4)], as well as miscellaneous helper functions, that implement the DelayedArray framework.
The reverse dependencies of **DelayedArray** are shown below:
```{r, echo = FALSE, message = FALSE, dev = 'png'}
library(BiocPkgTools)
library(tidygraph)
library(ggraph)
`%>%` <- dplyr::`%>%`
# TODO: Re-write so as to avoid library() calls that clobbers the NAMESPACE
# NOTE: From http://lazappi.id.au/2018/06/exploring-the-sce-verse/
get_bioc_deps <- function(bpi, pkg, reverse) {
deps <- bpi %>%
dplyr::filter(Package == pkg)
if (reverse) {
deps <- deps %>%
dplyr::select(depends = dependsOnMe, imports = importsMe,
suggests = suggestsMe)
} else {
deps <- deps %>%
dplyr::select(depends = Depends, imports = Imports,
suggests = Suggests)
}
deps <- deps %>%
tidyr::gather(key = "type", value = "package") %>%
tidyr::separate_rows() %>%
dplyr::filter(!is.na(package))
if (reverse) {
deps <- deps %>%
dplyr::mutate(package2 = pkg) %>%
dplyr::rename(package1 = package)
} else {
deps <- deps %>%
dplyr::mutate(package1 = pkg) %>%
dplyr::rename(package2 = package)
}
deps <- deps %>% dplyr::select(package1, uses = type, package2)
}
# NOTE: From http://lazappi.id.au/2018/06/exploring-the-sce-verse/
get_cran_deps <- function(pkg, db, reverse) {
types <- c("Depends", "Imports", "Suggests")
deps <- sapply(types, function(type) {
deps <- tools::package_dependencies(pkg, db, which = type,
reverse = reverse)
c(type = type, package = paste(deps[[1]], collapse = ", "))
})
deps <- deps %>%
t() %>%
dplyr::as_data_frame() %>%
dplyr::mutate(type = tolower(type)) %>%
dplyr::filter(package != "") %>%
tidyr::separate_rows(package)
if (nrow(deps) == 0) {
return(dplyr::tibble(package1 = character(), uses = character(),
package2 = character()))
}
if (reverse) {
deps <- deps %>%
dplyr::mutate(package2 = pkg) %>%
dplyr::rename(package1 = package)
} else {
deps <- deps %>%
dplyr::mutate(package1 = pkg) %>%
dplyr::rename(package2 = package)
}
deps <- deps %>% dplyr::select(package1, uses = type, package2)
}
bpi <- getBiocPkgList()
db <- available.packages(repos = "http://cran.r-project.org")
bioc_revdeps <- get_bioc_deps(bpi, "DelayedArray", reverse = TRUE)
cran_revdeps <- get_cran_deps("DelayedArray", db, reverse = TRUE)
nodes <- bioc_revdeps %>%
dplyr::bind_rows(cran_revdeps) %>%
dplyr::select(-uses) %>%
tidyr::gather(key = id, value = package) %>%
dplyr::select(-id) %>%
dplyr::distinct() %>%
dplyr::mutate(
repo = dplyr::if_else(package %in% bpi$Package, "Bioconductor", "CRAN"))
edges <- bioc_revdeps %>%
dplyr::bind_rows(cran_revdeps) %>%
dplyr::rename(from = package1, to = package2)
graph <- tbl_graph(nodes = nodes, edges = edges)
ggraph(graph, layout = "fr") +
geom_edge_fan(aes(colour = uses),
arrow = arrow(length = unit(4, 'mm')),
end_cap = circle(3, 'mm')) +
geom_node_point(aes(colour = repo)) +
geom_node_text(aes(label = package, colour = repo), repel = TRUE) +
scale_color_brewer(palette = "Set1") +
scale_edge_color_brewer(palette = "Dark2") +
theme_graph()
```
The above figures includes packages that extend the DelayedArray framework in various ways, as well as those that simply use the DelayedArray framework to analyse specific types of 'omics data. We briefly discuss some of these:
#### Packages that extend **DelayedArray**
There are two ways a package may extend the DelayedArray framework.
The first kind of package adds support for a new _realization backend_ (**TODO** Define?). Examples of this are:
- The **HDF5Array** package adds the *HDF5Array* realization backend for accessing and creating data stored on-disk in an HDF5 file. This is typically used for on-disk representation of multidimensional (numeric) arrays.
- The **GDSArray** package adds the *GDSArray* backend for accessing and creating data stored on-disk in a GDS file. This is typically used for on-disk representation of genotyping or sequence data.
- The **rhdf5client** packages adds the *H5S_Array* realization backend for accessing and creating data stored on a HDF Server, a Python-based web service that can be used to send and receive HDF5 data using an HTTP-based REST interface. This is typically used for on-disk representation of multidimensional (numeric) arrays that need to be shared with multiple users from a central location.
The second kind of package adds methods for computing on _DelayedArray_ instances. Examples of this are:
- The **DelayedMatrixStats** package provides methods for computing commonly used row- and column-wise summaries of a 2-dimensional *DelayedArray*.
- The **beachmat** package provides a consistent C++ class interface for a variety of commonly used matrix types, including ordinary R arrays, sparse matrices, and *DelayedArray* with various backends.
- The **kmknn** package provides methods for performing k-means for k-nearest neighbours of data stored in a *DelayedArray*.
- Packages for performing matrix factorizations, (generalized) linear regression, , etc. (work in progress)
#### Packages that use **DelayedArray**
- The **bsseq** package uses the DelayedArray framework to support the analysis of large whole-genome bisulfite methylation sequencing experiments.
- **TODO**: minfi, scater, scran, others
### The *DelayedArray* class
The *DelayedArray* class is the key data structure that end users of the DelayedArray framework will interact with.
A *DelayedMatrix* is the same thing as a two-dimensional *DelayedArray*, just as a *matrix* is the same thing as a two-dimensional *array*.
The *DelayedArray* class has a single slot called the *seed*.
This name is evocative, it is the core of the object.
A package developer may create a subclass of *DelayedArray*; we will make extensive use of the *HDF5Array* class, for example.
```{r}
# TODO: Graphical representation of this network?
# TODO: Illustrate How classes depend on one another
# a. with DelayedArray as the root
# b. With DelayedArray as the leaf
showClass(getClass("DelayedArray", where = "DelayedArray"))
showClass(getClass("HDF5Array", where = "HDF5Array"))
```
#### Learning objectives
In this section, we'll learn how an end user can construct a *DelayedArray* from:
* An in-memory array-like object.
* A local, on-disk data store, such as an HDF5 file.
* A remote data store, such as from a HDF Server.
We'll also learn:
* What defines the "seed contract"
* What types of operations 'degrade' an instance of a *DelayedArray* subclass to a *DelayedArray*, as well as when and why this matters (**TODO** Save the when and why it matters to later?).
#### The seed of a *DelayedArray*
From an end user's perspective, there are a two broad categories of seed:
1. In-memory seeds
2. Out-of-memory seeds
a. Local, on-disk seeds
b. Remote (in-memory or on-disk) seeds
**TODO:** Box this out as an aside
> For a developer's perspective, there's a third category of seed that is defined by the *DelayedOp* class.
Instances of the *DelayedOp* class are not intended to be manipulated directly by the end user, but they are central to the concept of delayed operations, which we will learn about later (**TODO:** Link to section).
A seed must implement the "seed contract"^[For further details, see the vignette in the **DelayedArray** package (available via `vignette("02-Implementing_a_backend", "DelayedArray")`)]:
- `dim(x)`: Return the dimensions of the seed.
- `dimnames(x)`: Return the (possibly `NULL`) dimension names of the seed.
- `extract_array(x, index)`: Return a slice, specified by `index`, of the seed as an ordinary R *array*.
#### In-memory seeds
To begin, we'll consider a *DelayedArray* instance with the simplest *seed*, an in-memory array:
```{r}
library(DelayedArray)
mat <- matrix(rep(1:20, 1:20), ncol = 2)
da_mat <- DelayedArray(seed = mat)
da_mat
```
We can use other, more complex, array-like objects as the *seed*, such as *Matrix* objects from the **Matrix** package:
```{r}
library(Matrix)
Mat <- Matrix(mat)
da_Mat <- DelayedArray(seed = Mat)
# NOTE: The type is "double" because of how the Matrix package stores the data.
da_Mat
```
We can even use data frames as the *seed* of a two-dimensional *DelayedArray*.
```{r}
df <- as.data.frame(mat)
da_df <- DelayedArray(seed = df)
# NOTE: This inherits the (default) column names of the data.frame.
da_df
library(tibble)
tbl <- as_tibble(mat)
da_tbl <- DelayedArray(seed = tbl)
# NOTE: This inherits the (default) column names of the tibble.
da_tbl
DF <- as(mat, "DataFrame")
da_DF <- DelayedArray(seed = DF)
# NOTE: This inherits the (default) column names of the DataFrame.
da_DF
```
A package developer can also implement a novel in-memory seed.
For example, the **DelayedArray** package defines the *RleArraySeed* class^[In fact, the *RleArraySeed* class is a virtual class, with concrete subclasses *SolidRleArraySeed* and *ChunkedRleArraySeed*].
This can be used as the seed of an *RleArray*, a *DelayedArray* subclass for storing run-length encoded data.
```{r}
# NOTE: The DelayedArray package does not expose the RleArraySeed() constructor.
# Instead, we directly call the RleArray() constructor on the run-length
# encoded data.
da_Rle <- RleArray(rle = Rle(mat), dim = dim(mat))
da_Rle
```
The *RleArray* examples illustrates some important concepts in the DelayedArray class hierarchy that warrants reiteration and expansion.
#### Degrading DelayedArray subclasses
The `da_Rle` object is an *RleMatrix*, a direct subclass of *RleArray* and a direct subclass of *DelayedMatrix*.
Both *RleArray* and *DelayedMatrix* are direct subclasses of a *DelayedArray*.
As such, in accordance with properties of S4 class inheritance, **an *RleMatrix* is a *DelayedArray***.
```{r}
da_Rle
is(da_Rle, "DelayedArray")
showClass(getClass("RleMatrix", where = "DelayedArray"))
```
However, if we do (almost) anything to the *RleMatrix*, the result is 'degraded' to a *DelayedMatrix*.
This 'degradation' isn't an issue in and of itself, but it can be surprising and can complicate writing functions that expect a certain type of input (e.g., S4 methods).
```{r}
# NOTE: Adding one to each element will 'degrade' the result.
da_Rle + 1L
is(da_Rle + 1L, "RleMatrix")
# NOTE: Subsetting will 'degrade' the result.
da_Rle[1:10, ]
is(da_Rle[1:10, ], "RleMatrix")
# NOTE: Changing the dimnames will 'degrade' the result.
da_Rle_with_dimnames <- da_Rle
colnames(da_Rle_with_dimnames) <- c("A", "B")
da_Rle_with_dimnames
is(da_Rle_with_dimnames, "RleMatrix")
# NOTE: Transposing will 'degrade' the result.
t(da_Rle)
is(t(da_Rle), "RleMatrix")
# NOTE: Even adding zero (conceptually a no-op) will 'degrade' the result.
da_Rle + 0L
```
There are some exceptions to this rule, when the DelayedArray framework can recognise/guarantee that the operation is a no-op that will leave the object in its current state:
```{r}
# NOTE: Subsetting to select the entire object does not 'degrade' the result.
da_Rle[seq_len(nrow(da_Rle)), seq_len(ncol(da_Rle))]
is(da_Rle[seq_len(nrow(da_Rle)), seq_len(ncol(da_Rle))], "RleMatrix")
# NOTE: Transposing and transposing back does not 'degrade' the result.
t(t(da_Rle))
is(t(t(da_Rle)), "RleMatrix")
```
A particularly important example of this 'degrading' to be aware of is when accessing the assays of a *SummarizedExperiment* via the `assay()` and `assays()` getters.
Each of these getters has an argument `withDimnames` with a default value of `TRUE`; this copies the dimnames of the *SummarizedExperiment* to the returned assay. Consequently, this may 'degrade' the object(s) returned by `assay()` and `assays()` to *DelayedArray*.
To avoid this, use `withDimnames = FALSE` in the call to `assay()` and `assays()`.
```{r}
library(SummarizedExperiment)
# Construct a SummarizedExperiment with column names 'A' and 'B' and da_Rle as
# the assay data.
se <- SummarizedExperiment(da_Rle, colData = DataFrame(row.names = c("A", "B")))
se
# NOTE: dimnames are copied, so the result is 'degraded'.
assay(se)
is(assay(se), "RleMatrix")
# NOTE: dimnames are not copied, so the result is not 'degraded'.
assay(se, withDimnames = FALSE)
is(assay(se, withDimnames = FALSE), "RleMatrix")
```
As noted up front, the degradation isn't in and of itself a problem.
However, if you are passing an object to a function that expects an *RleMatrix*, for example, some care needs to be taken to ensure that the object isn't accidentally 'degraded' along the way to a *DelayedMatrix*.
To summarise, the lessons here are:
- Modifying an instance of a *DelayedArray* subclass will almost always 'degrade' it to a *DelayedArray*.
- It is very easy to accidentally trigger this 'degradation'.
- This 'degradation' can complicate method dispatch.
#### Out-of-memory seeds
The DelayedArray framework really shines when working with out-of-memory seeds.
It can give the user the "feel" of interacting with an ordinary R array but allows for the data to be stored on a local disk or even on a remote server, thus reducing (local) memory usage.
##### Local on-disk seeds
The **HDF5Array** package defines the *HDF5Array* class, a *DelayedArray* subclass for data stored on disk in a HDF5 file.
The *seed* of an *HDF5Array* is a *HDF5ArraySeed*.
It is important to note that creating a *HDF5Array* does not read the data into memory!
The data remain on disk until requested (we'll see how to do this later on in the workshop).
```{r}
library(HDF5Array)
hdf5_file <- file.path(
"500_Effectively_Using_the_DelayedArray_Framework",
"hdf5_mat.h5")
# NOTE: We can use rhdf5::h5ls() to take a look what is in the HDF5 file.
# This is very useful when working interactively!
rhdf5::h5ls(hdf5_file)
# We can create the HDF5Array by first creating a HDF5ArraySeed and then
# creating the HDF5Array.
hdf5_seed <- HDF5ArraySeed(filepath = hdf5_file, name = "hdf5_mat")
da_hdf5 <- DelayedArray(seed = hdf5_seed)
da_hdf5
# Alternatively, we can create this in one go using the HDF5Array() constructor.
da_hdf5 <- HDF5Array(filepath = hdf5_file, name = "hdf5_mat")
da_hdf5
```
Other on-disk seeds are possible such as **fst**, **bigmemory**, **ff**, or **matter**.
##### Remote seeds
The **rhdf5client** packages defines the *H5S_Array* class, a *DelayedArray* subclass for data stored on a remote HDF Server.
The *seed* of an *H5S_Array* is a *H5S_ArraySeed*.
It is important to note that creating a *H5S_Array* does not read the data into memory!
The data remain on the server until requested.
```{r}
library(rhdf5client)
da_h5s <- HSDS_Matrix(URL_hsds(), "/home/stvjc/hdf5_mat.h5")
da_h5s
```
#### So what seed should I use?
Notably, `da_mat`, `da_Mat`, `da_tbl`, `da_df`, `da_Rle`, `da_hdf5`, and `da_h5s` all "look" and "feel" much the same.
The *DelayedArray* is a very light wrapper around the *seed* that formalises this consistent "look" and "feel".
If your data always fit in memory, and it's still comfortable to work with it interactively, then you should probably stick with using ordinary *matrix* and *array* objects.
If, however, your data sometimes don't fit in memory or it becomes painful to work with them when they are in-memory, you should consider a disk-backed *DelayedArray*.
If you need to share a large dataset with a number of uses, and most users only need slices of it, you may want to consider a remote server-backed *DelayedArray*.
Not suprisingly, working with in-memory seeds will typically be faster than disk-backed or remote seeds; you are trading off memory-usage and data access times to obtain faster performance.
Below are some further opinionated recommendations.
Although the *DelayedArray* is a light wrapper, it does introduce some overhead.
For example, operations on a *DelayedArray* with an *array* or *Matrix* seed will be slower than operating on the *array* or *Matrix* directly.
For some operations this is almost non-existant (e.g., seed-aware operations in the **DelayedMatrixStats** package), but for others this overhead may be noticeable.
Conversely, a potential upside to wrapping a *Matrix* in a *DelayedArray* is that it can now leverage the block-processing strategy of the DelayedArray framework.
If you need a disk-backed *DelayedArray*, I would recommend using *HDF5Array* as the backend at this time.
HDF5 is a well-established scientific data format and the **HDF5Array** package is developed by the core Bioconductor team.
Furthermore, several bioinformatics tools natively export HDF5 files (e.g., **kallisto**, **CellRanger** from 10x Genomics).
One potential downside of HDF5 is the recent split into "Community" and "Enterprise Support" editions.
It's not yet clear how this will affect the HDF5 library or its community.
Other on-disk backends may not yet have well-established formats or libraries (e.g., **TileDB**).
A topic of on-going research in the Bioconductor community is to compare the performance of HDF5 to other disk-backed data stores.
Another topic is how to best choose the layout of the data on disk, often referred to as the "chunking" of the data.
For example, if the data is stored in column-major order then it might be expected that row-wise data access suffers.
One final comment.
If you are trying to support both in-memory and on-disk data, it may be worth considering using *DelayedArray* for everything rather than trying to support both *DelayedArray* and ordinary arrays.
This is particularly true for internal, non-user facing code.
Otherwise your software may need separate code paths for *array* and *DelayedArray* inputs, although the need for separate branches is reducing.
For example, when I initially added support for the DelayedArray framework in **bsseq** (~2 years ago), I opted to remove support for ordinary arrays.
However, more recently I added support for the DelayedArray framework in **minfi**, and there I opted to maintain support for ordinary arrays.
I've found that as the DelayedArray framework matures and expands, it is increasingly common that the same code "just works" for both *array* and *DelayedArray* objects.
Maintaining support for ordinary arrays can also be critical for popular Bioconductor packages, although this can be handled by having functions that accept an *array* as input simply wrapping them in a *DelayedArray* for internal computations.
### Operating on *DelayedArray* objects
Now that we know what a *DelayedArray* is, it's various subclasses, and how to construct these, we will discuss how to use operate on *DelayedArray* objects.
We'll cover three fundamental concepts:
1. Delayed operations
2. Block-processing
3. Realization
These three concepts work hand-in-hand.
Delayed operations are exactly that, the operation is recorded but performing the operation is delayed until the result is required; *realization* is the process of executing the delayed operations; and *block-processing* is used to perform operations, including *realization*, on blocks of the *DelayedArray*.
#### Learning goals
* Become familiar with the fundamental concepts of delayed operations, block-processing, and realization.
* **TODO** Construct a *DelayedArray* from scratch by writing data to a *RealizationSink*.
#### Delayed operations
A delayed operation is one that is not actually performed until the result is required.
Here's a simple example of a delayed operations: taking the negative of every element of a *HDF5Matrix*.
```{r}
da_hdf5
-da_hdf5
```
It may look like running `-da_hdf5` has taken the negative of every element.
However, we can see this is in fact not the case by using the `showtree()` function to inspect the internal state of these objects^[You may like to use the `str()` function for more detailed and verbose output]:
```{r}
showtree(da_hdf5)
# NOTE: The "Unary iso op" line is 'recording' the `-` as a delayed operation.
showtree(-da_hdf5)
```
To further illustrate the idea, let's perform some delayed operations on a large *HDF5Array* and compare them to performing the operations on the equivalent *array*.
```{r, cache.lazy=FALSE}
library(h5vcData)
tally_file <- system.file("extdata", "example.tally.hfs5", package = "h5vcData")
x_h5 <- HDF5Array::HDF5Array(tally_file, "/ExampleStudy/16/Coverages")
x_h5
x <- as.array(x_h5)
# Delayed operations are fast compared to ordinary operations!
system.time(x_h5 + 100L)
system.time(x + 100L)
# Delayed operations can be chained
system.time(t(x_h5[1, , 1:100] + 100L))
```
Rather than modifying the data stored in the HDF5 file, which can be costly for large datasets, we've recorded the 'idea' of these operations as a tree of *DelayedOp* objects.
Each node in the tree is represented by a *DelayedOp* object, of which there are 6 concrete subclasses:
| Node type | Out-degree | Operation |
|---------------------|------------|-----------------------------------------------------------|
| *DelayedSubset* | 1 | Multi-dimensional single bracket subsetting |
| *DelayedAperm* | 1 | Extended `aperm()` (can drop dimensions) |
| *DelayedUnaryIsoOp* | 1 | Unary op that preserves the geometry (e.g., `-`, `log()`) |
| *DelayedDimnames* | 1 | Set dimnames |
| *DelayedNaryIsoOp* | N | N-ary op that preserves the geometry |
| *DelayedAbind* | N | `abind()` | |
#### Block-processing
Block-processing allows you to iterate over blocks of a *DelayedArray* in a manner that abstracts over the backend.
The following cartoon illustrates the basic idea of block-processing:
![](500_Effectively_Using_the_DelayedArray_Framework/block_processing/Slide1.png)
![](500_Effectively_Using_the_DelayedArray_Framework/block_processing/Slide2.png)
![](500_Effectively_Using_the_DelayedArray_Framework/block_processing/Slide3.png)
![](500_Effectively_Using_the_DelayedArray_Framework/block_processing/Slide4.png)
![](500_Effectively_Using_the_DelayedArray_Framework/block_processing/Slide5.png)
![](500_Effectively_Using_the_DelayedArray_Framework/block_processing/Slide6.png)
![](500_Effectively_Using_the_DelayedArray_Framework/block_processing/Slide7.png)
To implement block-processing, we first construct an *ArrayGrid* over the *DelayedArray*.
Each element of the *ArrayGrid* is called an *ArrayViewport*.
We iterate over the *ArrayGrid*, extract the corresponding *ArrayViewport*, and load the data for that block into memory as an ordinary R *array* where it can be processed using ordinary R functions.
In pseudocode, we might implement block processing as follows:
```{r, eval = FALSE}
# Construct an ArrayGrid over 'x'.
# NOTE: blockGrid() is explained below.
grid <- blockGrid(x)
for (b in seq_along(grid)) {
# Construct an ArrayViewPort using the b-th element of the ArrayGrid.
viewport <- grid[[b]]
# Read the block of data using the current 'viewport' applied to 'x'.
block <- read_block(x, viewport)
# Apply our function to 'block' (which is an ordinary R array).
FUN(block)
}
```
In practice, when constructing the *ArrayGrid* for a particular operation, we need to consider a few issues:
1. Constraints on how much data we can bring into memory (the 'maximum block length').
2. The geometry of the *ArrayGrid*, which will control the order in which we access elements.
3. How the data are physically stored ('chunking' of the data on disk).
##### Maximum block length
There is a global option `getOption("DelayedArray.block.size")` for controlling how much data is brough into memory with default value corresponding to each block containing a maximum of `r getOption("DelayedArray.block.size") / 10 ^ 6` Mb of data.
It's good practice to respect this value, however, some algorithms may require this be ignored (e.g., if you require a full column's worth of data for each block).
##### Chunking and data structure
Regardless of the backend, the data will be stored with a particular structure.
For example, ordinary R arrays store the data in column-major order.
The geometry of the data structure is especially relevant for HDF5 files.
The HDF5 format supports 'chunking' of data to maximize data access performance.
From [https://support.hdfgroup.org/HDF5/doc/Advanced/Chunking/index.html](https://support.hdfgroup.org/HDF5/doc/Advanced/Chunking/index.html):
> Datasets in HDF5 can represent arrays with any number of dimensions (up to 32). However, in the file this dataset must be stored as part of the 1-dimensional stream of data that is the low-level file. The way in which the multidimensional dataset is mapped to the serial file is called the layout. The most obvious way to accomplish this is to simply flatten the dataset in a way similar to how arrays are stored in memory, serializing the entire dataset into a monolithic block on disk, which maps directly to a memory buffer the size of the dataset. This is called a contiguous layout.
> An alternative to the contiguous layout is the chunked layout. Whereas contiguous datasets are stored in a single block in the file, chunked datasets are split into multiple chunks which are all stored separately in the file. The chunks can be stored in any order and any position within the HDF5 file. Chunks can then be read and written individually, improving performance when operating on a subset of the dataset.
Although this sounds similar to the blocks used in block-processing, the two are distinct.
**Chunks are used to determine how the data are stored, blocks are used to determine how the data are accessed via the *ArrayGrid*.**
That is, you can use a different *ArrayGrid* geometry to the chunk geometry, but performance may be suboptimal.
##### Geometry of the *ArrayGrid*
The *ArrayGrid* could be a regularly spaced grid (a *RegularArrayGrid*) or a grid with arbitrary spacing (an *ArbitraryArrayGrid*).
The **DelayedArray** package includes some helper functions for setting up *ArrayGrid* instances with particular geometries and other properties.
The `blockGrid()` function returns the "optimal" *ArrayGrid* for block-processing.
The grid is "optimal" in the sense that:
- It's "compatible" with the chunk grid (i.e. with `chunkGrid(x)` or with the chunk grid supplied via the `chunk.grid` argument). That is, the chunks are contained in the blocks. In other words, chunks never cross block boundaries.
- Its "resolution" is such that the blocks have a length that is as close as possible to (but does not exceed) `block.maxlength`. An exception is when some chunks are already >= `block.maxlength`, in which case the returned grid is the same as the chunk grid.
##### Block-processing in practice
In practice, most block-processing algorithms can be implemented by constructing an *ArrayGrid* with a suitable geometry and then using `DelayedArray::blockApply()` or `DelayedArray::blockReduce()`.
As an example, let's compute row and column medians of `da_hdf5`.
We'll start by constructing *ArrayGrid* instances over the rows and columns of `da_hdf5`.
```{r}
# NOTE: Making block-processing verbose to show what's going on under the hood.
DelayedArray:::set_verbose_block_processing(TRUE)
system.time(
row_medians <- blockApply(
x = da_hdf5,
FUN = median,
grid = RegularArrayGrid(
refdim = dim(da_hdf5),
spacings = c(1L, ncol(da_hdf5)))))
head(row_medians)
system.time(
col_medians <- blockApply(
x = da_hdf5,
FUN = median,
grid = RegularArrayGrid(
refdim = dim(da_hdf5),
spacings = c(nrow(da_hdf5), 1L))))
head(col_medians)
```
That works, but is somewhat slow for computing row maximums because we read from the HDF5 file `nrow(da_hdf5)` ( `r nrow(da_hdf5)`) times.
We would be better off loading multiple rows of data per block.
There are several ways to do this; here, we'll use `DelayedArray::blockGrid()` to construct an optimal *ArrayGrid* and `matrixStats::rowMedians()` and `matrixStats::colMedians()` to compute the row and column maximums.
```{r}
system.time(
row_medians <- blockApply(
x = da_hdf5,
FUN = matrixStats::rowMedians,
grid = blockGrid(
x = da_hdf5,
block.shape = "first-dim-grows-first")))
head(row_medians)
system.time(
col_medians <- blockApply(
x = da_hdf5,
FUN = matrixStats::colMedians,
grid = blockGrid(
x = da_hdf5,
block.shape = "last-dim-grows-first")))
head(col_medians)
```
For larger problems, we can improve performance by processing blocks in parallel by passing an appropriate *BiocParallelParam* object via the `bpparam` argument of `blockApply()` and `blockReduce()`.
Although `blockApply()` and `blockReduce()` cover most of the block-processing tasks, sometimes you may need to implement the block-processing at a lower level.
For example, you may need to iterate over multiple *DelayedArray* objects or your `FUN` returns an object equally large (or larger) than the `block`.
The details of these abstractions are still being worked out, but some likely candidates include methods that conceptually do:
- `blockMapply()`
- `blockApplyWithRealization()`
- `blockMapplyWithRealization()`
#### Realization
Realization is the process of taking a *DelayedArray*, executing any delayed operations, and returning the result as a new *DelayedArray* with the appropriate backend.
Returning to our earlier example with data from the **h5vcData** package:
```{r}
x_h5
showtree(x_h5)
y <- t(x_h5[1, , 1:10000000] + 100L)
showtree(y)
# Realize the result in-memory
z <- realize(y, BACKEND = NULL)
# NOTE: 'z' does not carry any delayed operations.
showtree(z)
# Realize the result on-disk in an autogenerated HDF5 file
z_h5 <- realize(y, BACKEND = "HDF5Array")
# NOTE: 'z_h5' does not carry any delayed operations.
showtree(z_h5)
path(z_h5)
# NOTE: The show() method performs realization on the first few and last few
# elements in order to preview the result
y
```
Realization uses block-processing under the hood:
```{r}
DelayedArray:::set_verbose_block_processing(TRUE)
realize(y, BACKEND = "HDF5Array")
DelayedArray:::set_verbose_block_processing(FALSE)
```
## What's out there already?
**UP TO HERE**
- **DelayedMatrixStats**
- **beachmat**
#### Learning goal
* Learn of existing functions and packages for constructing and computing on DelayedArray objects, avoiding the need to re-invent the wheel.
## Incoporating DelayedArray into a package
### Writing algorithms to process *DelayedArray* instances
#### Learning goals
* Learn common design patterns for writing performant code that operates on a DelayedArray.
* Evaluate whether an existing function that operates on an ordinary array can be readily adapted to work on a DelayedArray.
* Reason about potential bottlenecks in algorithms operating on DelayedArray objects.
#### Learning objectives
* Take a function that operates on rows or columns of a matrix and apply it to a DelayedMatrix.
* Use block-processing on a *DelayedArray* to compute:
* A univariate (scalar) summary statistic (e.g., `max()`).
* A multivariate (vector) summary statistic (e.g., `colSum()` or `rowMean()`).
* A multivariate (array-like) summary statistic (e.g., `rowRanks()`).
* Design an algorithm that imports data into a DelayedArray.
## Questions and discussion
This section will be updated to address questions and to summarise the discussion from the presentation of this workshop at BioC2018.
## TODOs
- Use **BiocStyle**?
- Show packages depend on one another, with HDF5Array as the root (i.e. explain the HDF5 stack)
- Use `suppressPackageStartupMessages()` or equivalent.
- Note that we'll be focusing on numerical array-like data, i.e. no real discussion of **GDSArray**.
- Remove **memuse** dependency
- Link to my useR talk and slides