[PUBLISHER] Merge #14

* PUSH NOTE : Joaquin Fontbona.md * PUSH NOTE : Javier Maass Martinez.md * PUSH NOTE : Jan E. Gerken.md * PUSH NOTE : Symmetries in Overparametrized Neural Networks - A Mean-Field View.md * PUSH NOTE : Emergent Equivariance in Deep Ensembles.md * PUSH NOTE : Color Space Transformation Network.md * PUSH NOTE : Block Transformer - Global-to-Local Language Modeling for Fast Inference.md * PUSH NOTE : Provably Strict Generalisation Benefit for Equivariant Models.md * PUSH NOTE : In Search of Projectively Equivariant Networks.md * PUSH NOTE : University of Chile.md * PUSH NOTE : Chalmers University of Technology.md
dgcnz · Jun 6, 2024 · 620c2f4 · 620c2f4
1 parent 13bc7df
commit 620c2f4
Show file tree

Hide file tree

Showing 11 changed files with 108 additions and 0 deletions.
diff --git a/...ure/Block Transformer - Global-to-Local Language Modeling for Fast Inference.md b/...ure/Block Transformer - Global-to-Local Language Modeling for Fast Inference.md
@@ -0,0 +1,21 @@
+---
+authors:
+  - "[[Namgyu Ho|Namgyu Ho]]"
+  - "[[Sangmin Bae|Sangmin Bae]]"
+  - "[[Taehyeon Kim|Taehyeon Kim]]"
+  - "[[Hyunjik Jo|Hyunjik Jo]]"
+  - "[[Yireun Kim|Yireun Kim]]"
+  - "[[Tal Schuster|Tal Schuster]]"
+  - "[[Adam Fisch|Adam Fisch]]"
+  - "[[James Thorne|James Thorne]]"
+  - "[[Se-Young Yun|Se-Young Yun]]"
+year: 2024
+tags:
+  - efficient_dl
+  - transformers
+url: https://arxiv.org/abs/2406.02657
+share: true
+---
+> [!info] Abstract
+> This paper presents the Block Transformer architecture which adopts hierarchical global-to-local modeling to autoregressive transformers to mitigate the inference bottlenecks of self-attention. To apply self-attention, the key-value (KV) cache of all previous sequences must be retrieved from memory at every decoding step. Thereby, this KV cache IO becomes a significant bottleneck in batch inference. We notice that these costs stem from applying self-attention on the global context, therefore we isolate the expensive bottlenecks of global modeling to lower layers and apply fast local modeling in upper layers. To mitigate the remaining costs in the lower layers, we aggregate input tokens into fixed size blocks and then apply self-attention at this coarse level. Context information is aggregated into a single embedding to enable upper layers to decode the next block of tokens, without global attention. Free of global attention bottlenecks, the upper layers can fully utilize the compute hardware to maximize inference throughput. By leveraging global and local modules, the Block Transformer architecture demonstrates 10-20x gains in inference throughput compared to vanilla transformers with equivalent perplexity. Our work introduces a new approach to optimize language model inference through novel application of global-to-local modeling. Code is available at https://github.com/itsnamgyu/block-transformer.
+
diff --git a/docs/100 Reference notes/101 Literature/Color Space Transformation Network.md b/docs/100 Reference notes/101 Literature/Color Space Transformation Network.md
@@ -0,0 +1,11 @@
+---
+authors:
+  - "[[Alexandros Karargyris|Alexandros Karargyris]]"
+year: 2015
+tags:
+  - cnn
+url: https://arxiv.org/abs/1511.01064
+share: true
+---
+> [!info] Abstract
+> Deep networks have become very popular over the past few years. The main reason for this widespread use is their excellent ability to learn and predict knowledge in a very easy and efficient way. Convolutional neural networks and auto-encoders have become the normal in the area of imaging and computer vision achieving unprecedented accuracy levels in many applications. The most common strategy is to build and train networks with many layers by tuning their hyper-parameters. While this approach has proven to be a successful way to build robust deep learning schemes it suffers from high complexity. In this paper we introduce a module that learns color space transformations within a network. Given a large dataset of colored images the color space transformation module tries to learn color space transformations that increase overall classification accuracy. This module has shown to increase overall accuracy for the same network design and to achieve faster convergence. It is part of a broader family of image transformations (e.g. spatial transformer network).
diff --git a/docs/100 Reference notes/101 Literature/Emergent Equivariance in Deep Ensembles.md b/docs/100 Reference notes/101 Literature/Emergent Equivariance in Deep Ensembles.md
@@ -0,0 +1,13 @@
+---
+authors:
+  - "[[Jan E. Gerken|Jan E. Gerken]]"
+  - "[[Pan Kessel|Pan Kessel]]"
+year: 2024
+tags:
+  - equivariance
+  - dl_theory
+url: https://arxiv.org/abs/2403.03103
+share: true
+---
+> [!info] Abstract
+> We demonstrate that deep ensembles are secretly equivariant models. More precisely, we show that deep ensembles become equivariant for all inputs and at all training times by simply using data augmentation. Crucially, equivariance holds off-manifold and for any architecture in the infinite width limit. The equivariance is emergent in the sense that predictions of individual ensemble members are not equivariant but their collective prediction is. Neural tangent kernel theory is used to derive this result and we verify our theoretical insights using detailed numerical experiments.
diff --git a/...eference notes/101 Literature/In Search of Projectively Equivariant Networks.md b/...eference notes/101 Literature/In Search of Projectively Equivariant Networks.md
@@ -0,0 +1,14 @@
+---
+authors:
+  - "[[Georg Bokman|Georg Bokman]]"
+  - "[[Axel Flinth|Axel Flinth]]"
+  - "[[Fredrik Kahl|Fredrik Kahl]]"
+year: 2022
+tags:
+  - equivariance
+  - dl_theory
+url: https://arxiv.org/abs/2209.14719
+share: true
+---
+> [!info] Abstract
+> Equivariance of linear neural network layers is well studied. In this work, we relax the equivariance condition to only be true in a projective sense. We propose a way to construct a projectively equivariant neural network through building a standard equivariant network where the linear group representations acting on each intermediate feature space are"multiplicatively modified lifts"of projective group representations. By theoretically studying the relation of projectively and linearly equivariant linear layers, we show that our approach is the most general possible when building a network out of linear layers. The theory is showcased in two simple experiments.
diff --git a/...101 Literature/Provably Strict Generalisation Benefit for Equivariant Models.md b/...101 Literature/Provably Strict Generalisation Benefit for Equivariant Models.md
@@ -0,0 +1,14 @@
+---
+authors:
+  - "[[Bryn Elesedy|Bryn Elesedy]]"
+  - "[[Sheheryar Zaidi|Sheheryar Zaidi]]"
+year: 2021
+tags:
+  - dl_theory
+  - equivariance
+url: https://arxiv.org/abs/2102.10333
+share: true
+---
+> [!info] Abstract
+> It is widely believed that engineering a model to be invariant/equivariant improves generalisation. Despite the growing popularity of this approach, a precise characterisation of the generalisation benefit is lacking. By considering the simplest case of linear models, this paper provides the first provably non-zero improvement in generalisation for invariant/equivariant models when the target distribution is invariant/equivariant with respect to a compact group. Moreover, our work reveals an interesting relationship between generalisation, the number of training examples and properties of the group action. Our results rest on an observation of the structure of function spaces under averaging operators which, along with its consequences for feature averaging, may be of independent interest.
+
diff --git a/...iterature/Symmetries in Overparametrized Neural Networks - A Mean-Field View.md b/...iterature/Symmetries in Overparametrized Neural Networks - A Mean-Field View.md
@@ -0,0 +1,14 @@
+---
+authors:
+  - "[[Javier Maass Martinez|Javier Maass Martinez]]"
+  - "[[Joaquin Fontbona|Joaquin Fontbona]]"
+year: 2024
+tags:
+  - dl_theory
+  - equivariance
+url: https://arxiv.org/abs/2405.19995
+share: true
+---
+> [!info] Abstract
+> We develop a Mean-Field (MF) view of the learning dynamics of overparametrized Artificial Neural Networks (NN) under data symmetric in law wrt the action of a general compact group G. We consider for this a class of generalized shallow NNs given by an ensemble of N multi-layer units, jointly trained using stochastic gradient descent (SGD) and possibly symmetry-leveraging (SL) techniques, such as Data Augmentation (DA), Feature Averaging (FA) or Equivariant Architectures (EA). We introduce the notions of weakly and strongly invariant laws (WI and SI) on the parameter space of each single unit, corresponding, respectively, to G-invariant distributions, and to distributions supported on parameters fixed by the group action (which encode EA). This allows us to define symmetric models compatible with taking N→∞ and give an interpretation of the asymptotic dynamics of DA, FA and EA in terms of Wasserstein Gradient Flows describing their MF limits. When activations respect the group action, we show that, for symmetric data, DA, FA and freely-trained models obey the exact same MF dynamic, which stays in the space of WI laws and minimizes therein the population risk. We also give a counterexample to the general attainability of an optimum over SI laws. Despite this, quite remarkably, we show that the set of SI laws is also preserved by the MF dynamics even when freely trained. This sharply contrasts the finite-N setting, in which EAs are generally not preserved by unconstrained SGD. We illustrate the validity of our findings as N gets larger in a teacher-student experimental setting, training a student NN to learn from a WI, SI or arbitrary teacher model through various SL schemes. We last deduce a data-driven heuristic to discover the largest subspace of parameters supporting SI distributions for a problem, that could be used for designing EA with minimal generalization error.
+
diff --git a/docs/100 Reference notes/102 Authors/Jan E. Gerken.md b/docs/100 Reference notes/102 Authors/Jan E. Gerken.md
@@ -0,0 +1,5 @@
+---
+affiliation:
+  - "[[Chalmers University of Technology|Chalmers University of Technology]]"
+share: true
+---
diff --git a/docs/100 Reference notes/102 Authors/Javier Maass Martinez.md b/docs/100 Reference notes/102 Authors/Javier Maass Martinez.md
@@ -0,0 +1,5 @@
+---
+affiliation:
+  - "[[University of Chile|University of Chile]]"
+share: true
+---
diff --git a/docs/100 Reference notes/102 Authors/Joaquin Fontbona.md b/docs/100 Reference notes/102 Authors/Joaquin Fontbona.md
@@ -0,0 +1,5 @@
+---
+affiliation:
+  - "[[University of Chile|University of Chile]]"
+share: true
+---
diff --git a/docs/100 Reference notes/103 Affiliations/Chalmers University of Technology.md b/docs/100 Reference notes/103 Affiliations/Chalmers University of Technology.md
@@ -0,0 +1,3 @@
+---
+share: true
+---
diff --git a/docs/100 Reference notes/103 Affiliations/University of Chile.md b/docs/100 Reference notes/103 Affiliations/University of Chile.md
@@ -0,0 +1,3 @@
+---
+share: true
+---