Skip to content

Commit

Permalink
[PUBLISHER] Merge #14
Browse files Browse the repository at this point in the history
* PUSH NOTE : Joaquin Fontbona.md

* PUSH NOTE : Javier Maass Martinez.md

* PUSH NOTE : Jan E. Gerken.md

* PUSH NOTE : Symmetries in Overparametrized Neural Networks - A Mean-Field View.md

* PUSH NOTE : Emergent Equivariance in Deep Ensembles.md

* PUSH NOTE : Color Space Transformation Network.md

* PUSH NOTE : Block Transformer - Global-to-Local Language Modeling for Fast Inference.md

* PUSH NOTE : Provably Strict Generalisation Benefit for Equivariant Models.md

* PUSH NOTE : In Search of Projectively Equivariant Networks.md

* PUSH NOTE : University of Chile.md

* PUSH NOTE : Chalmers University of Technology.md
  • Loading branch information
dgcnz committed Jun 6, 2024
1 parent 13bc7df commit 620c2f4
Show file tree
Hide file tree
Showing 11 changed files with 108 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
authors:
- "[[Namgyu Ho|Namgyu Ho]]"
- "[[Sangmin Bae|Sangmin Bae]]"
- "[[Taehyeon Kim|Taehyeon Kim]]"
- "[[Hyunjik Jo|Hyunjik Jo]]"
- "[[Yireun Kim|Yireun Kim]]"
- "[[Tal Schuster|Tal Schuster]]"
- "[[Adam Fisch|Adam Fisch]]"
- "[[James Thorne|James Thorne]]"
- "[[Se-Young Yun|Se-Young Yun]]"
year: 2024
tags:
- efficient_dl
- transformers
url: https://arxiv.org/abs/2406.02657
share: true
---
> [!info] Abstract
> This paper presents the Block Transformer architecture which adopts hierarchical global-to-local modeling to autoregressive transformers to mitigate the inference bottlenecks of self-attention. To apply self-attention, the key-value (KV) cache of all previous sequences must be retrieved from memory at every decoding step. Thereby, this KV cache IO becomes a significant bottleneck in batch inference. We notice that these costs stem from applying self-attention on the global context, therefore we isolate the expensive bottlenecks of global modeling to lower layers and apply fast local modeling in upper layers. To mitigate the remaining costs in the lower layers, we aggregate input tokens into fixed size blocks and then apply self-attention at this coarse level. Context information is aggregated into a single embedding to enable upper layers to decode the next block of tokens, without global attention. Free of global attention bottlenecks, the upper layers can fully utilize the compute hardware to maximize inference throughput. By leveraging global and local modules, the Block Transformer architecture demonstrates 10-20x gains in inference throughput compared to vanilla transformers with equivalent perplexity. Our work introduces a new approach to optimize language model inference through novel application of global-to-local modeling. Code is available at https://github.com/itsnamgyu/block-transformer.
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
---
authors:
- "[[Alexandros Karargyris|Alexandros Karargyris]]"
year: 2015
tags:
- cnn
url: https://arxiv.org/abs/1511.01064
share: true
---
> [!info] Abstract
> Deep networks have become very popular over the past few years. The main reason for this widespread use is their excellent ability to learn and predict knowledge in a very easy and efficient way. Convolutional neural networks and auto-encoders have become the normal in the area of imaging and computer vision achieving unprecedented accuracy levels in many applications. The most common strategy is to build and train networks with many layers by tuning their hyper-parameters. While this approach has proven to be a successful way to build robust deep learning schemes it suffers from high complexity. In this paper we introduce a module that learns color space transformations within a network. Given a large dataset of colored images the color space transformation module tries to learn color space transformations that increase overall classification accuracy. This module has shown to increase overall accuracy for the same network design and to achieve faster convergence. It is part of a broader family of image transformations (e.g. spatial transformer network).
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
---
authors:
- "[[Jan E. Gerken|Jan E. Gerken]]"
- "[[Pan Kessel|Pan Kessel]]"
year: 2024
tags:
- equivariance
- dl_theory
url: https://arxiv.org/abs/2403.03103
share: true
---
> [!info] Abstract
> We demonstrate that deep ensembles are secretly equivariant models. More precisely, we show that deep ensembles become equivariant for all inputs and at all training times by simply using data augmentation. Crucially, equivariance holds off-manifold and for any architecture in the infinite width limit. The equivariance is emergent in the sense that predictions of individual ensemble members are not equivariant but their collective prediction is. Neural tangent kernel theory is used to derive this result and we verify our theoretical insights using detailed numerical experiments.
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
---
authors:
- "[[Georg Bokman|Georg Bokman]]"
- "[[Axel Flinth|Axel Flinth]]"
- "[[Fredrik Kahl|Fredrik Kahl]]"
year: 2022
tags:
- equivariance
- dl_theory
url: https://arxiv.org/abs/2209.14719
share: true
---
> [!info] Abstract
> Equivariance of linear neural network layers is well studied. In this work, we relax the equivariance condition to only be true in a projective sense. We propose a way to construct a projectively equivariant neural network through building a standard equivariant network where the linear group representations acting on each intermediate feature space are"multiplicatively modified lifts"of projective group representations. By theoretically studying the relation of projectively and linearly equivariant linear layers, we show that our approach is the most general possible when building a network out of linear layers. The theory is showcased in two simple experiments.
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
---
authors:
- "[[Bryn Elesedy|Bryn Elesedy]]"
- "[[Sheheryar Zaidi|Sheheryar Zaidi]]"
year: 2021
tags:
- dl_theory
- equivariance
url: https://arxiv.org/abs/2102.10333
share: true
---
> [!info] Abstract
> It is widely believed that engineering a model to be invariant/equivariant improves generalisation. Despite the growing popularity of this approach, a precise characterisation of the generalisation benefit is lacking. By considering the simplest case of linear models, this paper provides the first provably non-zero improvement in generalisation for invariant/equivariant models when the target distribution is invariant/equivariant with respect to a compact group. Moreover, our work reveals an interesting relationship between generalisation, the number of training examples and properties of the group action. Our results rest on an observation of the structure of function spaces under averaging operators which, along with its consequences for feature averaging, may be of independent interest.
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
---
authors:
- "[[Javier Maass Martinez|Javier Maass Martinez]]"
- "[[Joaquin Fontbona|Joaquin Fontbona]]"
year: 2024
tags:
- dl_theory
- equivariance
url: https://arxiv.org/abs/2405.19995
share: true
---
> [!info] Abstract
> We develop a Mean-Field (MF) view of the learning dynamics of overparametrized Artificial Neural Networks (NN) under data symmetric in law wrt the action of a general compact group G. We consider for this a class of generalized shallow NNs given by an ensemble of N multi-layer units, jointly trained using stochastic gradient descent (SGD) and possibly symmetry-leveraging (SL) techniques, such as Data Augmentation (DA), Feature Averaging (FA) or Equivariant Architectures (EA). We introduce the notions of weakly and strongly invariant laws (WI and SI) on the parameter space of each single unit, corresponding, respectively, to G-invariant distributions, and to distributions supported on parameters fixed by the group action (which encode EA). This allows us to define symmetric models compatible with taking N→∞ and give an interpretation of the asymptotic dynamics of DA, FA and EA in terms of Wasserstein Gradient Flows describing their MF limits. When activations respect the group action, we show that, for symmetric data, DA, FA and freely-trained models obey the exact same MF dynamic, which stays in the space of WI laws and minimizes therein the population risk. We also give a counterexample to the general attainability of an optimum over SI laws. Despite this, quite remarkably, we show that the set of SI laws is also preserved by the MF dynamics even when freely trained. This sharply contrasts the finite-N setting, in which EAs are generally not preserved by unconstrained SGD. We illustrate the validity of our findings as N gets larger in a teacher-student experimental setting, training a student NN to learn from a WI, SI or arbitrary teacher model through various SL schemes. We last deduce a data-driven heuristic to discover the largest subspace of parameters supporting SI distributions for a problem, that could be used for designing EA with minimal generalization error.
5 changes: 5 additions & 0 deletions docs/100 Reference notes/102 Authors/Jan E. Gerken.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
affiliation:
- "[[Chalmers University of Technology|Chalmers University of Technology]]"
share: true
---
5 changes: 5 additions & 0 deletions docs/100 Reference notes/102 Authors/Javier Maass Martinez.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
affiliation:
- "[[University of Chile|University of Chile]]"
share: true
---
5 changes: 5 additions & 0 deletions docs/100 Reference notes/102 Authors/Joaquin Fontbona.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
affiliation:
- "[[University of Chile|University of Chile]]"
share: true
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
---
share: true
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
---
share: true
---

0 comments on commit 620c2f4

Please sign in to comment.