Convolution vs. Attention

Layers in neural netowrks can be seen as a function that takes in a multi-dimensional input and produces an output. For simplicity, let’s assume the input and output dimesnions to be the same.

Fully-connected Layer

In a fully connected layer each output is just a linear combination of all the inputs.

$\textcolor{FF7800}{\textbf{y}} = \textcolor{9966FF}{W} \textcolor{2EC27E}{\textbf{x}} $

fullyconnected

_{^{In a fully connected layer, each output depends on (and is connected to) all inputs}}

In fact, in Pytorch, a fully-connected layer is just represented by the linear layer.

import torch
torch.nn.Linear(in_features, out_features, bias=True, device=None, dtype=None)

The weights matrix $\textcolor{9966FF}{W}$ is the learnable parameter which learns to assign weights of each output to all of the inputs.

The problem with the fully-connected layer is that it consumes a lot of learnable parameters. Since each output is connected to all inputs, a lot of the weights have to be learnt. But, for most input types it is wasteful to connect to all inputs since dependencies are often local, for e.g. spatio-temporal neighbors, than global.

In principle, connecting to all inputs shouldn’t be a problem since given enough training samples and enough time to train, the network should be able to learn to assign meaningful (and sparse) weights to some locations (like the spatio-temporal neighbors) than to the rest, which could be exploited by post-processing steps such as Pruning to fasten inference.

But, especially for data such as images where we know the inherent local dependencies of pixels to their spatial neighbors, it is a good idea to limit the layer to such neighborhood.

Convolution

Convolutional neural networks (CNNs) are typically used for spatial data processing, such as images, where there is a spatial relationship between the data or temporal data such as audio. For example, neighboring pixels (in X or Y direction) are related to each other. A convolutional filter is applied to such data to extract features, such as edges, textures, etc., in images.

convolution

_{^{A convolutional layer only attends at it’s neighbors}}

Attention

Transformers on the other hand are typically used for sequential data processing, such as text, natural language, where short-term and long-term dependencies are present. The actual dependencies are not explicit in this case. For example, in the sentence “Alice had gone to the supermarket to meet Bob”, one of the verb “meet”, is located far-away from the subject “Alice” and this dependency is not spatial but differs a lot. This is even more for longer inputs with multiple paragraphs where the final sentence could have had a dependecy to a sentence somewhere in the beginning. Transformers are based on the so called attention mechanisms which learns these relationships between the elements in the sequence.

_{^{Attentions is the key component of Transformers and have been successfully applied to image data. Davide Coccomini, CC BY-SA 4.0, via Wikimedia Commons}}

Basic Self-attention

The basic idea of self-attention is to assign different importance to the inputs based on the inputs themselves. In comparison to convolution, self-attention allows the receptive field to be the entire spatial locations.

attention

_{^{An attention layer assigns importance to the inputs based on the inputs themselves}}

The weights are computed based on cosine similarity of the inputs $\textcolor{2EC27E}{\textbf{x}}$ to themselves

attention

_{^{$\textcolor{9966FF}{W} = \textcolor{2EC27E}{\textbf{x}} \textcolor{2EC27E}{\textbf{x}^T}$}}

Input $\textcolor{2EC27E}{\textbf{x}^T}$ and its transforms are usually multi-dimensional. Vector notation is used for illustration

Self-attention

Note that the basic version of self-attention does not include any learnable parameters. For this reason, the “Attention is all you need”^[¹^] variation of the self-attention includes 3 learnable weight matrices (key $\textcolor{9966FF}{W_K}$, query $\textcolor{9966FF}{W_Q}$ and value $\textcolor{9966FF}{W_V}$). But, the basic principle remains the same. The key $\textcolor{9966FF}{W_K}$ and query $\textcolor{9966FF}{W_Q}$ matrices are used to transform the input into key $\textcolor{2EC27E}{\textbf{k}} = \textcolor{9966FF}{W_K} \textcolor{2EC27E}{\textbf{x}}$, and query $\textcolor{2EC27E}{\textbf{q}} = \textcolor{9966FF}{W_Q} \textcolor{2EC27E}{\textbf{x}}$, whose similarity $\textcolor{2EC27E}{\textbf{q}} \textcolor{2EC27E}{\textbf{k}^\mathrm{T}}$, weighs the output (value), $\textcolor{FF7800}{\textbf{v}} = \textcolor{9966FF}{W_V} \textcolor{2EC27E}{\textbf{x}} $.^[²^]

$\textcolor{FF7800}{\textbf{y}} = \text{softmax}\left(\frac{\textcolor{2EC27E}{\textbf{q}} \textcolor{2EC27E}{\textbf{k}^\mathrm{T}}}{\sqrt{d_k}}\right)\textcolor{FF7800}{\textbf{v}}$

Also note that in the basic version, the self-similarity of the inputs always causes the diagonal to be of the highest similarity and makes the weight matrix symmetric. This problem is also elevated by transforming the same input using two seperate learnable weight matrices, $\textcolor{9966FF}{W_K}$ and $\textcolor{9966FF}{W_Q}$