A policeman sees a drunk man searching for something under a streetlight and asks what the drunk has lost. He says he lost his keys and they both look under the streetlight together. After a few minutes the policeman asks if he is sure he lost them here, and the drunk replies, no, and that he lost them in the park. The policeman asks why he is searching here, and the drunk replies, “this is where the light is”. – Source
Beyond the Streetlight – I believe this same phenomenon has occurred for our understanding of the cerebellum. Deficits in fine motor function are very easy to spot and were the simplest function to attribute to cerebellar lesions. This is still what is taught in most textbooks.
However, it is less well known that the cerebellum contains ~70% of all neurons in the brain and is ubiquitous across organisms as varied as humans, fruit flies, and electric fish. Over the last ~20 years in particular, evidence has been accumulating that the cerebellum, first named by Leonardo da Vinci for “little brain” is more important than its size may suggest.
A prominent neuroscientist once said to me:
A dirty little secret in cognitive science is that the cerebellum lights up for almost every task.
Taking a excerpt from a recent paper:
A review of 275 positron emission tomography (PET) and functional magnetic resonance imaging (fMRI) studies revealed that cerebellar activation was observed during a broad range of tasks, including orienting attention, olfaction, spoken and written language, verbal working memory, problem solving, spatial memory, episodic memory, skill learning, and associative learning (Cabeza & Nyberg, 2000).
A broad range of neuropsychological deficits has also been documented following localized cerebellar pathology, with deficits across tasks of attention, working memory, language and naming, counting, visuospatial processing, planning, and abstract reasoning reported (Kalashnikova, Zveva, Pugacheva, & Korsakova, 2005).
Cerebellar ubiquity – The cerebellum-like structure found in many insects including fruit flies, ants, and bees is the Mushroom Body (MB). The MB is responsible for associative learning and is the primary region of the Drosophila brain that is not genetically pre-wired, instead containing many random connections that undergo learning. Cerebellar equivalents have also been discovered in families of crustaceans and flatworms through likely shared ancestral inheritance and cephalopods through potentially convergent evolution.^{1}
Cerebellar intelligence – Just as there has been found to be a positive correlation between the size of the MB and the intellect of various insects, it has recently been discovered that the human cerebellum is larger than previously thought making up 78% of the total surface area of our neocortex, making it much larger than for other primates. For example, in the macaque monkey the cerebellum is only 33% the surface area of its neocortex.
Sans cerebellum – Another critique of cerebellar importance is that people can live being born without one. This phenomenon is not unique to the cerebellum, where people can be born without entire cortical lobes and show no phenotypic deficits. For example, patient EG, a highly intelligent lawyer who taught herself Russian later in life and scored 98th percentile for vocabulary, was only found at the age of 25 to be missing her entire left temporal lobe – the canonical center of language processing! Her right temporal lobe was found to have entirely compensated.^{2}
Being familiar with cases like patient EG, it is reasonable to assume the same compensation and lack of phenotypic deficiency would occur with a missing cerebellum. However, this turns out to not be the case and permanently harms much more than fine motor control, including also language development and emotion. Here are excerpts from an interview with Jonathan who was born without a cerebellum:
“All his milestones were late: sitting up, walking, talking.” […] He also lacks the balance to ride a bicycle.
“Reaction time, not my strong suit,” Jonathan says, adding that he doesn’t drive anymore. Emotional complexity is another challenge for Jonathan, says his sister, Sarah Napoline. She says her brother is a great listener, but isn’t introspective.
“He doesn’t really get into this deeper level of conversation that builds strong relationships, things that would be the foundation for a romantic relationship or deep enduring friendships,” she says. Jonathan, who is sitting beside her, says he agrees. – Source
Summary – The fact the same fundamental cerebellar architecture appears across such diverse species, the growing evidence of its involvement in most cognitive functions, and the correlation between its size and intelligence, all indicate that cerebellum-like neuronal architectures are performing a crucial and differentiated cognitive operation that spans far beyond motor control.
(Shameless plug: Sparse Distributed Memory is a theory for cerebellar function that is also a close approximation to the state of the art Transformer deep learning architecture: Attention Approximates Sparse Distributed Memory.)
Thanks to the Jamie Simon for spotting a typo. All remaining errors are mine and mine alone.
Note that the extent to which the cephalopod vertical lobe approximates the cerebellum is contested (pro convergence; contra convergence). ↩
It is wild to me that there is no consequence of missing an entire lobe in the brain. Across animals, the number of neurons (as a ratio of body size) matters a lot for intelligence. Patient EG is highly intelligent but it seems plausible that she could have been even smarter if she was not missing a large number of neurons (computational units). ↩
If you found this post useful for your research please use this citation:
@misc{CerebellumBeyondMotorControl,
title={The Cerebellum Beyond Motor Control},
url={https://www.trentonbricken.com/Cerebellum-Beyond-Motor-Control/},
journal={Blog Post, trentonbricken.com},
author={Trenton Bricken},
year={2022}, month={November}}
A paper “Recasting Self-Attention with Holographic Reduced Representations” was recently posted that claims to use the Holographic Reduced Representations to “recast” Transformer Self Attention. While the paper shows some interesting empirical results, I explain why I think the work is flawed in its theoretical underpinnings.
I’m taking the time to write this critique because I believe it is a critical period for Vector Symbolic Architectures (VSAs) to interface with Deep Learning and that this work represents VSAs poorly.
For some background, the Transformer Self Attention equation for a single query vector is the following:
\[V \text{softmax}(K^T \mathbf{q}_t)\]where our values and keys are vectors of dimension \(d\) stored columnwise in matrices \(K \in \mathbb{R}^{d\times T}\), \(V \in \mathbb{R}^{d\times T}\), and we are only considering a single query vector \(\mathbf{q}_t \in \mathbb{R}^{d}\). \(T\) is used for the number of tokens in the receptive field of the model and \(t\) subscript is the current time point that we are predicting the next token from. This time point determines the current query and will become crucial later.
We will write:
\[\mathbf{\hat{a}}_t = K^T \mathbf{q}_t = [ \mathbf{k}_1^T \mathbf{q}_t, \mathbf{k}_2^T \mathbf{q}_t , \dots, \mathbf{k}_T^T \mathbf{q}_t ]^T,\]to be the attention vector before the softmax operation. Creating this vector takes \(O(dT)\) compute (\(T\) dot products each of \(d\) dimensional vectors).
Now here is what the paper is doing:
#1. Bind keys and values across the sequence together using the VSA bind operator \(\otimes\) to create the superposition vector \(\mathbf{s}_{kv}\):
\[\mathbf{s}_{kv} = \sum_i^T \mathbf{k}_i \otimes \mathbf{v}_i\]All you need to know about the bind operator is that it produces another n dimensional vector and is invertible where \((\mathbf{a} \otimes \mathbf{b}) \otimes \mathbf{a}^{-1} = \mathbf{b}\).
#2. Create a superposition of all queries across the sequence:
\[\mathbf{s}_q = \sum_i^T \mathbf{q}_i\]#3. Unbind the query superposition from the key value superposition (this computes the query key dot products between all queries and keys but in superposition):
\[\begin{align} \mathbf{z} &= \mathbf{s}_{kv} \otimes \mathbf{s}_q^{-1} = \Big ( \sum_i^T ( \mathbf{k}_i \otimes \mathbf{v}_i ) \Big ) \otimes \Big (\sum_i^T \mathbf{q}_i \Big )^{-1} \\ &= \mathbf{q}_1^{-1} \otimes \Big ( \mathbf{k}_1 \otimes \mathbf{v}_1 + \dots + \mathbf{k}_T \otimes \mathbf{v}_T \Big ) + \dots + \mathbf{q}_T^{-1} \otimes \Big ( \mathbf{k}_1 \otimes \mathbf{v}_1 + \dots + \mathbf{k}_T \otimes \mathbf{v}_T \Big ) \end{align}\]#4. Extract the attention weights by doing a cosine similarity (CS) between each value vector and \(\mathbf{z}\) where \(\epsilon\) is a noise term for everything that doesn’t have the corresponding \(\mathbf{v}_i\) match.
\[\begin{align} \mathbf{\tilde{a}}_t &= [ \text{CS}(\mathbf{v}_1, \mathbf{z}), \dots, \text{CS}(\mathbf{v}_T, \mathbf{z}) ]^T \\ &= [ CS(\mathbf{v}_1, \mathbf{v}_1 \otimes \mathbf{k}_1 \otimes \sum_i^T \mathbf{q}_i^{-1} +\epsilon ), \dots, CS(\mathbf{v}_T, \mathbf{v}_T \otimes \mathbf{k}_T \otimes \sum_i^T \mathbf{q}_i^{-1} +\epsilon) ]^T \\ &\approx [ \sum_i^T \mathbf{k}_1^T \mathbf{q}_i +\epsilon, \dots, \sum_i^T \mathbf{k}_T^T \mathbf{q}_i+\epsilon ]^T \\ \end{align}\]Can you spot the difference between this \(\mathbf{\tilde{a}}_t\) and the original Self Attention \(\mathbf{\hat{a}}_t\)?
\(\mathbf{\tilde{a}}_t\) computes the dot product between the key vector and every query! Not the current query \(\mathbf{q}_t\) that should be the only query used to predict the next token. This means that every attention weight vector is the same across the entire sequence: \(\mathbf{\tilde{a}}_i == \mathbf{\tilde{a}}_j \forall i,j \in [1,T]\).
There are two solutions to modify this approach so that it is a true recasting of Attention, however, both of them remove the speedup claimed by the paper, leaving only the increase in noise from \(\epsilon\)!
First, if a masked language setting is implemented correctly^{1}, at e.g. \(t=5\) we don’t have access to the keys, queries and values for \(t>5\). This means that as we move across the sequence, we incrementally add queries to our query superposition and keys/values to our key+value superposition and compute all of the above equations (#1-#4) each time. This means we have \(O(dT^2)\) where \(d\) is the dimensionality and \(T\) is the sequence length (\(dT\) operations for a single query because we compute cosine similarity using every value vector and we repeat this for every incremental query in the sequence).
Second, rather than adding more vectors to the superposition, making it noisier, we can keep each query separate when we perform the above operations. However, this is again \(O(dT^2)\) complexity and reveals how using VSAs here doesn’t make sense. We bind together every key and value vector to compute a noisy dot product with the query in superposition, only to then unbind all of them again? This is more expensive and noisier than merely doing a dot product between every key and query as in the original attention operation!
To conclude, while I share with the paper authors the desire to integrate VSAs into Deep Learning, the way it has been done here is ineffective and misleading. What is created is not a re-casting of the Attention operation. It is surprising that it does better than baselines on some idiosyncratic benchmarks and this may also be due to an implementation error.
Please reach out if I am missing something about this paper as I am happy to discuss it and revise this blog post.
Thanks to the Denis Kleyko for helpful comments and discussion. All remaining errors are mine and mine alone.
I am concerned that the results in the paper that beat benchmarks are the result of incorrect masking. ↩
If you found this post useful for your research please use this citation:
@misc{HHRsRecastingAttention,
title={HRRs Can't Recast Self Attention},
url={https://www.trentonbricken.com/Contra-Recasting-Attention/},
journal={Blog Post, trentonbricken.com},
author={Trenton Bricken},
year={2022}, month={November}}
Going off citation count for their original, seminal papers, Hopfield Networks are ~24x more popular than Sparse Distributed Memory (SDM) (24,362 citations versus 1,337). I think this is a shame because SDM can not only be viewed as a generalization of both the original and more modern Hopfield Networks but also passes a higher bar for biological plausbility – having a one-to-one mapping to the circuitry of the cerebellum. Additionally, like Hopfield Networks, SDM has been shown to closely relate to the powerful Transformer deep learning architecture.
In this blog post, we first provide background on Hopfield Networks. We then review how Sparse Distributed Memory (SDM) is a more general form of the original Hopfield Network. Finally, we provide insight into how modern improvements to the Hopfield Network modify the weighting of patterns, making them even more convergent with SDM. (In other words, SDM can be seen as pre-empting modern Hopfield Net innovations).
The fundamental difference between SDM and Hopfield Networks lies in the primitives they use. In SDM, the core primitive is neurons that patterns are written into and read from. Hopfield Networks do a figure-ground inversion, where the core primitive is patterns and it is from their storage/retrieval that neurons implicitly appear.
To make this more concrete, we first provide a quick background on how SDM works:
If you like videos then watch the first 10 mins of this talk I gave on how SDM works and skip the rest of this section.
To keep things simple, we will introduce the continuous version of SDM, where all neurons and patterns exist on the \(L^2\) unit norm hypersphere and cosine similarity is our distance metric. The original version of SDM used binary vectors and the Hamming distance metric.
SDM randomly initializes the addresses of \(r\) neurons on the \(L^2\) unit hypersphere in an \(n\) dimensional space. These neurons have addresses that each occupy a column in our address matrix \(X_a \in (L^2)^{n\times r}\), where \(L^2\) is shorthand for all \(n\)-dimensional vectors existing on the \(L^2\) unit norm hypersphere. Each neuron also has a storage vector used to store patterns represented in the matrix \(X_v \in \mathbb{R}^{o\times r}\), where \(o\) is the output dimension.
Patterns also have addresses constrained on the \(n\)-dimensional \(L^2\) hypersphere that are determined by their encoding; pattern encodings can be as simple as flattening an image into a vector or as complex as preprocessing the image through a deep convolutional network.
Patterns are stored by activating all nearby neurons within a cosine similarity threshold \(c\), and performing an elementwise summation with the activated neurons’ storage vector. Depending on the task at hand, patterns write themselves into the storage vector (e.g., during a reconstruction task) or write another pattern, possibly of different dimension (e.g., writing in their one hot label for a classification task).
Because in most cases we have fewer neurons than patterns, the same neuron will be activated by multiple different patterns. This is handled by storing the pattern values in superposition via the aforementioned elementwise summation operation. The fidelity of each pattern stored in this superposition is a function of the vector orthogonality and dimensionality \(n\).
Using \(m\) to denote the number of patterns, matrix \(P_a \in (L^2)^{n\times m}\) for the pattern addresses, and matrix \(P_v \in \mathbb{R}^{o\times m}\) for values the patterns want to write, the SDM write operation is:
\begin{align}
\label{eq:SDMWriteMatrix}
X_v = P_v b_c \big ( P_a^T X_a \big ), \qquad
b_c(e)=
\begin{cases}
1, & \text{if}\ e \geq c
& \text{else} \; 0,
\end{cases}
\end{align}
where \(b(e)\) performs an element-wise binarization of its input to determine which pattern and neuron addresses are within the cosine similarity threshold \(c\) of each other.
Having written patterns into our neurons, we read from the system by inputting a query \(\boldsymbol{\xi}\), that again activates nearby neurons. Each activated neuron outputs its storage vector and they are all summed elementwise to give a final output \(\mathbf{y}\). The output \(\mathbf{y}\) can be interpreted as an updated query and optionally \(L^2\) normalized again as a post processing step:
\begin{align} \label{eq:SDMReadMatrix} \mathbf{y} = X_v b_c \big ( X_a^T \boldsymbol{\xi} \big). \end{align}
Intuitively, SDM’s query will update towards the values of the patterns with the closest addresses. This is because the patterns with the closest addresses will have written their values into more neurons that the query reads from than any competing patterns. For example, in the summary figure above, the blue pattern address is the closest to the query meaning that it appears the most in those nearby neurons the query reads from.
Another potentially useful perspective is to see SDM as a single hidden layer MLP with a few modifications:
In a recent paper, we resolve a number of these modifications to make SDM more compatible with Deep Learning. This results in a model that is very good at continual learning!
In another paper, we show that the SDM update rule is closely approximated by the softmax update rule used by Transformer Attention. This will be relevant later with newer versions of Hopfield Networks that also show this relationship.
Hopfield Networks before the modern continuous version all use bipolar vectors \(\bar{\mathbf{a}} \in \{-1,1\}^n\) where the bar denotes bipolarity. The Hopfield Network update rule in matrix form, in its typical auto-associative format where \(P_v=P_a\) is:^{1}
\begin{equation}
\label{eq:HopfieldUpdateRule}
\bar{\mathbf{y}} = \bar{g}( \bar{P}_a \bar{P}_a^T \bar{\boldsymbol{\xi}} ) \qquad \bar{g}(e)=
\begin{cases}
1, & \text{if}\ e> 0
, & \text{else} \; -1,
\end{cases}
\end{equation}
Interpreting this equation from the perspective of SDM, we first compute a re-scaled and approximate version of the cosine similarity between the query and pattern addresses \(\bar{P}_a^T \bar{\boldsymbol{\xi}}\). This can be done by taking a dot product between bipolar vectors and dividing by \(n\) to convert the interval our bipolar values can take from [-\(n\), \(n\)] to [-1,1] of cosine similarity:
\begin{equation} \label{eq:BipolarHammConversion} \mathbf{x}^T \mathbf{y} \approx \frac{\bar{\mathbf{x}}^T\bar{\mathbf{y}}}{n} \end{equation}
This relationship is exact with binary SDM and bipolar Hopfield but we leave this relationship to the Appendix B.6 of our paper.
We use this distance metric to weight each pattern before mapping back into our bipolar space with \(\bar{g}(\cdot)\).
Instead of first multiplying the query with the pattern addresses (\(\bar{P}_a^T \bar{\boldsymbol{\xi}}\)) like in SDM, Hopfield Networks instead typically perform \(\bar{P}_a \bar{P}_a^T=M\) which gives us a symmetric, \(n \times n\) dimensional matrix. We can interpret this symmetric matrix \(M\) as containing \(n\) neurons where each neuron’s address is a row and its value vector is the corresponding column, which by symmetry is the row transpose. Therefore, the number of neurons is defined by the pattern dimensionality \(n\) and the neuron address and value vectors are derived from \(\bar{P}_a \bar{P}_a^T\). This is how neurons emerge from the patterns in Hopfield Networks.
What is most distinct about this operation in comparison with SDM is that there is no activation threshold between the patterns and query (\(\bar{P}_a^T \bar{\boldsymbol{\xi}}\)). As a result, every pattern has an effect on the update rule including positive attraction and negative repulsion forces. We will see how more modern versions of the Hopfield Network have in fact re-implemented activation thresholds that are increasingly reminiscent of that used by SDM.
It was first established by Keeler that SDM was a generalization of Hopfield Networks. Hopfield Networks can be represented by SDM’s neuron primitives as a special case where there are no distributed reading or writing operations. This makes the weighting of patterns in the read operation the bipolar re-scaling of the cosine similarity to the query.
The Hopfield version of SDM has neurons centered at each pattern so \(r=m\) and \(X_a=P_a\). Distributed writing and reading are removed by setting the cosine similarity threshold for writing to \(d_\text{write}=1\) and for reading to \(d_\text{read}=-1\).
This means patterns are only written to the neurons centered at them and the query reads from every neuron. In addition, for reading, rather than binary neuron activations using the Heaviside function \(b(\cdot)\), neurons are weighted by the bipolar version of the cosine similarity given by \(\bar{\mathbf{x}}^T\bar{\mathbf{y}}\). Writing out the SDM read operation in full with bipolar vectors that make our normalizing constant unnecessary, we have:
\[\bar{\mathbf{y}} = \bar{g} \Big ( \bar{X}_v b_{d_\text{read}} \big ( \frac{\bar{X}_a^T \bar{P}_a}{n} \big ) \Big ) = \bar{g} \Big ( \underbrace{ \bar{P}_v b_{d_\text{write}}( \frac{\bar{X}_a^T \bar{P}_a}{n})}_{\text{Write Patterns}} \underbrace{ b_{d_\text{read}} \big ( \frac{\bar{X}_a^T \bar{P}_a}{n} \big ) }_{\text{Read Patterns}} \Big )\]Looking first at the write operation where \(d_\text{write}=0\):
\[X_v = P_v b_c \big ( P_a^T X_a \big ) = \bar{P}_v b_{d_\text{write}}(\frac{\bar{X}_a^T\bar{P}_a}{n})=\bar{P}_v I = \bar{P}_v=\bar{P}_a\]where \(I\) is the identity matrix and \(P_v=P_a\) in the typical autoassociative setting. For the read operation we remove the threshold and cosine similarity re-scaling:
\[b_{d_\text{read}}(\frac{\bar{X}_a^T\bar{\boldsymbol{\xi}} }{n}) = \bar{X}_a^T \bar{\boldsymbol{\xi}} =\bar{P}_a^T \bar{\boldsymbol{\xi}}.\]Together, these modifications turn SDM using neuron representations into the original Hopfield Network:
\[\bar{g} \Big ( \underbrace{ \bar{P}_v b_{d_\text{write}}( \frac{\bar{X}_a^T \bar{P}_a}{n})}_{\text{Write Patterns}} \underbrace{ b_{d_\text{read}} \big ( \frac{\bar{X}_a^T \bar{P}_a}{n} \big ) }_{\text{Read Patterns}} \Big ) = \bar{g}( \bar{P}_a \bar{P}_a^T \bar{\boldsymbol{\xi}} )\]While Hopfield Networks have been traditionally presented and used in an autoassociative fashion, by using a synchronous update rule they can also be heteroassociative \(\bar{\mathbf{y}} = \bar{g}( \bar{P}_v \bar{P}_a^T \bar{\boldsymbol{\xi}} )\) but do not work as well. However, a solution is to operate autoassociatively but concatenate together the pattern address and pointer as was introduced here. However, this concatenation is less biologically plausible than SDMs solution that has separate input lines capable of carrying different key and value vectors.
A second and more important difference between SDM and Hopfield Networks is that the latter is biologically implausible because of the weight transport problem, whereby the afferent and efferent weights are symmetric. At the expense of biology, these symmetric weights allow for the derivation an energy landscape that can be used for convergence proofs and to solve optimization problems like the Traveling Salesman. Meanwhile, SDM is not only free from weight symmetry but also has its mapping onto the cerebellum outlined here.
A third difference between SDM and Hopfield Networks lies in how they weight their patterns. We can interpret the Hopfield update as computing the similarity between \(P_a^T \boldsymbol{\xi}\) which, because of the bipolar values, has a maximum of \(n\), minimum of \(-n\) and moves in increments of 2 (flipping from a +1 to -1 or vice versa). This distance metric between each pattern and the query results in our query being attracted to similar patterns and repulsed from dissimilar. Distinctly, all patterns aside from those that are completely orthogonal contribute to the query update while in SDM, it has been shown that patterns have approximately exponential weighting for nearby patterns and those further than \(d(\boldsymbol{\xi},\mathbf{p_a})>2d\), receiving no weighting at all. We will expand upon this weighting and how modern improvements to Hopfield Networks have made them a closer relation to SDM.
Finally, both Hopfield Networks and SDM, while using different parameters, have been shown to have the same information content as a function of their parameter count. Yet, these parameters are used in different ways because of SDM’s neuron primitive. Notably, SDM can increase its memory capacity up to a limit by increasing the number of neurons \(r\) rather than needing to increase \(n\), the dimensionality of the patterns being stored.
The binary modern Hopfield Network introduced by Krotov and Hopfield, showed that using higher order polynomials to assign new pattern weightings in the read operation increased capacity. In particular, adding odd and rectified polynomials, which put more weight on memories that are closer to the query, better separating the signal of the target pattern from the noise of all others. The energy function to be minimized is:
\[E=-\sum_{\bar{\mathbf{p}}\in \bar{P}} K_a \Big( \bar{\mathbf{p}}_a \bar{\boldsymbol{\xi}} \Big )=-\sum_{\bar{\mathbf{p}}\in \bar{P}} K_a \Big( \sum_i^n [\bar{\mathbf{p}}_a]_i \bar{\boldsymbol{\xi}}_i \Big )\]where:
\[K_a(x)= \begin{cases} x^a, & x \geq 0 \\ 0, & x < 0 \end{cases}\]with \(x<0\) being the rectified component that is optional and \(a\) being the order of the polynomial. It can be shown that the original Hopfield Network energy function used a second order polynomial \(a=2\). The query updates its bit in position \(i\) by comparing the difference in energy if this bit was “on” or “off” (1, -1 when bipolar here):
\[\label{eq:ModernHopfieldEnergyEquation} \bar{\mathbf{y}}_i = \text{Sign}\Bigg[ \sum_{\bar{\mathbf{p}}\in \bar{P}} \Big ( K \big( [\bar{\mathbf{p}}_a]_i + \sum_{j \neq i} [\bar{\mathbf{p}}_a]_j \bar{\boldsymbol{\xi}}_j \big ) - K \big( - [\bar{\mathbf{p}}_a]_i + \sum_{j \neq i} [\bar{\mathbf{p}}_a]_j \bar{\boldsymbol{\xi}}_j \big ) \Big ) \Bigg]\]Whichever configuration between “on” and “off” gives the highest output from \(K(x)\), corresponding to a lower energy state, will be updated towards.
Using even polynomials in the energy difference means that attraction to similar patterns (closer than orthogonal) is rewarded as much as making opposite patterns (further than orthogonal) further away. Odd polynomials reward attraction to similar patterns also but instead reward reducing the distance of opposite patterns, trying to make them orthogonal. In other words, an even polynomial would rather have a vector be opposite than orthogonal while it is the other way around for an odd polynomial. Meanwhile, the rectification of the polynomial, which empirically resulted in a higher memory capacity, simply ignores all opposite patterns. Finally, the capacity of this network was further improved by replacing the polynomial with an exponential function \(K(x)=\exp(x)\) here.
Fundamentally, these odd, rectified, and exponential \(K(x)\) functions make the Hopfield Network more similar to SDM by introducing activation thresholds around the query, and either ignore (in the rectified case and exponential cases), or penalize (in the odd polynomial case), vectors for being greater than orthogonal distance away. The weighting of vectors remains different between the polynomials and exponential such that they have different information storage capacities. However, by introducing their de facto cosine similarity thresholds, these Hopfield variants are all convergent with the read operation of SDM. This is particularly the case for the exponential variant because SDM weights its patterns in an approximately exponential fashion.
Beyond the convergence in pattern weightings, we note that the last step in making the exponential variant of the modern Hopfield Network into Transformer Attention is to make it continuous in the paper “Hopfield Networks is All You Need”. Making SDM continuous is the same step taken in our work that results in the Attention approximation to SDM. This is done by modifying the energy function so that it enforces a vector norm (SDM does this too) and then using the Concave Convex Optimization Procedure (CCP) to find local minima (flipping bits is no longer an option).
The recent paper “Universal Hopfield Networks” (disclaimer: I am in the acknowledgements for this paper and the first author is a friend) makes a similar point relating different activation functions that have emerged in Hopfield Networks to Attention but does not dive into the relationship between SDM and Hopfield Networks and their chronological evolution.
Finally there are weak indications of convergence between the range of optimal SDM cosine similarity thresholds \(d\) and optimal polynomials \(a\) in the polynomial Hopfield Network. It was found when pattern representations were learnt that the optimal polynomial for maximizing accuracy in a classification task was neither too small nor too large. This is also the case for optimal \(d\), which are related via their effect on the pattern weightings. They have a useful interpretation of their system where some of the vectors could serve as prototypes to the solution and be very close, while others vectors could represent features that different components had. The best solution interpolated between these two approaches.
One can view the prototypes as anchoring the solution and the features providing generalization ability. Even for SDM’s neuron perspective, the same intuition may apply where it is advantageous to have some protoypes read from nearby neurons but also collect features from related patterns stored in more distant neurons. This overview of SDM, gives an example of noisy patterns being stored and their combination being a noise free version is related to this line of thinking on using features from related patterns.
If you found this post useful for your research please use this citation:
@misc{SDMHopfieldNetworks,
title={Sparse Distributed Memory and Hopfield Networks},
url={https://www.trentonbricken.com/SDM-And-Hopfield-Networks/},
journal={Blog Post, trentonbricken.com},
author={Trenton Bricken},
year={2022}, month={October}}
Thanks to the NSF Foundation and the Kreiman Lab for making this research possible. Also to Dmitry Krotov and Beren Millidge for useful conversations. All remaining errors are mine and mine alone.
]]>A number of influential continual learning algorithms like Elastic Weight Consolidation (EWC) and Synaptic Intelligence (SI) protect neural network weights important for previous tasks from being updated by newer tasks. Otherwise, these network weights are overwritten by the next task, resulting in catastrophic forgetting.
It turns out that existing implementations of these algorithms have, at least for some tasks, been significantly underestimating their performance. This is because they need a small modification to be able to actually learn which network weights need protecting. In a Split MNIST class incremental benchmark, this modification leads to 43% higher validation accuracy.
These weight regularization methods use the magnitude of gradients during the backwards pass to infer what weights are important for a particular task. They then use a regularization term in the loss function to penalize the model from updating these weights during new tasks. However, in cases where the model performs very well on all training data within a task, there is almost no gradient for the model to be able to learn what weights are important!
In order to restore gradient information, we introduce a \(\beta\) coefficient into the cross-entropy loss function when learning weight importances. This is used to make the model less confident in its predictions. A hyperparameter sweep found that \(\beta = 0.005\) worked the best for both EWC and SI, leading to 43% and 15% performance gains as shown in the below figure.
This result is presented in the paper Sparse Distributed Memory is a Continual Learner, which relies upon a version of MNIST where the images use pixel values [0,255] that are not rescaled to [0, 1] or normalized to have a mean of 0. We believe this makes the MNIST classes more orthogonal and easier for continual learning.^{1} However, even when the pixels are rescaled, there is still see a performance boost (going from 27% when \(\beta=1.0\) to 54% \(\beta=0.005\)) and also performance gains for Split CIFAR10.
The \(\beta\) coefficient will not increase performance independently of other parameters. In both of the above cases, we also use gradient clipping, SGD with momentum and a single hidden layer of neurons. However, having these other modifications in place, \(\beta\) remains worth experimenting with to potentially get significant performance gains. Our results are better than those of any other paper we have seen that uses the EWC or SI as baselines including Re-evaluating Continual Learning Scenarios: A Categorization and Case for Strong Baselines (emphasis mine).
There are two ways to reproduce our results: 1. Clone this GitHub repo which is a forked version of the original continual learning baseline and run the following command (after installing the requirements):
python -u iBatchLearn.py --gpuid 0 \
--repeat 1 --incremental_class --optimizer SGD \
--momentum 0.9 --weight_decay 0.0 --force_out_dim 10 \
--no_class_remap --first_split_size 2 --other_split_size 2 \
--schedule 4 --batch_size 512 --model_name MLP1000 \
--agent_type customization --agent_name EWC_mnist \
--lr 0.03 --reg_coef 20000 --use_beta_coef True \
--beta_coef 0.005
This uses the version of Split MNIST with [0,1] pixel values.
(if you are running on CPU set --gpuid -1
).
test_runner.py
set:model_style = ModelStyles.CLASSIC_FFN
and
cl_baseline = "EWC-MEMORY",
Then call: python test_runner.py
!
This uses the version of Split MNIST with [0,255] pixel values.
If you found this post useful for your research please cite the paper: Sparse Distributed Memory is a Continual Learner.
@misc{SDMContinualLearning,
title={Sparse Distributed Memory is a Continual Learner},
url={https://openreview.net/forum?id=JknGeelZJpHP},
journal={ICLR Submission},
author={Anonymous},
year={2022}, month={September}}
This may explain why careful tuning of our regularization coefficient for the algorithms Memory Aware Synapses and L2 Regularization also led to slightly higher performance than reported in the baselines paper. ↩
Thanks to the NSF Foundation and the Kreiman Lab for making this research possible. All remaining errors are mine and mine alone.
]]>Overall I am super happy and feel very lucky to be in my PhD program. On an average week I have ~2 hours of commitments in total. The rest of the time is totally unstructured where I am free to learn and pursue whatever it is that I’m interested in. My interests are centered enough around my research that I manage to keep my supervisor and collaborators happy (at least to date!). But I still have lots of free time to explore and learn about so many other things. I doubt I’ll ever have this degree of freedom and flexibility again in my life.
Lessons:
His years at sea had taught him that if you don’t fix something when you first see it beginning to fail, it is very likely to finish failing just when it is the most dangerous and the hardest to deal with, such as in the midst of a storm. Source
What I cannot create I do not understand - Richard Feynman
My favourite science quote:
The Most Exciting Phrase in Science Is Not ‘Eureka!’ But ‘That’s funny …’ - Isaac Asimov
Oh and if you haven’t read You and Your Research then you should.
Goals:
Thanks to Max Farrens for reading drafts of this piece. All remaining errors are mine and mine alone.
]]>Maybe it is because I think so highly of Tyler Cowen that I expected the long anticipated Talent to be harder hitting and more insightful. A couple of the interview questions are interesting and a few other factoids that I’ve tried to aggregate. But otherwise the book is basically a Psych 101 introduction to the Big 5 Personality types that is covered more carefully and thoroughly by its Wikipedia entry.
In classic Straussian fashion, the main takeaways of the book in the foreword are trite while the actual ones are:
The discussion of interview questions and strategies to do this is are the strongest part of the book but also only one chapter.
Interview questions:
Questions I want to add to the list are:
Behind the scenes during the interview you want to:
On breaking the ice and how anything goes:
On the topic of references, they are are very important but often don’t have much time and don’t want to say bad things. Try to get things into conversation mode and ask for objective comparisons:
This first question, reformulated as, would you be ok with this person being your boss? I think it is a great question to ask yourself about any hiring decision too.
The meat of the book talks about IQ and the Big Five Personality types but covers these in a cursory and not particularly precise way. For example, they give very loose definitions of the Big Five traits, under extraversion they say “friendliness” but should this not be under agreeableness? For neuroticism they define it as:
A general tendency to experience negative emotions and negative affect, including fear, sadness, embarrassment, anger, guilt, and disgust.
But how does this relate to depression and also to just being less emotionally stable? They go on to talk about neurotics as if they are just people who often complain and lead social movements giving John Calvin and Gandhi as an examples of “pests” or “as prickly individuals.”
They define conscientiousness as:
High conscientious individuals have high self-control, are very responsible, have a strong sense of duty, and usually are good at planning and organizing, due to their reliability.
But this mixes both morals and being a hard worker.
Having never really defined the Big Five types, things get particularly confusing when they summarize studies that show conscientiousness is not actually what you might otherwise think it is. This includes interesting results that South Koreans work long hours but are low in conscientiousness and that conscientiousness is uncorrelated with COVID mask wearing obedience in Italy. But we were never really told what we were supposed to think consciousness is to begin with!
There is a summary of psychology/psychometrics on how the Big Five impacts job performance and earnings. While they do a good job hedging by talking about the replication crisis and needing to take any of these results with a grain of salt, they simultaneously cherry pick a subset of studies that readers will likely over update on.
For example, a study found 20% of variance in achievement for scientists was due to personality after adjusting for scientific potential and intelligence but my guess is that this varies dramatically across the specific subfields depending in a large part upon how collaborative the field is (think a large biology lab versus a mathematician with his chalk board).
In the section on IQ they paint a picture where it is somewhat useful but not all that important? While I largely agree, I again wish they had been more precise and covered more ground. For example, they fail to acknowledge g instead making sweeping statements like “Intelligence is context dependent”. They also fail to talk about the Flynn effect, don’t acknowledge that IQ tests in the US are illegal for hiring, and at times seem to blur IQ with creativity and the quality of one’s ideas.
A key citation they use looked at the IQ of CEOs in Sweden. This study found that “the small company CEO is above 66% of the Swedish population in cognitive ability, and the median large-company CEO is above 83% of the Swedish population in cognitive ability.” Yes, this may be good evidence that there is more to IQ, but having more data on how much it matters exactly across a broad swathe of outcomes would have been useful for example I have seen this table before in other places on the inter-webs:
They also seem to get things wrong when they say there is “evidence that autistics have strong performance on Ravens IQ tests” but this is not true on average from having read NeuroTribes and just looking at top cited papers via a cursory google search seems to back this, for example here. I also wish they had clarified other results I have seen around polygenic scoring and genetic predictors for personality type. See for example, Top 10 Replicated Findings from Behavioral Genetics. And that they had brought up other interesting phenomena such as birth order effects.
At times the book seems to lose track of its audience, jumping without any clear delineations between being a self promoting biography of Tyler and Daniel; a management book; and a self help book. The personal anecdotes are often lengthy and the third person writing can be a bit much at times. The book also deviated into the occasional over-grandized motivational self help sales pitch with phrases like:
Do you wish to be part of such trends for mobilizing the talents of strongly unique people, or are you going to let others eclipse you in the search for talent?
And management advice such as when discussing the move to remote work saying:
Those methods will reward non-paranoid leaders who are okay with giving up some sense of control in the moment, and you will need to adjust your style more in that direction, if you have not already.
Even on the back cover it says:
Identifying underrated, brilliant individuals is one of the simplest ways to give yourself an organizational edge.”
But if it is so simple then why is an entire book trying to discuss the nuances and how it all comes down to context?
They also move between focusing on how to hire real outliers to how to hire for standard positions. For example, in the interview section, early on they say they will discuss unstructured interviews typically used for more senior positions instead of structured interviews where the same question is asked to every candidate. But later when discussing the Big Five personality types they turn their attention towards hiring mediocre candidates for entry level jobs.
Maybe the problem is me getting my hopes up too much for this book both in terms of what I expect from Tyler Cowen and also how hard it is to make any meaningful statements about something as context dependent and difficult as finding talent? And while the book tries to be prescriptive, they correctly acknowledge that at the end of the day it is all context.
I’m now going to transition into notes from the book that I found interesting:
Thiel anecdotes:
On the talent spotting abilities of Thiel:
Peter Thiel found and helped to mobilize the talents of Elon Musk, Reid Hoffman, Max Levchin, Mark Zuckerberg, and others, including Steve Chen, … . His approach is not well described by any kind of mechanical formula, and Peter’s own background is in the humanities – philosophy and law – rather than science or tech. Many of his current interests concern religion, as he studied the Bible under French anthropologist and philosopher Rene Girard, who was a professor of Peter’s at STanford. We understand Peeter as applying a very serious philosophical and indeed even moral test to people. … In our view, Peter actually asks whether you deserve to succeed, as he understands that concept, and he derives additional information from that interior and indeed deeply emotional line of inquiry.
How Thiel is compelling:
The first time each of us met Peter Thiel, for instance, we noticed how engrossed he was in his explanations and, furthermore, how quickly and effectively he pulled people into his worldview, introduction and applying concepts such as “technological stagnation,” “the inability to imagine a future very different from the present,” “Georgist economics,” and “the Girardian sacrificial victim.” Maybe you don’t know what all of those concepts refer to and maybe Peter’s audience doesn’t either, but that is not the point. There is a logic to his argument, and Peter communicates that logic with the utmost conviction; the audience correctly senses a coherent underlying worldview.
Another Thiel anecdote I love that is not in the book but is just absurd is that his Roth IRA is worth $5B (you can only deposit into these when your income is low and only $6K per year. It is also not taxable!).
Altman anecdotes:
Focus not on where someone currently is but their rate of growth.
I found it amusing that Altman on talent says:
I look for founders who are scrappy and formidable at the same time (a rarer combination than it sounds); mission-oriented, obsessed with their companies, relentless, and determined; extremely smart (necessary but certainly not sufficient); decisive, fast-moving, and willful; courageous, high-conviction, and willing to be misunderstood; strong communicators and infectious evangelists; and capable of becoming tough and ambitious.
Should we call in a superhero? Isn’t he just listing every possible desirable characteristic? Isn’t the value of VC in finding “alpha” by investing in people who don’t fit all of these straightforward criteria?
On the importance of speed and being proactive:
Years ago I wrote a little program to look at this, like how quickly our best founders – the founders that run billion-dollar plus companies – answer my emails versus our bad founders. I don’t remember the exact data, but it was mind-blowingly different. It was a difference of minutes versus days on average response times.
One they did not state but stuck with me from Altman’s podcast with Tyler was that the most successful founders all came from stable, middle class families. This may have something to do with their risk tolerance?
The later sections covered Disabilities, Minorities, and Gender. I thought the sections on minorities and hiring for diversity are important and broadly well stated. However, I have already highlighted my issue with the claim about autism being positively correlated with Raven’s Matrices. Here are some of the more striking points from the section on gender:
Miscellaneous:
Thanks to Davis Brown for reading drafts of this piece. All remaining errors are mine and mine alone.
Disclaimer: I am not a philosopher, I very much welcome comments and debate on this piece. I am trying to not let perfect be the enemy of good and share this piece somewhat unfinished rather than continuing to sit on it.
Derek Parfitt in Reasons and Persons introduces the “Repugnant Conclusion” which is an unsettling answer to the question: “should we have more people or happier people?” Parfitt persuasively argues from a few simple axioms that the answer to this question is always quantity over quality and that we should have as many people alive as possible, such that everyone is living right on the threshold of life not being worth living. In other words:
“For any perfectly equal population with very high positive welfare, there is a population with very low positive welfare which is better, other things being equal.” - Derek Parfitt
Many, myself included, find the idea of this subsistence level living repugnant, hence the name of Parfitt’s conclusion. However, I think there is actually a simple solution to the Repugnant Conclusion that I will outline after better formalizing it.
Parfitt considers the utility function of individuals as either being linear or non-linear with diminishing returns:
The diminishing returns case is the most realistic and in this case any increase in resources on the x axis can give only equal or less than equal returns on quality of life. We can use this utility function to consider the utility of a whole population, deciding the amount of resources that each person gets and summing together their quality of life:
Comparing the areas of the rectangles on the right:
We want the one with the largest area and due to our utility function having diminishing returns, the way to maximize area is by having everyone live just above subsistence:
This always results in the repugnant conclusion where we choose quantity over quality for the sake of maximizing total utility.
However, I think there is a solution that isn’t repugnant and in fact leverages the very nature of us seeing this conclusion as repugnant – there is a point where a small increase in resources leads to an even larger increase in life quality. In other words, when life goes from glass half empty to glass half full; when you are sufficiently far above subsistence that life gets a lot more enjoyable. Exactly where this point occurs and how large this non-linear increase in quality of life as a function of resources is remains up for debate. Yet as long as any non-linearity of this form exists, the repugnant conclusion will be avoided. Formally, as long as this non-linearity with a positive 2nd derivative exists and the utility function is monotonically increasing, it dissolves the repugnant conclusion.
This is because at this non-linearity, for a decrease in resources, there is an even greater decrease in life quality. This means that in order to maximize the utility of a population, nobody’s resource allocation and quality of life should drop below this non-linearity.
I am curious to know if this non-linearity already exists in the literature and what might be wrong with it. It is such a simple modification and I was frustrated that Parfitt never addressed it in his magnum opus. Tamay noted that one known occurrence of this sort of asymmetry in utility exists with Prospect Theory, where losses hurt more than wins.
What Do We Learn from the Repugnant Conclusion? by Tyler Cowen reviews many papers that came out after Reasons and Persons introduced the Repugnant Conclusion. Most related is his section titled “Asymmetric Treatment for Low-Utility Individuals”. However, this section discusses placing bounds on the utility function which is not done here.
In other places there is discussion of having asymptotically declining utilities that tend towards zero and may be non-linear. This violates axiom 4 of the repugnant conclusion:
Axiom (4) - No value should become infinitely small in importance at the margin. A very large addition to that value, all other things being held equal, should never translate into an asymptotically insignificant contribution to the social welfare function. I call this the non-vanishing value axiom.
However, the solution proposed does not rely upon an asymptotic decline of the utility function to a value of 0. Instead, it could intersect the y-axis at a value higher up, the only thing that matters is that non-linearity with a positive 2nd derivative exists somewhere.
Cowen also summarizes work where interaction effects are modeled such that a decrease in resources per person leading to a reduction in, say, dignity, and it is the effects on dignity itself that causes a decrease in utility such that we should avoid the repugnant conclusion. He provides a number of attacks against these kinds of interaction effects. However, fundamentally there is clearly an interaction effect between resources and life quality and life quality itself is the sum of many components. Having an interaction between resources and these components therefore seems inevitable? Yet, more work needs to be done to suggest where the non-linearity I suggest actually comes from.
For the utility function that describes the average persons’ life, as long as there exists a point where a one unit increase in resources leads to greater than one unit increase in life quality, the Repugnant Conclusion is not reached. The source of this non-linearity is unclear but at risk of being tautological, may exist due to the very fact that the Repugnant Conclusion feels so repugnant.
Thanks to Tamay Besiroglu and Davis Brown for reading drafts of this piece. All remaining errors are mine and mine alone.
]]>I have just finished both of Ted Chiang’s collections of short stories: “Stories of Your Life and Others” and “Exhalation”, which were both excellent. Chiang’s work is steeped in past and present scientific ideas spanning fields including physics, computer science, and biology. For example, Chiang considers worlds where we can turn off the part of the brain that recognizes beauty in faces; where children are created from preformed humans inside sperm; where we can do forms of time travel that don’t violate our current laws of physics; the Everetiann many worlds interpretation of quantum physics is real and we can communicate across it; reflections on the heat death of the universe; the effects of glasses that record and allow for immediate recall of your past experiences.
I really like how Chiang makes salient slippery topics like the progression of technology, Chesterton’s fence, free will, morality, the meaning of life. He provides novel angles to view these topics through and handles the ideas subtly. The stories leave many more questions than answers but are stimulating and beautiful.
—
Questions the stories prompted that I will keep thinking about:
When is technology beneficial?
Technology gives us more power and optionality, allowing us to do things that were never possible before. This forces us to reconsider what about our status quo that evolution gave us is desirable to keep and what is not. Paul Graham notes a growing divergence in The Acceleration of Addictiveness between what is “normal” in the sense that cavemen also did it and “normal” in the sense that the majority of people do it now.
Evolution is simultaneously a “blind idiot God”, responsible for vast amounts of unnecessary suffering and a gargantuan Chesterton’s fence, creating 12 stage cassava processing techniques to remove cyanide. How can we fix evolution’s shortcomings while not poisoning ourselves with cyanide? Moreover, figuring out what is best for us is made all the more tricky because our very desires are programmed by evolution. Moreover, how can we use technology to restore the very things that we lost because of other technologies and where does this end? For example, in the story “Liking What You See: A Documentary” technology, including better cosmetic surgery, leads to superstimuli that hijack our natural bias to treat more beautiful people better. A counter response to this is a non-invasive brain modification that makes one “blind” to human beauty. The story provides a back and forth debate for and against this technological arms race.
The utility of memory?
By default I buy into wanting to remember everything and the importance of objective fact. However, Chiang in “The Truth of Fact, the Truth of Feeling” reveals how even a memory device as simple as writing can affect our social dynamics and outcomes. There is a distinction made between what is factually correct and what it is best to believe in order to make the right decision. This is closely related to Elephant in the Brain, which puts forward the hypothesis that our inner mind hides its intentions from our conscious mind so as to best reach our ends by being optimally deceptive – the best liar is the person who doesn’t even know they are lying! Chiang projects our external memory tools and their consequences into the future where we all wear video cameras and can effortlessly query any previous memory. This extension takes our external memory abilities beyond just the factual (e.g. Googling what is the capital of France?) and into the personal (e.g. What did I say to Alice four weeks ago at that party?).
In this story the protagonist learns that he was in fact misremembering previous interactions and, embarrassed, concludes the recording device can help him become a better person. However, to what extent are we as humans already too self-effacing? And given that we are terrible at holding both good and bad things in mind at the same time (the affect heuristic) is it a bad thing that we are constantly overwhelmed in nuance and who the good guy is versus the bad guy? At what point do we hit epistemic learned helplessness? If this all sounds interesting, Symbolic Species takes this argument about memory and fact even further with the costs and benefits of symbolic thought and language itself.
How can we establish the rights and consciousness of digital minds?
“The Lifecycle of Software Objects” short story is timely on a number of fronts. Digital beings (digients) are created that run on artificial neural networks and develop analogously to children over time. Their owners grapple with figuring out just how intelligent the digients are (we are currently doing this with our largest AI language models). The digients also desire to have autonomy and be incorporated as independent entities raising tricky legal issues and getting to the core of free will and agency. We will soon face something similar with driverless vehicles. Interestingly, the story assumes that digients are conscious and can suffer by, for example, being tortured. This assumption and the implications of being able to create vast amounts of suffering with mere computer code was troubling but again timely with the debates this month on whether or not Google’s LaMBDA language model is conscious.
What’s the point of it all?
Exhalation poetically captures the ultimate heat death of the universe when all life and existence will inevitably come to a standstill. Even in light of this ultimate extinction, there is a very Buddhist perspective of enjoying the present and existence itself. A number of Chiang’s other stories also touch on these sorts of realizations and existential crises including: Omphalos, Division by Zero, and Tower of Babylon.
If you made it through the above, some of these points may sound trite, especially the last one. This is where I believe Chiang is at his best, weaving together deep ideas with imagery and emotion that resonates and feels more profound that I can hope to do it justice. Go and read the originals :P
—
My favorite short stories, some of which I have already mentioned, in rough order of enjoyment were:
Most of these came from the second collection of short stories: “Exhalation”.
As a final note, I love that there are story notes at the end of each book where Chiang shares his inspiration for each of the stories, providing a different and richer perspective on their origins.
If you have read Ted Chiang’s work then reach out and let me know your thoughts!
There has been recent discussion on StackOverflow and Twitter on the full memory requirements of training a Transformer.
Because I am in the process of training Transformers myself and scaling to multiple GPUs, I became interested in this question myself. Misha Laskin provides some back of the envelope calculations for why batch size and sequence length dominate over model size that are interesting but off by approximately 4x for the model parameters and 2x for activations.
I have thrown together a more detailed calculator as a Colab notebook here. And outline my reasoning below. I’ve tested this on the “small” 124M parameter and “medium” 345M parameter GPT2 models and get close to the real values.
Here is the GPT2 model architecture (image taken from my paper):
See the Illustrated GPT2 for a full explanation of how GPT2 works.
Here are the equations and notation in full:
L = 12 # number of blocks
N = 12 # number of heads
E = 768 # embedding dimension
B = 8 # batch size
T = 1024 # sequence length
TOKS = 50257 # number of tokens in the vocab
param_bytes = 4 # float32 uses 4 bytes
bytes_to_gigs = 1_000_000_000 # 1 billion bytes in a gigabyte
model_params = (TOKS*E)+ L*( 4*E**2 + 2*E*4*E + 4*E)
act_params = B*T*(2*TOKS+L*(14*E + N*T ))
backprop_model_params = 3*model_params
backprop_act_params = act_params
total_params = model_params+act_params+backprop_model_params+backprop_act_params=4*model_params+2*act_params
gigabytes_used = total_params*param_bytes/bytes_to_gigs
For the “small” GPT2 model with 124M parameters (that uses the above values for each parameter) we get:
model_params = 123,568,896
act_params = 3,088,334,848
gigabytes_used = 26.6 Gb
While running the Hugging Face GPT2 we get 27.5Gb.
If our batch size is 1 then we undershoot again where memory is predicted to be 5.1Gb but in reality it is 6.1Gb.
For the medium sized 345M parameter model and a batch size of 1 our equation predicts that there it will use 12.5Gb while empirically it is: 13.4Gb. The 1Gb gap remains. I learned that this 1Gb gap comes from loading the GPU kernels into memory! See here.
The model parameter equation comes from:
(TOKS*E) [embedding layer ]+ L [number of blocks]*( 4*E**2 [Q,K,V matrices and the linear projection after Attention] + 2*E*4*E [the MLP layer that projects up to 4*E hidden neurons and then back down again] + 4*E [Two layer norms and their scale and bias terms])
Where we ignore the bias terms and positional embedding.
The activation parameter equation comes from:
B[batch]*T[seq. length]*(2*TOKS [one hot vectors at input and output]+L[number of blocks]*(3E [K,Q,V projections] + N*T [Attention Heads softmax weightings] + E [value vector] + E [linear projection] + E [residual connection] + E [LayerNorm] +4E [MLP activation]+E [MLP projection down]+E[residual]+E[LayerNorm] ))
When I turn on floating point 16 the memory for 1 batch only drops from 6.1Gb to 5.8Gb. Meanwhile for a model with a batch of 8, it goes from 27.5 to 21.8Gb. Why are there not larger memory savings? Is this because it is mixed precision and the model decides that it needs high precision for many of its components?
Thanks to Miles Turpin and Misha Laskin for motivating this piece. All remaining errors are mine and mine alone.
Author Chris Eliasmith uses ideas from Vector Symbolic Architectures (VSAs) and implements them with his Neural Engineering Framework (NEF) to make his “brain” somewhat biologically plausible. He uses this combination to solve a number of different tasks and replicate some core psychology and neuroscience experimental results.
The book provides useful context about other attempts in computational neuroscience to build a brain and highlights the promise of VSAs to bridge many of the apparent dichotomies between previous approaches. In focusing at the scale of building a brain, Eliasmith also drags the microscopic and myopic back to the terminal goal of building minds and developing solutions that work at scale.
Chris Eliasmith summarizes years of his research with the culmination being “Spaun”, a model of the brain with many interconnected components that can perform diverse tasks including classifying handwritten digits and solving Raven’s Progressive Matrices (one of the core tests of human intelligence). This model, while certainly performing some impressive tasks, is less impressive than it first sounds, with many of the tricky details being hard coded. It’s success is thanks to an impressive engineering effort, both in connecting the right pipes in the right ways, and using Eliasmith’s Neural Engineering Framework (NEF), to implement basic VSAs and associative memory theories that had already been developed.
A big reason I was let down by the book is that I was aware of VSAs and excited to see them applied. I had hoped that some open questions I had around them would be addressed such as how a brain should decide to organize its cleanup memory, how it should represent symbols and vectors, and what variables should be bound to others. However, many of these issues were ignored just to flex the power of NEF on a number of toy problems with the really difficult parts simply hardcoded.
Additionally, I don’t think Eliasmith gave previous work in these domains of VSAs and associative memory enough credit, nor did he leverage it to its full potential. For example, he presents a way to store memories through “chaining” without any citation for Plate’s PhD thesis that presents and outlines this idea in detail, calling it “chunking”.
In addition, for clean-up memory, he simply takes a dot product between the query and every pattern that has ever been stored. It is biologically implausible to imagine that every pattern is independently stored in the brain in this way. Far more powerful and biologically plausible systems such as SDM and even Hopfield Networks exist and should have been used instead.
Maybe I am overdoing it with my criticism here, maybe these fancier memory models could have been implemented but were not the main focus of this work. For any project, there is always finite scope. However, one thing I appreciated about the book was its prioritization of biological plausibility and this emphasis in some domains and neglect in others is frustrating. This is particularly because the biggest source of biological plausibility is the spiking neural networks implemented via NEF…
According to Greek mythology, anything Midas touched turned to gold. Here, anything that NEF touches turns to “biologically plausible”. Beyond being used ineffectively to approximate functions that we know have more biologically plausible alternatives, the extent of NEF’s representational power makes me skeptical of its own underlying plausibility, bringing the whole edifice it is built upon down with it.
The largest issues with the NEF are: (i) how it computes a loss function and (ii) how it propagates the error signal from this loss function to the specific subset of neurons used. The NEF takes a population of spiking neurons with random tuning curves and learns to weight them via a linear decoder such that they can approximate any arbitrary function. It is never explained^{1} how the arbitrary loss functions are both stored and calculated by a single decoder neuron. The error signal from this loss function is said to come from a local or external signal and mentions dopamine as maybe being this training signal but how it propagates and it targets the subset of activated neurons is highly unclear.
Again, the overall focus and ambition is refreshing but there remains much work to be done and I believe that a number of the approaches outlined in this book will need to be re-questioned and over-written in the future.
I appreciate the focus on bio-plausibility and sometimes it is very compelling but other times it seems much less justified and this brings the compelling sections into doubt.
Pros of the book include:
Cons:
Other issues with the book:
Concluding Remarks
Given my tone and critiques throughout this review I may come as a surprise that I am really glad that this book exists and Chris Eliasmith has done the research that he has. It is a great introduction to the field of computational neuroscience and shows the exciting potential of VSAs. However, it is because this route is so exciting that I have high expectations and want for everything to be done in the most biologically plausible and sophisticated way possible with the right attributions to the right original researchers.
If you have read the book or have opinions please comment as I am curious to get outside perspectives. Hat tip to Adam Marblestone for bringing my attention to the book through this wonderful list of recommendations on his website.
Thanks to Joe Choo-Choy for influential discussions and reading drafts of this piece. All remaining errors are mine and mine alone.
I will now transition from providing commentary to summarizing each chapter of the book by sharing the notes that I took for each. I’m not sure how useful this is to readers but maybe treat it as a “Table of Contents” for the book?
I found this chapter to provide a useful high level background to computational and cognitive neuroscience, at least as of 2013 when this book was published.
There are four main approaches for modelling cognition: - Logic/Symbolic, Good Old Fashioned AI (GOFAI) - our thinking is logical and needs to be highly flexible as enabled by symbols. - Connectionist / Parallel Distributed Processing (PDP) - brain is highly parallel and distributed. - Dynamicist - we need to account for time and the environment. - Bayesian - induction and probabilistic models of the world. We need to account for uncertainty.
The two main approaches for trying to explain data can be considered top down and bottom up. - Production systems like ACT-R can capture high level results but have constraining time constants that are hand tuned and biologically implausible. - Lower level dynamics models that capture the low level phenomena but not high level cognition.
In response to these buckets, Eliasmith presents VSAs that can be logical and have nested, semantically meaningful relationships while also being dynamicist in their focus on action and implemented in a way that models both temporality, is distributed and probabilistic (Bayesian).
Eliasmith poses an interesting question apparently asked during a funding agency meeting: “What have we actually accomplished in the last 50 years of cognitive systems research?” Eliasmith answers saying that we have a better idea of the landscape of cognition and criteria that any intelligent system must fulfil.
Core Cognitive Criteria (CCC) for theories of cognition – how to evaluate any model:
I also tried to answer this question myself and would be very interested in comments with answers from you, reader. Try it now! It’s a fun exercise.
“What have we actually accomplished in the last 50 years of cognitive systems research?”
Obviously we have a long way to go but a great deal has been accomplished.
Eliasmith states what key questions are answered in the book, however I believe that this is overpromising and I append why to each question.
How are semantics captured in the system?
How is syntactic structure encoded and manipulated by the system?
How is the flow of information flexibly controlled in response to task demands?
How are memory and learning employed in the system?
This chapter introduces the NEF, which I will skip as I have already written about it above. It also provides a general neuroscience background that gave me some new fun facts!
The laryngeal nerve of the giraffe, linking larynx to brain, a few inches away — but because of evolutionary developments, instead dropping from the brain all the way down the neck to the heart, and then back up to the larynx. In giraffes the nerve can be as much as 15 feet long, to make a connection a few inches away.
The next 4 Chapters outline: Semantics; Syntax; Control; Memory & Learning. These are then all combined into Spaun, the system that is meant to emulate a brain solving a number of tasks.
This chapter introduces VSAs. For a better introduction I defer to … .
VSAs are capable of implementing Hinton’s reduced representations which are: (this is taken from Plate’s PhD thesis, he was supervised by Hinton).
A golf ball was used as a description of the high dim space that the symbols, represented as vectors, exist in. The little pockets on the ball’s surface can do clustering of similar representations.
Work is summarized on how a convolutional network can be used to learn compressed representations of MNIST digits. The quality of the compression was assessed by decoding it fixing the top layer and optimizing the input layer. This compression is suggested to be done the VSA uses superpositions/convolutions. Using NEF they also got nice Gabor like filters that we see in V1.
There is also this nice diagram for neural processing and control.
Some nice diagrams of basal ganglia action selection:
updated routing diagram that also incorporates in the dopamine signalling:
I NEED TO FINISH ADDING NOTES HERE ON THE REMAINING CHAPTERS.
Disclaimer that I have not read the original Neural Engineering Framework book and would love to be corrected if any of my understanding of it is incorrect. ↩