Trenton Bricken

On Solo Backpacking

2024-09-10T11:27:00+00:00

Backpacking forces you to walk; Solitude isolates your thoughts. The combination results in a meditative state.

Your mind hums like an engine in neutral, fueled by nature’s simple ingredients: dappled sunshine, a stream’s burble, a swaying hammock.

A fractal tree of associations form in your mind. Each idea ignites, burns like a comet across consciousness, then fades, passing the torch to the next. You become both observer and creator of this mental light show.

This particular trip didn’t produce epiphanies, but that was never the goal. Months of stressors needed quiet rumination. Intuitions were given the time and space to grow closer to convictions. Every contact between dirt and boot turned mental gears, massaging away tension like nitrogen bubbles rising from a diver’s tissues.

Back in the world of deadlines and notifications, the trail remains an oasis in memory. The cacophony of daily life—its pings, buzzes, and raised eyebrows—stands in sharp contrast to the peace found.

I hope to return to the same but different trail soon.

The Cerebellum Beyond Motor Control

2022-11-19T15:28:00+00:00

Searching beyond the streetlight finds a plethora of important cerebellar functions.

A policeman sees a drunk man searching for something under a streetlight and asks what the drunk has lost. He says he lost his keys and they both look under the streetlight together. After a few minutes the policeman asks if he is sure he lost them here, and the drunk replies, no, and that he lost them in the park. The policeman asks why he is searching here, and the drunk replies, “this is where the light is”. – Source

Beyond the Streetlight – I believe this same phenomenon has occurred for our understanding of the cerebellum. Deficits in fine motor function are very easy to spot and were the simplest function to attribute to cerebellar lesions. This is still what is taught in most textbooks.

However, it is less well known that the cerebellum contains ~70% of all neurons in the brain and is ubiquitous across organisms as varied as humans, fruit flies, and electric fish. Over the last ~20 years in particular, evidence has been accumulating that the cerebellum, first named by Leonardo da Vinci for “little brain” is more important than its size may suggest.

This is the cerebellum. It is important for lots of things.

A prominent neuroscientist once said to me:

A dirty little secret in cognitive science is that the cerebellum lights up for almost every task.

Taking a excerpt from a recent paper:

A review of 275 positron emission tomography (PET) and functional magnetic resonance imaging (fMRI) studies revealed that cerebellar activation was observed during a broad range of tasks, including orienting attention, olfaction, spoken and written language, verbal working memory, problem solving, spatial memory, episodic memory, skill learning, and associative learning (Cabeza & Nyberg, 2000).

A broad range of neuropsychological deficits has also been documented following localized cerebellar pathology, with deficits across tasks of attention, working memory, language and naming, counting, visuospatial processing, planning, and abstract reasoning reported (Kalashnikova, Zveva, Pugacheva, & Korsakova, 2005).

Neural tracers have recently found closed loop circuits between the cerebellum and almost every other brain region. Graphical abstract of Pisano et al. (2021).

Cerebellar ubiquity – The cerebellum-like structure found in many insects including fruit flies, ants, and bees is the Mushroom Body (MB). The MB is responsible for associative learning and is the primary region of the Drosophila brain that is not genetically pre-wired, instead containing many random connections that undergo learning. Cerebellar equivalents have also been discovered in families of crustaceans and flatworms through likely shared ancestral inheritance and cephalopods through potentially convergent evolution.¹

The shared circuitry between the cerebellum and Mushroom Body. Image taken from Figure 1 of Modi et al. (2020).

Cerebellar intelligence – Just as there has been found to be a positive correlation between the size of the MB and the intellect of various insects, it has recently been discovered that the human cerebellum is larger than previously thought making up 78% of the total surface area of our neocortex, making it much larger than for other primates. For example, in the macaque monkey the cerebellum is only 33% the surface area of its neocortex.

Sans cerebellum – Another critique of cerebellar importance is that people can live being born without one. This phenomenon is not unique to the cerebellum, where people can be born without entire cortical lobes and show no phenotypic deficits. For example, patient EG, a highly intelligent lawyer who taught herself Russian later in life and scored 98th percentile for vocabulary, was only found at the age of 25 to be missing her entire left temporal lobe – the canonical center of language processing! Her right temporal lobe was found to have entirely compensated.²

Patient EG missing her left temporal lobe, which is the the seat of language processing in normal brains (containing both Broca's and Wernicke's areas). Source

Being familiar with cases like patient EG, it is reasonable to assume the same compensation and lack of phenotypic deficiency would occur with a missing cerebellum. However, this turns out to not be the case and permanently harms much more than fine motor control, including also language development and emotion. Here are excerpts from an interview with Jonathan who was born without a cerebellum:

“All his milestones were late: sitting up, walking, talking.” […] He also lacks the balance to ride a bicycle.

“Reaction time, not my strong suit,” Jonathan says, adding that he doesn’t drive anymore. Emotional complexity is another challenge for Jonathan, says his sister, Sarah Napoline. She says her brother is a great listener, but isn’t introspective.

“He doesn’t really get into this deeper level of conversation that builds strong relationships, things that would be the foundation for a romantic relationship or deep enduring friendships,” she says. Jonathan, who is sitting beside her, says he agrees. – Source

Summary – The fact the same fundamental cerebellar architecture appears across such diverse species, the growing evidence of its involvement in most cognitive functions, and the correlation between its size and intelligence, all indicate that cerebellum-like neuronal architectures are performing a crucial and differentiated cognitive operation that spans far beyond motor control.

(Shameless plug: Sparse Distributed Memory is a theory for cerebellar function that is also a close approximation to the state of the art Transformer deep learning architecture: Attention Approximates Sparse Distributed Memory.)

Thanks to the Jamie Simon for spotting a typo. All remaining errors are mine and mine alone.

Footnotes

Note that the extent to which the cephalopod vertical lobe approximates the cerebellum is contested (pro convergence; contra convergence). ↩
It is wild to me that there is no consequence of missing an entire lobe in the brain. Across animals, the number of neurons (as a ratio of body size) matters a lot for intelligence. Patient EG is highly intelligent but it seems plausible that she could have been even smarter if she was not missing a large number of neurons (computational units). ↩

Citation

If you found this post useful for your research please use this citation:

@misc{CerebellumBeyondMotorControl,
  title={The Cerebellum Beyond Motor Control},
  url={https://www.trentonbricken.com/Cerebellum-Beyond-Motor-Control/},
  journal={Blog Post, trentonbricken.com},
  author={Trenton Bricken},
  year={2022}, month={November}}

HRRs Can’t Recast Self Attention

2022-11-19T10:24:00+00:00

Why Holographic Reduced Representations cannot be used to “Recast Self Attention”.

A paper “Recasting Self-Attention with Holographic Reduced Representations” was recently posted that claims to use the Holographic Reduced Representations to “recast” Transformer Self Attention. While the paper shows some interesting empirical results, I explain why I think the work is flawed in its theoretical underpinnings.

I’m taking the time to write this critique because I believe it is a critical period for Vector Symbolic Architectures (VSAs) to interface with Deep Learning and that this work represents VSAs poorly.

For some background, the Transformer Self Attention equation for a single query vector is the following:

\[V \text{softmax}(K^T \mathbf{q}_t)\]

where our values and keys are vectors of dimension $d$ stored columnwise in matrices $K \in \mathbb{R}^{d\times T}$, $V \in \mathbb{R}^{d\times T}$, and we are only considering a single query vector $\mathbf{q}_t \in \mathbb{R}^{d}$. $T$ is used for the number of tokens in the receptive field of the model and $t$ subscript is the current time point that we are predicting the next token from. This time point determines the current query and will become crucial later.

We will write:

\[\mathbf{\hat{a}}_t = K^T \mathbf{q}_t = [ \mathbf{k}_1^T \mathbf{q}_t, \mathbf{k}_2^T \mathbf{q}_t , \dots, \mathbf{k}_T^T \mathbf{q}_t ]^T,\]

to be the attention vector before the softmax operation. Creating this vector takes $O(dT)$ compute ($T$ dot products each of $d$ dimensional vectors).

Now here is what the paper is doing:

#1. Bind keys and values across the sequence together using the VSA bind operator $\otimes$ to create the superposition vector $\mathbf{s}_{kv}$:

\[\mathbf{s}_{kv} = \sum_i^T \mathbf{k}_i \otimes \mathbf{v}_i\]

All you need to know about the bind operator is that it produces another n dimensional vector and is invertible where $(\mathbf{a} \otimes \mathbf{b}) \otimes \mathbf{a}^{-1} = \mathbf{b}$.

#2. Create a superposition of all queries across the sequence:

\[\mathbf{s}_q = \sum_i^T \mathbf{q}_i\]

#3. Unbind the query superposition from the key value superposition (this computes the query key dot products between all queries and keys but in superposition):

\[\begin{align} \mathbf{z} &= \mathbf{s}_{kv} \otimes \mathbf{s}_q^{-1} = \Big ( \sum_i^T ( \mathbf{k}_i \otimes \mathbf{v}_i ) \Big ) \otimes \Big (\sum_i^T \mathbf{q}_i \Big )^{-1} \\ &= \mathbf{q}_1^{-1} \otimes \Big ( \mathbf{k}_1 \otimes \mathbf{v}_1 + \dots + \mathbf{k}_T \otimes \mathbf{v}_T \Big ) + \dots + \mathbf{q}_T^{-1} \otimes \Big ( \mathbf{k}_1 \otimes \mathbf{v}_1 + \dots + \mathbf{k}_T \otimes \mathbf{v}_T \Big ) \end{align}\]

#4. Extract the attention weights by doing a cosine similarity (CS) between each value vector and $\mathbf{z}$ where $\epsilon$ is a noise term for everything that doesn’t have the corresponding $\mathbf{v}_i$ match.

\[\begin{align} \mathbf{\tilde{a}}_t &= [ \text{CS}(\mathbf{v}_1, \mathbf{z}), \dots, \text{CS}(\mathbf{v}_T, \mathbf{z}) ]^T \\ &= [ CS(\mathbf{v}_1, \mathbf{v}_1 \otimes \mathbf{k}_1 \otimes \sum_i^T \mathbf{q}_i^{-1} +\epsilon ), \dots, CS(\mathbf{v}_T, \mathbf{v}_T \otimes \mathbf{k}_T \otimes \sum_i^T \mathbf{q}_i^{-1} +\epsilon) ]^T \\ &\approx [ \sum_i^T \mathbf{k}_1^T \mathbf{q}_i +\epsilon, \dots, \sum_i^T \mathbf{k}_T^T \mathbf{q}_i+\epsilon ]^T \\ \end{align}\]

Can you spot the difference between this $\mathbf{\tilde{a}}_t$ and the original Self Attention $\mathbf{\hat{a}}_t$?

$\mathbf{\tilde{a}}_t$ computes the dot product between the key vector and every query! Not the current query $\mathbf{q}_t$ that should be the only query used to predict the next token. This means that every attention weight vector is the same across the entire sequence: $\mathbf{\tilde{a}}_i == \mathbf{\tilde{a}}_j \forall i,j \in [1,T]$.

There are two solutions to modify this approach so that it is a true recasting of Attention, however, both of them remove the speedup claimed by the paper, leaving only the increase in noise from $\epsilon$!

First, if a masked language setting is implemented correctly¹, at e.g. $t=5$ we don’t have access to the keys, queries and values for $t>5$. This means that as we move across the sequence, we incrementally add queries to our query superposition and keys/values to our key+value superposition and compute all of the above equations (#1-#4) each time. This means we have $O(dT^2)$ where $d$ is the dimensionality and $T$ is the sequence length ($dT$ operations for a single query because we compute cosine similarity using every value vector and we repeat this for every incremental query in the sequence).

Second, rather than adding more vectors to the superposition, making it noisier, we can keep each query separate when we perform the above operations. However, this is again $O(dT^2)$ complexity and reveals how using VSAs here doesn’t make sense. We bind together every key and value vector to compute a noisy dot product with the query in superposition, only to then unbind all of them again? This is more expensive and noisier than merely doing a dot product between every key and query as in the original attention operation!

To conclude, while I share with the paper authors the desire to integrate VSAs into Deep Learning, the way it has been done here is ineffective and misleading. What is created is not a re-casting of the Attention operation. It is surprising that it does better than baselines on some idiosyncratic benchmarks and this may also be due to an implementation error.

Please reach out if I am missing something about this paper as I am happy to discuss it and revise this blog post.

Thanks to the Denis Kleyko for helpful comments and discussion. All remaining errors are mine and mine alone.

Footnotes

I am concerned that the results in the paper that beat benchmarks are the result of incorrect masking. ↩

Citation

If you found this post useful for your research please use this citation:

@misc{HHRsRecastingAttention,
  title={HRRs Can't Recast Self Attention},
  url={https://www.trentonbricken.com/Contra-Recasting-Attention/},
  journal={Blog Post, trentonbricken.com},
  author={Trenton Bricken},
  year={2022}, month={November}}

Sparse Distributed Memory and Hopfield Networks

2022-10-18T10:24:00+00:00

How Hopfield Networks are a special case of the biologically plausible Sparse Distributed Memory.

Going off citation count for their original, seminal papers, Hopfield Networks are ~24x more popular than Sparse Distributed Memory (SDM) (24,362 citations versus 1,337). I think this is a shame because SDM can not only be viewed as a generalization of both the original and more modern Hopfield Networks but also passes a higher bar for biological plausbility – having a one-to-one mapping to the circuitry of the cerebellum. Additionally, like Hopfield Networks, SDM has been shown to closely relate to the powerful Transformer deep learning architecture.

In this blog post, we first provide background on Hopfield Networks. We then review how Sparse Distributed Memory (SDM) is a more general form of the original Hopfield Network. Finally, we provide insight into how modern improvements to the Hopfield Network modify the weighting of patterns, making them even more convergent with SDM. (In other words, SDM can be seen as pre-empting modern Hopfield Net innovations).

Summary of how modifications to SDM and Hopfield Networks relate them to Transformers and the Brain. Question marks denote uncertain but potential links.

Background on SDM and Hopfield Networks

The fundamental difference between SDM and Hopfield Networks lies in the primitives they use. In SDM, the core primitive is neurons that patterns are written into and read from. Hopfield Networks do a figure-ground inversion, where the core primitive is patterns and it is from their storage/retrieval that neurons implicitly appear.

To make this more concrete, we first provide a quick background on how SDM works:

SDM Background

If you like videos then watch the first 10 mins of this talk I gave on how SDM works and skip the rest of this section.

Summary SDM write operations (top row) and read operations (bottom row). The bottom left sub-figure shows the neuron view. The bottom right sub-figure shows how the neurons can be abstracted away and the original pattern locations considered.

To keep things simple, we will introduce the continuous version of SDM, where all neurons and patterns exist on the $L^2$ unit norm hypersphere and cosine similarity is our distance metric. The original version of SDM used binary vectors and the Hamming distance metric.

SDM randomly initializes the addresses of $r$ neurons on the $L^2$ unit hypersphere in an $n$ dimensional space. These neurons have addresses that each occupy a column in our address matrix $X_a \in (L^2)^{n\times r}$, where $L^2$ is shorthand for all $n$-dimensional vectors existing on the $L^2$ unit norm hypersphere. Each neuron also has a storage vector used to store patterns represented in the matrix $X_v \in \mathbb{R}^{o\times r}$, where $o$ is the output dimension.

Patterns also have addresses constrained on the $n$-dimensional $L^2$ hypersphere that are determined by their encoding; pattern encodings can be as simple as flattening an image into a vector or as complex as preprocessing the image through a deep convolutional network.

Patterns are stored by activating all nearby neurons within a cosine similarity threshold $c$, and performing an elementwise summation with the activated neurons’ storage vector. Depending on the task at hand, patterns write themselves into the storage vector (e.g., during a reconstruction task) or write another pattern, possibly of different dimension (e.g., writing in their one hot label for a classification task).

Because in most cases we have fewer neurons than patterns, the same neuron will be activated by multiple different patterns. This is handled by storing the pattern values in superposition via the aforementioned elementwise summation operation. The fidelity of each pattern stored in this superposition is a function of the vector orthogonality and dimensionality $n$.

Using $m$ to denote the number of patterns, matrix $P_a \in (L^2)^{n\times m}$ for the pattern addresses, and matrix $P_v \in \mathbb{R}^{o\times m}$ for values the patterns want to write, the SDM write operation is: \begin{align} \label{eq:SDMWriteMatrix} X_v = P_v b_c \big ( P_a^T X_a \big ), \qquad b_c(e)= \begin{cases} 1, & \text{if}\ e \geq c
& \text{else} \; 0, \end{cases} \end{align} where $b(e)$ performs an element-wise binarization of its input to determine which pattern and neuron addresses are within the cosine similarity threshold $c$ of each other.

Having written patterns into our neurons, we read from the system by inputting a query $\boldsymbol{\xi}$, that again activates nearby neurons. Each activated neuron outputs its storage vector and they are all summed elementwise to give a final output $\mathbf{y}$. The output $\mathbf{y}$ can be interpreted as an updated query and optionally $L^2$ normalized again as a post processing step:

\begin{align} \label{eq:SDMReadMatrix} \mathbf{y} = X_v b_c \big ( X_a^T \boldsymbol{\xi} \big). \end{align}

Intuitively, SDM’s query will update towards the values of the patterns with the closest addresses. This is because the patterns with the closest addresses will have written their values into more neurons that the query reads from than any competing patterns. For example, in the summary figure above, the blue pattern address is the closest to the query meaning that it appears the most in those nearby neurons the query reads from.

Another potentially useful perspective is to see SDM as a single hidden layer MLP with a few modifications:

All neuron addresses $X_a$ are randomly initialized and fixed.
All $X_a$, $P_a$ patterns and $\boldsymbol{\xi}$ queries are $L^2$ normalized.
Only neurons above the activation threshold remain on so the activation function is not piecewise. They also use a Heaviside step function being 0 or 1 (this is not differentiable).
The neurons have no bias term.

In a recent paper, we resolve a number of these modifications to make SDM more compatible with Deep Learning. This results in a model that is very good at continual learning!

How our SDM notation can be mapped onto the structure of a single hidden layer MLP.

In another paper, we show that the SDM update rule is closely approximated by the softmax update rule used by Transformer Attention. This will be relevant later with newer versions of Hopfield Networks that also show this relationship.

Hopfield Background

Hopfield Networks before the modern continuous version all use bipolar vectors $\bar{\mathbf{a}} \in \{-1,1\}^n$ where the bar denotes bipolarity. The Hopfield Network update rule in matrix form, in its typical auto-associative format where $P_v=P_a$ is:¹

\begin{equation} \label{eq:HopfieldUpdateRule} \bar{\mathbf{y}} = \bar{g}( \bar{P}_a \bar{P}_a^T \bar{\boldsymbol{\xi}} ) \qquad \bar{g}(e)= \begin{cases} 1, & \text{if}\ e> 0
, & \text{else} \; -1, \end{cases} \end{equation}

Interpreting this equation from the perspective of SDM, we first compute a re-scaled and approximate version of the cosine similarity between the query and pattern addresses $\bar{P}_a^T \bar{\boldsymbol{\xi}}$. This can be done by taking a dot product between bipolar vectors and dividing by $n$ to convert the interval our bipolar values can take from [-$n$, $n$] to [-1,1] of cosine similarity:

\begin{equation} \label{eq:BipolarHammConversion} \mathbf{x}^T \mathbf{y} \approx \frac{\bar{\mathbf{x}}^T\bar{\mathbf{y}}}{n} \end{equation}

This relationship is exact with binary SDM and bipolar Hopfield but we leave this relationship to the Appendix B.6 of our paper.

We use this distance metric to weight each pattern before mapping back into our bipolar space with $\bar{g}(\cdot)$.

Instead of first multiplying the query with the pattern addresses ($\bar{P}_a^T \bar{\boldsymbol{\xi}}$) like in SDM, Hopfield Networks instead typically perform $\bar{P}_a \bar{P}_a^T=M$ which gives us a symmetric, $n \times n$ dimensional matrix. We can interpret this symmetric matrix $M$ as containing $n$ neurons where each neuron’s address is a row and its value vector is the corresponding column, which by symmetry is the row transpose. Therefore, the number of neurons is defined by the pattern dimensionality $n$ and the neuron address and value vectors are derived from $\bar{P}_a \bar{P}_a^T$. This is how neurons emerge from the patterns in Hopfield Networks.

What is most distinct about this operation in comparison with SDM is that there is no activation threshold between the patterns and query ($\bar{P}_a^T \bar{\boldsymbol{\xi}}$). As a result, every pattern has an effect on the update rule including positive attraction and negative repulsion forces. We will see how more modern versions of the Hopfield Network have in fact re-implemented activation thresholds that are increasingly reminiscent of that used by SDM.

SDM as a Generalization of Hopfield Networks

It was first established by Keeler that SDM was a generalization of Hopfield Networks. Hopfield Networks can be represented by SDM’s neuron primitives as a special case where there are no distributed reading or writing operations. This makes the weighting of patterns in the read operation the bipolar re-scaling of the cosine similarity to the query.

The Hopfield version of SDM has neurons centered at each pattern so $r=m$ and $X_a=P_a$. Distributed writing and reading are removed by setting the cosine similarity threshold for writing to $d_\text{write}=1$ and for reading to $d_\text{read}=-1$.

This means patterns are only written to the neurons centered at them and the query reads from every neuron. In addition, for reading, rather than binary neuron activations using the Heaviside function $b(\cdot)$, neurons are weighted by the bipolar version of the cosine similarity given by $\bar{\mathbf{x}}^T\bar{\mathbf{y}}$. Writing out the SDM read operation in full with bipolar vectors that make our normalizing constant unnecessary, we have:

\[\bar{\mathbf{y}} = \bar{g} \Big ( \bar{X}_v b_{d_\text{read}} \big ( \frac{\bar{X}_a^T \bar{P}_a}{n} \big ) \Big ) = \bar{g} \Big ( \underbrace{ \bar{P}_v b_{d_\text{write}}( \frac{\bar{X}_a^T \bar{P}_a}{n})}_{\text{Write Patterns}} \underbrace{ b_{d_\text{read}} \big ( \frac{\bar{X}_a^T \bar{P}_a}{n} \big ) }_{\text{Read Patterns}} \Big )\]

Looking first at the write operation where $d_\text{write}=0$:

\[X_v = P_v b_c \big ( P_a^T X_a \big ) = \bar{P}_v b_{d_\text{write}}(\frac{\bar{X}_a^T\bar{P}_a}{n})=\bar{P}_v I = \bar{P}_v=\bar{P}_a\]

where $I$ is the identity matrix and $P_v=P_a$ in the typical autoassociative setting. For the read operation we remove the threshold and cosine similarity re-scaling:

\[b_{d_\text{read}}(\frac{\bar{X}_a^T\bar{\boldsymbol{\xi}} }{n}) = \bar{X}_a^T \bar{\boldsymbol{\xi}} =\bar{P}_a^T \bar{\boldsymbol{\xi}}.\]

Together, these modifications turn SDM using neuron representations into the original Hopfield Network:

\[\bar{g} \Big ( \underbrace{ \bar{P}_v b_{d_\text{write}}( \frac{\bar{X}_a^T \bar{P}_a}{n})}_{\text{Write Patterns}} \underbrace{ b_{d_\text{read}} \big ( \frac{\bar{X}_a^T \bar{P}_a}{n} \big ) }_{\text{Read Patterns}} \Big ) = \bar{g}( \bar{P}_a \bar{P}_a^T \bar{\boldsymbol{\xi}} )\]

Differences between SDM and Hopfield Networks

While Hopfield Networks have been traditionally presented and used in an autoassociative fashion, by using a synchronous update rule they can also be heteroassociative $\bar{\mathbf{y}} = \bar{g}( \bar{P}_v \bar{P}_a^T \bar{\boldsymbol{\xi}} )$ but do not work as well. However, a solution is to operate autoassociatively but concatenate together the pattern address and pointer as was introduced here. However, this concatenation is less biologically plausible than SDMs solution that has separate input lines capable of carrying different key and value vectors.

A second and more important difference between SDM and Hopfield Networks is that the latter is biologically implausible because of the weight transport problem, whereby the afferent and efferent weights are symmetric. At the expense of biology, these symmetric weights allow for the derivation an energy landscape that can be used for convergence proofs and to solve optimization problems like the Traveling Salesman. Meanwhile, SDM is not only free from weight symmetry but also has its mapping onto the cerebellum outlined here.

A third difference between SDM and Hopfield Networks lies in how they weight their patterns. We can interpret the Hopfield update as computing the similarity between $P_a^T \boldsymbol{\xi}$ which, because of the bipolar values, has a maximum of $n$, minimum of $-n$ and moves in increments of 2 (flipping from a +1 to -1 or vice versa). This distance metric between each pattern and the query results in our query being attracted to similar patterns and repulsed from dissimilar. Distinctly, all patterns aside from those that are completely orthogonal contribute to the query update while in SDM, it has been shown that patterns have approximately exponential weighting for nearby patterns and those further than $d(\boldsymbol{\xi},\mathbf{p_a})>2d$, receiving no weighting at all. We will expand upon this weighting and how modern improvements to Hopfield Networks have made them a closer relation to SDM.

Finally, both Hopfield Networks and SDM, while using different parameters, have been shown to have the same information content as a function of their parameter count. Yet, these parameters are used in different ways because of SDM’s neuron primitive. Notably, SDM can increase its memory capacity up to a limit by increasing the number of neurons $r$ rather than needing to increase $n$, the dimensionality of the patterns being stored.

Modern Hopfield Networks Approximate SDM

The evolution of Hopfield Networks update rule over time. The transition from the original Hopfield Network to the Polynomial (Modern) Hopfield Network applied a non-linearity to the dot product between the query and each of the patterns. This is reminiscent of SDM's activation threshold. This was then improved upon by making the non-linearity into an exponential and then the softmax! This is a very close approximation to the weighting that SDM applies to each pattern.

The binary modern Hopfield Network introduced by Krotov and Hopfield, showed that using higher order polynomials to assign new pattern weightings in the read operation increased capacity. In particular, adding odd and rectified polynomials, which put more weight on memories that are closer to the query, better separating the signal of the target pattern from the noise of all others. The energy function to be minimized is:

\[E=-\sum_{\bar{\mathbf{p}}\in \bar{P}} K_a \Big( \bar{\mathbf{p}}_a \bar{\boldsymbol{\xi}} \Big )=-\sum_{\bar{\mathbf{p}}\in \bar{P}} K_a \Big( \sum_i^n [\bar{\mathbf{p}}_a]_i \bar{\boldsymbol{\xi}}_i \Big )\]

where:

\[K_a(x)= \begin{cases} x^a, & x \geq 0 \\ 0, & x < 0 \end{cases}\]

with $x<0$ being the rectified component that is optional and $a$ being the order of the polynomial. It can be shown that the original Hopfield Network energy function used a second order polynomial $a=2$. The query updates its bit in position $i$ by comparing the difference in energy if this bit was “on” or “off” (1, -1 when bipolar here):

\[\label{eq:ModernHopfieldEnergyEquation} \bar{\mathbf{y}}_i = \text{Sign}\Bigg[ \sum_{\bar{\mathbf{p}}\in \bar{P}} \Big ( K \big( [\bar{\mathbf{p}}_a]_i + \sum_{j \neq i} [\bar{\mathbf{p}}_a]_j \bar{\boldsymbol{\xi}}_j \big ) - K \big( - [\bar{\mathbf{p}}_a]_i + \sum_{j \neq i} [\bar{\mathbf{p}}_a]_j \bar{\boldsymbol{\xi}}_j \big ) \Big ) \Bigg]\]

Whichever configuration between “on” and “off” gives the highest output from $K(x)$, corresponding to a lower energy state, will be updated towards.

Using even polynomials in the energy difference means that attraction to similar patterns (closer than orthogonal) is rewarded as much as making opposite patterns (further than orthogonal) further away. Odd polynomials reward attraction to similar patterns also but instead reward reducing the distance of opposite patterns, trying to make them orthogonal. In other words, an even polynomial would rather have a vector be opposite than orthogonal while it is the other way around for an odd polynomial. Meanwhile, the rectification of the polynomial, which empirically resulted in a higher memory capacity, simply ignores all opposite patterns. Finally, the capacity of this network was further improved by replacing the polynomial with an exponential function $K(x)=\exp(x)$ here.

Fundamentally, these odd, rectified, and exponential $K(x)$ functions make the Hopfield Network more similar to SDM by introducing activation thresholds around the query, and either ignore (in the rectified case and exponential cases), or penalize (in the odd polynomial case), vectors for being greater than orthogonal distance away. The weighting of vectors remains different between the polynomials and exponential such that they have different information storage capacities. However, by introducing their de facto cosine similarity thresholds, these Hopfield variants are all convergent with the read operation of SDM. This is particularly the case for the exponential variant because SDM weights its patterns in an approximately exponential fashion.

Beyond the convergence in pattern weightings, we note that the last step in making the exponential variant of the modern Hopfield Network into Transformer Attention is to make it continuous in the paper “Hopfield Networks is All You Need”. Making SDM continuous is the same step taken in our work that results in the Attention approximation to SDM. This is done by modifying the energy function so that it enforces a vector norm (SDM does this too) and then using the Concave Convex Optimization Procedure (CCP) to find local minima (flipping bits is no longer an option).

The recent paper “Universal Hopfield Networks” (disclaimer: I am in the acknowledgements for this paper and the first author is a friend) makes a similar point relating different activation functions that have emerged in Hopfield Networks to Attention but does not dive into the relationship between SDM and Hopfield Networks and their chronological evolution.

Finally there are weak indications of convergence between the range of optimal SDM cosine similarity thresholds $d$ and optimal polynomials $a$ in the polynomial Hopfield Network. It was found when pattern representations were learnt that the optimal polynomial for maximizing accuracy in a classification task was neither too small nor too large. This is also the case for optimal $d$, which are related via their effect on the pattern weightings. They have a useful interpretation of their system where some of the vectors could serve as prototypes to the solution and be very close, while others vectors could represent features that different components had. The best solution interpolated between these two approaches.

One can view the prototypes as anchoring the solution and the features providing generalization ability. Even for SDM’s neuron perspective, the same intuition may apply where it is advantageous to have some protoypes read from nearby neurons but also collect features from related patterns stored in more distant neurons. This overview of SDM, gives an example of noisy patterns being stored and their combination being a noise free version is related to this line of thinking on using features from related patterns.

Citation

If you found this post useful for your research please use this citation:

@misc{SDMHopfieldNetworks,
  title={Sparse Distributed Memory and Hopfield Networks},
  url={https://www.trentonbricken.com/SDM-And-Hopfield-Networks/},
  journal={Blog Post, trentonbricken.com},
  author={Trenton Bricken},
  year={2022}, month={October}}

Footnotes

The Hopfield Network is almost always autoassociative so $P_v=P_a=X_a=X_v$ (citation, citation). ↩

Thanks to the NSF Foundation and the Kreiman Lab for making this research possible. Also to Dmitry Krotov and Beren Millidge for useful conversations. All remaining errors are mine and mine alone.

Improving Weight Regularization Continual Learning Baselines

2022-10-04T13:04:00+00:00

A simple modification to improve Elastic Weight Consolidation and Synaptic Intelligence continual learning baselines.

A number of influential continual learning algorithms like Elastic Weight Consolidation (EWC) and Synaptic Intelligence (SI) protect neural network weights important for previous tasks from being updated by newer tasks. Otherwise, these network weights are overwritten by the next task, resulting in catastrophic forgetting.

It turns out that existing implementations of these algorithms have, at least for some tasks, been significantly underestimating their performance. This is because they need a small modification to be able to actually learn which network weights need protecting. In a Split MNIST class incremental benchmark, this modification leads to 43% higher validation accuracy.

These weight regularization methods use the magnitude of gradients during the backwards pass to infer what weights are important for a particular task. They then use a regularization term in the loss function to penalize the model from updating these weights during new tasks. However, in cases where the model performs very well on all training data within a task, there is almost no gradient for the model to be able to learn what weights are important!

In order to restore gradient information, we introduce a $\beta$ coefficient into the cross-entropy loss function when learning weight importances. This is used to make the model less confident in its predictions. A hyperparameter sweep found that $\beta = 0.005$ worked the best for both EWC and SI, leading to 43% and 15% performance gains as shown in the below figure.

This result is presented in the paper Sparse Distributed Memory is a Continual Learner, which relies upon a version of MNIST where the images use pixel values [0,255] that are not rescaled to [0, 1] or normalized to have a mean of 0. We believe this makes the MNIST classes more orthogonal and easier for continual learning.¹ However, even when the pixels are rescaled, there is still see a performance boost (going from 27% when $\beta=1.0$ to 54% $\beta=0.005$) and also performance gains for Split CIFAR10.

The $\beta$ coefficient will not increase performance independently of other parameters. In both of the above cases, we also use gradient clipping, SGD with momentum and a single hidden layer of neurons. However, having these other modifications in place, $\beta$ remains worth experimenting with to potentially get significant performance gains. Our results are better than those of any other paper we have seen that uses the EWC or SI as baselines including Re-evaluating Continual Learning Scenarios: A Categorization and Case for Strong Baselines (emphasis mine).

Replication

There are two ways to reproduce our results: 1. Clone this GitHub repo which is a forked version of the original continual learning baseline and run the following command (after installing the requirements):

python -u iBatchLearn.py --gpuid 0 \
 --repeat 1 --incremental_class --optimizer SGD  \
 --momentum 0.9 --weight_decay 0.0 --force_out_dim 10 \
 --no_class_remap --first_split_size 2 --other_split_size 2 \
 --schedule 4 --batch_size 512 --model_name MLP1000 \
 --agent_type customization  --agent_name EWC_mnist  \
 --lr 0.03 --reg_coef 20000 --use_beta_coef True \
 --beta_coef 0.005

This uses the version of Split MNIST with [0,1] pixel values.

(if you are running on CPU set --gpuid -1).

Clone this repo and in test_runner.py set:

model_style = ModelStyles.CLASSIC_FFN

and

cl_baseline = "EWC-MEMORY",

Then call: python test_runner.py!

This uses the version of Split MNIST with [0,255] pixel values.

Citation

If you found this post useful for your research please cite the paper: Sparse Distributed Memory is a Continual Learner.

@misc{SDMContinualLearning,
  title={Sparse Distributed Memory is a Continual Learner},
  url={https://openreview.net/forum?id=JknGeelZJpHP},
  journal={ICLR Submission},
  author={Anonymous},
  year={2022}, month={September}}

Footnotes

This may explain why careful tuning of our regularization coefficient for the algorithms Memory Aware Synapses and L2 Regularization also led to slightly higher performance than reported in the baselines paper. ↩

Thanks to the NSF Foundation and the Kreiman Lab for making this research possible. All remaining errors are mine and mine alone.

Reflections two years into the PhD

2022-09-12T15:27:00+00:00

It is absurd that the second year of my PhD has come to an end. I am taking a minute to reflect on what I have learnt and want to improve upon.

Overall I am super happy and feel very lucky to be in my PhD program. On an average week I have ~2 hours of commitments in total. The rest of the time is totally unstructured where I am free to learn and pursue whatever it is that I’m interested in. My interests are centered enough around my research that I manage to keep my supervisor and collaborators happy (at least to date!). But I still have lots of free time to explore and learn about so many other things. I doubt I’ll ever have this degree of freedom and flexibility again in my life.

Lessons:

Make feedback loops as tight as possible. This is both in the quality of data produced from an experiment and the speed at which it can be run. Your first experiment is never going to work, everything in the real world is messy and requires fine tuning. Build a system so that it can scale and run very fast.
- Getting more meta, your first system to run things quickly will certainly not be all that fast so also invest in maintaining and refactoring it over time. Expect growing pains.
- 20% of the things you do will likely have 80% of the impact so take a lot of shots on target.
If there is a problem with your system, fix it now before it becomes a problem later. In other words, prepare for the worst.

His years at sea had taught him that if you don’t fix something when you first see it beginning to fail, it is very likely to finish failing just when it is the most dangerous and the hardest to deal with, such as in the midst of a storm. Source

Nothing is new under the sun - the more old papers you read the better.
Use Twitter but only sparingly. It is a delicate dance between knowing what the state of the art is but also avoiding flavour of the month topics that are overly crowded.
Be very careful when taking on mentees, especially time-poor and inexperienced undergrads. More hands do not always make lighter work. There are real benefits to doing everything yourself. The mentorship may be a net time sink for your research.
Rotate with your highest priority labs first. I assumed that everyone would rotate for the first year and there would be an official point where everyone bid for and committed to a lab. In reality, it is far more ad hoc and if you don’t rotate with the labs you are most excited about and likely to join first then they may not have space by the time you rotate.
If you can always work more than four hours in a day then you might be doing the wrong kinds of work. It seems like most people (including Fields medalists) can only do really deep work for approximately four hours a day. Other kinds of work where you are more on autopilot like implementing an idea or replying to emails you may be able to work for much longer. However, if you are always doing this kind of work and never thinking super deeply then you might be doing something wrong.
Be explicit about taking time off. Flexible hours can be great until you feel like you can and should be working all the time. This drains the enjoyment from time not spent working unless you are explicit about actually taking it off.
Think for longer, act more hesitantly. In hindsight there are a number of results that I could have foreseen if I had just thought for a bit longer and more deeply about the problem first. There are no hard deadlines in research (unlike in school). Use this to your advantage.
But if something is cheap to test then don’t theorize about it, just actually go and test it, immediately. Right now.
Most insights come unconsciously but you need to support their appearance. Read widely, go on deep dives into topics, ramble on and on in your notebook about ideas and see where it goes.
Life is cyclical. Moods, motivation levels, the type of work needed to drive a project forwards. Accept this rather than pretending the cycles aren’t there.
If you do ignore them… you might work at really suboptimal times and try to take shortcuts that hurt in the long run (see the bullet point about solving problems now rather than when they actually become problems).
But don’t let perfect be the enemy of good. It is too easy to decide not to start something that seems hard because you don’t feel at your absolute peak right now. E.g. starting to write that paper or reading that publication. Everything is iterative and you will probably need to read it again or re-write it to get it perfect anyways, but you can still capture a lot of the variance the first time around.
The publication itself and the main text in particular hide an incredible number of details, this makes talking to the paper authors really overpowered.
Working alone can have its advantages. Collaboration is great but if you never work alone and never force yourself to come up with your own research ideas then those muscles will never grow. They certainly didn’t in school where there is always a right answer and a very small number of inductive leaps are required to solve any problem.
However well you think you understand something, you certainly don’t until you go through the process of re-creating it for yourself.

What I cannot create I do not understand - Richard Feynman

Found written on Feynman's office blackboard after he died.

Everyone is naked. This is a dramatization of the phrase “the emperor has no clothes”. The more I climb ivory towers the more I realize that the people who you assumed knew everything and had everything under control often don’t. They are humans too. This is both terrifying because nobody is in control but also highly motivating because you can make a difference.
Just because you can work on something doesn’t mean you should. Taking an idea all the way to a publication is incredibly time consuming. And there are fixed costs to the endeavor that don’t scale and aren’t particularly educational, e.g. making Figure 4 for the 30th time with a slightly different color scheme this time.
When deciding if a project is worth pursuing think “if this went perfectly according to plan, what comes of it?” If the outcome in this unrealistic scenario isn’t all that great then look further.

My favourite science quote:

The Most Exciting Phrase in Science Is Not ‘Eureka!’ But ‘That’s funny …’ - Isaac Asimov

Oh and if you haven’t read You and Your Research then you should.

Goals:

Read more textbooks. I need to gain new tools for thought and invest more in the long term instead of immediate projects.
Find more collaborators. I have been working alone for too long. Moving to the Bay Area and working with the Redwood Theoretical Neuro Institute should really help with this.
Do more theory and less engineering.
Spend more time on maintenance. Unit tests, refactoring code, etc.
Have a good answer to the Tyler Cowen question: “What form of routine practice do you do that is analogous to the way a pianist practices the scales?”
Share my research more openly via blog posts and get things on ArXiv more frequently.
Get better at accepting how the world actually is. There is something really powerful about seeing the world not as it should be or could have been but how it is right now. You’re in the cockpit, everything that has happened has happened, now what is going to happen next? I feel this most acutely with investment decisions, “ok yes I should have sold before the market crash. But what should I be doing right now?” Yet I think this generalizes to accepting that I am two years into grad school and turning 25 soon. What projects are sunk costs and what do I want to change going forwards? Meditation and journaling are two ways that I think can really help with this “world acceptance” and future planning.

Thanks to Max Farrens for reading drafts of this piece. All remaining errors are mine and mine alone.

Book Review - Talent

2022-07-25T12:51:00+00:00

Talent is a conversation starter on the under discussed topic of how to identify talent but not much more.

Maybe it is because I think so highly of Tyler Cowen that I expected the long anticipated Talent to be harder hitting and more insightful. A couple of the interview questions are interesting and a few other factoids that I’ve tried to aggregate. But otherwise the book is basically a Psych 101 introduction to the Big 5 Personality types that is covered more carefully and thoroughly by its Wikipedia entry.

In classic Straussian fashion, the main takeaways of the book in the foreword are trite while the actual ones are:

Assume that the interviewee is trying to conceal themselves and everything is a canned answer.
Anything goes when it comes to breaking the ice and getting to know the real person that you will actually be hiring and interacting with every day.

The discussion of interview questions and strategies to do this is are the strongest part of the book but also only one chapter.

Interview questions:

What is one mainstream or consensus view hat you whole-heartedly agree with?
- This is the inverse of the “Thiel question”: “What do you believe to be true that very few others in the world would agree with you on?” (which actually originated from Tyler Cowen!) and is now said to have been over-used.
Who are our competitors?
How do you think this interview is going?
Ask a classic question like “what is your greatest weakness” but keep asking it again and again to break through the canned answers and see how deep and organic the later (real) answers are.
What are the open tabs on your browser right now?
How successful do you want to be? / How ambitious are you?
What blogs do you read?
What views do you hold religiously, almost irrationally?
What’s the farthest you’ve ever been from another human?
Which of your beliefs are you most likely wrong about?
Ask a question about a think you talked about during the interview to assess engagement with the conversation and something they could not have prepared for.
- E.g. “During the middle of this discussion, we chatted about [a very particular project]. What questions do you have about that project?
What do you do as routine practice in the same way a musician practices their scales?

Questions I want to add to the list are:

What do you want me to ask you?
List every possible way in which you can use a __? (towel; coffee mug; ballpoint pen) – this is a common test for creativity.

Behind the scenes during the interview you want to:

Get into conversational mode so they are more likely to be honest and themselves.
Flush out all of their canned answers.
Work out who this person views as important to impress – what their value and status hierarchy is
Substance and quality over articulateness
Ask questions you are genuinely curious to know the answer to – this will increase your own engagement in ways that can’t be faked and the quality of the conversation.

On breaking the ice and how anything goes:

It is fine to hold the tension and make them feel a bit uncomfortable.
Switch up the environment by going to a coffee shop, for example. And ask questions that more closely resemble a conversation with a friend and that they couldn’t have possibly prepared for like: “what do you think of the service here?”
Also very out there questions like: “Why are person-to-person interactions often more informative than Zoom calls?” Or “Why do so many successful people drink diet Coke?”

On the topic of references, they are are very important but often don’t have much time and don’t want to say bad things. Try to get things into conversation mode and ask for objective comparisons:

Is this person so good that you would happily work for them?
Can this person get you where you need to be way faster than any reasonable person could?
When this person disagrees with you, do you think it will be as likely you are wrong as they are wrong?

This first question, reformulated as, would you be ok with this person being your boss? I think it is a great question to ask yourself about any hiring decision too.

What colored circle describes you?.

The meat of the book talks about IQ and the Big Five Personality types but covers these in a cursory and not particularly precise way. For example, they give very loose definitions of the Big Five traits, under extraversion they say “friendliness” but should this not be under agreeableness? For neuroticism they define it as:

A general tendency to experience negative emotions and negative affect, including fear, sadness, embarrassment, anger, guilt, and disgust.

But how does this relate to depression and also to just being less emotionally stable? They go on to talk about neurotics as if they are just people who often complain and lead social movements giving John Calvin and Gandhi as an examples of “pests” or “as prickly individuals.”

They define conscientiousness as:

High conscientious individuals have high self-control, are very responsible, have a strong sense of duty, and usually are good at planning and organizing, due to their reliability.

But this mixes both morals and being a hard worker.

Having never really defined the Big Five types, things get particularly confusing when they summarize studies that show conscientiousness is not actually what you might otherwise think it is. This includes interesting results that South Koreans work long hours but are low in conscientiousness and that conscientiousness is uncorrelated with COVID mask wearing obedience in Italy. But we were never really told what we were supposed to think consciousness is to begin with!

There is a summary of psychology/psychometrics on how the Big Five impacts job performance and earnings. While they do a good job hedging by talking about the replication crisis and needing to take any of these results with a grain of salt, they simultaneously cherry pick a subset of studies that readers will likely over update on.

For example, a study found 20% of variance in achievement for scientists was due to personality after adjusting for scientific potential and intelligence but my guess is that this varies dramatically across the specific subfields depending in a large part upon how collaborative the field is (think a large biology lab versus a mathematician with his chalk board).

In the section on IQ they paint a picture where it is somewhat useful but not all that important? While I largely agree, I again wish they had been more precise and covered more ground. For example, they fail to acknowledge g instead making sweeping statements like “Intelligence is context dependent”. They also fail to talk about the Flynn effect, don’t acknowledge that IQ tests in the US are illegal for hiring, and at times seem to blur IQ with creativity and the quality of one’s ideas.

A key citation they use looked at the IQ of CEOs in Sweden. This study found that “the small company CEO is above 66% of the Swedish population in cognitive ability, and the median large-company CEO is above 83% of the Swedish population in cognitive ability.” Yes, this may be good evidence that there is more to IQ, but having more data on how much it matters exactly across a broad swathe of outcomes would have been useful for example I have seen this table before in other places on the inter-webs:

Source

They also seem to get things wrong when they say there is “evidence that autistics have strong performance on Ravens IQ tests” but this is not true on average from having read NeuroTribes and just looking at top cited papers via a cursory google search seems to back this, for example here. I also wish they had clarified other results I have seen around polygenic scoring and genetic predictors for personality type. See for example, Top 10 Replicated Findings from Behavioral Genetics. And that they had brought up other interesting phenomena such as birth order effects.

At times the book seems to lose track of its audience, jumping without any clear delineations between being a self promoting biography of Tyler and Daniel; a management book; and a self help book. The personal anecdotes are often lengthy and the third person writing can be a bit much at times. The book also deviated into the occasional over-grandized motivational self help sales pitch with phrases like:

Do you wish to be part of such trends for mobilizing the talents of strongly unique people, or are you going to let others eclipse you in the search for talent?

And management advice such as when discussing the move to remote work saying:

Those methods will reward non-paranoid leaders who are okay with giving up some sense of control in the moment, and you will need to adjust your style more in that direction, if you have not already.

Even on the back cover it says:

Identifying underrated, brilliant individuals is one of the simplest ways to give yourself an organizational edge.”

But if it is so simple then why is an entire book trying to discuss the nuances and how it all comes down to context?

They also move between focusing on how to hire real outliers to how to hire for standard positions. For example, in the interview section, early on they say they will discuss unstructured interviews typically used for more senior positions instead of structured interviews where the same question is asked to every candidate. But later when discussing the Big Five personality types they turn their attention towards hiring mediocre candidates for entry level jobs.

Maybe the problem is me getting my hopes up too much for this book both in terms of what I expect from Tyler Cowen and also how hard it is to make any meaningful statements about something as context dependent and difficult as finding talent? And while the book tries to be prescriptive, they correctly acknowledge that at the end of the day it is all context.

I’m now going to transition into notes from the book that I found interesting:

Thiel anecdotes:

On the talent spotting abilities of Thiel:

Peter Thiel found and helped to mobilize the talents of Elon Musk, Reid Hoffman, Max Levchin, Mark Zuckerberg, and others, including Steve Chen, … . His approach is not well described by any kind of mechanical formula, and Peter’s own background is in the humanities – philosophy and law – rather than science or tech. Many of his current interests concern religion, as he studied the Bible under French anthropologist and philosopher Rene Girard, who was a professor of Peter’s at STanford. We understand Peeter as applying a very serious philosophical and indeed even moral test to people. … In our view, Peter actually asks whether you deserve to succeed, as he understands that concept, and he derives additional information from that interior and indeed deeply emotional line of inquiry.

How Thiel is compelling:

The first time each of us met Peter Thiel, for instance, we noticed how engrossed he was in his explanations and, furthermore, how quickly and effectively he pulled people into his worldview, introduction and applying concepts such as “technological stagnation,” “the inability to imagine a future very different from the present,” “Georgist economics,” and “the Girardian sacrificial victim.” Maybe you don’t know what all of those concepts refer to and maybe Peter’s audience doesn’t either, but that is not the point. There is a logic to his argument, and Peter communicates that logic with the utmost conviction; the audience correctly senses a coherent underlying worldview.

Another Thiel anecdote I love that is not in the book but is just absurd is that his Roth IRA is worth $5B (you can only deposit into these when your income is low and only $6K per year. It is also not taxable!).

Altman anecdotes:

Focus not on where someone currently is but their rate of growth.

I found it amusing that Altman on talent says:

I look for founders who are scrappy and formidable at the same time (a rarer combination than it sounds); mission-oriented, obsessed with their companies, relentless, and determined; extremely smart (necessary but certainly not sufficient); decisive, fast-moving, and willful; courageous, high-conviction, and willing to be misunderstood; strong communicators and infectious evangelists; and capable of becoming tough and ambitious.

Should we call in a superhero? Isn’t he just listing every possible desirable characteristic? Isn’t the value of VC in finding “alpha” by investing in people who don’t fit all of these straightforward criteria?

On the importance of speed and being proactive:

Years ago I wrote a little program to look at this, like how quickly our best founders – the founders that run billion-dollar plus companies – answer my emails versus our bad founders. I don’t remember the exact data, but it was mind-blowingly different. It was a difference of minutes versus days on average response times.

One they did not state but stuck with me from Altman’s podcast with Tyler was that the most successful founders all came from stable, middle class families. This may have something to do with their risk tolerance?

The later sections covered Disabilities, Minorities, and Gender. I thought the sections on minorities and hiring for diversity are important and broadly well stated. However, I have already highlighted my issue with the claim about autism being positively correlated with Raven’s Matrices. Here are some of the more striking points from the section on gender:

Dataset of VC pitches were more critical of all women teams. If they were mixed then the VCs only paid attention to the men.
Deep voices are considered more authoritative and women’s voices are getting deeper over time. After WWII they were one octave higher than mens. Now they are only 2/3rds higher. Thatcher underwent voice training to lower her voice during speeches. Elizabeth Holmes did the same.
Women score higher in agreeableness, openness, extraversion, and neuroticism.
At YC (YCombinator) there is always at least one woman out of the three on the interviewing panel. This was traditionally the role of Jessica Livingstone, regarded as having a x-ray vision for a person’s personality
Personality is judged more more, they also have more imposter syndrome when answering the question “I performed well on the test” out of 1 to 100, women gave 46 on average versus 61 for men.

Miscellaneous:

~10% of the world’s population is dyslexic. This seems very high and I guess highlights how unnatural and recently evolved reading ability is?
Musk personally interviewed the first 3,000 employees at SpaceX.
Growth in US output since 1960, at least 20-40 percent of that growth has stemmed from the better allocation of talent. Argue that it was a low bar though because of sexism and racism. I think we are still doing this via not giving citizenship away, something that the UK has recently begun correcting for by allowing those in the top 50 universities to easily move there.
There is only a 28% correlation between the interviewer and an individuals self assessment of personality traits, in particular conscientiousness and emotional stability which are two of the most important for the job!
Bringing talent to you: Tyler Cowen via Marginal Revolution and Thiel through his writings (pointed out in Marc Andreessen podcast with Tyler), also likely Paul Graham with his writings, they attract talent to them rather than having to go out and find it on their own.

Thanks to Davis Brown for reading drafts of this piece. All remaining errors are mine and mine alone.

A Solution to the Repugnant Conclusion?

2022-07-25T12:15:00+00:00

Having a non-linear uptick in utility avoids the Repugnant Conclusion.

Disclaimer: I am not a philosopher, I very much welcome comments and debate on this piece. I am trying to not let perfect be the enemy of good and share this piece somewhat unfinished rather than continuing to sit on it.

Derek Parfitt in Reasons and Persons introduces the “Repugnant Conclusion” which is an unsettling answer to the question: “should we have more people or happier people?” Parfitt persuasively argues from a few simple axioms that the answer to this question is always quantity over quality and that we should have as many people alive as possible, such that everyone is living right on the threshold of life not being worth living. In other words:

“For any perfectly equal population with very high positive welfare, there is a population with very low positive welfare which is better, other things being equal.” - Derek Parfitt

Many, myself included, find the idea of this subsistence level living repugnant, hence the name of Parfitt’s conclusion. However, I think there is actually a simple solution to the Repugnant Conclusion that I will outline after better formalizing it.

Parfitt considers the utility function of individuals as either being linear or non-linear with diminishing returns:

Linear or diminishing returns between resources and life quality gains.

The diminishing returns case is the most realistic and in this case any increase in resources on the x axis can give only equal or less than equal returns on quality of life. We can use this utility function to consider the utility of a whole population, deciding the amount of resources that each person gets and summing together their quality of life:

Comparing the areas of the rectangles on the right:

We want the one with the largest area and due to our utility function having diminishing returns, the way to maximize area is by having everyone live just above subsistence:

This always results in the repugnant conclusion where we choose quantity over quality for the sake of maximizing total utility.

However, I think there is a solution that isn’t repugnant and in fact leverages the very nature of us seeing this conclusion as repugnant – there is a point where a small increase in resources leads to an even larger increase in life quality. In other words, when life goes from glass half empty to glass half full; when you are sufficiently far above subsistence that life gets a lot more enjoyable. Exactly where this point occurs and how large this non-linear increase in quality of life as a function of resources is remains up for debate. Yet as long as any non-linearity of this form exists, the repugnant conclusion will be avoided. Formally, as long as this non-linearity with a positive 2nd derivative exists and the utility function is monotonically increasing, it dissolves the repugnant conclusion.

The non-repugnant utility function of life?

This is because at this non-linearity, for a decrease in resources, there is an even greater decrease in life quality. This means that in order to maximize the utility of a population, nobody’s resource allocation and quality of life should drop below this non-linearity.

I am curious to know if this non-linearity already exists in the literature and what might be wrong with it. It is such a simple modification and I was frustrated that Parfitt never addressed it in his magnum opus. Tamay noted that one known occurrence of this sort of asymmetry in utility exists with Prospect Theory, where losses hurt more than wins.

What Do We Learn from the Repugnant Conclusion? by Tyler Cowen reviews many papers that came out after Reasons and Persons introduced the Repugnant Conclusion. Most related is his section titled “Asymmetric Treatment for Low-Utility Individuals”. However, this section discusses placing bounds on the utility function which is not done here.

In other places there is discussion of having asymptotically declining utilities that tend towards zero and may be non-linear. This violates axiom 4 of the repugnant conclusion:

Axiom (4) - No value should become infinitely small in importance at the margin. A very large addition to that value, all other things being held equal, should never translate into an asymptotically insignificant contribution to the social welfare function. I call this the non-vanishing value axiom.

However, the solution proposed does not rely upon an asymptotic decline of the utility function to a value of 0. Instead, it could intersect the y-axis at a value higher up, the only thing that matters is that non-linearity with a positive 2nd derivative exists somewhere.

Cowen also summarizes work where interaction effects are modeled such that a decrease in resources per person leading to a reduction in, say, dignity, and it is the effects on dignity itself that causes a decrease in utility such that we should avoid the repugnant conclusion. He provides a number of attacks against these kinds of interaction effects. However, fundamentally there is clearly an interaction effect between resources and life quality and life quality itself is the sum of many components. Having an interaction between resources and these components therefore seems inevitable? Yet, more work needs to be done to suggest where the non-linearity I suggest actually comes from.

Summary

For the utility function that describes the average persons’ life, as long as there exists a point where a one unit increase in resources leads to greater than one unit increase in life quality, the Repugnant Conclusion is not reached. The source of this non-linearity is unclear but at risk of being tautological, may exist due to the very fact that the Repugnant Conclusion feels so repugnant.

Thanks to Tamay Besiroglu and Davis Brown for reading drafts of this piece. All remaining errors are mine and mine alone.

Book Review - Ted Chiang Short Stories

2022-07-25T10:32:00+00:00

Ted Chiang creates emotionally resonant and novel perspectives on deep questions about life and technology.

I have just finished both of Ted Chiang’s collections of short stories: “Stories of Your Life and Others” and “Exhalation”, which were both excellent. Chiang’s work is steeped in past and present scientific ideas spanning fields including physics, computer science, and biology. For example, Chiang considers worlds where we can turn off the part of the brain that recognizes beauty in faces; where children are created from preformed humans inside sperm; where we can do forms of time travel that don’t violate our current laws of physics; the Everetiann many worlds interpretation of quantum physics is real and we can communicate across it; reflections on the heat death of the universe; the effects of glasses that record and allow for immediate recall of your past experiences.

I really like how Chiang makes salient slippery topics like the progression of technology, Chesterton’s fence, free will, morality, the meaning of life. He provides novel angles to view these topics through and handles the ideas subtly. The stories leave many more questions than answers but are stimulating and beautiful.

—

Questions the stories prompted that I will keep thinking about:

When is technology beneficial?

Technology gives us more power and optionality, allowing us to do things that were never possible before. This forces us to reconsider what about our status quo that evolution gave us is desirable to keep and what is not. Paul Graham notes a growing divergence in The Acceleration of Addictiveness between what is “normal” in the sense that cavemen also did it and “normal” in the sense that the majority of people do it now.

Evolution is simultaneously a “blind idiot God”, responsible for vast amounts of unnecessary suffering and a gargantuan Chesterton’s fence, creating 12 stage cassava processing techniques to remove cyanide. How can we fix evolution’s shortcomings while not poisoning ourselves with cyanide? Moreover, figuring out what is best for us is made all the more tricky because our very desires are programmed by evolution. Moreover, how can we use technology to restore the very things that we lost because of other technologies and where does this end? For example, in the story “Liking What You See: A Documentary” technology, including better cosmetic surgery, leads to superstimuli that hijack our natural bias to treat more beautiful people better. A counter response to this is a non-invasive brain modification that makes one “blind” to human beauty. The story provides a back and forth debate for and against this technological arms race.

The utility of memory?

By default I buy into wanting to remember everything and the importance of objective fact. However, Chiang in “The Truth of Fact, the Truth of Feeling” reveals how even a memory device as simple as writing can affect our social dynamics and outcomes. There is a distinction made between what is factually correct and what it is best to believe in order to make the right decision. This is closely related to Elephant in the Brain, which puts forward the hypothesis that our inner mind hides its intentions from our conscious mind so as to best reach our ends by being optimally deceptive – the best liar is the person who doesn’t even know they are lying! Chiang projects our external memory tools and their consequences into the future where we all wear video cameras and can effortlessly query any previous memory. This extension takes our external memory abilities beyond just the factual (e.g. Googling what is the capital of France?) and into the personal (e.g. What did I say to Alice four weeks ago at that party?).

In this story the protagonist learns that he was in fact misremembering previous interactions and, embarrassed, concludes the recording device can help him become a better person. However, to what extent are we as humans already too self-effacing? And given that we are terrible at holding both good and bad things in mind at the same time (the affect heuristic) is it a bad thing that we are constantly overwhelmed in nuance and who the good guy is versus the bad guy? At what point do we hit epistemic learned helplessness? If this all sounds interesting, Symbolic Species takes this argument about memory and fact even further with the costs and benefits of symbolic thought and language itself.

How can we establish the rights and consciousness of digital minds?

“The Lifecycle of Software Objects” short story is timely on a number of fronts. Digital beings (digients) are created that run on artificial neural networks and develop analogously to children over time. Their owners grapple with figuring out just how intelligent the digients are (we are currently doing this with our largest AI language models). The digients also desire to have autonomy and be incorporated as independent entities raising tricky legal issues and getting to the core of free will and agency. We will soon face something similar with driverless vehicles. Interestingly, the story assumes that digients are conscious and can suffer by, for example, being tortured. This assumption and the implications of being able to create vast amounts of suffering with mere computer code was troubling but again timely with the debates this month on whether or not Google’s LaMBDA language model is conscious.

What’s the point of it all?

Exhalation poetically captures the ultimate heat death of the universe when all life and existence will inevitably come to a standstill. Even in light of this ultimate extinction, there is a very Buddhist perspective of enjoying the present and existence itself. A number of Chiang’s other stories also touch on these sorts of realizations and existential crises including: Omphalos, Division by Zero, and Tower of Babylon.

If you made it through the above, some of these points may sound trite, especially the last one. This is where I believe Chiang is at his best, weaving together deep ideas with imagery and emotion that resonates and feels more profound that I can hope to do it justice. Go and read the originals :P

—

My favorite short stories, some of which I have already mentioned, in rough order of enjoyment were:

Exhalation - Heat death of the universe and the beauty and purpose that can still be derived from life.
Story of Your Life - inspiration for the movie Arrival. The nature of time and causality. Beautiful depictions of parenting that makes me want to have kids.
Liking What You See: A Documentary - on beauty and cognitive biases. This Paul Graham piece is very related: The Acceleration of Addictiveness.
The Truth of Fact, the Truth of Feeling - “truth” as what is factually correct versus what is right. Forgetting has its benefits. What is a world like where we never forget anything that happened to us?
Hell Is the Absence of God - comic on religion, angels, justifications and morality.
The Lifecycle of Software Objects - creation of digital lifeforms, nature of consciousness, rights of digital minds, nature of intelligence.
Anxiety is the Dizziness of Freedom - Everetiann multiple worlds and morality within them. Jealousy and what could have been.
What’s Expected of Us - Free will.

Most of these came from the second collection of short stories: “Exhalation”.

As a final note, I love that there are story notes at the end of each book where Chiang shares his inspiration for each of the stories, providing a different and richer perspective on their origins.

If you have read Ted Chiang’s work then reach out and let me know your thoughts!

Transformer Memory Requirements

2022-07-22T15:32:00+00:00

Working out how much memory it takes to train a Transformer GPT2 Model.

There has been recent discussion on StackOverflow and Twitter on the full memory requirements of training a Transformer.

Because I am in the process of training Transformers myself and scaling to multiple GPUs, I became interested in this question myself. Misha Laskin provides some back of the envelope calculations for why batch size and sequence length dominate over model size that are interesting but off by approximately 4x for the model parameters and 2x for activations.

I have thrown together a more detailed calculator as a Colab notebook here. And outline my reasoning below. I’ve tested this on the “small” 124M parameter and “medium” 345M parameter GPT2 models and get close to the real values.

Here is the GPT2 model architecture (image taken from my paper):

See the Illustrated GPT2 for a full explanation of how GPT2 works.

Here are the equations and notation in full:

L = 12 # number of blocks
N = 12 # number of heads
E = 768 # embedding dimension
B = 8 # batch size
T = 1024 # sequence length
TOKS = 50257 # number of tokens in the vocab
param_bytes = 4 # float32 uses 4 bytes
bytes_to_gigs = 1_000_000_000 # 1 billion bytes in a gigabyte

model_params = (TOKS*E)+ L*( 4*E**2 + 2*E*4*E + 4*E)
act_params = B*T*(2*TOKS+L*(14*E + N*T ))
backprop_model_params = 3*model_params
backprop_act_params = act_params
total_params = model_params+act_params+backprop_model_params+backprop_act_params=4*model_params+2*act_params
gigabytes_used = total_params*param_bytes/bytes_to_gigs

For the “small” GPT2 model with 124M parameters (that uses the above values for each parameter) we get:

model_params = 123,568,896
act_params = 3,088,334,848
gigabytes_used = 26.6 Gb

While running the Hugging Face GPT2 we get 27.5Gb.

If our batch size is 1 then we undershoot again where memory is predicted to be 5.1Gb but in reality it is 6.1Gb.

For the medium sized 345M parameter model and a batch size of 1 our equation predicts that there it will use 12.5Gb while empirically it is: 13.4Gb. The 1Gb gap remains. I learned that this 1Gb gap comes from loading the GPU kernels into memory! See here.

The model parameter equation comes from:

(TOKS*E) [embedding layer ]+ L [number of blocks]*( 4*E**2 [Q,K,V matrices and the linear projection after Attention] + 2*E*4*E [the MLP layer that projects up to 4*E hidden neurons and then back down again] + 4*E [Two layer norms and their scale and bias terms])

Where we ignore the bias terms and positional embedding.

The activation parameter equation comes from:

B[batch]*T[seq. length]*(2*TOKS [one hot vectors at input and output]+L[number of blocks]*(3E [K,Q,V projections] + N*T [Attention Heads softmax weightings] + E [value vector] + E [linear projection] + E [residual connection] + E [LayerNorm] +4E [MLP activation]+E [MLP projection down]+E[residual]+E[LayerNorm] ))

When I turn on floating point 16 the memory for 1 batch only drops from 6.1Gb to 5.8Gb. Meanwhile for a model with a batch of 8, it goes from 27.5 to 21.8Gb. Why are there not larger memory savings? Is this because it is mixed precision and the model decides that it needs high precision for many of its components?

Thanks to Miles Turpin and Misha Laskin for motivating this piece. All remaining errors are mine and mine alone.

Trenton Bricken

On Solo Backpacking

The Cerebellum Beyond Motor Control

Footnotes

Citation

HRRs Can’t Recast Self Attention

Footnotes

Citation

Sparse Distributed Memory and Hopfield Networks

Background on SDM and Hopfield Networks

SDM Background

Hopfield Background

SDM as a Generalization of Hopfield Networks

Differences between SDM and Hopfield Networks

Modern Hopfield Networks Approximate SDM

Citation

Footnotes

Improving Weight Regularization Continual Learning Baselines

Replication

Citation

Footnotes

Reflections two years into the PhD

Book Review - Talent

A Solution to the Repugnant Conclusion?

Related Work

Summary

Book Review - Ted Chiang Short Stories

Transformer Memory Requirements