Jekyll2023-10-24T17:39:32+00:00http://trentonbricken.github.io/feed.xmlTrenton BrickenInterested in Machine Learning, Neuroscience, and Original Glazed Krispy Kreme Doughnuts.The Cerebellum Beyond Motor Control2022-11-19T15:28:00+00:002022-11-19T15:28:00+00:00http://trentonbricken.github.io/Cerebellum-Beyond-Motor-Control<p><em>Searching beyond the streetlight finds a plethora of important cerebellar functions.</em></p>
<hr />
<blockquote>
<p>A policeman sees a drunk man searching for something under a streetlight and asks what the drunk has lost. He says he lost his keys and they both look under the streetlight together. After a few minutes the policeman asks if he is sure he lost them here, and the drunk replies, no, and that he lost them in the park. The policeman asks why he is searching here, and the drunk replies, “this is where the light is”. – <a href="https://en.wikipedia.org/wiki/Streetlight_effect">Source</a></p>
</blockquote>
<p><strong>Beyond the Streetlight</strong> – I believe this same phenomenon has occurred for our understanding of the cerebellum. Deficits in fine motor function are very easy to spot and were the simplest function to attribute to cerebellar lesions. This is still what is taught in most textbooks.</p>
<p>However, it is less well known that the cerebellum contains <a href="https://academic.oup.com/book/25657">~70%</a> of all neurons in the brain and is ubiquitous across organisms as varied as <a href="https://www.annualreviews.org/doi/abs/10.1146/annurev-neuro-080317-0621333">humans, fruit flies, and electric fish</a>. Over the last ~20 years in particular, evidence has been accumulating that the cerebellum, first <a href="https://link.springer.com/article/10.1007/s12311-020-01133-7">named by Leonardo da Vinci</a> for “little brain” is more important than its size may suggest.</p>
<div align="center">
<img width="200" src="../images/CerebellumIsAwesome/CerebellumIntro.png" />
<br />
<em>This is the cerebellum. It is important for lots of things.</em>
</div>
<p>A prominent neuroscientist once said to me:</p>
<blockquote>
<p>A dirty little secret in cognitive science is that the cerebellum lights up for almost every task.</p>
</blockquote>
<p>Taking a excerpt from a <a href="https://pubmed.ncbi.nlm.nih.gov/22047489/">recent paper</a>:</p>
<blockquote>
<p>A review of 275 positron emission tomography (PET) and functional magnetic resonance imaging (fMRI) studies revealed that cerebellar activation was observed during a broad range of tasks, including orienting attention, olfaction, spoken and written language, verbal working memory, problem solving, spatial memory, episodic memory, skill learning, and associative learning (Cabeza & Nyberg, 2000).</p>
<p>A broad range of neuropsychological deficits has also been documented following localized cerebellar pathology, with deficits across tasks of attention, working memory, language and naming, counting, visuospatial processing, planning, and abstract reasoning reported (Kalashnikova, Zveva, Pugacheva, & Korsakova, 2005).</p>
</blockquote>
<div align="center">
<img width="500" src="../images/CerebellumIsAwesome/SamWang.png" />
<br />
<em>Neural tracers have recently found closed loop circuits between the cerebellum and almost every other brain region. Graphical abstract of <a href="https://pubmed.ncbi.nlm.nih.gov/34551311/">Pisano et al. (2021)</a>.</em>
</div>
<p><strong>Cerebellar ubiquity</strong> – The cerebellum-like structure found in many insects including <a href="https://www.annualreviews.org/doi/abs/10.1146/annurev-neuro-080317-0621333">fruit flies</a>, <a href="https://pubmed.ncbi.nlm.nih.gov/11240291/">ants</a>, and <a href="https://royalsocietypublishing.org/doi/10.1098/rstb.1982.0086">bees</a> is the Mushroom Body (MB). The MB is responsible for associative learning and is the primary region of the Drosophila brain that is not genetically pre-wired, instead containing many <a href="https://elifesciences.org/articles/04577">random connections that undergo learning</a>. Cerebellar equivalents have also been discovered in families of <a href="https://royalsocietypublishing.org/doi/10.1098/rstb.2015.0055">crustaceans and flatworms</a> through likely shared ancestral inheritance and <a href="https://www.sciencedirect.com/science/article/pii/S096098221101013X">cephalopods</a> through potentially convergent evolution.<sup id="fnref:Cephalopods" role="doc-noteref"><a href="#fn:Cephalopods" class="footnote" rel="footnote">1</a></sup></p>
<div align="center">
<img width="700" src="../images/CerebellumIsAwesome/MBCerebellum.png" />
<br />
<em>The shared circuitry between the cerebellum and Mushroom Body. Image taken from Figure 1 of <a href="https://www.annualreviews.org/doi/abs/10.1146/annurev-neuro-080317-0621333">Modi et al. (2020)</a>.</em>
</div>
<p><strong>Cerebellar intelligence</strong> – Just as there has been found to be a <a href="https://pubmed.ncbi.nlm.nih.gov/28215558/">positive correlation</a> between the size of the MB and the intellect of various insects, it has <a href="https://www.pnas.org/doi/pdf/10.1073/pnas.2002896117">recently been discovered</a> that the human cerebellum is larger than previously thought making up 78% of the total surface area of our neocortex, making it much larger than for other primates. For example, in the macaque monkey the cerebellum is only 33% the surface area of its neocortex.</p>
<p><strong>Sans cerebellum</strong> – Another critique of cerebellar importance is that people can live being born without one. This phenomenon is not unique to the cerebellum, where people can be born without entire cortical lobes and show no phenotypic deficits. For example, <a href="https://www.wired.com/story/she-was-missing-a-chunk-of-her-brain-it-didnt-matter/">patient EG</a>, a highly intelligent lawyer who taught herself Russian later in life and scored 98th percentile for vocabulary, was only found at the age of 25 to be missing her entire left temporal lobe – the canonical center of language processing! Her right temporal lobe was found to have entirely <a href="https://evlab.mit.edu/assets/papers/Tuckute%20et%20al%202022%20Nplogia.pdf">compensated</a>.<sup id="fnref:LostEinstein" role="doc-noteref"><a href="#fn:LostEinstein" class="footnote" rel="footnote">2</a></sup></p>
<div align="center">
<img width="500" src="../images/CerebellumIsAwesome/PatientEG.png" />
<br />
<em>Patient EG missing her left temporal lobe, which is the the seat of language processing in normal brains (containing both Broca's and Wernicke's areas). <a href="https://evlab.mit.edu/assets/papers/Tuckute%20et%20al%202022%20Nplogia.pdf">Source</a></em>
</div>
<p>Being familiar with cases like patient EG, it is reasonable to assume the same compensation and lack of phenotypic deficiency would occur with a missing cerebellum. However, this turns out to not be the case and permanently harms much more than fine motor control, including also language development and emotion. Here are excerpts from an <a href="https://www.npr.org/sections/health-shots/2015/03/16/392789753/a-man-s-incomplete-brain-reveals-cerebellum-s-role-in-thought-and-emotion">interview</a> with Jonathan who was born without a cerebellum:</p>
<blockquote>
<p>“All his milestones were late: sitting up, walking, talking.” […] He also lacks the balance to ride a bicycle.</p>
<p>“Reaction time, not my strong suit,” Jonathan says, adding that he doesn’t drive anymore. Emotional complexity is another challenge for Jonathan, says his sister, Sarah Napoline. She says her brother is a great listener, but isn’t introspective.</p>
<p>“He doesn’t really get into this deeper level of conversation that builds strong relationships, things that would be the foundation for a romantic relationship or deep enduring friendships,” she says. Jonathan, who is sitting beside her, says he agrees. – <a href="https://www.npr.org/sections/health-shots/2015/03/16/392789753/a-man-s-incomplete-brain-reveals-cerebellum-s-role-in-thought-and-emotion">Source</a></p>
</blockquote>
<p><strong>Summary</strong> – The fact the same fundamental cerebellar architecture appears across such diverse species, the growing evidence of its involvement in most cognitive functions, and the correlation between its size and intelligence, all indicate that cerebellum-like neuronal architectures are performing a crucial and differentiated cognitive operation that spans far beyond motor control.</p>
<p>(Shameless plug: Sparse Distributed Memory is a theory for cerebellar function that is also a close approximation to the state of the art Transformer deep learning architecture: <a href="https://www.trentonbricken.com/Attention-Approximates-Sparse-Distributed-Memory/">Attention Approximates Sparse Distributed Memory</a>.)</p>
<hr />
<p><em>Thanks to the <a href="https://james-simon.github.io/">Jamie Simon</a> for spotting a typo. All remaining errors are mine and mine alone.</em></p>
<h3 id="footnotes">Footnotes</h3>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:Cephalopods" role="doc-endnote">
<p>Note that the extent to which the cephalopod vertical lobe approximates the cerebellum is contested (<a href="https://royalsocietypublishing.org/doi/10.1098/rstb.2015.0055">pro convergence</a>; <a href="https://pubmed.ncbi.nlm.nih.gov/25644267/">contra convergence</a>). <a href="#fnref:Cephalopods" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:LostEinstein" role="doc-endnote">
<p>It is wild to me that there is <em>no</em> consequence of missing an entire lobe in the brain. Across animals, the number of neurons (as a ratio of body size) matters a lot for intelligence. Patient EG is highly intelligent but it seems plausible that she could have been even smarter if she was not missing a large number of neurons (computational units). <a href="#fnref:LostEinstein" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
<h3 id="citation">Citation</h3>
<p>If you found this post useful for your research please use this citation:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@misc{CerebellumBeyondMotorControl,
title={The Cerebellum Beyond Motor Control},
url={https://www.trentonbricken.com/Cerebellum-Beyond-Motor-Control/},
journal={Blog Post, trentonbricken.com},
author={Trenton Bricken},
year={2022}, month={November}}
</code></pre></div></div>
<hr />Searching beyond the streetlight finds a plethora of important cerebellar functions.HRRs Can’t Recast Self Attention2022-11-19T10:24:00+00:002022-11-19T10:24:00+00:00http://trentonbricken.github.io/Contra-Recasting-Attention<p><em>Why Holographic Reduced Representations cannot be used to “Recast Self Attention”.</em></p>
<hr />
<p>A paper “<a href="https://kdd-milets.github.io/milets2022/papers/MILETS_2022_paper_5942.pdf">Recasting Self-Attention with Holographic Reduced Representations</a>” was recently posted that claims to use the Holographic Reduced Representations to “recast” Transformer Self Attention. While the paper shows some interesting empirical results, I explain why I think the work is flawed in its theoretical underpinnings.</p>
<p>I’m taking the time to write this critique because I believe it is a critical period for Vector Symbolic Architectures (VSAs) to interface with Deep Learning and that this work represents VSAs poorly.</p>
<hr />
<p>For some background, the Transformer Self Attention equation for a single query vector is the following:</p>
\[V \text{softmax}(K^T \mathbf{q}_t)\]
<p>where our values and keys are vectors of dimension \(d\) stored columnwise in matrices \(K \in \mathbb{R}^{d\times T}\), \(V \in \mathbb{R}^{d\times T}\), and we are only considering a single query vector \(\mathbf{q}_t \in \mathbb{R}^{d}\). \(T\) is used for the number of tokens in the receptive field of the model and \(t\) subscript is the current time point that we are predicting the next token from. This time point determines the current query and will become crucial later.</p>
<p>We will write:</p>
\[\mathbf{\hat{a}}_t = K^T \mathbf{q}_t = [ \mathbf{k}_1^T \mathbf{q}_t, \mathbf{k}_2^T \mathbf{q}_t , \dots, \mathbf{k}_T^T \mathbf{q}_t ]^T,\]
<p>to be the attention vector before the softmax operation. Creating this vector takes \(O(dT)\) compute (\(T\) dot products each of \(d\) dimensional vectors).</p>
<hr />
<p>Now here is what the <a href="https://kdd-milets.github.io/milets2022/papers/MILETS_2022_paper_5942.pdf">paper</a> is doing:</p>
<p>#1. Bind keys and values across the sequence together using the VSA bind operator \(\otimes\) to create the superposition vector \(\mathbf{s}_{kv}\):</p>
\[\mathbf{s}_{kv} = \sum_i^T \mathbf{k}_i \otimes \mathbf{v}_i\]
<p>All you need to know about the bind operator is that it produces another n dimensional vector and is invertible where \((\mathbf{a} \otimes \mathbf{b}) \otimes \mathbf{a}^{-1} = \mathbf{b}\).</p>
<p>#2. Create a superposition of all queries across the sequence:</p>
\[\mathbf{s}_q = \sum_i^T \mathbf{q}_i\]
<p>#3. Unbind the query superposition from the key value superposition (this computes the query key dot products between all queries and keys but in superposition):</p>
\[\begin{align}
\mathbf{z} &= \mathbf{s}_{kv} \otimes \mathbf{s}_q^{-1} = \Big ( \sum_i^T ( \mathbf{k}_i \otimes \mathbf{v}_i ) \Big ) \otimes \Big (\sum_i^T \mathbf{q}_i \Big )^{-1} \\
&= \mathbf{q}_1^{-1} \otimes \Big ( \mathbf{k}_1 \otimes \mathbf{v}_1 + \dots + \mathbf{k}_T \otimes \mathbf{v}_T \Big ) + \dots + \mathbf{q}_T^{-1} \otimes \Big ( \mathbf{k}_1 \otimes \mathbf{v}_1 + \dots + \mathbf{k}_T \otimes \mathbf{v}_T \Big )
\end{align}\]
<p>#4. Extract the attention weights by doing a cosine similarity (CS) between each value vector and \(\mathbf{z}\) where \(\epsilon\) is a noise term for everything that doesn’t have the corresponding \(\mathbf{v}_i\) match.</p>
\[\begin{align}
\mathbf{\tilde{a}}_t &= [ \text{CS}(\mathbf{v}_1, \mathbf{z}), \dots, \text{CS}(\mathbf{v}_T, \mathbf{z}) ]^T \\
&= [ CS(\mathbf{v}_1, \mathbf{v}_1 \otimes \mathbf{k}_1 \otimes \sum_i^T \mathbf{q}_i^{-1} +\epsilon ), \dots, CS(\mathbf{v}_T, \mathbf{v}_T \otimes \mathbf{k}_T \otimes \sum_i^T \mathbf{q}_i^{-1} +\epsilon) ]^T \\
&\approx [ \sum_i^T \mathbf{k}_1^T \mathbf{q}_i +\epsilon, \dots, \sum_i^T \mathbf{k}_T^T \mathbf{q}_i+\epsilon ]^T \\
\end{align}\]
<p>Can you spot the difference between this \(\mathbf{\tilde{a}}_t\) and the original Self Attention \(\mathbf{\hat{a}}_t\)?</p>
<p>\(\mathbf{\tilde{a}}_t\) computes the dot product between the key vector and <strong>every</strong> query! Not the current query \(\mathbf{q}_t\) that should be the <em>only</em> query used to predict the next token. This means that every attention weight vector is the same across the entire sequence: \(\mathbf{\tilde{a}}_i == \mathbf{\tilde{a}}_j \forall i,j \in [1,T]\).</p>
<p>There are two solutions to modify this approach so that it is a true recasting of Attention, however, both of them remove the speedup claimed by the paper, leaving only the increase in noise from \(\epsilon\)!</p>
<p>First, if a masked language setting is implemented correctly<sup id="fnref:Masking" role="doc-noteref"><a href="#fn:Masking" class="footnote" rel="footnote">1</a></sup>, at e.g. \(t=5\) we don’t have access to the keys, queries and values for \(t>5\). This means that as we move across the sequence, we incrementally add queries to our query superposition and keys/values to our key+value superposition and compute all of the above equations (#1-#4) each time. This means we have \(O(dT^2)\) where \(d\) is the dimensionality and \(T\) is the sequence length (\(dT\) operations for a single query because we compute cosine similarity using every value vector and we repeat this for every incremental query in the sequence).</p>
<p>Second, rather than adding more vectors to the superposition, making it noisier, we can keep each query separate when we perform the above operations. However, this is again \(O(dT^2)\) complexity and reveals how using VSAs here doesn’t make sense. We bind together every key and value vector to compute a noisy dot product with the query in superposition, only to then unbind all of them again? This is more expensive and noisier than merely doing a dot product between every key and query as in the original attention operation!</p>
<p>To conclude, while I share with the paper authors the desire to integrate VSAs into Deep Learning, the way it has been done here is ineffective and misleading. What is created is not a re-casting of the Attention operation. It is surprising that it does better than baselines on some idiosyncratic benchmarks and this may also be due to an implementation error.</p>
<p>Please reach out if I am missing something about this paper as I am happy to discuss it and revise this blog post.</p>
<hr />
<p><em>Thanks to the <a href="https://redwood.berkeley.edu/people/denis-kleyko/">Denis Kleyko</a> for helpful comments and discussion. All remaining errors are mine and mine alone.</em></p>
<h3 id="footnotes">Footnotes</h3>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:Masking" role="doc-endnote">
<p>I am concerned that the results in the paper that beat benchmarks are the result of incorrect masking. <a href="#fnref:Masking" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
<h3 id="citation">Citation</h3>
<p>If you found this post useful for your research please use this citation:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@misc{HHRsRecastingAttention,
title={HRRs Can't Recast Self Attention},
url={https://www.trentonbricken.com/Contra-Recasting-Attention/},
journal={Blog Post, trentonbricken.com},
author={Trenton Bricken},
year={2022}, month={November}}
</code></pre></div></div>
<hr />Why Holographic Reduced Representations cannot be used to “Recast Self Attention”.Sparse Distributed Memory and Hopfield Networks2022-10-18T10:24:00+00:002022-10-18T10:24:00+00:00http://trentonbricken.github.io/SDM-And-Hopfield-Networks<p><em>How Hopfield Networks are a special case of the biologically plausible Sparse Distributed Memory.</em></p>
<hr />
<p>Going off citation count for their original, seminal papers, <a href="https://www.pnas.org/doi/10.1073/pnas.79.8.2554">Hopfield Networks</a> are ~24x more popular than <a href="https://mitpress.mit.edu/9780262514699/sparse-distributed-memory/">Sparse Distributed Memory</a> (SDM) (24,362 citations versus 1,337). I think this is a shame because SDM can not only be viewed as a generalization of both the original and more modern Hopfield Networks but also passes a higher bar for biological plausbility – having a one-to-one mapping to the circuitry of the cerebellum. Additionally, like <a href="https://arxiv.org/abs/2008.02217">Hopfield Networks</a>, SDM has been shown to <a href="https://arxiv.org/abs/2111.05498">closely relate</a> to the powerful Transformer deep learning architecture.</p>
<p>In this blog post, we first provide background on Hopfield Networks. We then review how Sparse Distributed Memory (SDM) is a more general form of the original Hopfield Network. Finally, we provide insight into how modern improvements to the Hopfield Network modify the weighting of patterns, making them even more convergent with SDM. (In other words, SDM can be seen as pre-empting modern Hopfield Net innovations).</p>
<div align="center">
<img width="1200" src="../images/HopfieldSDM/Frame 6.png" />
<br />
<em>Summary of how modifications to SDM and Hopfield Networks relate them to Transformers and the Brain. Question marks denote uncertain but potential links.</em>
</div>
<h2 id="background-on-sdm-and-hopfield-networks">Background on SDM and Hopfield Networks</h2>
<p>The fundamental difference between SDM and Hopfield Networks lies in the primitives they use. In SDM, the core primitive is neurons that patterns are written into and read from. Hopfield Networks do a figure-ground inversion, where the core primitive is patterns and it is from their storage/retrieval that neurons implicitly appear.</p>
<p>To make this more concrete, we first provide a quick background on how SDM works:</p>
<h3 id="sdm-background">SDM Background</h3>
<p><em>If you like videos then watch the first 10 mins of <a href="https://www.youtube.com/watch?v=THIIk7LR9_8">this talk</a> I gave on how SDM works and skip the rest of this section.</em></p>
<div align="center">
<img width="700" src="../images/HopfieldSDM/Frame 43.png" />
<br />
<em>Summary SDM write operations (top row) and read operations (bottom row). The bottom left sub-figure shows the neuron view. The bottom right sub-figure shows how the neurons can be abstracted away and the original pattern locations considered.</em>
</div>
<p>To keep things simple, we will introduce the continuous version of SDM, where all neurons and patterns exist on the \(L^2\) unit norm hypersphere and cosine similarity is our distance metric. The original version of SDM used binary vectors and the Hamming distance metric.</p>
<p>SDM randomly initializes the addresses of \(r\) neurons on the \(L^2\) unit hypersphere in an \(n\) dimensional space. These neurons have addresses that each occupy a column in our address matrix \(X_a \in (L^2)^{n\times r}\), where \(L^2\) is shorthand for all \(n\)-dimensional vectors existing on the \(L^2\) unit norm hypersphere. Each neuron also has a storage vector used to store patterns represented in the matrix \(X_v \in \mathbb{R}^{o\times r}\), where \(o\) is the output dimension.</p>
<p>Patterns also have addresses constrained on the \(n\)-dimensional \(L^2\) hypersphere that are determined by their encoding; pattern encodings can be as simple as flattening an image into a vector or as complex as preprocessing the image through a deep convolutional network.</p>
<p>Patterns are stored by activating all nearby neurons within a cosine similarity threshold \(c\), and performing an elementwise summation with the activated neurons’ storage vector. Depending on the task at hand, patterns write themselves into the storage vector (e.g., during a reconstruction task) or write another pattern, possibly of different dimension (e.g., writing in their one hot label for a classification task).</p>
<p>Because in most cases we have fewer neurons than patterns, the same neuron will be activated by multiple different patterns. This is handled by storing the pattern values in superposition via the aforementioned elementwise summation operation. The fidelity of each pattern stored in this superposition is a function of the vector orthogonality and dimensionality \(n\).</p>
<p>Using \(m\) to denote the number of patterns, matrix \(P_a \in (L^2)^{n\times m}\) for the pattern addresses, and matrix \(P_v \in \mathbb{R}^{o\times m}\) for values the patterns want to write, the SDM write operation is:
\begin{align}
\label{eq:SDMWriteMatrix}
X_v = P_v b_c \big ( P_a^T X_a \big ), \qquad
b_c(e)=
\begin{cases}
1, & \text{if}\ e \geq c <br />
& \text{else} \; 0,
\end{cases}
\end{align}
where \(b(e)\) performs an element-wise binarization of its input to determine which pattern and neuron addresses are within the cosine similarity threshold \(c\) of each other.</p>
<p>Having written patterns into our neurons, we read from the system by inputting a query \(\boldsymbol{\xi}\), that again activates nearby neurons. Each activated neuron outputs its storage vector and they are all summed elementwise to give a final output \(\mathbf{y}\). The output \(\mathbf{y}\) can be interpreted as an updated query and optionally \(L^2\) normalized again as a post processing step:</p>
<p>\begin{align}
\label{eq:SDMReadMatrix}
\mathbf{y} = X_v b_c \big ( X_a^T \boldsymbol{\xi} \big).
\end{align}</p>
<p>Intuitively, SDM’s query will update towards the values of the patterns with the closest addresses. This is because the patterns with the closest addresses will have written their values into more neurons that the query reads from than any competing patterns. For example, in the summary figure above, the blue pattern address is the closest to the query meaning that it appears the most in those nearby neurons the query reads from.</p>
<p>Another potentially useful perspective is to see SDM as a single hidden layer MLP with a few modifications:</p>
<ul>
<li>All neuron addresses \(X_a\) are randomly initialized and fixed.</li>
<li>All \(X_a\), \(P_a\) patterns and \(\boldsymbol{\xi}\) queries are \(L^2\) normalized.</li>
<li>Only neurons above the activation threshold remain on so the activation function is not piecewise. They also use a Heaviside step function being 0 or 1 (this is not differentiable).</li>
<li>The neurons have no bias term.</li>
</ul>
<p>In a <a href="https://openreview.net/forum?id=JknGeelZJpHP">recent paper</a>, we resolve a number of these modifications to make SDM more compatible with Deep Learning. This results in a model that is very good at continual learning!</p>
<div align="center">
<img width="700" src="../images/HopfieldSDM/Frame 76.png" />
<br />
<em>How our SDM notation can be mapped onto the structure of a single hidden layer MLP.</em>
</div>
<p>In <a href="https://arxiv.org/abs/2111.05498">another paper</a>, we show that the SDM update rule is closely approximated by the softmax update rule used by Transformer Attention. This will be relevant later with newer versions of Hopfield Networks that also show this relationship.</p>
<h3 id="hopfield-background">Hopfield Background</h3>
<p>Hopfield Networks before the modern continuous version all use bipolar vectors \(\bar{\mathbf{a}} \in \{-1,1\}^n\) where the bar denotes bipolarity. The Hopfield Network update rule in matrix form, in its typical auto-associative format where \(P_v=P_a\) is:<sup id="fnref:HopfieldAutoAssoc" role="doc-noteref"><a href="#fn:HopfieldAutoAssoc" class="footnote" rel="footnote">1</a></sup></p>
<p>\begin{equation}
\label{eq:HopfieldUpdateRule}
\bar{\mathbf{y}} = \bar{g}( \bar{P}_a \bar{P}_a^T \bar{\boldsymbol{\xi}} ) \qquad \bar{g}(e)=
\begin{cases}
1, & \text{if}\ e> 0 <br />
, & \text{else} \; -1,
\end{cases}
\end{equation}</p>
<p>Interpreting this equation from the perspective of SDM, we first compute a re-scaled and approximate version of the cosine similarity between the query and pattern addresses \(\bar{P}_a^T \bar{\boldsymbol{\xi}}\). This can be done by taking a dot product between bipolar vectors and dividing by \(n\) to convert the interval our bipolar values can take from [-\(n\), \(n\)] to [-1,1] of cosine similarity:</p>
<p>\begin{equation}
\label{eq:BipolarHammConversion}
\mathbf{x}^T \mathbf{y} \approx \frac{\bar{\mathbf{x}}^T\bar{\mathbf{y}}}{n}
\end{equation}</p>
<p>This relationship is exact with binary SDM and bipolar Hopfield but we leave this relationship to the Appendix B.6 of our <a href="https://arxiv.org/abs/2111.05498">paper</a>.</p>
<p>We use this distance metric to weight each pattern before mapping back into our bipolar space with \(\bar{g}(\cdot)\).</p>
<p>Instead of first multiplying the query with the pattern addresses (\(\bar{P}_a^T \bar{\boldsymbol{\xi}}\)) like in SDM, Hopfield Networks instead typically perform \(\bar{P}_a \bar{P}_a^T=M\) which gives us a symmetric, \(n \times n\) dimensional matrix. We can interpret this symmetric matrix \(M\) as containing \(n\) neurons where each neuron’s address is a row and its value vector is the corresponding column, which by symmetry is the row transpose. Therefore, the number of neurons is defined by the pattern dimensionality \(n\) and the neuron address and value vectors are derived from \(\bar{P}_a \bar{P}_a^T\). This is how neurons emerge from the patterns in Hopfield Networks.</p>
<p>What is most distinct about this operation in comparison with SDM is that there is no activation threshold between the patterns and query (\(\bar{P}_a^T \bar{\boldsymbol{\xi}}\)). As a result, every pattern has an effect on the update rule including positive attraction and negative repulsion forces. We will see how more modern versions of the Hopfield Network have in fact re-implemented activation thresholds that are increasingly reminiscent of that used by SDM.</p>
<h2 id="sdm-as-a-generalization-of-hopfield-networks">SDM as a Generalization of Hopfield Networks</h2>
<p>It was first established by <a href="https://www.sciencedirect.com/science/article/abs/pii/0364021388900262">Keeler</a> that SDM was a generalization of Hopfield Networks. <a href="https://www.pnas.org/doi/10.1073/pnas.79.8.2554">Hopfield Networks</a> can be represented by SDM’s neuron primitives as a special case where there are no distributed reading or writing operations. This makes the weighting of patterns in the read operation the bipolar re-scaling of the cosine similarity to the query.</p>
<p>The Hopfield version of SDM has neurons centered at each pattern so \(r=m\) and \(X_a=P_a\). Distributed writing and reading are removed by setting the cosine similarity threshold for writing to \(d_\text{write}=1\) and for reading to \(d_\text{read}=-1\).</p>
<p>This means patterns are only written to the neurons centered at them and the query reads from every neuron. In addition, for reading, rather than binary neuron activations using the Heaviside function \(b(\cdot)\), neurons are weighted by the bipolar version of the cosine similarity given by \(\bar{\mathbf{x}}^T\bar{\mathbf{y}}\). Writing out the SDM read operation in full with bipolar vectors that make our normalizing constant unnecessary, we have:</p>
\[\bar{\mathbf{y}} = \bar{g} \Big ( \bar{X}_v b_{d_\text{read}} \big ( \frac{\bar{X}_a^T \bar{P}_a}{n} \big ) \Big ) = \bar{g} \Big ( \underbrace{ \bar{P}_v b_{d_\text{write}}( \frac{\bar{X}_a^T \bar{P}_a}{n})}_{\text{Write Patterns}} \underbrace{ b_{d_\text{read}} \big ( \frac{\bar{X}_a^T \bar{P}_a}{n} \big ) }_{\text{Read Patterns}} \Big )\]
<p>Looking first at the write operation where \(d_\text{write}=0\):</p>
\[X_v = P_v b_c \big ( P_a^T X_a \big ) = \bar{P}_v b_{d_\text{write}}(\frac{\bar{X}_a^T\bar{P}_a}{n})=\bar{P}_v I = \bar{P}_v=\bar{P}_a\]
<p>where \(I\) is the identity matrix and \(P_v=P_a\) in the typical autoassociative setting. For the read operation we remove the threshold and cosine similarity re-scaling:</p>
\[b_{d_\text{read}}(\frac{\bar{X}_a^T\bar{\boldsymbol{\xi}} }{n}) = \bar{X}_a^T \bar{\boldsymbol{\xi}} =\bar{P}_a^T \bar{\boldsymbol{\xi}}.\]
<p>Together, these modifications turn SDM using neuron representations into the original Hopfield Network:</p>
\[\bar{g} \Big ( \underbrace{ \bar{P}_v b_{d_\text{write}}( \frac{\bar{X}_a^T \bar{P}_a}{n})}_{\text{Write Patterns}} \underbrace{ b_{d_\text{read}} \big ( \frac{\bar{X}_a^T \bar{P}_a}{n} \big ) }_{\text{Read Patterns}} \Big ) = \bar{g}( \bar{P}_a \bar{P}_a^T \bar{\boldsymbol{\xi}} )\]
<h2 id="differences-between-sdm-and-hopfield-networks">Differences between SDM and Hopfield Networks</h2>
<p>While Hopfield Networks have been traditionally presented and used in an autoassociative fashion, by using a synchronous update rule they can also be heteroassociative \(\bar{\mathbf{y}} = \bar{g}( \bar{P}_v \bar{P}_a^T \bar{\boldsymbol{\xi}} )\) but do not work <a href="https://www.sciencedirect.com/science/article/abs/pii/0364021388900262">as well</a>. However, a solution is to operate autoassociatively but concatenate together the pattern address and pointer as was introduced <a href="https://proceedings.neurips.cc/paper/2016/file/eaae339c4d89fc102edd9dbdb6a28915-Paper.pdf">here</a>. However, this concatenation is less biologically plausible than SDMs solution that has separate input lines capable of carrying different key and value vectors.</p>
<p>A second and more important difference between SDM and Hopfield Networks is that the latter is biologically implausible because of the weight transport problem, whereby the afferent and efferent weights are symmetric. At the expense of biology, these symmetric weights allow for the derivation an energy landscape that can be used for convergence proofs and to solve optimization problems like the <a href="https://www.sciencedirect.com/science/article/abs/pii/0364021388900262">Traveling Salesman</a>. Meanwhile, SDM is not only free from weight symmetry but also has its mapping onto the cerebellum outlined <a href="https://redwood.berkeley.edu/wp-content/uploads/2020/08/KanervaP_SDMrelated_models1993.pdf">here</a>.</p>
<p>A third difference between SDM and Hopfield Networks lies in how they weight their patterns. We can interpret the Hopfield update as computing the similarity between \(P_a^T \boldsymbol{\xi}\) which, because of the bipolar values, has a maximum of \(n\), minimum of \(-n\) and moves in increments of 2 (flipping from a +1 to -1 or vice versa). This distance metric between each pattern and the query results in our query being attracted to similar patterns and repulsed from dissimilar. Distinctly, all patterns aside from those that are completely orthogonal contribute to the query update while in SDM, it has been <a href="https://arxiv.org/abs/2111.05498">shown</a> that patterns have approximately exponential weighting for nearby patterns and those further than \(d(\boldsymbol{\xi},\mathbf{p_a})>2d\), receiving no weighting at all. We will expand upon this weighting and how modern improvements to Hopfield Networks have made them a closer relation to SDM.</p>
<p>Finally, both Hopfield Networks and SDM, while using different parameters, have been <a href="https://www.sciencedirect.com/science/article/abs/pii/0364021388900262">shown</a> to have the same information content as a function of their parameter count. Yet, these parameters are used in different ways because of SDM’s neuron primitive. Notably, SDM can increase its memory capacity up to a limit by increasing the number of neurons \(r\) rather than needing to increase \(n\), the dimensionality of the patterns being stored.</p>
<h2 id="modern-hopfield-networks-approximate-sdm">Modern Hopfield Networks Approximate SDM</h2>
<div align="center">
<img width="700" src="../images/HopfieldSDM/Frame 113.png" />
<br />
<br />
<em>The evolution of Hopfield Networks update rule over time. The transition from the <a href="https://www.pnas.org/doi/10.1073/pnas.79.8.2554">original Hopfield Network</a> to the <a href="https://proceedings.neurips.cc/paper/2016/file/eaae339c4d89fc102edd9dbdb6a28915-Paper.pdf">Polynomial (Modern) Hopfield Network</a> applied a non-linearity to the dot product between the query and each of the patterns. This is reminiscent of SDM's activation threshold. This was then improved upon by making the non-linearity into an <a href="https://mathematical-neuroscience.springeropen.com/articles/10.1186/s13408-017-0056-2">exponential</a> and then the <a href="https://arxiv.org/abs/2008.02217">softmax</a>! This is a very close approximation to the weighting that SDM applies to each pattern.</em>
</div>
<p>The binary modern Hopfield Network introduced by <a href="https://proceedings.neurips.cc/paper/2016/file/eaae339c4d89fc102edd9dbdb6a28915-Paper.pdf">Krotov and Hopfield</a>, showed that using higher order polynomials to assign new pattern weightings in the read operation increased capacity. In particular, adding odd and rectified polynomials, which put more weight on memories that are closer to the query, better separating the signal of the target pattern from the noise of all others. The energy function to be minimized is:</p>
\[E=-\sum_{\bar{\mathbf{p}}\in \bar{P}} K_a \Big( \bar{\mathbf{p}}_a \bar{\boldsymbol{\xi}} \Big )=-\sum_{\bar{\mathbf{p}}\in \bar{P}} K_a \Big( \sum_i^n [\bar{\mathbf{p}}_a]_i \bar{\boldsymbol{\xi}}_i \Big )\]
<p>where:</p>
\[K_a(x)=
\begin{cases}
x^a, & x \geq 0 \\
0, & x < 0
\end{cases}\]
<p>with \(x<0\) being the rectified component that is optional and \(a\) being the order of the polynomial. It can be shown that the original Hopfield Network energy function used a second order polynomial \(a=2\). The query updates its bit in position \(i\) by comparing the difference in energy if this bit was “on” or “off” (1, -1 when bipolar here):</p>
\[\label{eq:ModernHopfieldEnergyEquation}
\bar{\mathbf{y}}_i = \text{Sign}\Bigg[ \sum_{\bar{\mathbf{p}}\in \bar{P}} \Big ( K \big( [\bar{\mathbf{p}}_a]_i + \sum_{j \neq i} [\bar{\mathbf{p}}_a]_j \bar{\boldsymbol{\xi}}_j \big ) - K \big( - [\bar{\mathbf{p}}_a]_i + \sum_{j \neq i} [\bar{\mathbf{p}}_a]_j \bar{\boldsymbol{\xi}}_j \big ) \Big ) \Bigg]\]
<p>Whichever configuration between “on” and “off” gives the highest output from \(K(x)\), corresponding to a lower energy state, will be updated towards.</p>
<p>Using even polynomials in the energy difference means that attraction to similar patterns (closer than orthogonal) is rewarded as much as making opposite patterns (further than orthogonal) further away. Odd polynomials reward attraction to similar patterns also but instead reward reducing the distance of opposite patterns, trying to make them orthogonal. In other words, an even polynomial would rather have a vector be opposite than orthogonal while it is the other way around for an odd polynomial. Meanwhile, the rectification of the polynomial, which empirically resulted in a higher memory capacity, simply ignores all opposite patterns. Finally, the capacity of this network was further improved by replacing the polynomial with an exponential function \(K(x)=\exp(x)\) <a href="https://mathematical-neuroscience.springeropen.com/articles/10.1186/s13408-017-0056-2">here</a>.</p>
<p>Fundamentally, these odd, rectified, and exponential \(K(x)\) functions make the Hopfield Network more similar to SDM by introducing activation thresholds around the query, and either ignore (in the rectified case and exponential cases), or penalize (in the odd polynomial case), vectors for being greater than orthogonal distance away. The weighting of vectors remains different between the polynomials and exponential such that they have different information storage capacities. However, by introducing their de facto cosine similarity thresholds, these Hopfield variants are all convergent with the read operation of SDM. This is particularly the case for the exponential variant because SDM weights its patterns in an approximately <a href="https://arxiv.org/abs/2111.05498">exponential fashion</a>.</p>
<p>Beyond the convergence in pattern weightings, we note that the last step in making the exponential variant of the modern Hopfield Network into Transformer Attention is to make it continuous in the paper <a href="https://arxiv.org/abs/2008.02217">“Hopfield Networks is All You Need”</a>. Making SDM continuous is the same step taken in <a href="https://arxiv.org/abs/2111.05498">our work</a> that results in the Attention approximation to SDM. This is done by modifying the energy function so that it enforces a vector norm (<a href="https://arxiv.org/abs/2111.05498">SDM does this too</a>) and then using the Concave Convex Optimization Procedure (CCP) to find local minima (flipping bits is no longer an option).</p>
<p>The recent paper <a href="https://arxiv.org/abs/2202.04557">“Universal Hopfield Networks”</a> (disclaimer: I am in the acknowledgements for this paper and the first author is a friend) makes a similar point relating different activation functions that have emerged in Hopfield Networks to Attention but does not dive into the relationship between SDM and Hopfield Networks and their chronological evolution.</p>
<p>Finally there are weak indications of convergence between the range of optimal SDM cosine similarity thresholds \(d\) and optimal polynomials \(a\) in the polynomial Hopfield Network. It was found when pattern representations were learnt that the optimal polynomial for maximizing accuracy in a classification task was neither too small nor too large. This is also the case for optimal \(d\), which are related via their effect on the pattern weightings. They have a useful interpretation of their system where some of the vectors could serve as prototypes to the solution and be very close, while others vectors could represent features that different components had. The best solution interpolated between these two approaches.</p>
<p>One can view the prototypes as anchoring the solution and the features providing generalization ability. Even for SDM’s neuron perspective, the same intuition may apply where it is advantageous to have some protoypes read from nearby neurons but also collect features from related patterns stored in more distant neurons. <a href="https://redwood.berkeley.edu/wp-content/uploads/2020/08/KanervaP_SDMrelated_models1993.pdf">This overview of SDM</a>, gives an example of noisy patterns being stored and their combination being a noise free version is related to this line of thinking on using features from related patterns.</p>
<h2 id="citation">Citation</h2>
<p>If you found this post useful for your research please use this citation:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@misc{SDMHopfieldNetworks,
title={Sparse Distributed Memory and Hopfield Networks},
url={https://www.trentonbricken.com/SDM-And-Hopfield-Networks/},
journal={Blog Post, trentonbricken.com},
author={Trenton Bricken},
year={2022}, month={October}}
</code></pre></div></div>
<h3 id="footnotes">Footnotes</h3>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:HopfieldAutoAssoc" role="doc-endnote">
<p>The Hopfield Network is almost always autoassociative so \(P_v=P_a=X_a=X_v\) (<a href="https://arxiv.org/abs/2111.05498">citation</a>, <a href="https://www.sciencedirect.com/science/article/abs/pii/0364021388900262">citation</a>). <a href="#fnref:HopfieldAutoAssoc" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
<hr />
<p><em>Thanks to the NSF Foundation and the Kreiman Lab for making this research possible. Also to <a href="https://twitter.com/DimaKrotov">Dmitry Krotov</a> and <a href="https://twitter.com/BerenMillidge">Beren Millidge</a> for useful conversations. All remaining errors are mine and mine alone.</em></p>How Hopfield Networks are a special case of the biologically plausible Sparse Distributed Memory.Improving Weight Regularization Continual Learning Baselines2022-10-04T13:04:00+00:002022-10-04T13:04:00+00:00http://trentonbricken.github.io/Continual-Learning-Baselines<p><em>A simple modification to improve Elastic Weight Consolidation and Synaptic Intelligence continual learning baselines.</em></p>
<hr />
<p>A number of influential continual learning algorithms like <a href="https://arxiv.org/abs/1612.00796">Elastic Weight Consolidation</a> (EWC) and <a href="https://arxiv.org/abs/1703.04200?context=cs">Synaptic Intelligence</a> (SI) protect neural network weights important for previous tasks from being updated by newer tasks. Otherwise, these network weights are overwritten by the next task, resulting in catastrophic forgetting.</p>
<p>It turns out that existing implementations of these algorithms have, at least for some tasks, been significantly <strong>underestimating</strong> their performance. This is because they need a small modification to be able to actually learn which network weights need protecting. In a Split MNIST class incremental benchmark, this modification leads to 43% higher validation accuracy.</p>
<p>These weight regularization methods use the magnitude of gradients during the backwards pass to infer what weights are important for a particular task. They then use a regularization term in the loss function to penalize the model from updating these weights during new tasks. However, in cases where the model performs very well on all training data within a task, there is almost no gradient for the model to be able to learn what weights are important!</p>
<p>In order to restore gradient information, we introduce a \(\beta\) coefficient into the cross-entropy loss function when learning weight importances. This is used to make the model less confident in its predictions. A hyperparameter sweep found that \(\beta = 0.005\) worked the best for both EWC and SI, leading to 43% and 15% performance gains as shown in the below figure.</p>
<p><img src="../images/ContinualLearningBeta/MainTable.png" alt="BetaTable" /></p>
<p>This result is presented in the paper <a href="https://openreview.net/forum?id=JknGeelZJpHP">Sparse Distributed Memory is a Continual Learner</a>, which relies upon a version of MNIST where the images use pixel values [0,255] that are not rescaled to [0, 1] or normalized to have a mean of 0. We believe this makes the MNIST classes more orthogonal and easier for continual learning.<sup id="fnref:IntegerMNIST" role="doc-noteref"><a href="#fn:IntegerMNIST" class="footnote" rel="footnote">1</a></sup> However, even when the pixels are rescaled, there is still see a performance boost (going from 27% when \(\beta=1.0\) to 54% \(\beta=0.005\)) and also performance gains for Split CIFAR10.</p>
<p>The \(\beta\) coefficient will not increase performance independently of other parameters. In both of the above cases, we also use gradient clipping, SGD with momentum and a single hidden layer of neurons. However, having these other modifications in place, \(\beta\) remains worth experimenting with to potentially get significant performance gains. Our results are better than those of any other paper we have seen that uses the EWC or SI as baselines including <a href="https://arxiv.org/abs/1810.12488">Re-evaluating Continual Learning Scenarios: A Categorization and Case for <strong>Strong</strong> Baselines</a> (emphasis mine).</p>
<hr />
<h2 id="replication">Replication</h2>
<p>There are two ways to reproduce our results: 1. Clone <a href="https://github.com/TrentBrick/Continual-Learning-Benchmark">this</a> GitHub repo which is a forked version of the original continual learning baseline and run the following command (after installing the requirements):</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python <span class="nt">-u</span> iBatchLearn.py <span class="nt">--gpuid</span> 0 <span class="se">\</span>
<span class="nt">--repeat</span> 1 <span class="nt">--incremental_class</span> <span class="nt">--optimizer</span> SGD <span class="se">\</span>
<span class="nt">--momentum</span> 0.9 <span class="nt">--weight_decay</span> 0.0 <span class="nt">--force_out_dim</span> 10 <span class="se">\</span>
<span class="nt">--no_class_remap</span> <span class="nt">--first_split_size</span> 2 <span class="nt">--other_split_size</span> 2 <span class="se">\</span>
<span class="nt">--schedule</span> 4 <span class="nt">--batch_size</span> 512 <span class="nt">--model_name</span> MLP1000 <span class="se">\</span>
<span class="nt">--agent_type</span> customization <span class="nt">--agent_name</span> EWC_mnist <span class="se">\</span>
<span class="nt">--lr</span> 0.03 <span class="nt">--reg_coef</span> 20000 <span class="nt">--use_beta_coef</span> True <span class="se">\</span>
<span class="nt">--beta_coef</span> 0.005
</code></pre></div></div>
<p>This uses the version of Split MNIST with [0,1] pixel values.</p>
<p>(if you are running on CPU set <code class="language-plaintext highlighter-rouge">--gpuid -1</code>).</p>
<ol>
<li>Clone <a href="https://github.com/anon8371/AnonPaper1">this</a> repo and in <code class="language-plaintext highlighter-rouge">test_runner.py</code> set:</li>
</ol>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model_style</span> <span class="o">=</span> <span class="n">ModelStyles</span><span class="p">.</span><span class="n">CLASSIC_FFN</span>
</code></pre></div></div>
<p>and</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cl_baseline</span> <span class="o">=</span> <span class="s">"EWC-MEMORY"</span><span class="p">,</span>
</code></pre></div></div>
<p>Then call: <code class="language-plaintext highlighter-rouge">python test_runner.py</code>!</p>
<p>This uses the version of Split MNIST with [0,255] pixel values.</p>
<h2 id="citation">Citation</h2>
<p>If you found this post useful for your research please cite the paper: <a href="https://openreview.net/forum?id=JknGeelZJpHP">Sparse Distributed Memory is a Continual Learner</a>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@misc{SDMContinualLearning,
title={Sparse Distributed Memory is a Continual Learner},
url={https://openreview.net/forum?id=JknGeelZJpHP},
journal={ICLR Submission},
author={Anonymous},
year={2022}, month={September}}
</code></pre></div></div>
<h3 id="footnotes">Footnotes</h3>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:IntegerMNIST" role="doc-endnote">
<p>This may explain why careful tuning of our regularization coefficient for the algorithms <a href="https://arxiv.org/abs/1711.09601">Memory Aware Synapses</a> and <a href="https://arxiv.org/abs/1312.6211">L2 Regularization</a> also led to slightly higher performance than reported in the <a href="https://arxiv.org/abs/1810.12488">baselines paper</a>. <a href="#fnref:IntegerMNIST" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
<hr />
<p><em>Thanks to the NSF Foundation and the Kreiman Lab for making this research possible. All remaining errors are mine and mine alone.</em></p>A simple modification to improve Elastic Weight Consolidation and Synaptic Intelligence continual learning baselines.Reflections two years into the PhD2022-09-12T15:27:00+00:002022-09-12T15:27:00+00:00http://trentonbricken.github.io/PhD-Reflections<p><em>It is absurd that the second year of my PhD has come to an end. I am taking a minute to reflect on what I have learnt and want to improve upon.</em></p>
<hr />
<p>Overall I am super happy and feel very lucky to be in my PhD program. On an average week I have ~2 hours of commitments in total. The rest of the time is totally unstructured where I am free to learn and pursue whatever it is that I’m interested in. My interests are centered enough around my research that I manage to keep my supervisor and collaborators happy (at least to date!). But I still have lots of free time to explore and learn about so many other things. I doubt I’ll ever have this degree of freedom and flexibility again in my life.</p>
<p><strong>Lessons:</strong></p>
<ul>
<li>Make feedback loops as tight as possible. This is both in the quality of data produced from an experiment and the speed at which it can be run. Your first experiment is never going to work, everything in the real world is messy and requires fine tuning. Build a system so that it can scale and run very fast.
<ul>
<li>Getting more meta, your first system to run things quickly will certainly not be all that fast so also invest in maintaining and refactoring it over time. Expect growing pains.</li>
<li>20% of the things you do will likely have 80% of the impact so take a lot of shots on target.</li>
</ul>
</li>
<li>If there is a problem with your system, fix it now before it becomes a problem later. In other words, prepare for the worst.</li>
</ul>
<blockquote>
<p>His years at sea had taught him that if you don’t fix something when you first see it beginning to fail, it is very likely to finish failing just when it is the most dangerous and the hardest to deal with, such as in the midst of a storm. <a href="https://www.worksinprogress.co/issue/the-maintenance-race/">Source</a></p>
</blockquote>
<ul>
<li>Nothing is new under the sun - the more old papers you read the better.</li>
<li>Use Twitter but only sparingly. It is a delicate dance between knowing what the state of the art is but also avoiding flavour of the month topics that are overly crowded.</li>
<li>Be very careful when taking on mentees, especially time-poor and inexperienced undergrads. More hands do not always make lighter work. There are real benefits to doing everything yourself. The mentorship may be a net time sink for your research.</li>
<li>Rotate with your highest priority labs first. I assumed that everyone would rotate for the first year and there would be an official point where everyone bid for and committed to a lab. In reality, it is far more ad hoc and if you don’t rotate with the labs you are most excited about and likely to join first then they may not have space by the time you rotate.</li>
<li>If you can always work more than four hours in a day then you might be doing the wrong kinds of work. It seems like most people (<a href="https://www.quantamagazine.org/june-huh-high-school-dropout-wins-the-fields-medal-20220705/">including Fields medalists</a>) can only do really deep work for approximately four hours a day. Other kinds of work where you are more on autopilot like implementing an idea or replying to emails you may be able to work for much longer. However, if you are always doing this kind of work and never thinking super deeply then you might be doing something wrong.</li>
<li>Be explicit about taking time off. Flexible hours can be great until you feel like you can and should be working all the time. This drains the enjoyment from time not spent working unless you are explicit about actually taking it off.</li>
<li>Think for longer, act more hesitantly. In hindsight there are a number of results that I could have foreseen if I had just thought for a bit longer and more deeply about the problem first. There are no hard deadlines in research (unlike in school). Use this to your advantage.</li>
<li>But if something is cheap to test then don’t theorize about it, just actually go and test it, immediately. Right now.</li>
<li>Most insights come unconsciously but you need to support their appearance. Read widely, go on deep dives into topics, ramble on and on in your notebook about ideas and see where it goes.</li>
<li>Life is cyclical. Moods, motivation levels, the type of work needed to drive a project forwards. Accept this rather than pretending the cycles aren’t there.</li>
<li>If you do ignore them… you might work at really suboptimal times and try to take shortcuts that hurt in the long run (see the bullet point about solving problems now rather than when they actually become problems).</li>
<li>But don’t let perfect be the enemy of good. It is too easy to decide not to start something that seems hard because you don’t feel at your absolute peak right now. E.g. starting to write that paper or reading that publication. Everything is iterative and you will probably need to read it again or re-write it to get it perfect anyways, but you can still capture a lot of the variance the first time around.</li>
<li>The publication itself and the main text in particular hide an incredible number of details, this makes talking to the paper authors really overpowered.</li>
<li>Working alone can have its advantages. Collaboration is great but if you never work alone and never force yourself to come up with your own research ideas then those muscles will never grow. They certainly didn’t in school where there is always a right answer and a very small number of inductive leaps are required to solve any problem.</li>
<li>However well you think you understand something, you certainly don’t until you go through the process of re-creating it for yourself.</li>
</ul>
<blockquote>
<p>What I cannot create I do not understand - Richard Feynman</p>
</blockquote>
<div align="center">
<img width="400" src="https://qph.cf2.quoracdn.net/main-qimg-87833c78a604ff07a82ff7787574e197.webp" />
<br />
<em> Found written on Feynman's office blackboard after he died. </em>
</div>
<ul>
<li>Everyone is naked. This is a dramatization of the phrase “the emperor has no clothes”. The more I climb ivory towers the more I realize that the people who you assumed knew everything and had everything under control often don’t. They are humans too. This is both terrifying because nobody is in control but also highly motivating because you can make a difference.</li>
<li>Just because you <em>can</em> work on something doesn’t mean you should. Taking an idea all the way to a publication is incredibly time consuming. And there are fixed costs to the endeavor that don’t scale and aren’t particularly educational, e.g. making Figure 4 for the 30th time with a slightly different color scheme this time.</li>
<li>When deciding if a project is worth pursuing think “if this went perfectly according to plan, what comes of it?” If the outcome in this unrealistic scenario isn’t all that great then look further.</li>
</ul>
<p>My favourite science quote:</p>
<blockquote>
<p>The Most Exciting Phrase in Science Is Not ‘Eureka!’ But ‘That’s funny …’ - Isaac Asimov</p>
</blockquote>
<p>Oh and if you haven’t read <a href="https://www.cs.virginia.edu/~robins/YouAndYourResearch.html">You and Your Research</a> then you should.</p>
<p><strong>Goals:</strong></p>
<ul>
<li>Read more textbooks. I need to gain new tools for thought and invest more in the long term instead of immediate projects.</li>
<li>Find more collaborators. I have been working alone for too long. Moving to the Bay Area and working with the Redwood Theoretical Neuro Institute should really help with this.</li>
<li>Do more theory and less engineering.</li>
<li>Spend more time on <a href="https://www.worksinprogress.co/issue/the-maintenance-race/">maintenance</a>. Unit tests, refactoring code, etc.</li>
<li>Have a good answer to the Tyler Cowen question: “What form of routine practice do you do that is analogous to the way a pianist practices the scales?”</li>
<li>Share my research more openly via blog posts and get things on ArXiv more frequently.</li>
<li>Get better at accepting how the world actually is. There is something really powerful about seeing the world not as it should be or could have been but how it is right now. You’re in the cockpit, everything that has happened has happened, now what is going to happen next? I feel this most acutely with investment decisions, “ok yes I should have sold before the market crash. But what should I be doing right now?” Yet I think this generalizes to accepting that I am two years into grad school and turning 25 soon. What projects are sunk costs and what do I want to change going forwards? Meditation and journaling are two ways that I think can really help with this “world acceptance” and future planning.</li>
</ul>
<hr />
<p><em>Thanks to <a href="https://twitter.com/lord_applebee">Max Farrens</a> for reading drafts of this piece. All remaining errors are mine and mine alone.</em></p>It is absurd that the second year of my PhD has come to an end. I am taking a minute to reflect on what I have learnt and want to improve upon.Book Review - Talent2022-07-25T12:51:00+00:002022-07-25T12:51:00+00:00http://trentonbricken.github.io/Talent<p><em>Talent is a conversation starter on the under discussed topic of how to identify talent but not much more.</em></p>
<hr />
<p>Maybe it is because I think so highly of Tyler Cowen that I expected the long anticipated <a href="https://www.amazon.com/Talent-Identify-Energizers-Creatives-Winners/dp/1250275814">Talent</a> to be harder hitting and more insightful. A couple of the interview questions are interesting and a few other factoids that I’ve tried to aggregate. But otherwise the book is basically a Psych 101 introduction to the Big 5 Personality types that is covered more carefully and thoroughly by its <a href="https://en.wikipedia.org/wiki/Big_Five_personality_traits">Wikipedia entry</a>.</p>
<p>In classic Straussian fashion, the main takeaways of the book in the foreword are trite while the actual ones are:</p>
<ol>
<li>Assume that the interviewee is trying to conceal themselves and everything is a canned answer.</li>
<li>Anything goes when it comes to breaking the ice and getting to know the real person that you will actually be hiring and interacting with every day.</li>
</ol>
<p>The discussion of interview questions and strategies to do this is are the strongest part of the book but also only one chapter.</p>
<hr />
<p><strong>Interview questions:</strong></p>
<ul>
<li>What is one mainstream or consensus view hat you whole-heartedly agree with?
<ul>
<li>This is the inverse of the “Thiel question”: “What do you believe to be true that very few others in the world would agree with you on?” (which actually originated from Tyler Cowen!) and is now said to have been over-used.</li>
</ul>
</li>
<li>Who are our competitors?</li>
<li>How do you think this interview is going?</li>
<li>Ask a classic question like “what is your greatest weakness” but keep asking it again and again to break through the canned answers and see how deep and organic the later (real) answers are.</li>
<li>What are the open tabs on your browser right now?</li>
<li>How successful do you want to be? / How ambitious are you?</li>
<li>What blogs do you read?</li>
<li>What views do you hold religiously, almost irrationally?</li>
<li>What’s the farthest you’ve ever been from another human?</li>
<li>Which of your beliefs are you most likely wrong about?</li>
<li>Ask a question about a think you talked about during the interview to assess engagement with the conversation and something they could not have prepared for.
<ul>
<li>E.g. “During the middle of this discussion, we chatted about [a very particular project]. What questions do you have about that project?</li>
</ul>
</li>
<li>What do you do as routine practice in the same way a musician practices their scales?</li>
</ul>
<p>Questions I want to add to the list are:</p>
<ul>
<li>What do you want me to ask you?</li>
<li>List every possible way in which you can use a __? (towel; coffee mug; ballpoint pen) – this is a common test for creativity.</li>
</ul>
<p>Behind the scenes during the interview you want to:</p>
<ul>
<li>Get into conversational mode so they are more likely to be honest and themselves.</li>
<li>Flush out all of their canned answers.</li>
<li>Work out who this person views as important to impress – what their value and status hierarchy is</li>
<li>Substance and quality over articulateness</li>
<li>Ask questions you are genuinely curious to know the answer to – this will increase your own engagement in ways that can’t be faked and the quality of the conversation.</li>
</ul>
<p>On breaking the ice and how anything goes:</p>
<ul>
<li>It is fine to hold the tension and make them feel a bit uncomfortable.</li>
<li>Switch up the environment by going to a coffee shop, for example. And ask questions that more closely resemble a conversation with a friend and that they couldn’t have possibly prepared for like: “what do you think of the service here?”</li>
<li>Also very out there questions like: “Why are person-to-person interactions often more informative than Zoom calls?” Or “Why do so many successful people drink diet Coke?”</li>
</ul>
<p>On the topic of references, they are are very important but often don’t have much time and don’t want to say bad things. Try to get things into conversation mode and ask for objective comparisons:</p>
<ul>
<li>Is this person so good that you would happily work for them?</li>
<li>Can this person get you where you need to be way faster than any reasonable person could?</li>
<li>When this person disagrees with you, do you think it will be as likely you are wrong as they are wrong?</li>
</ul>
<p>This first question, reformulated as, would you be ok with this person being your boss? I think it is a great question to ask yourself about any hiring decision too.</p>
<hr />
<div align="center">
<img width="700" src="https://cdn.slidemodel.com/wp-content/uploads/big-five-personalities-traits-model-diagram.png" />
<br />
<em>What colored circle describes you?.</em>
</div>
<p>The meat of the book talks about IQ and the Big Five Personality types but covers these in a cursory and not particularly precise way. For example, they give very loose definitions of the Big Five traits, under extraversion they say “friendliness” but should this not be under agreeableness? For neuroticism they define it as:</p>
<blockquote>
<p>A general tendency to experience negative emotions and negative affect, including fear, sadness, embarrassment, anger, guilt, and disgust.</p>
</blockquote>
<p>But how does this relate to depression and also to just being less emotionally stable? They go on to talk about neurotics as if they are just people who often complain and lead social movements giving John Calvin and Gandhi as an examples of “pests” or “as prickly individuals.”</p>
<p>They define conscientiousness as:</p>
<blockquote>
<p>High conscientious individuals have high self-control, are very responsible, have a strong sense of duty, and usually are good at planning and organizing, due to their reliability.</p>
</blockquote>
<p>But this mixes both morals and being a hard worker.</p>
<p>Having never really defined the Big Five types, things get particularly confusing when they summarize studies that show conscientiousness is not actually what you might otherwise think it is. This includes interesting results that South Koreans work long hours but are low in conscientiousness and that conscientiousness is uncorrelated with COVID mask wearing obedience in Italy. But we were never really told what we were supposed to think consciousness is to begin with!</p>
<p>There is a summary of psychology/psychometrics on how the Big Five impacts job performance and earnings. While they do a good job hedging by talking about the replication crisis and needing to take any of these results with a grain of salt, they simultaneously cherry pick a subset of studies that readers will likely over update on.</p>
<p>For example, a study found 20% of variance in achievement for scientists was due to personality after adjusting for scientific potential and intelligence but my guess is that this varies dramatically across the specific subfields depending in a large part upon how collaborative the field is (think a large biology lab versus a mathematician with his chalk board).</p>
<p>In the section on IQ they paint a picture where it is somewhat useful but not all that important? While I largely agree, I again wish they had been more precise and covered more ground. For example, they fail to acknowledge <em>g</em> instead making sweeping statements like “Intelligence is context dependent”. They also fail to talk about the Flynn effect, don’t acknowledge that IQ tests in the US are illegal for hiring, and at times seem to blur IQ with creativity and the quality of one’s ideas.</p>
<p>A key citation they use looked at the IQ of CEOs in Sweden. This study found that “the small company CEO is above 66% of the Swedish population in cognitive ability, and the median large-company CEO is above 83% of the Swedish population in cognitive ability.” Yes, this may be good evidence that there is more to IQ, but having more data on how much it matters exactly across a broad swathe of outcomes would have been useful for example I have seen this table before in other places on the inter-webs:</p>
<div align="center">
<br />
<img width="700" src="https://www.researchgate.net/profile/Tarmo-Strenze/publication/328416329/figure/tbl1/AS:684252119175168@1540149826486/Correlations-between-intelligence-and-success-results-from-meta-analyses.png" />
<br />
<a href="https://www.researchgate.net/profile/Tarmo-Strenze/publication/328416329/figure/tbl1/AS:684252119175168@1540149826486/Correlations-between-intelligence-and-success-results-from-meta-analyses.png">Source</a>
<br />
</div>
<p>They also seem to get things wrong when they say there is “evidence that autistics have strong performance on Ravens IQ tests” but this is not true on average from having read <a href="https://www.amazon.com/Neurotribes-Legacy-Autism-Future-Neurodiversity/dp/0399185615">NeuroTribes</a> and just looking at top cited papers via a cursory google search seems to back this, for example <a href="https://pubmed.ncbi.nlm.nih.gov/21272389/">here</a>. I also wish they had clarified other results I have seen around polygenic scoring and genetic predictors for personality type. See for example, <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4739500/">Top 10 Replicated Findings from Behavioral Genetics</a>. And that they had brought up other interesting phenomena such as <a href="https://astralcodexten.substack.com/p/birth-order-effects-nature-vs-nurture">birth order effects</a>.</p>
<hr />
<p>At times the book seems to lose track of its audience, jumping without any clear delineations between being a self promoting biography of Tyler and Daniel; a management book; and a self help book. The personal anecdotes are often lengthy and the third person writing can be a bit much at times. The book also deviated into the occasional over-grandized motivational self help sales pitch with phrases like:</p>
<blockquote>
<p>Do you wish to be part of such trends for mobilizing the talents of strongly unique people, or are you going to let others eclipse you in the search for talent?</p>
</blockquote>
<p>And management advice such as when discussing the move to remote work saying:</p>
<blockquote>
<p>Those methods will reward non-paranoid leaders who are okay with giving up some sense of control in the moment, and you will need to adjust your style more in that direction, if you have not already.</p>
</blockquote>
<p>Even on the back cover it says:</p>
<blockquote>
<p>Identifying underrated, brilliant individuals is one of the simplest ways to give yourself an organizational edge.”</p>
</blockquote>
<p>But if it is so <em>simple</em> then why is an entire book trying to discuss the nuances and how it all comes down to context?</p>
<p>They also move between focusing on how to hire real outliers to how to hire for standard positions. For example, in the interview section, early on they say they will discuss unstructured interviews typically used for more senior positions instead of structured interviews where the same question is asked to every candidate. But later when discussing the Big Five personality types they turn their attention towards hiring mediocre candidates for entry level jobs.</p>
<p>Maybe the problem is me getting my hopes up too much for this book both in terms of what I expect from Tyler Cowen and also how hard it is to make any meaningful statements about something as context dependent and difficult as finding talent? And while the book tries to be prescriptive, they correctly acknowledge that at the end of the day it is all context.</p>
<hr />
<p>I’m now going to transition into notes from the book that I found interesting:</p>
<p><em>Thiel anecdotes:</em></p>
<p>On the talent spotting abilities of Thiel:</p>
<blockquote>
<p>Peter Thiel found and helped to mobilize the talents of Elon Musk, Reid Hoffman, Max Levchin, Mark Zuckerberg, and others, including Steve Chen, … . His approach is not well described by any kind of mechanical formula, and Peter’s own background is in the humanities – philosophy and law – rather than science or tech. Many of his current interests concern religion, as he studied the Bible under French anthropologist and philosopher Rene Girard, who was a professor of Peter’s at STanford. We understand Peeter as applying a very serious philosophical and indeed even moral test to people. … In our view, Peter actually asks whether you deserve to succeed, as he understands that concept, and he derives additional information from that interior and indeed deeply emotional line of inquiry.</p>
</blockquote>
<p>How Thiel is compelling:</p>
<blockquote>
<p>The first time each of us met Peter Thiel, for instance, we noticed how engrossed he was in his explanations and, furthermore, how quickly and effectively he pulled people into his worldview, introduction and applying concepts such as “technological stagnation,” “the inability to imagine a future very different from the present,” “Georgist economics,” and “the Girardian sacrificial victim.” Maybe you don’t know what all of those concepts refer to and maybe Peter’s audience doesn’t either, but that is not the point. There is a logic to his argument, and Peter communicates that logic with the utmost conviction; the audience correctly senses a coherent underlying worldview.</p>
</blockquote>
<p>Another Thiel anecdote I love that is not in the book but is just absurd is that his Roth IRA is worth $5B (you can only deposit into these when your income is low and only $6K per year. It is also not taxable!).</p>
<hr />
<p><em>Altman anecdotes:</em></p>
<p>Focus not on where someone currently is but their rate of growth.</p>
<p>I found it amusing that Altman on talent says:</p>
<blockquote>
<p>I look for founders who are scrappy and formidable at the same time (a rarer combination than it sounds); mission-oriented, obsessed with their companies, relentless, and determined; extremely smart (necessary but certainly not sufficient); decisive, fast-moving, and willful; courageous, high-conviction, and willing to be misunderstood; strong communicators and infectious evangelists; and capable of becoming tough and ambitious.</p>
</blockquote>
<p>Should we call in a superhero? Isn’t he just listing every possible desirable characteristic? Isn’t the value of VC in finding “alpha” by investing in people who don’t fit all of these straightforward criteria?</p>
<p>On the importance of speed and being proactive:</p>
<blockquote>
<p>Years ago I wrote a little program to look at this, like how quickly our best founders – the founders that run billion-dollar plus companies – answer my emails versus our bad founders. I don’t remember the exact data, but it was mind-blowingly different. It was a difference of minutes versus days on average response times.</p>
</blockquote>
<p>One they did not state but stuck with me from Altman’s podcast with Tyler was that the most successful founders all came from stable, middle class families. This may have something to do with their risk tolerance?</p>
<hr />
<p>The later sections covered Disabilities, Minorities, and Gender. I thought the sections on minorities and hiring for diversity are important and broadly well stated. However, I have already highlighted my issue with the claim about autism being positively correlated with Raven’s Matrices. Here are some of the more striking points from the section on gender:</p>
<ul>
<li>Dataset of VC pitches were more critical of all women teams. If they were mixed then the VCs only paid attention to the men.</li>
<li>Deep voices are considered more authoritative and women’s voices are getting deeper over time. After WWII they were one octave higher than mens. Now they are only 2/3rds higher. Thatcher underwent voice training to lower her voice during speeches. Elizabeth Holmes did the same.</li>
<li>Women score higher in agreeableness, openness, extraversion, and neuroticism.</li>
<li>At YC (YCombinator) there is always at least one woman out of the three on the interviewing panel. This was traditionally the role of Jessica Livingstone, regarded as having a x-ray vision for a person’s personality</li>
<li>Personality is judged more more, they also have more imposter syndrome when answering the question “I performed well on the test” out of 1 to 100, women gave 46 on average versus 61 for men.</li>
</ul>
<hr />
<p>Miscellaneous:</p>
<ul>
<li>~10% of the world’s population is dyslexic. This seems very high and I guess highlights how unnatural and recently evolved reading ability is?</li>
<li>Musk personally interviewed the first 3,000 employees at SpaceX.</li>
<li>Growth in US output since 1960, at least 20-40 percent of that growth has stemmed from the better allocation of talent. Argue that it was a low bar though because of sexism and racism. I think we are still doing this via not giving citizenship away, something that the UK has recently begun correcting for by allowing those in the top 50 universities to easily move there.</li>
<li>There is only a 28% correlation between the interviewer and an individuals self assessment of personality traits, in particular conscientiousness and emotional stability which are two of the most important for the job!</li>
<li>Bringing talent to you: Tyler Cowen via Marginal Revolution and Thiel through his writings (pointed out in Marc Andreessen podcast with Tyler), also likely Paul Graham with his writings, they attract talent to them rather than having to go out and find it on their own.</li>
</ul>
<hr />
<p><em>Thanks to <a href="https://twitter.com/davisbrownr">Davis Brown</a> for reading drafts of this piece. All remaining errors are mine and mine alone.</em></p>
<hr />Talent is a conversation starter on the under discussed topic of how to identify talent but not much more.A Solution to the Repugnant Conclusion?2022-07-25T12:15:00+00:002022-07-25T12:15:00+00:00http://trentonbricken.github.io/Solution-To-Repugnant-Conclusion<p><em>Having a non-linear uptick in utility avoids the Repugnant Conclusion.</em></p>
<hr />
<p><em>Disclaimer: I am not a philosopher, I very much welcome comments and debate on this piece. I am trying to not let perfect be the enemy of good and share this piece somewhat unfinished rather than continuing to sit on it.</em></p>
<p>Derek Parfitt in Reasons and Persons introduces the <a href="https://en.wikipedia.org/wiki/Mere_addition_paradox">“Repugnant Conclusion”</a> which is an unsettling answer to the question: “should we have more people or happier people?” Parfitt persuasively argues from a few simple axioms that the answer to this question is always quantity over quality and that we should have as many people alive as possible, such that everyone is living right on the threshold of life not being worth living. In other words:</p>
<blockquote>
<p>“For any perfectly equal population with very high positive welfare, there is a population with very low positive welfare which is better, other things being equal.” - Derek Parfitt</p>
</blockquote>
<p>Many, myself included, find the idea of this subsistence level living repugnant, hence the name of Parfitt’s conclusion. However, I think there is actually a simple solution to the Repugnant Conclusion that I will outline after better formalizing it.</p>
<p>Parfitt considers the utility function of individuals as either being linear or non-linear with diminishing returns:</p>
<div align="center">
<img width="700" src="../images/RepugConclusion/First.png" />
<br />
<em>Linear or diminishing returns between resources and life quality gains.</em>
</div>
<p>The diminishing returns case is the most realistic and in this case any increase in resources on the x axis can give only equal or less than equal returns on quality of life. We can use this utility function to consider the utility of a whole population, deciding the amount of resources that each person gets and summing together their quality of life:</p>
<div align="center">
<img width="700" src="../images/RepugConclusion/Second.png" />
<br />
</div>
<div align="center">
<img width="700" src="../images/RepugConclusion/Third.png" />
<br />
</div>
<p>Comparing the areas of the rectangles on the right:</p>
<div align="center">
<img width="200" src="../images/RepugConclusion/Fourth.png" />
<br />
</div>
<p>We want the one with the largest area and due to our utility function having diminishing returns, the way to maximize area is by having everyone live just above subsistence:</p>
<div align="center">
<img width="700" src="../images/RepugConclusion/Fifth.png" />
<br />
</div>
<p>This always results in the repugnant conclusion where we choose quantity over quality for the sake of maximizing total utility.</p>
<p>However, I think there is a solution that isn’t repugnant and in fact leverages the very nature of us seeing this conclusion as repugnant – there is a point where a small increase in resources leads to an even larger increase in life quality. In other words, when life goes from glass half empty to glass half full; when you are sufficiently far above subsistence that life gets a lot more enjoyable. Exactly where this point occurs and how large this non-linear increase in quality of life as a function of resources is remains up for debate. Yet as long as any non-linearity of this form exists, the repugnant conclusion will be avoided. Formally, as long as this non-linearity with a positive 2nd derivative exists and the utility function is monotonically increasing, it dissolves the repugnant conclusion.</p>
<div align="center">
<img width="400" src="../images/RepugConclusion/Sixth.png" />
<br />
<em> The non-repugnant utility function of life? </em>
</div>
<p>This is because at this non-linearity, for a decrease in resources, there is an even greater decrease in life quality. This means that in order to maximize the utility of a population, nobody’s resource allocation and quality of life should drop below this non-linearity.</p>
<p>I am curious to know if this non-linearity already exists in the literature and what might be wrong with it. It is such a simple modification and I was frustrated that Parfitt never addressed it in his magnum opus. <a href="https://twitter.com/tamaybes">Tamay</a> noted that one known occurrence of this sort of asymmetry in utility exists with <a href="https://en.wikipedia.org/wiki/Prospect_theory">Prospect Theory</a>, where losses hurt more than wins.</p>
<hr />
<h3 id="related-work">Related Work</h3>
<p><a href="https://www.jstor.org/stable/2382033">What Do We Learn from the Repugnant Conclusion?</a> by Tyler Cowen reviews many papers that came out after Reasons and Persons introduced the Repugnant Conclusion. Most related is his section titled “Asymmetric Treatment for Low-Utility Individuals”. However, this section discusses placing bounds on the utility function which is not done here.</p>
<p>In other places there is discussion of having asymptotically declining utilities that tend towards zero and may be non-linear. This violates axiom 4 of the repugnant conclusion:</p>
<blockquote>
<p>Axiom (4) - No value should become infinitely small in importance at the margin. A very large addition to that value, all other things being held equal, should never translate into an asymptotically insignificant contribution to the social welfare function. I call this the non-vanishing value axiom.</p>
</blockquote>
<p>However, the solution proposed does not rely upon an asymptotic decline of the utility function to a value of 0. Instead, it could intersect the y-axis at a value higher up, the only thing that matters is that non-linearity with a positive 2nd derivative exists somewhere.</p>
<p>Cowen also summarizes work where interaction effects are modeled such that a decrease in resources per person leading to a reduction in, say, dignity, and it is the effects on dignity itself that causes a decrease in utility such that we should avoid the repugnant conclusion. He provides a number of attacks against these kinds of interaction effects. However, fundamentally there is clearly an interaction effect between resources and life quality and life quality itself is the sum of many components. Having an interaction between resources and these components therefore seems inevitable? Yet, more work needs to be done to suggest where the non-linearity I suggest actually comes from.</p>
<hr />
<h3 id="summary">Summary</h3>
<p>For the utility function that describes the average persons’ life, as long as there exists a point where a one unit increase in resources leads to greater than one unit increase in life quality, the Repugnant Conclusion is not reached. The source of this non-linearity is unclear but at risk of being tautological, may exist due to the very fact that the Repugnant Conclusion feels so repugnant.</p>
<hr />
<p><em>Thanks to <a href="https://twitter.com/tamaybes">Tamay Besiroglu</a> and <a href="https://twitter.com/davisbrownr">Davis Brown</a> for reading drafts of this piece. All remaining errors are mine and mine alone.</em></p>Having a non-linear uptick in utility avoids the Repugnant Conclusion.Book Review - Ted Chiang Short Stories2022-07-25T10:32:00+00:002022-07-25T10:32:00+00:00http://trentonbricken.github.io/Ted-Chiang<p><em>Ted Chiang creates emotionally resonant and novel perspectives on deep questions about life and technology.</em></p>
<hr />
<p>I have just finished both of Ted Chiang’s collections of short stories: “Stories of Your Life and Others” and “Exhalation”, which were both excellent. Chiang’s work is steeped in past and present scientific ideas spanning fields including physics, computer science, and biology. For example, Chiang considers worlds where we can turn off the part of the brain that recognizes beauty in faces; where children are created from preformed humans inside sperm; where we can do forms of time travel that don’t violate our current laws of physics; the Everetiann many worlds interpretation of quantum physics is real and we can communicate across it; reflections on the heat death of the universe; the effects of glasses that record and allow for immediate recall of your past experiences.</p>
<p>I really like how Chiang makes salient slippery topics like the progression of technology, Chesterton’s fence, free will, morality, the meaning of life. He provides novel angles to view these topics through and handles the ideas subtly. The stories leave many more questions than answers but are stimulating and beautiful.</p>
<p>—</p>
<p>Questions the stories prompted that I will keep thinking about:</p>
<p><em>When is technology beneficial?</em></p>
<p>Technology gives us more power and optionality, allowing us to do things that were never possible before. This forces us to reconsider what about our status quo that evolution gave us is desirable to keep and what is not. Paul Graham notes a growing divergence in <a href="http://www.paulgraham.com/addiction.html">The Acceleration of Addictiveness</a> between what is “normal” in the sense that cavemen also did it and “normal” in the sense that the majority of people do it now.</p>
<p>Evolution is simultaneously a <a href="https://www.lesswrong.com/posts/pLRogvJLPPg6Mrvg4/an-alien-god">“blind idiot God”</a>, responsible for vast amounts of unnecessary suffering and a gargantuan <a href="https://www.trentonbricken.com/On-Chestertons-Fence/">Chesterton’s fence</a>, creating 12 stage <a href="https://www.trentonbricken.com/On-Chestertons-Fence/">cassava processing techniques</a> to remove cyanide. How can we fix evolution’s shortcomings while not poisoning ourselves with cyanide? Moreover, figuring out what is best for us is made all the more tricky because our very desires are programmed by evolution. Moreover, how can we use technology to restore the very things that we lost because of other technologies and where does this end? For example, in the story “Liking What You See: A Documentary” technology, including better cosmetic surgery, leads to superstimuli that hijack our natural bias to treat more beautiful people better. A counter response to this is a non-invasive brain modification that makes one “blind” to human beauty. The story provides a back and forth debate for and against this technological arms race.</p>
<p><em>The utility of memory?</em></p>
<p>By default I buy into wanting to remember everything and the importance of objective fact. However, Chiang in “The Truth of Fact, the Truth of Feeling” reveals how even a memory device as simple as writing can affect our social dynamics and outcomes. There is a distinction made between what is factually correct and what it is best to believe in order to make the right decision. This is closely related to <a href="https://www.amazon.com/Elephant-Brain-Hidden-Motives-Everyday/dp/0190495995">Elephant in the Brain</a>, which puts forward the hypothesis that our inner mind hides its intentions from our conscious mind so as to best reach our ends by being optimally deceptive – the best liar is the person who doesn’t even know they are lying! Chiang projects our external memory tools and their consequences into the future where we all wear video cameras and can effortlessly query any previous memory. This extension takes our external memory abilities beyond just the factual (e.g. Googling what is the capital of France?) and into the personal (e.g. What did I say to Alice four weeks ago at that party?).</p>
<p>In this story the protagonist learns that he was in fact misremembering previous interactions and, embarrassed, concludes the recording device can help him become a better person. However, to what extent are we as humans already too self-effacing? And given that we are terrible at holding both good and bad things in mind at the same time (the affect heuristic) is it a bad thing that we are constantly overwhelmed in nuance and who the good guy is versus the bad guy? At what point do we hit <a href="https://slatestarcodex.com/2019/06/03/repost-epistemic-learned-helplessness/">epistemic learned helplessness</a>? If this all sounds interesting, <a href="https://www.amazon.com/Symbolic-Species-Co-evolution-Language-Brain/dp/0393317544">Symbolic Species</a> takes this argument about memory and fact even further with the costs and benefits of symbolic thought and language itself.</p>
<p><em>How can we establish the rights and consciousness of digital minds?</em></p>
<p>“The Lifecycle of Software Objects” short story is timely on a number of fronts. Digital beings (digients) are created that run on artificial neural networks and develop analogously to children over time. Their owners grapple with figuring out just how intelligent the digients are (we are currently doing this with our largest AI language models). The digients also desire to have autonomy and be incorporated as independent entities raising tricky legal issues and getting to the core of free will and agency. We will soon face something similar with driverless vehicles. Interestingly, the story assumes that digients are conscious and can suffer by, for example, being tortured. This assumption and the implications of being able to create vast amounts of suffering with mere computer code was troubling but again timely with the debates this month on whether or not Google’s <a href="https://cajundiscordian.medium.com/is-lamda-sentient-an-interview-ea64d916d917">LaMBDA language model is conscious</a>.</p>
<p><em>What’s the point of it all?</em></p>
<p>Exhalation poetically captures the ultimate heat death of the universe when all life and existence will inevitably come to a standstill. Even in light of this ultimate extinction, there is a very Buddhist perspective of enjoying the present and existence itself. A number of Chiang’s other stories also touch on these sorts of realizations and existential crises including: Omphalos, Division by Zero, and Tower of Babylon.</p>
<p>If you made it through the above, some of these points may sound trite, especially the last one. This is where I believe Chiang is at his best, weaving together deep ideas with imagery and emotion that resonates and feels more profound that I can hope to do it justice. Go and read the originals :P</p>
<p>—</p>
<p>My favorite short stories, some of which I have already mentioned, in rough order of enjoyment were:</p>
<ul>
<li>Exhalation - Heat death of the universe and the beauty and purpose that can still be derived from life.</li>
<li>Story of Your Life - inspiration for the movie Arrival. The nature of time and causality. Beautiful depictions of parenting that makes me want to have kids.</li>
<li>Liking What You See: A Documentary - on beauty and cognitive biases. This Paul Graham piece is very related: <a href="http://www.paulgraham.com/addiction.html">The Acceleration of Addictiveness</a>.</li>
<li>The Truth of Fact, the Truth of Feeling - “truth” as what is factually correct versus what is right. Forgetting has its benefits. What is a world like where we never forget anything that happened to us?</li>
<li>Hell Is the Absence of God - comic on religion, angels, justifications and morality.</li>
<li>The Lifecycle of Software Objects - creation of digital lifeforms, nature of consciousness, rights of digital minds, nature of intelligence.</li>
<li>Anxiety is the Dizziness of Freedom - Everetiann multiple worlds and morality within them. Jealousy and what could have been.</li>
<li>What’s Expected of Us - Free will.</li>
</ul>
<p>Most of these came from the second collection of short stories: “Exhalation”.</p>
<p>As a final note, I love that there are story notes at the end of each book where Chiang shares his inspiration for each of the stories, providing a different and richer perspective on their origins.</p>
<hr />
<p><em>If you have read Ted Chiang’s work then reach out and let me know your thoughts!</em></p>
<hr />Ted Chiang creates emotionally resonant and novel perspectives on deep questions about life and technology.Transformer Memory Requirements2022-07-22T15:32:00+00:002022-07-22T15:32:00+00:00http://trentonbricken.github.io/TransformerMemoryRequirements<p><em>Working out how much memory it takes to train a Transformer GPT2 Model.</em></p>
<hr />
<p>There has been recent discussion on <a href="https://stats.stackexchange.com/questions/563919/formula-to-compute-approximate-memory-requirements-of-transformer-models">StackOverflow</a> and <a href="https://twitter.com/MishaLaskin/status/1546994229674647553?s=20&t=0gkdvE1j_363D3xvTT1d4A">Twitter</a> on the full memory requirements of training a Transformer.</p>
<p>Because I am in the process of training Transformers myself and scaling to multiple GPUs, I became interested in this question myself. Misha Laskin provides some <a href="https://twitter.com/MishaLaskin/status/1546994229674647553?s=20&t=0gkdvE1j_363D3xvTT1d4A">back of the envelope</a> calculations for why batch size and sequence length dominate over model size that are interesting but off by approximately 4x for the model parameters and 2x for activations.</p>
<p>I have thrown together a more detailed calculator as a Colab notebook <a href="https://colab.research.google.com/drive/1G0OabelIWifPfYgoVmUFhr3hKODSHe6a?usp=sharing">here</a>. And outline my reasoning below. I’ve tested this on the “small” 124M parameter and “medium” 345M parameter GPT2 models and get close to the real values.</p>
<p>Here is the GPT2 model architecture (image taken from my <a href="https://arxiv.org/abs/2111.05498">paper</a>):</p>
<div align="center">
<img width="700" src="../images/TransformerCalc/TransformerModel.png" />
</div>
<p>See the <a href="https://jalammar.github.io/illustrated-gpt2/">Illustrated GPT2</a> for a full explanation of how GPT2 works.</p>
<p>Here are the equations and notation in full:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>L = 12 # number of blocks
N = 12 # number of heads
E = 768 # embedding dimension
B = 8 # batch size
T = 1024 # sequence length
TOKS = 50257 # number of tokens in the vocab
param_bytes = 4 # float32 uses 4 bytes
bytes_to_gigs = 1_000_000_000 # 1 billion bytes in a gigabyte
model_params = (TOKS*E)+ L*( 4*E**2 + 2*E*4*E + 4*E)
act_params = B*T*(2*TOKS+L*(14*E + N*T ))
backprop_model_params = 3*model_params
backprop_act_params = act_params
total_params = model_params+act_params+backprop_model_params+backprop_act_params=4*model_params+2*act_params
gigabytes_used = total_params*param_bytes/bytes_to_gigs
</code></pre></div></div>
<p>For the “small” GPT2 model with 124M parameters (that uses the above values for each parameter) we get:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>model_params = 123,568,896
act_params = 3,088,334,848
gigabytes_used = 26.6 Gb
</code></pre></div></div>
<p>While running the Hugging Face GPT2 we get 27.5Gb.</p>
<p>If our batch size is 1 then we undershoot again where memory is predicted to be 5.1Gb but in reality it is 6.1Gb.</p>
<p>For the medium sized 345M parameter model and a batch size of 1 our equation predicts that there it will use 12.5Gb while empirically it is: 13.4Gb. The 1Gb gap remains. I learned that this 1Gb gap comes from loading the GPU kernels into memory! See <a href="https://huggingface.co/docs/transformers/perf_train_gpu_one">here</a>.</p>
<p>The model parameter equation comes from:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(TOKS*E) [embedding layer ]+ L [number of blocks]*( 4*E**2 [Q,K,V matrices and the linear projection after Attention] + 2*E*4*E [the MLP layer that projects up to 4*E hidden neurons and then back down again] + 4*E [Two layer norms and their scale and bias terms])
</code></pre></div></div>
<p>Where we ignore the bias terms and positional embedding.</p>
<p>The activation parameter equation comes from:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>B[batch]*T[seq. length]*(2*TOKS [one hot vectors at input and output]+L[number of blocks]*(3E [K,Q,V projections] + N*T [Attention Heads softmax weightings] + E [value vector] + E [linear projection] + E [residual connection] + E [LayerNorm] +4E [MLP activation]+E [MLP projection down]+E[residual]+E[LayerNorm] ))
</code></pre></div></div>
<p>When I turn on floating point 16 the memory for 1 batch only drops from 6.1Gb to 5.8Gb. Meanwhile for a model with a batch of 8, it goes from 27.5 to 21.8Gb. Why are there not larger memory savings? Is this because it is mixed precision and the model decides that it needs high precision for many of its components?</p>
<hr />
<p><em>Thanks to <a href="https://twitter.com/milesaturpin">Miles Turpin</a> and <a href="https://twitter.com/MishaLaskin">Misha Laskin</a> for motivating this piece. All remaining errors are mine and mine alone.</em></p>
<hr />Working out how much memory it takes to train a Transformer GPT2 Model.Book Review - How to Build a Brain2022-04-26T14:25:00+00:002022-04-26T14:25:00+00:00http://trentonbricken.github.io/How-To-Build-A-Brain<p><em>How to Build a Brain over promised and under delivered but I appreciate its ambitious goal to “build a brain” and the efforts made towards it.</em></p>
<hr />
<p>Author Chris Eliasmith uses ideas from Vector Symbolic Architectures (VSAs) and implements them with his Neural Engineering Framework (NEF) to make his “brain” <em>somewhat</em> biologically plausible. He uses this combination to solve a number of different tasks and replicate some core psychology and neuroscience experimental results.</p>
<p>The book provides useful context about other attempts in computational neuroscience to build a brain and highlights the promise of VSAs to bridge many of the apparent dichotomies between previous approaches. In focusing at the scale of building a brain, Eliasmith also drags the microscopic and myopic back to the terminal goal of building minds and developing solutions that work at scale.</p>
<p>Chris Eliasmith summarizes years of his research with the culmination being “<a href="https://www.science.org/doi/10.1126/science.1225266">Spaun</a>”, a model of the brain with many interconnected components that can perform diverse tasks including classifying handwritten digits and solving Raven’s Progressive Matrices (one of the core tests of human intelligence). This model, while certainly performing some impressive tasks, is less impressive than it first sounds, with many of the tricky details being hard coded. It’s success is thanks to an impressive engineering effort, both in connecting the right pipes in the right ways, and using Eliasmith’s Neural <em>Engineering</em> Framework (NEF), to implement basic VSAs and associative memory theories that had already been developed.</p>
<p>A big reason I was let down by the book is that I was aware of VSAs and excited to see them applied. I had hoped that some open questions I had around them would be addressed such as how a brain should decide to organize its cleanup memory, how it should represent symbols and vectors, and what variables should be bound to others. However, many of these issues were ignored just to flex the power of NEF on a number of toy problems with the really difficult parts simply hardcoded.</p>
<p>Additionally, I don’t think Eliasmith gave previous work in these domains of VSAs and associative memory enough credit, nor did he leverage it to its full potential. For example, he presents a way to store memories through “chaining” without any citation for Plate’s PhD thesis that presents and outlines this idea in detail, calling it “chunking”.</p>
<p>In addition, for clean-up memory, he simply takes a dot product between the query and every pattern that has ever been stored. It is biologically implausible to imagine that every pattern is independently stored in the brain in this way. Far more powerful and biologically plausible systems such as SDM and even Hopfield Networks exist and should have been used instead.</p>
<p>Maybe I am overdoing it with my criticism here, maybe these fancier memory models could have been implemented but were not the main focus of this work. For any project, there is always finite scope. However, one thing I appreciated about the book was its prioritization of biological plausibility and this emphasis in some domains and neglect in others is frustrating. This is particularly because the biggest source of biological plausibility is the spiking neural networks implemented via NEF…</p>
<p>According to Greek mythology, anything Midas touched turned to gold. Here, anything that NEF touches turns to “biologically plausible”. Beyond being used ineffectively to approximate functions that we know have more biologically plausible alternatives, the extent of NEF’s representational power makes me skeptical of its own underlying plausibility, bringing the whole edifice it is built upon down with it.</p>
<p>The largest issues with the NEF are: (i) how it computes a loss function and (ii) how it propagates the error signal from this loss function to the specific subset of neurons used. The NEF takes a population of spiking neurons with random tuning curves and learns to weight them via a linear decoder such that they can approximate any arbitrary function. It is never explained<sup id="fnref:NeedToReadNEFBook" role="doc-noteref"><a href="#fn:NeedToReadNEFBook" class="footnote" rel="footnote">1</a></sup> how the arbitrary loss functions are both stored and calculated by a single decoder neuron. The error signal from this loss function is said to come from a local or external signal and mentions dopamine as maybe being this training signal but how it propagates and it targets the subset of activated neurons is highly unclear.</p>
<p>Again, the overall focus and ambition is refreshing but there remains much work to be done and I believe that a number of the approaches outlined in this book will need to be re-questioned and over-written in the future.</p>
<p>I appreciate the focus on bio-plausibility and sometimes it is very compelling but other times it seems much less justified and this brings the compelling sections into doubt.</p>
<hr />
<p><strong>Pros of the book include:</strong></p>
<ul>
<li>Spike time dependent plasticity.</li>
<li>Use of population codes.</li>
<li>Reference interesting neuroscience and psychology data/benchmarks that any model should try to satisfy.</li>
<li>Reference to different brain regions that may be implementing each component.</li>
</ul>
<p><strong>Cons:</strong></p>
<ul>
<li>While reference to brain regions was a nice gesture, it felt very over simplified along the lines of “this component in my model should remember the start of the sequence and so we are going to call it the hippocampus.”</li>
<li>Does not implement or acknowledge sparse computations in the brain. There are cases where 100 neurons are needed to represent a single scalar. Surely this is metabolically inefficient/implausible?</li>
<li>NEF assumes there are lots of off neurons that fire from negative current. How common actually are these?</li>
</ul>
<p><strong>Other issues with the book:</strong></p>
<ul>
<li>Logic based pre-encoding of rules to solve problems.</li>
<li>Representations not being learnt, instead using random vectors - this is again hard coded and ignores the potential for symbolic relationships between concepts.</li>
<li>Plate’s HRR biologically plausible convolution operations taking advantage of random connectivity are a real missed opportunity.</li>
</ul>
<hr />
<p><strong>Concluding Remarks</strong></p>
<p>Given my tone and critiques throughout this review I may come as a surprise that I am really glad that this book exists and Chris Eliasmith has done the research that he has. It is a great introduction to the field of computational neuroscience and shows the exciting potential of VSAs. However, it is because this route is so exciting that I have high expectations and want for everything to be done in the most biologically plausible and sophisticated way possible with the right attributions to the right original researchers.</p>
<p>If you have read the book or have opinions please comment as I am curious to get outside perspectives. Hat tip to Adam Marblestone for bringing my attention to the book through <a href="http://web.mit.edu/amarbles/www/talks.html">this</a> wonderful list of recommendations on his website.</p>
<hr />
<p><em>Thanks to <a href="https://twitter.com/joechoochoy">Joe Choo-Choy</a> for influential discussions and reading drafts of this piece. All remaining errors are mine and mine alone.</em></p>
<hr />
<hr />
<hr />
<p>I will now transition from providing commentary to summarizing each chapter of the book by sharing the notes that I took for each. I’m not sure how useful this is to readers but maybe treat it as a “Table of Contents” for the book?</p>
<h3 id="chapter-1---science-of-cognition">Chapter 1 - Science of Cognition</h3>
<p>I found this chapter to provide a useful high level background to computational and cognitive neuroscience, at least as of 2013 when this book was published.</p>
<p>There are four main approaches for modelling cognition:
- Logic/Symbolic, Good Old Fashioned AI (GOFAI) - our thinking is logical and needs to be highly flexible as enabled by symbols.
- Connectionist / Parallel Distributed Processing (PDP) - brain is highly parallel and distributed.
- Dynamicist - we need to account for time and the environment.
- Bayesian - induction and probabilistic models of the world. We need to account for uncertainty.</p>
<p>The two main approaches for trying to explain data can be considered top down and bottom up.
- Production systems like ACT-R can capture high level results but have constraining time constants that are hand tuned and biologically implausible.
- Lower level dynamics models that capture the low level phenomena but not high level cognition.</p>
<p>In response to these buckets, Eliasmith presents VSAs that can be logical and have nested, semantically meaningful relationships while also being dynamicist in their focus on action and implemented in a way that models both temporality, is distributed and probabilistic (Bayesian).</p>
<p>Eliasmith poses an interesting question apparently asked during a funding agency meeting: <em>“What have we actually accomplished in the last 50 years of cognitive systems research?”</em> Eliasmith answers saying that we have a better idea of the landscape of cognition and criteria that any intelligent system must fulfil.</p>
<p><strong>Core Cognitive Criteria (CCC) for theories of cognition – how to evaluate any model:</strong></p>
<ul>
<li>Representational Structure:
<ul>
<li>Systematicity</li>
<li>Compositionality</li>
<li>Productivity</li>
<li>Massive Binding Problem - too many variables and features</li>
</ul>
</li>
<li>Performance Concerns:
<ul>
<li>Syntactic generalization</li>
<li>Robustness</li>
</ul>
</li>
<li>Adaptability:
<ul>
<li>Memory</li>
<li>Scalability</li>
</ul>
</li>
<li>Scientific Merit:
<ul>
<li>Triangulation</li>
<li>Compactness – Occam’s razor</li>
</ul>
</li>
</ul>
<p>I also tried to answer this question myself and would be very interested in comments with answers from you, reader. Try it now! It’s a fun exercise.</p>
<p><em>“What have we actually accomplished in the last 50 years of cognitive systems research?”</em></p>
<ul>
<li>Parallel Distributed Processing models -> Deep Learning.
<ul>
<li>Incredible performance improvements including GPT3 and DALLE
<ul>
<li>these empirical results suggest we might be onto something important here</li>
</ul>
</li>
<li>new probabilistic models that may be crucial to cognition eg. Transformer Attention and Variational AutoEncoders</li>
</ul>
</li>
<li>Reinforcement learning
<ul>
<li>TD learning - seems to still explain dopamine reward prediction error</li>
<li>further empirical successes like winning in Go, StarCraft, Poker, driverless cars, and even with general purpose algorithms like MuZero</li>
</ul>
</li>
<li>Bayesian Brain/Predictive Coding/Free Energy Principle - compelling set of theories for the brain forming probabilistic models, making top down predictions (see https://slatestarcodex.com/2017/09/05/book-review-surfing-uncertainty/ for a nice review of many of these ideas)</li>
<li>Vector Symbolic Architectures - Sparse Distributed Memory and Plate’s HRR in particular as connectionist ways to represent symbols that may also be a biologically plausible solution</li>
<li>Novel tools:
<ul>
<li>fMRI (whatever utility this provides, see After Phrenology for some critiques)</li>
<li>connectomics including of c. Elegans and exciting new results in drosophila</li>
<li>in situ RNA sequencing</li>
<li>expansion microscopy</li>
<li>many others I am unaware of…</li>
</ul>
</li>
<li>Novel data:
<ul>
<li>the amount of data we have on conditioned learning is incredible e.g. <a href="http://learnmem.cshlp.org/content/10/6/427">here</a></li>
</ul>
</li>
</ul>
<p>Obviously we have a long way to go but a great deal has been accomplished.</p>
<p>Eliasmith states what key questions are answered in the book, however I believe that this is overpromising and I append why to each question.</p>
<p><em>How are semantics captured in the system?</em></p>
<ul>
<li>Summarizes ideas from VSAs but makes no attempt to address how organisms encode hierarchical, semantic information either as sensory inputs or learned transformations. Symbols are assigned as random vectors that only have meaning to us human observers (who have already done the hard part of learning these symbols!). <a href="https://www.amazon.com/Symbolic-Species-Co-evolution-Language-Brain-ebook/dp/B005Q65DLY">Symbolic Species</a> has really influenced my thinking on this.</li>
</ul>
<p><em>How is syntactic structure encoded and manipulated by the system?</em></p>
<ul>
<li>Again, just presents ideas from VSAs on how to do variable binding, does nothing to address how a system might determine <em>what</em> should be stored and how (what should be chunked in meaningful ways).</li>
</ul>
<p><em>How is the flow of information flexibly controlled in response to task demands?</em></p>
<ul>
<li>Uses a simplified model of the basal ganglia for action selection without much consideration for how it receives error signals.</li>
</ul>
<p><em>How are memory and learning employed in the system?</em></p>
<ul>
<li>Rules for how to store memories and manipulate them for reasoning tasks are hardcoded, for example, solutions to Raven’s matrices and the Tower of Hanoi. This shows the usefulness of VSAs but not how representations and solutions can be learnt.</li>
</ul>
<hr />
<h3 id="ch-2---introduction-to-brain-building">Ch. 2 - Introduction to Brain Building</h3>
<p>This chapter introduces the NEF, which I will skip as I have already written about it above. It also provides a general neuroscience background that gave me some new fun facts!</p>
<ul>
<li>The brain uses 20 watts of power - the same as a lightbulb!</li>
<li>The brain takes up 2% of body weight but uses 25% of energy.</li>
<li>Visual cortex regions are all on the surface (cortex) not nested deeper in the brain as I had originally assumed. This is a case where there are in fact connections going between different cortical columns.</li>
<li>Inputs to a neuron in the cortex are on average 10K inputs and outputs. Can range from 500 (retina) to 200,000 (Purkinje).</li>
<li>There are 72 kilometers (~45 miles) of fiber in the brain. This is equivalent to the height of 9 Mt. Everests…</li>
<li>There are hundreds of different neurotransmitters and neuronal types.</li>
<li>As an example of evolution being stuck in a local optima, giraffes have the <a href="https://timpanogos.blog/2011/10/08/evidence-of-evolution-giraffes-laryngeal-nerve/">laryngeal nerve</a>:
<blockquote>
<p>The laryngeal nerve of the giraffe, linking larynx to brain, a few inches away — but because of evolutionary developments, instead dropping from the brain all the way down the neck to the heart, and then back up to the larynx. In giraffes the nerve can be as much as 15 feet long, to make a connection a few inches away.</p>
</blockquote>
</li>
</ul>
<p><img src="../images/HowToBuildABrain/LaryngealNerve.png" alt="" /></p>
<ul>
<li>Another example of local minima is the human retina where all of our rods and cones are <a href="https://theconversation.com/look-your-eyes-are-wired-backwards-heres-why-38319">flipped the wrong way</a>. This is why we have a blind spot for all the wires to enter through it.</li>
<li>Human motor neurons are up to a meter long.</li>
</ul>
<p>The next 4 Chapters outline: Semantics; Syntax; Control; Memory & Learning. These are then all combined into Spaun, the system that is meant to emulate a brain solving a number of tasks.</p>
<hr />
<h3 id="ch-3---biological-cognition-semantics">Ch. 3 - Biological Cognition: Semantics</h3>
<p>This chapter introduces VSAs. For a better introduction I defer to … .</p>
<p>VSAs are capable of implementing Hinton’s reduced representations which are:
<img src="https://firebasestorage.googleapis.com/v0/b/firescript-577a2.appspot.com/o/imgs%2Fapp%2FTrentonIdeas%2FzorY81AXk8.png?alt=media&token=1ab890f6-41aa-41d1-885d-134d012490c2" alt="" />
(this is taken from Plate’s PhD thesis, he was supervised by Hinton).</p>
<p>A golf ball was used as a description of the high dim space that the symbols, represented as vectors, exist in. The little pockets on the ball’s surface can do clustering of similar representations.</p>
<p>Work is summarized on how a convolutional network can be used to learn compressed representations of MNIST digits. The quality of the compression was assessed by decoding it fixing the top layer and optimizing the input layer. This compression is suggested to be done the VSA uses superpositions/convolutions. Using NEF they also got nice Gabor like filters that we see in V1.</p>
<p>There is also this nice diagram for neural processing and control.
<img src="https://firebasestorage.googleapis.com/v0/b/firescript-577a2.appspot.com/o/imgs%2Fapp%2FTrentonIdeas%2FrKu63uAkCZ.jpg?alt=media&token=3ada1a7e-c2e9-4091-9fab-3324855e7ee7" alt="" /></p>
<h3 id="ch-4---syntax">Ch. 4 - Syntax</h3>
<ul>
<li>Notes and cites work on how one to one relationships for associative learning are insufficient to solve lots of problems.</li>
<li>Thinks that all transformations need to also be convolutions? Can surely still do other vector transformations that are useful?</li>
<li>Dual coding theory - VSA operations are noisy. To get the perfect original vector you need to put your representation through a clean up operation. But you can also get pretty good results just working with the noisy vector. This sort of tradeoff at a high level fits well Daniel Khaneman esque System 1 vs System 2 processing.</li>
<li>Hard to overstate how powerful having everything remain the same dimensionality is.</li>
<li>Capacity for error and noise in VSAs is what makes them cognitively plausible!!</li>
<li>NEF assumes that neurons work in a continuous space and with a rate code.</li>
<li>How VSAs implemented in the brain is still v much an open problem. where and how.</li>
<li>Seem to use the Fourier transform HRR model here. Why? And how are the complex numbers represented?</li>
<li>pg. 133 back of envelope calculations on the number of neurons needed for a binding operation to be implemented. It is 140K neurons for two 500D vectors. This fits roughly with the number of neurons in a cortical column. But again depends also on dendritic processing capabilities and if NEF is actually being implemented.</li>
<li>Neurons connect with approx 5% of others in the region it can extend to.</li>
<li>Note that this binding structure can be reused many times.</li>
<li>Later on it is noted that the spiking implementation does a form of soft regularization that is powerful!</li>
<li>Decide on HRR because continuous representation. Note there may be others and should explore and compare more</li>
<li>Able to learn a given transformation in an online setting. This looks very similar to the Perceptron update rule.</li>
<li>One step reading from cleanup memory. Emphasis on this but later talk about chunking without acknowledging the need for sequential decoding.
<ul>
<li>Totally ignores bio plausibility.</li>
<li>Also continuous SDM ~= Attention and Hopfield paper showed you then get convergence in one step so this approach is unnecessary?</li>
<li>Also makes sense that convergence times should change depending on the memory. Eg. Tip of the tongue phenomenon that Kanerva writes about!</li>
<li>Ravens Progressive Matrices - learn the pairwise transformations, average over them. can then use this transform to predict the next.</li>
<li>Reasoning by induction, able to generalize across objects and out of sample</li>
<li>Gets very similar results as humans which is very cool.</li>
<li>Do they get the same ones right and wrong?</li>
<li>This the same kind of reasoning by induction that [[Hyperdimensional Computing]] uses.</li>
<li>But did use random vectors and hard coded the reasoning rules…</li>
</ul>
</li>
<li>Chaining is a copy of Plate’s chunking with no citations for it.</li>
</ul>
<h3 id="ch-5-control">Ch. 5: Control</h3>
<ul>
<li>Basal Ganglia is the controller deciding how to route information and doing so through the Thalamus (probably does more things too but unclear what). In particular it takes an argmax of the possible actions and returns only one. Uses a double inhibition mechanism to release one of the possible actions from the thalamus</li>
<li>In this case treat it as an argmax giving a one hot encoding that the thalamus then applies to a matrix of possible vectors so it selects only one of them!</li>
<li>If statement in the form of similarity dot product between different options depending on what the current vector is.</li>
<li>Used to solve the Tower of Hanoi but only because all of the rules are worked out and used to solve it</li>
<li>Does fit lots of bio data in performance and fMRI esp. once added working memory model for what was currently being tracked.</li>
</ul>
<p>Some nice diagrams of basal ganglia action selection:
<img src="https://firebasestorage.googleapis.com/v0/b/firescript-577a2.appspot.com/o/imgs%2Fapp%2FTrentonIdeas%2FZfvF3fUrDU.jpg?alt=media&token=85d0cea2-293f-4db2-8b2d-2c2786c32359" alt="" />
<img src="https://firebasestorage.googleapis.com/v0/b/firescript-577a2.appspot.com/o/imgs%2Fapp%2FTrentonIdeas%2FSE2i7SYpU0.jpg?alt=media&token=6a765a2f-35f8-44fe-9895-94d679426341" alt="" /></p>
<ul>
<li>Chains of action need the BG but not single actions?</li>
<li>All cortical areas aside from primary visual and auditory project to the basal ganglia.</li>
<li>All BG output goes through the thalamus before going back to the corex. Also the thalamus gets inputs directly from every sense except smell. Closed loops with all parts of the cortex.</li>
<li>Reticular nucleus of the thalamus forms a kind of shell around it, kind of like a meta thalamus.</li>
<li>Attention:
<ul>
<li>goes through a separate neural circuit that then connects to the PIT at the top of the visual processing pathway and then goes all the way down it back to the LGN.
<ul>
<li>it amplifies some of the signals that are coming up so they fill the whole receptive field to still utilize it
<img src="https://firebasestorage.googleapis.com/v0/b/firescript-577a2.appspot.com/o/imgs%2Fapp%2FTrentonIdeas%2Fb6VqrlqOLG.jpg?alt=media&token=bed6ac95-bb6d-4560-af84-7f43592f65c0" alt="" /></li>
</ul>
</li>
<li>how attention is implemented in a cortical column. recall that all visual areas are part of the cortex.
<ul>
<li>TD and BU are top down and bottom up.
<ul>
<li><img src="https://firebasestorage.googleapis.com/v0/b/firescript-577a2.appspot.com/o/imgs%2Fapp%2FTrentonIdeas%2FVsJ0dczF8E.jpg?alt=media&token=a96bebd1-089e-404a-97ec-9491a38e5910" alt="" /></li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
<li>are able to replicate results with changes in attention for a moving grating?</li>
<li>can use basal ganglia to move sequentially through the alphabet and load into memory a question vs an answer and divert resources in the right way.</li>
<li>able to replicate rats learning different utility functions. Here they are putting utilities directly into the BG. Later next chapter with their learning rule can replicate rats learning which bandit levers and their rates of switching.</li>
<li>Able to use this to then solve the Tower of Hanoi. Given the algorithm to solve it. The timing of each operation when given working memory does very closely model that of humans.</li>
<li>BOLD signal comes from dendritic processing driven by neurotransmitter usage, not neural activity. pg 198. able to use this to replicate brain regions that are active.</li>
</ul>
<h3 id="ch-6---memory-and-learning">Ch. 6 - Memory and Learning</h3>
<ul>
<li>Basic memory system is a leaky integrator - stores whatever was most recently put in until the next thing arrives.</li>
<li>Using two memory modules one to remember the first things put in (episodic, hippocampus) and last things (working memory, cortex) in order to reproduce the U shaped curve of remembering sequential facts.</li>
<li>Also able to replicate confusion between more similar objects and non sequential data that seems to use the same sequential learning system.</li>
<li>For learning they just have the NEF and plug in a Hebbian like error signal that can cone from a local or external signal, refer to dopamine as maybe being this training signal but how it propagates and is targeted enough is highly unclear.
<ul>
<li>Use this to learn a convolution where the circular convolution is the output transformation that it then needs to optimize…</li>
<li>Is very nice they use spike time dependent plasticity with actual spike trains and account for LTP vs LTD. and they are able to replicate this.</li>
</ul>
</li>
<li><strong>turns out their model is better to the human data than the dynamic equations because of the soft regularization that occurs on the HRR vectors!</strong>
<ul>
<li><strong>This is because neurons will saturate for very large vector values!</strong>
<ul>
<li>This is less optimal but emerges from the model (also don’t need to do explicit normalization) and replicates the human data. Cool.</li>
</ul>
</li>
</ul>
</li>
<li>Neurons synaptic weights do not increase indefinitely. Theorize there is some sort of normalization across the weights that happens?
<ul>
<li>SSC depression piece suggested that this happens during sleep and this is why for treatment resistant depression sleep deprivation works for 70% of them. Also feel better in the evenings, worse in the morning. and serotonin levels may affect this hence why it helps.
<ul>
<li>What are the bounds on the strengths of synaptic connections?</li>
</ul>
</li>
</ul>
</li>
<li>Ability to do vector binding only appears at age 4. “many of the more cognitive behaviors that require binding are not evident in people until about the age of 3 or 4 years eg. analogy in language.”</li>
<li>
<p>updated routing diagram that also incorporates in the dopamine signalling:
<img src="https://firebasestorage.googleapis.com/v0/b/firescript-577a2.appspot.com/o/imgs%2Fapp%2FTrentonIdeas%2FbHA40r7TTv.jpg?alt=media&token=6722afcd-6952-469e-9542-272a559b596b" alt="" /></p>
</li>
<li>able to replicate learning the rat lever task.</li>
<li>used to solve Wason logical cards. Need to test the rule holds and is unique by choosing two different cards and we succeed in the applied example but not the abstract one. Refers to social contract theory. but there seem to be plenty of other theories for why this result is the case. Train their optimizer on these examples and shows it can reproduce them, also that it can generalize to other examples in a social or abstract setting, but this is not surprising given the HRR semantics encoded in these different examples!</li>
<li>Robust to pruning up to 33% but does this also mean their model originally had too many neurons to represent each of these scalars?</li>
<li>Also an example of generalization benefitting from a few examples of Wason cards before then getting diminishing returns.</li>
</ul>
<p>I NEED TO FINISH ADDING NOTES HERE ON THE REMAINING CHAPTERS.</p>
<hr />
<hr />
<h3 id="footnotes">Footnotes</h3>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:NeedToReadNEFBook" role="doc-endnote">
<p>Disclaimer that I have not read the original Neural Engineering Framework book and would love to be corrected if any of my understanding of it is incorrect. <a href="#fnref:NeedToReadNEFBook" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>How to Build a Brain over promised and under delivered but I appreciate its ambitious goal to “build a brain” and the efforts made towards it.