Life Is Computation - neuroscience, computation, biology, statistics, and philosophy

Are Transformers Turing-complete? A Good Disguise Is All You Need.

Posted on October 11, 2024October 14, 2024

Four transformer characters holding up masks, drawn in comic style

I will avoid an introduction and get straight to it. I’ll assume that you know what Turing-complete means, that you are familiar with transformers, and that you care about the question: Are transformers Turing-complete? (I may later write about what that question means and why it is important.)

Several papers claim that transformers are Turing-complete and therefore capable of computing any computable function. But I believe these claims are all misleading. Some make conceptual errors and are simply incorrect in their claims of Turing-completeness for their models. Others have modified the essential properties of transformers to achieve universal computation, presenting models that only resemble transformers superficially. As of October 2024, the deep learning model widely known as a ‘transformer’ (despite its success in commercial large language models) has not been shown to be Turing-complete (barring add-ons and modifications to its essential properties).

Wallace’s Thought Experiment on Understanding How Life Works

Posted on May 9, 2024May 9, 2024

I recently finished The World of Life, by Alfred Russel Wallace. Wallace was a naturalist and contemporary of Darwin. He doesn’t get enough credit for the fact that he independently came up with the theory of natural selection, perhaps because he was much more chill about attribution than Darwin was. It was Wallace’s article that prompted Darwin to quickly publish his Origin of Species. (Read more about the history here and here).

In The World of Life, Wallace argues for the existence of an organizing and directing entity that is missing from our understanding of how life works. Below are excerpts from the book, including an illuminating thought experiment which I think is pertinent to biology’s current state of affairs.

It’s Not Intelligent If It Always Halts: A Critical Perspective on Current Approaches to AGI

Posted on April 6, 2023September 5, 2023

a photo of a robot with a hourglass as a body, sitting in a barren landscape

Imagine a conversation with one of these newly released AI chatbots. You ask it it to solve a tricky math problem. It responds with “That seems kind of hard. Give me some time to think.”. After a few minutes it comes back with “I haven’t solve it yet. And I am not sure I can. Would you like me to continue working on it?”. Another few minutes pass and then it comes back with “Aha! I figured it out!” and it proceeds to explain a neat and creative solution.

This scenario can never occur with PaLM, BARD, GPT-4, or any of the other transformer-based large language models that are thought to be on the path to general intelligence. In all of these models, each word in the machine’s response is produced in a fixed amount of time. The model cannot go away and “think” for a while. This is one of the reasons why I believe a solely transformer-based model can never be “intelligent”. (If you disagree with my characterization of transformers here, see section 4 and also this post).

Summary: I argue here, that intelligence requires the ability to explore “trains of thought” that are potentially never-ending. One cannot know a priori if a certain train of thought will lead to a solution or if it is futile. The only way to find out is to actually explore. And this type of exploration comes with the risk of never knowing if you are on the path to a solution or if your current path will go on forever. Intelligence involves problem-solving, and problem-solving requires arbitrary amounts of time. If a computer program is bound to finish quickly by virtue of its architecture, it cannot possibly be capable of general problem-solving.

In the summary paragraph above, I appealed to a number of intuitive notions (e.g. “train of thought”, or “exploration”, or “problem-solving”). In order to make my argument rigorous, I have to first introduce a few concepts rooted in classical theory of computation. In section 1, I will introduce three types of computer programs. In section 2, I describe what an unintelligent problem-solver can look like. In section 3, I describe what is needed to make the unintelligent problem-solver intelligent. In section 4, I explain why transformers can never be general problem-solvers. In section 5, I briefly discuss what I think needs to be done to address this problem.

The Researcher’s Guide for Being Mind Blown by a Neural Network

Posted on March 30, 2022June 17, 2022

Every so often a new neural network makes headlines for solving a computation problem. It is sometimes hard for me to judge how impressive these achievements are without diving into the details of the models. But my criteria are always the same and it should be easy for those who are familiar with their models to evaluate based on these criteria. For this purpose I have made a flowchart for how impressed I would be at a neural network. If you know of a new neural net that reaches “wow” please let me know about it, and if it reaches “mind-blown” you have permission to wake me up in the middle of the night – since I know no examples.

The Truth About the [Not So] Universal Approximation Theorem

Posted on January 19, 2022June 10, 2022

Computation is usually conceptualized in terms of input/output functions. For instance, to compute the square root of a number is to somehow solve the function that takes a number as an input and outputs its square root. It is commonly stated that feed-forward neural networks are “universal approximators” meaning that, in theory, any function can be approximated by a feed-forward neural network. Here are some examples of this idea being articulated:

“One of the most striking facts about [feed-forward] neural networks is that they can compute any function at all.”

– Neural Networks and Deep Learning, Ch. 4, Michael Nielsen

“In summary, a feed-forward network with a single layer is sufficient to represent any function [to an arbitrary degree of accuracy],…”

– Deep Learning (Section 6.4.1 Universal Approximation Properties and Depth), Ian Goodfellow

“…but can we solve anything? Can we stave off another neural winter coming from there being certain functions that we cannot approximate? Actually, yes. The problem of not being able to approximate some function is not going to come back.”

–Course on Deep Learning, Universal Approximation Theorem, Konrad Kording

A 7 Minute Timer Has Been Discovered in Neurons

Posted on February 18, 2021May 19, 2022

How does the brain keep track of time? This question has been intriguing neuroscientists for decades. Circadian clocks, which oscillate every 24 hours, are known to be implemented at the level of molecules and genes. But it is widely believed that keeping track of time for shorter durations (e.g. seconds and minutes) arise from electrical/synaptic activity patterns, not from molecular activity. The idea is that cells can be connected in ways that result in oscillations or sequential activity (e.g. one neuron fires at the 1s mark, the next fires at the 2s mark, etc.). As with most of our theories of short-term memory, if all the cells in a network go silent for a moment the timer falls apart. The spiking activity is what keeps the clock going. This theory has had its opponents, but I think it is fair to say that it has been a commonly held view in neuroscience.

A recent study, however, has made a serious crack in this paradigm. In a series of two papers from the Crickmore lab at Harvard University (one published last year and another last month), Thornquist and colleagues show that a single neuron can keep track of time in a completely silent manner. The time interval they studied was a 7 minute period in mating fruit flies. I believe this is a landmark study that every neuroscientist should know about. So here is my attempt at explaining it in simple terms.

Breaking Free from Neural Networks and Dynamical Systems

Posted on January 13, 2021February 19, 2023

This blog post is written as a dialogue between two imaginary characters, one of them representing myself (H) and the other a stubborn straw man (S). It is broken into four parts: the dogma, the insight, the decoy, and the clues. If you do not feel like reading the whole thing, you can skip to part 4; it contains a summary of the other parts.

Quantifying Evidence (2): Evidence Is Limited By How Much a Study Can Be Trusted

Posted on November 17, 2020April 20, 2023

In part 1, we defined evidence and showed that evidence across independent studies can be aggregated by addition; if Alice’s results provide 2 units of evidence and Bob’s results provide 3 units of evidence then we have a total of 5 units of evidence. The problem with this is that it doesn’t account for our intuition that single experiments cannot be trusted too much until they are replicated. 10 congruent studies each reporting 2 units of evidence should prevail over one conflicting study showing -20 units of evidence.

Let’s try to model this by assuming that every experiment has a chance of being flawed due to some mistake or systematic error. Each study can have its own probability of failure, in which case the results of that experiment should not be used at all. This is our first assumption: that any result is either completely valid or completely invalid. It is a simplification but a useful one.

We define trust (T) in a particular study as the logarithm of the odds ratio for the being valid versus being invalid. In formal terms:

Quantifying Evidence (1): What Are Units of Evidence?

Posted on November 16, 2020May 19, 2022

I am going to introduce a statistical framework for quantifying evidence as a series of blog posts. My hope is that by doing it through this format, people will understand it, build on these ideas, and actually use it as a practical replacement for p-value testing. If you haven’t already seen my post on why standard statistical methods that use p-values are flawed, you can check it out through this link.

My proposal builds on Bayesian hypothesis testing. Bayesian hypothesis testing makes use of the Bayes factor, which is the likelihood ratio of observing some data D for two competing hypotheses H₁ and H₂. A Bayes factor larger than 1 counts as evidence in favor of hypothesis H₁; a smaller than one Bayes factor counts as evidence in favor of H2.

In classical hypothesis testing, we typically set a threshold for the p-value (say, p<0.01) below which a hypothesis can be rejected. But in the Bayesian framework, no such threshold can be defined as hypothesis rejection/confirmation will depend on the prior probabilities. Prior probabilities (i.e., the probabilities assigned prior to seeing data) are subjective. One person may assign equal probabilities for H₁ and H₂. Another may think H₁ is ten times more likely than H₂. And neither can be said to be objectively correct. But the Bayesian method leaves this subjective part out of the equation, allowing anyone to multiply the Bayes factor into their own prior probability ratio to obtain a posterior probability ratio. Depending on how likely you think the hypotheses are, you may require more or less evidence in order to reject one in favor of the other.

$\dfrac{Pr(H_1 | \text{data})}{Pr(H_2 | \text{data})} = \dfrac{Pr(\text{data} | H_1)}{Pr(\text{data} | H_2)} \times \dfrac{Pr(H_1)}{Pr(H_2)}$

$\text{posterior odds ratio} = \text{Bayes factor} \times \text{prior odds ratio}$

Let us define ‘evidence‘ as the logarithm of the Bayes factor. The logarithmic scale is much more convenient to work with, as we will quickly see.

$E = \text{(evidence for $H_1$ against $H_2$ given $D$)} = \log_{10}(\text{Bayes factor}) = \log_{10}(\dfrac{Pr(D | H_1)}{Pr(D | H_2)})$

Evidence is a quantity that depends on a particular observation or outcome and relates two hypothesis to one another. It can be positive or negative. For example, one can say Alice’s experimental results provide 3 units of evidence in favor of hypothesis H₁ against hypothesis H₂, or equivalently, -3 units of evidence in favor of hypothesis H₂ against hypothesis H₁.

Figure 1: In this example, two hypotheses are being tested against one another. H1 is the hypothesis that the opossum is genetically closer to humpback whales than salmon is. H2 is the hypothesis that salmon is closer to humpback whales than the opossum. The data D that is being used to compare these hypotheses can be some DNA sequencing data, for instance. (The data here isn’t real).

When Biology Isn’t Messy

Posted on October 28, 2020June 17, 2022

There is a belief in biology that goes like this: Biology is messy. Nature has no interest in making things easy to understand. So for many scientific questions, there will not be a straight-forward answer.

Q: Where and how is a particular memory stored in the brain? A: Biology is messy; memories are distributed all over the brain and stored in many different forms: as molecules and as structure, inside neurons and in the connections between them.
Q: How do genes determine the number of fingers per hand? A: Biology is messy; there isn’t a single factor for it. It emerges from the interactions of numerous different genes.

Now it is not always the case that our answers are so unsatisfying. Ask a biologist how the eye works and, well, there are quite a lot of similarities to how cameras work. And living organisms aren’t messy globs of formless flesh. They have an organized body plan, with separate organs, each responsible for specific functions that we can talk about; the heart pumps blood and the lungs pump air.