bookmark_borderA 7 Minute Timer Has Been Discovered in Neurons

How does the brain keep track of time? This question has been intriguing neuroscientists for decades. Circadian clocks, which oscillate every 24 hours, are known to be implemented at the level of molecules and genes. But it is widely believed that keeping track of time for shorter durations (e.g. seconds and minutes) arise from electrical/synaptic activity patterns, not from molecular activity. The idea is that cells can be connected in ways that result in oscillations or sequential activity (e.g. one neuron fires at the 1s mark, the next fires at the 2s mark, etc.). As with most of our theories of short-term memory, if all the cells in a network go silent for a moment the timer falls apart. The spiking activity is what keeps the clock going. This theory has had its opponents, but I think it is fair to say that it has been a commonly held view in neuroscience.

A recent study, however, has made a serious crack in this paradigm. In a series of two papers from the Crickmore lab at Harvard University (one published last year and another last month), Thornquist and colleagues show that a single neuron can keep track of time in a completely silent manner. The time interval they studied was a 7 minute period in mating fruit flies. I believe this is a landmark study that every neuroscientist should know about. So here is my attempt at explaining it in simple terms.

Fruit flies mate for a very stereotyped 20 minute duration and the male ejaculates at the 7 minute marker. Before ejaculation, the male fly is so motivated to continue that it will even risk death to continue mating. After ejaculation, it is easier to get a male fly to let go of the female using heat or noxious stimuli. So the 7 minute timer not only initiates a specific behavior (i.e., ejaculation) but also controls a behavioral state (i.e., the male’s motivation to continue mating).

Now, it has been known for quite some time that there are exactly four neurons (called the “Corazonin” neurons) located in what would be the equivalent of the spinal cord for fruit flies (in the Drosophila ventral nerve cord or ventral nervous system). These neurons are silent for the most part, but suddenly erupt about 7 minutes after the onset of copulation. At least two of them have to fire at the same time to drive ejaculation and switch the fly’s motivational state. In practice, though, all four fire together at the same time.

The first thought that comes to a neuroscientist’s mind is that “these four neurons are downstream of the central nervous system and they’re silent for the most part. So they’re unlikely to be involved in any complicated computation. They are probably just gatekeepers of the ejaculation event. Some upstream neurons must keep track of time and send excitatory inputs to the Corazonin neurons for the fly to ejaculate“. Right? Wrong!

It turns out that each of these four neurons have their own cell-intrinsic timer! Once a cell’s 7 minute timer is up, it fires. The four neurons are recurrently connected so a cell will also fire if it sees that the other neurons fire.

Imagine four agents sitting around a large table, each with their own small 7-minute hourglass. Their task is to shout together when 7 minutes has elapsed. But no one wants to rely too much on the accuracy of their own hourglass. For the first six minutes, everyone is staring at the sand pouring down in their own hourglass. As an agent’s hourglass approaches the end, she lifts her gaze to look around the table to see what the other three are doing. Making eye-contact gives her confidence that the end is near. But if no one else is looking up, she doubts her own timer and delays shouting. On the other hand, if an agent hears her peers shouting, she will ignore her own timer and shout along with them. The authors call this process “reaching consensus“.

But how does the cell-intrinsic timer work? And how can an internal molecular process lead to a burst of electric activity in the absence of any excitatory synaptic inputs? For anyone that knows how photo-transduction works, it should not be hard to imagine ways that a neuron’s membrane potential can be controlled by molecular events. But the beauty of this study is that it really gets to the bottom of things. They were able to trace an outline of the molecular events that implement the timer.

The timer actually consists of two shorter timers, where the conclusion of the first sets off the second. The first timer is a fully cell-intrinsic 5-6 minute timer implemented by CAMKII molecules. That is the subject of their first paper (Thornquist et al 2020). CaMKII is an ancient molecular structure that can store a memory in its phosphorylation state. CAMKII molecules are phosphorylated at the onset of copulation. With the passage of time, they gradually get dephosphorylated and that allows the second timer to begin running.

Figure from Thornquist et al. 2020 showing a schematic of CaMKII phosphorylation over time

The second timer implements “reaching-consensus”. Although it doesn’t strictly need external inputs to work, it is sped up by inputs from other Corazonin neurons. The main component of this timer is the gradual conversion of ATP to cyclic AMP (cAMP). Through a molecular cascade that follows cAMP production, the cell’s membrane potential rises and this lead to spiking. Normally it takes 60-75 seconds for this timer to lead to the eruption event. But this process can be artificially induced by using a light-gated protein that hastens the production of cAMP in the cell.

Figure from Thornquist et al. 2021 showing the molecular cascades of the second timer that result in the eruption in neuronal activity

There are still a few missing pieces to the puzzle. For example it is not entirely clear how the CaMKII timer is initiated or how the first timer sets off the second timer. But the general picture has already emerged with serious implications for the entire field. Most of our experimental methods of studying neuronal states (from calcium imaging to fMRI) involve direct or indirect measurements of electric signaling. But here, we have a short-term memory of elapsed time that is undetectable by these methods, because it is stored in the cell’s CaMKII phosphorylation rate and cAMP levels.

Is this a weird hack in flies? Or is it a general time-keeping mechanism for all animals? As someone who has switched from mammalian neuroscience to insect neuroscience, I know there is a tendency in the field to disregard biological discoveries in fruit flies as specializations that cannot be generalized. After all, the fly has fewer neurons at its disposal and cannot afford large networks that produce oscillatory or sequential activity for time-keeping. Why would larger animals need to implement timers with molecules when they have so many neurons?

But I like to turn that question around. If it is possible to implement molecular timers at the scale of minutes or seconds, why would larger organisms ever evolve neural networks for the job? There are a number of serious challenges in implementing timers with electric signalling and computing with molecules is much more efficient in terms of energy, stability, and space. As Sterling & Laughlin put it, “wherever possible, neurons follow the principle: compute with chemistry“. I think it will only be a matter of time before people discover that molecular timers are the norm, rather than the exception.

Dislaimer: Stephen Thornquist is a friend a colleague currently working in the same lab as me. Although to be fair I was tweeting excitedly about his paper before I knew him personally.

bookmark_borderBreaking Free from Neural Networks and Dynamical Systems

This blog post is written as a dialogue between two imaginary characters, one of them being myself (H) and the other a stubborn straw man (S). It is broken into four parts: the dogma, the insight, the decoy, and the clues. If you do not feel like reading the whole thing, you can skip to part 4; it contains a summary of the other parts.

[image adapted from Song et al. 2016]

H: I need a straw man to make this a bit entertaining.

S: I can be your straw man what are we discussing today?

H: We are going to challenge a belief that pervades biological sciences, a very harmful one in my view.

S: How bold of you! People always summon straw men in these situations.

H: Okay. I get it. But please try to be stubborn. And ask good questions. I want whoever is reading this to deeply understand my arguments.

S: I can try.

Part I: The Dogma

S: So what is this harmful belief that you talk of?

H: It is the belief that networks models and, more generally, dynamical systems can explain computation in biology. The assumption in neuroscience is that cognition and behavior is implemented by networks of neurons that inhibit or excite one another. And in developmental biology, the process of cell division and differentiation that can build an entire animal starting from a single cell, is thought to be implemented by networks of genes that regulate one another. Cell function is typically explained in a framework where networks of molecular events interact with one another. Network models dominate the way we think about biological computation.

S: I feel like I’ve heard this argument before. You are talking about connectionism. Right?

H: Well, not exactly. Connectionism is an extreme approach that applies to cognitive science. But my criticism is toward all of biology, including approaches that aren’t thought of as connectionist approaches. Virtually, the entire field of biology adheres to a paradigm of network models or dynamical system.

S: It sounds like you are criticizing a simplified caricature of biology. Networks are just one framework that people use to study things. Neuroscience, for instance, is not merely about studying networks of inhibitory/excitatory neurons. Sometimes single neurons can do complicated things and people study neurons at that level.

H: Like what?

S: A lot goes on inside a single neuron. Take dendritic computation for instance. Dendrites form a complex tree like structure that can transform a cell’s inputs in sophisticated ways. It has even been shown that single neurons are capable of implementing things like XOR gates on their inputs. Neurons are not always simple integrate-and-fire cells with monotonic input/output relationships.

H: Dendritic computation can be modeled as a network. I don’t see this as transcending the network paradigm because it is merely looking at a single neuron as a network of dendritic compartments. In the end you have a network of networks which is just a larger network.

S: Okay, but not everything is a network. For example, single neurons can modulate their intrinsic properties, like excitability. In fact, there’s a recent study that suggests the precise ion channel composition of neurons might be an indispensable part of understanding how songs are generated in songbirds.

And to make matters even more complicated, neuronal activity can be rapidly modulated by molecular factors. There’s another recent study in flies that shows that phosphorylation rates can be used to control a cell’s activity in a way that implements a cell instrinsic timer.

H: These happen to be some of my favorite papers! But they still fall within the paradigm of networks and dynamical systems. The paradigm I am talking about runs deep in all of biology, not just neuroscience. And it is not a caricature of the field. I don’t know of a single exception! Let me try to make it more clear.

Take any biological model, including any of the ones you brought up. Can you represent the state of the system as a finite set of variables, which change in time depending on the system’s state? Those variables can be expression levels of certain ion channels, phosphorylation rates of certain molecules, synaptic weights, or whatever you like. The rates at which each variable changes can depend on any number of other variables, including itself. Usually the dependencies are sparse, in which case it is convenient to draw it out as a sparse network. But the sparsity doesn’t really matter to me.

S: You’re describing dynamical systems.

H: Yes. The question is whether you can describe your model as a dynamical system. Can you represent your model with a predefined number of quantities and write down a number of equations that specify how those quantities change in time?

If your answer is yes, you are operating withing the paradigm that I am trying to challenge. You can see how this paradigm is not limited to neuroscience. It even guides our developmental models like Waddington’s conception of a genetic landscape and Kauffman’s boolean networks for gene regulation.

In the 1950s, Conrad Waddington introduced the analogy of a marbles rolling down a hill with grooves in order to explain how cells with the same genome differentiate during embryonic development.

S: That’s what you are criticizing? Models that can be represented with a number of variables and differential equations?

H: Yes. You can always formulate our biological models as sets of differential equations with a finite number of variables. In a way, biology has been ignorant of computer science by sticking to a framework of finite dimensional dynamical systems.

S: Hold on a minute. Any physical model can be written as a set of differential equations with a finite number of variables. How is that being ignorant of computer science? Physics, at the fundamental level, is all differential equations.

H: That is precisely what everyone gets wrong about this. Some physically realistic systems cannot be understood as finite dimensional dynamical systems.

S: Like what?

H: Like Turing machines.

S: Isn’t that because Turing machines are mathematical abstractions that require infinite memory? In biology we’re concerned with things that can be physically instantiated, not mathematical abstractions.

H: Turing machines do not need infinite memory. In his original 1936 paper, Alan Turing did not endow his machine with an infinitely long memory tape. In fact he was explicitly concerned with things that are – in his own words – “calculable by finite means”. It is just that the description of a Turing machine does not include the length of the memory tape. In other words, you can describe how a Turing machine operates without specifying how long the memory tape is or specifying whether it can grow if it runs out of space.

S: I am lost. Every physical device contains a limited number of particles. And the laws of physics are differential equations. So does it not follow that every physical device is ultimately a finite dimensional dynamical system?

H: Excellent question. The answer is, surprisingly, “no”. Any closed device can be modeled as a finite dimensional dynamical system. But a device that can exchange particles with the environment might not be describable as a finite dimensional dynamical system. That doesn’t mean it’s an infinite dimensional system. (The nomenclature is confusing). At any given moment the device’s state is determined by a finite number of variables. But that number can increase as a result of the device recruiting more memory space by adding new components to itself. Adding new components to the system is like adding new variables and new differential equations. Think of a dynamical system that has a rule for adding new equations/dimensions.

So a system that grows in its number of parts might not be describable as a finite dimensional dynamical system.

S: Are you saying that we fail to consider that systems can grow in biology? What about cell division or neurogenesis? What about protein synthesis? People model growth all the time.

H: No that’s not what I said. What I am saying is that the models we use to study computation in biology always fall within the framework of finite dimensional dynamical systems. This is true even when the subject is cell division or dendritic spine growth. People capture growth with single variables, e.g. x = total population of cells or y = number of spines. You can always describe our biological models with predetermined number of variables that have predefined rules for influencing one another. The variables that are needed to simulate the model are known ahead of time.

That is a weak framework for studying computation. Not all physical systems are finite dimensional dynamical systems. Closed systems are. But systems that can grow are not.

S: But what is the alternative? Computers? If anything, it seems like computers are extremely closed systems. They are finite collections of wires, transistors, and circuit elements that obey the laws of physics. They don’t grow. So they can be described as dynamical systems. Right?

H: You might be able to describe a specific computer, like the one on my desk, as a physical object governed by a set of differential equations. But that misses the point. Computer science is mostly the study of computer programs, not of the physical devices which execute them.

Computers don’t grow but computer programs grow. A program can recruit more memory and vary in the amount of space it occupies as it progresses in time. When I say “this code runs with polynomial memory space”, I am being agnostic about how much hard disk is available on the computer that it is intended to run on. So it’s really a statement about the computer program not the computing device.

S: That’s an interesting way of looking at it. So what you’re saying is that computer programs can’t be modeled as dynamical systems because they might grow in the space they occupy in the computer’s memory, which means the number of variables needed to describe their state may increase through time.

H: Yes. And you may not know how many variables are needed to run a specific program ahead of time, because it can depend on the input to the program.

S: Okay, I see. Dynamical systems might not be able to describe computer programs. But are you suggesting they are also the wrong framework for describing biological systems too?

H: Yes, perhaps.

S: Why? Because biological systems grow?

H: Oh, no. You’re making it sound like the issue is growth. The only reason I brought up growth is to argue that dynamical systems are not general enough to describe all kinds of physical systems. I am afraid this is distracting us from the main point.

The main reason dynamical systems are so unsatisfactory is that they are computationally weak systems, realistically speaking that is. It’s about computation power, not growth.

S: In what sense are dynamical systems weak?

H: They are weak in terms of the scope of problems they can solve. There are some problems that cannot be solved using dynamical systems.

S: What is an example of a problem they cannot solve?

H: Here’s one easy example. No realistic dynamical system can be used to simulate Turing machines.

S: Why should biologists care about simulating Turing machines?

H: Okay, forget Turing machines for now. I’ll give you another example. Consider this problem. Given a piece of code in your favorite programming language (e.g. Python or C++), determine the output of program. Realistically, dynamical systems cannot solve this problem.

S: You’re saying that dynamical systems are weak because they cannot simulate computer programs. But computer programs are digital and deterministic. Biology is analog and probabilistic. Why even compare the two?

H: To answer this I have to teach you something which I consider to be one of the most profound and mysterious discoveries of the 20th century.

Part II: The Insight

H: In the late 1920s and early 1930s, mathematicians and logicians were getting deeply philosophical. They were asking questions like “can we come up with an algorithm to decide if a given statement can be proved true?”. Or “Are there some numbers that can be defined but not calculated?”. Or “can we come up with a system for constructing functions that is general enough to include all calculable functions?”. There was a concerted attempt to formalize the notion of “computation”.

Three approaches came out of that era, that were developed independently of one another: 1) combinatory logic and lambda calculus, 2) general recursive functions, and 3) Turing machines and Post’s model. In the end, these three competing attempts at formalizing the notion of computation turned out to be equivalent. It wasn’t obvious at first. In fact, it took them years to realize this. All three systems are equally powerful. Each one can be used to simulate another.

This led people to the idea that they had found the limit of computability. It is common to state this idea in terms of Turing machines: “Turing machines are capable of computing any computable function” which is essentially what the Church-Turing thesis says. But you can replace “Turing machine” with any of the other systems. For instance, lambda calculus is capable of computing any computable function.

S: That’s why you keep bringing up Turing machines.

H: Yes! And you can see now that Turing machines aren’t really that special. They’re just one of many universal computation systems. When I bring it up you can replace it with any other equivalent system, even your favorite programming language. Because anything that can simulate a Turing machine is also capable of computing any computable function.

S: So if I understand correctly, there’s something called the Church-Turing thesis that says Turing machines can compute anything that can be possibly computed. And I can replace “Turing machine” with “Python” or “javascript” and it still holds true.

H: Yes!

S: Is this a proven fact?

H: Well this is where it gets a bit mysterious. The problem with proving the Church-Turing thesis is that it’s not entirely well-defined. You have to prove that Turing machines can compute anything that is computable. But what does it mean for something to be “computable”? The whole point of coming up with Turing machines was to formalize what it means to compute.

S: So it’s a definition! The Church-Turing thesis simply defines what it means to be computable. Right?

H: Some people seem to think it’s just a definition. I don’t. I think it’s claiming something substantive. A definition isn’t something that is true or false. But this is a statement that can in principle be refuted.

S: I still don’t get it. So is the Church-Turing thesis an empirical statement? Or is it a mathematical statement?

H: It’s unclear. From the very beginning there have been different interpretations. Gödel suggested that it can be proved from a set of axioms that define “effectively calculable”. In fact, Nachum Dershowitz and Yuri Gurevich, proved it in 2008.

S: So there is a proof for the Church-Turing thesis?

H: Yes, but you can always disagree with the axioms that they chose for that proof. Someone can say that their definition of computation was not general enough to encompass all kinds of computation. For example, their axioms allowed analog computations and real numbers but everything still has to happen in discrete steps. Computation has to be algorithmic according to their axioms.

S: That’s brings us to something I wanted to to say, actually. In the classical view, computation is something that is algorithmic – happening in discrete time steps. And there may be other assumptions about what it means to compute that don’t necessarily apply to the natural world.

H: Then let’s only consider the empirical interpretation of the Church-Turing thesis. If you show me a biological or physical system that can compute things that Turing machines cannot, then we shall say we have disproved the Church-Turing thesis. Non-algorithmic non-discrete computations are permitted. So far, no one has done this.

S: I guess I’ll have to take your word for it. But I’m still bothered by this whole approach. You are imposing a framework onto an area where it might not be applicable. Nature does things that cannot be well defined. It’s not mathematics or computer science where you have a clear separation of inputs and outputs. I mean, what are the inputs and outputs of my cat? What exactly is it computing? Does a redwood tree compute? What about a rock that absorbs sunlight and emits rays and changes shape with years of weathering?

H: Excellent questions. This is what I am claiming: You get to pick the system and you get to pick the inputs and outputs. As long as your system is realistic, anything that your system can be used to compute, Turing machines can also compute. I don’t care if your system is analog, probabilistic, or continuous in time. You cannot transcend the power of Turing machines unless the Church-Turing thesis is wrong (at least the physical/empirical interpretation of the thesis).

S: But my point is that sometimes we might not be able to define the inputs and outputs and specify how they’re related. For example, integer factorization is a well-defined computation problem. The input is a single number and the output is a series of numbers, uniquely determinable from the input. But if we’re talking about biology, well… what is the input to output mapping that a cat solves?

Perhaps you can say the inputs come from its sensory organs and its outputs are motor commands. But what time interval are we talking about? Are we talking about mapping instantaneous sensory stimuli to immediate motor commands? Surely, cats cannot be modeled that way; a cat can do different things in the same exact sensory situation. So the instantaneous input cannot be mapped to a unique output because a cat can use memory of past experiences to influence decision making. On the other extreme you might say that the cat maps a lifetime of input stimuli to a lifetime of motor commands to maximize survival or reproduction. But this is not the predefined input to output framework that is typical in computation theory. Most of a cat’s sensory inputs are the result of its previous decisions. A cat’s actions shapes its sensory inputs. And the sensory inputs along with a memory of past experiences is used to guide its next actions. There’s a constant feedback loop between inputs and outputs. Whereas, in integer factorization its just a straightforward predefined mapping.

H: You raise very good points. Here is how I see it. It is not that the cat as a whole organism is solving a specific input/output mapping. Rather, it is that the cat contains one or more computational systems that can be used to solve various problems when they come up.

An analogy that you may dislike but I think will be helpful anyway is a smart phone. What input/output relationship does your phone solve? The same issues you raised about past experiences and input/output feedback loops also applies to computing devices like smart phones. It is really difficult to describe smart phones as solving a unique input to output mapping. But your smart phone contains many applications, each containing many modules and functions. One module is responsible for drawing images on the screen; another may be responsible for converting information into packets to send over the network. At that level, the input to output relationships can be well defined.

Similarly, a cat’s behavior can be decomposed into simple tasks and computation problems. People characterize and study these smaller components in the field of animal behavior. My current favorite example is path integration: knowing where you are in space by integrating small movements. It is a well-defined computation problem, almost as well-defined as integer factorization! The inputs to the computation (orientation and speed) have to come from other computation modules and the output (position in space) can be used for all sorts of different things.

S: Okay, but even if we succeed in decomposing animal behavior into smaller components and modules, the functions are not necessarily as cleanly defined as it is for, say, path integration. In path integration there is a unique correct answer. But when an animal is searching for a mate or running away from a predator, there isn’t a unique correct answer. There can be many correct answers and the animal can just make a decision that is good enough. The classical conception of computation, with fixed input to output mappings, is not suited for these types of problems.

H: I disagree. In practice, many algorithms are heuristics or approximations that find an answer that is good enough. And very often computer programs use random generators to produce outputs that are variable and not always unique.

I think you can always formulate computation problems as classical input to output mappings. A problem that has more than one correct output can be reformulated as a decision problem that checks whether a candidate output is correct or “good enough”.

S: Your claim is that that Turing machines are this ultimate extremely powerful computing system that can solve anything that can be solved, including all problems that come up in biology. Where are you going with this?

H: Good. Being able to solve all computable problems is called “universal computation”. But let me reemphasize that its not just Turing machines that are universal. There are numerous systems that are also universal. Very little is required to achieve universal computation. Almost all programming language achieve it, even ones with ridiculously minimal components. Some systems have accidentally stumble upon it. For instance, Wofram’s rule 110 or Conway’s game of life were not intended to be powerful computation systems. But with very simple rules, they are capable of computing any computable function (i.e. Turing-equivalent). So it’s not that Turing machines are extraordinarily powerful computers. It’s that dynamical systems and neural networks are pathetically weak.

S: Pathetically weak? That sounds unfair. What about all the recent success we’ve had with deep neural networks? Neural nets can beat humans in games like Go and Starcraft!

H: And what do these deep networks run on? Are they not computer programs running on digital computers?

S: Yes, but they can be implemented with networks of neurons too.

H: Maybe they can. But we are currently using digital computers and programming languages to train and run them, not analog circuits or networks of neurons. I think that attests to the generality of the Turing-equivalent systems. Notice that Go and Starcraft are both examples where there is no uniquely correct move. You just criticized the classical framework of computation theory for not being able to deal with these kinds of problems. Remember? Well here we have programs that are playing games! Isn’t it fascinating that we are still using the same CPU operations and architecture for this? Turing-equivalence is all you need. Dynamical systems, on the other hand, can’t even do what the simplest programming languages can do.

S: You need to be a little more concrete about the weakness of dynamical systems. I asked you for an example of something they can’t compute, and you said they can’t simulate Turing machines or run computer programs. But that’s not a very convincing example.

H: Well, to be honest it’s hard for me to give you more examples that I’m 100% sure about. But I can give you examples of things I suspect dynamical systems are realistically incapable of. Here’s one: given a string of open and close parentheses, e.g. “(()(()))()”, return its maximum depth.

Or here’s a yet simper one! Given a string of Xs and Os, determine if the number of Xs equals the number of Os.

S: Really? I have a hard time believing dynamical systems are incapable of solving this problem. It’s basically a counting problem.

H: Yes. I suspect that dynamical systems cannot be made to count up to arbitrarily large numbers. You can certainly make a dynamical system that will work for input sizes up to, say, 50 or 100 characters of Xs and Os. But there will always be a limit to how high your dynamical system can count.

S: Any physical system will have that limitation. Your computer has limited memory so it also has a limit to how large of a number can be stored on it. It can’t count up to infinity!

H: You’re confused again about the distinction between computing devices and computer programs. It’s the computer program (and also the computer architecture) that is Turing-equivalent. I can write a program that counts Xs and Os up to arbitrarily large numbers. If you run it on some huge input and it runs out of memory or batteries, that’s a limitation that came from your computing device – not from my program. You can then take the same program and run it on another computer with large enough memory and it will work. The crucial point is that it will work with more resources without changing the code.

On the other hand there is no description of a dynamical system that can count arbitrarily high. You can describe a dynamical system that counts up to a very large number. But if I give an input that exceeds that limit, you will have to redesign your dynamical system. The description of your dynamical system will have to change to deal with larger inputs.

S: Okay. But I can argue that biological systems don’t need to count up to arbitrarily large numbers.

H: Neither do computer programs for all of our practical purposes! But we still use universal computation systems in technology.

S: But what’s something that biology absolutely needs universal computation for? It sounds like universal computation is not necessary if the input size is restricted.

H: If the maximum input size of a problem is predetermined, you can solve that problem with even a look-up table – which is a far weaker than a dynamical system. But there are plenty of computation problems in biology where the input size is not predetermined.

S: Like what?

H: Oh, I can think of so many examples, from human speech recognition to insect path integration.

S: But there is a limit to how long a sentence practically gets or how far an insect will ever walk.

H: Yes, I suppose you are right. But it’s much simpler to describe a solution to those problems that doesn’t take that limit into account.

It’s like saying that we don’t need an algorithm for division, because you can just memorize a very large division table. By your logic, there is a limit to the set of numbers humans will ever encounter, so we can replace all of math with look-up tables. The problem is that the descriptions become excessively long and complicated if you want to solve problems this way.

You have to stop thinking of universal computation as this sophisticated level of computation power that would not evolve unless there is a need for it. It is rather a very accessible level of computation that can be reached – often accidentally – by implementing unsophisticated rules.

Let me ask you a question. Do you think cars need a universal computer to function?

S: Probably not. Cars were invented way before computers.

H: Indeed. In principle, cars don’t require a universal computation system to work. But then why do modern cars have dozens of microprocessors in them? (Microprocessors are based on the von Neumann architecture and Turing-equivalent). I’m talking about things like controlling the engine and breaks. That’s all done with microprocessors these days. In fact we have microprocessors in almost every modern device, including coffee machines, thermostats, toys, washing machines. Microprocessors are replacing the older analog circuit solutions. Not because these devices strictly need universal computers to work, but because universal computers are sufficient. In most situations it’s the simpler and more efficient way to compute.

S: I see your point. Although I am wary about generalizing what works for human technology to what works for biology. My intuition is that dynamical systems are a messier but more natural computation system. As long as they do the job, nature would not evolve a universal computer.

H: You are not alone. I believe this is how most biologists think. My intuition, however, is that it is extremely unlikely for nature to not have stumbled upon a universal computation system. And once such a system is found, it will spread and be used for almost everything that involves computation.

Part III: The Decoy

S: I am sorry to say, I have bad news for you. The entire premise of your argument is that dynamical systems and network models are computationally weak and that – unlike Turing machines and other universal computers – they cannot be used to compute any computatable function.

Well, I did a bit of research and it turns out you are wrong! In 1990 Cristopher Moore showed that you can simulate Turing machines using finite-dimensional dynamical systems. In the following years, several other authors independently proved that dynamical systems can be Turing-equivalent. There’s even a neural network version! In 1991, Hava Siegelmann and Eduardo Sontag showed that you can simulate Turing machines using a finite sized neural network.

According to what you said, anything that can simulate Turing machines can also be used to compute any computable function. So dynamical systems and neural networks are universal.

H: You are good at this. I am glad you brought up these studies. But they do not contradict my position. I have tried to phrase my arguments carefully. What I said is that realistic dynamical systems cannot be used to simulate Turing machines.

S: Oh boy. What do you mean by realistic? This better be convincing.

H: It will be. I will convince you on two fronts. First, a practical consideration regarding the resolution of physical quantities. Second, a more serious consideration regarding something called structural stability.

Let’s take a closer look at how these dynamical systems are able to simulate Turing machines. Turing machines are composed of a “memory tape” and and “machine head” that can manipulate the symbols on the tape. The machine head only needs to be as intelligent as a finite state automaton. So implementing the head using a dynamical system is easy. The tough part is implementing the memory tape. Because with finite-dimensional dynamical systems you only have a predefined set of variables at your disposal. But there is no limit to the number of symbols that can be on a Turing machine’s memory tape.

Do you know how these dynamical systems implement the memory tape? They do it by using numerical digits after the decimal point as a string of symbols! So the number 0.2381 is interpreted as a memory tape consisting of the symbols 2, 3, 8, and 1. Now if you want to add the symbol 6, all you have to do is add 6 and divide by 10 to obtain 0.62381. If you then want to retrieve a symbol, you multiply by 10 and round down. That’s how you can store unlimited information in a single number.

Although, to be precise, a single number is more like a stack than a bidirectional tape. You need two numbers (two stacks) to implement a bidirectional tape. And the example I used was in base 10. But you can represent your numbers in any base. Binary representations will do. You can even use three unary stacks to simulate Turing machines. But these details are not important. My point is that all of these dynamical systems use numerical digits as strings of symbols.

S: I agree that this is a strange model. I don’t think biological systems use numerical digits as strings of symbols. But these specific models are unrealistic because they are accomplishing the unrealistic goal of simulating Turing machines. Biological systems don’t need to simulate Turing machines.

My understanding is that simulating Turing machines is like a benchmark. If a system can be shown to do that, we know it can compute anything.

H: Your understanding is correct. But using numerical digits as symbols is critical for these systems. A necessary condition for universal computation is that the number of possible states not be restricted by the descriptor. The number of states for a computer program cannot always be determined from the code. It may depend on the input.

The number of variables in a finite-dimensional dynamical system is predetermined. If the amount of information that can be stored in each variable is bounded, then the number of states of that dynamical system is bounded by its descriptor. So it cannot be universal.

The only way around this is to come up with a way to store unlimited information in at least one those variable. That is why you have to use unlimited precision in these systems.

S: I am not convinced. Unlimited precision is unrealistic, yes. But so is unlimited memory. Don’t Turing machines also require unlimited memory?

H: Right. We assume no limits for memory or energy when describing Turing machines. And we’re fine with that. The problem is not that these systems assume unlimited precision. The problem is that physical quantities are, in practice, extremely limited in the amount of information they can store. What is the practical resolution of a neuron’s membrane potential, or the concentration of a specific chemical? Let’s be generous and say 10 orders of magnitude. That’s just over 30 bits of information. This places severe practical limitations on how much memory can expand in dynamical system. Compare it with computer programs that can practically recruit trillions of bits of information.

S: But the human brain contains billions of neurons and trillions of synapses. Even if each of these neurons/synapses contain a few bits of information, that seems plenty to me!

H: We are talking about how much memory can be added to a computation system, not how much information can be stored from the onset.

S: I don’t understand. How is a brain that contains billions of neurons and trillions of synapses different from a computer that contains billions or trillions of bits? Isn’t their memory capacity comparable?

H: The difference is that the amount of memory that is present in a computer is not part of a computer program’s descriptor. We spoke about this. You can give me a code that solves a specific problem without knowing how much memory is available on the computer that it will run on. Computers have a reservoir of general purpose memory that programs can recruit during computation. Any program, regardless of its purpose, can draw upon unused memory to complete a computation.

On the other hand, finite-dimensional dynamical systems and neural networks are not designed to work that way. There aren’t network architectures or dormant dimensions/variables that can be used as general purpose memory units. In the papers you cited from the 90s, memory expansion is achieved by expanding the precision of one, two, or three special variables. It is as if the dynamical system is drawing from a reservoir of decimal places of a specific physical quantity, not from a reservoir of neurons or network motifs.

S: But if I come up with a model that does work that way, a model that uses general purpose memory units, that model will be universal?

H: Yes, potentially. As long as the number of memory units is not part of your network descriptor your model might be universal. (Remember, the descriptor is what specifies the computation problem that it solves). I don’t know of any biologically plausible network models that work like this. The closest thing I’ve seen are what are called neural Turing machines.

S: Let’s take a step back. It seems that you agree that dynamical systems are capable of simulating Turing machines, and thus capable of computing any computable function. But your critique is that these systems are limited in how much their memory usage can increase once they are set into motion.

H: Yes. That is my first criticism. The limits on the precision or resolution of a physical quantity are much more severe than limits on how many components or particles can be added to a system.

The dynamical systems you cited, even if they can be implemented, can only increase their memory usage by something on the order of a hundred bits. While modern computer programs can increase memory usage by trillions or quadrillions of bits and not even be near any physical limitations.

But there is an even more serious problem with the dynamical systems you cited.

S: What?

H: All of the papers you’ve cited are dynamical systems that lack structural stability. Structural instability renders a system physically irrelevant. It is practically impossible to build a structurally unstable dynamical system or to find one to occur in nature.

In fact, there is a conjecture about this by Cristopher Moore, the very person you cited as first solving the Turing machine simulation problem. He conjectured in 1998 that “no finite-dimensional [dynamical] system capable of universal computation is stable with respect to perturbations, or generic according to any reasonable definition”.

This is why I say they are unrealistic.

S: And structural stability is widely accepted as a condition for being realistic?

H: Well, sort of. Moore argues that it is. A system that lacks structural stability needs to be implemented with infinite precision for it to resemble to intended dynamics. An error of one part per billion is enough to ruin it.

S: But it’s a conjecture, not a proof.

H: Right. There may be some ingenious method for getting structurally stable dynamical systems to be capable of universal computation. And maybe nature uses that method. But as far as we know, dynamical systems are realistically incapable of universal computation.

S: And in that case, your argument falls apart.

H: Yes. If Moore’s conjecture is wrong then my argument falls apart. Otherwise dynamical systems are realistically weak computation systems.

Part IV: The Clues

S: There is much I have learned from our discussion. Let me try to summarize your arguments so far. You have convinced me that not all physical systems can be modeled as finite dimensional dynamical systems. That is because some systems can grow but dynamical systems don’t have a rule for adding more dimensions going forward in time.

H: Yes!

S: You also claim that there is this level of computation power that is easy to reach. And once you have a system that reaches that level, it can compute anything.

H: Well, not anything. A universal computer can compute anything that is computable. There are some things that are non-computable, by any machine. But that’s a discussion for another time.

You’re doing a great job summarizing my points. Please, continue.

S: Okay. You also believe that dynamical systems do not reach that level of computation power.

H: Yes. Although, to be precise, we’re talking about dynamical systems that are physically/biologically relevant.

S: But at the same time you claim that it is really easy to reach universal computation. You keep saying that people sometimes accidentally stumble upon it.

H: Yes. Universal computation only requires a number of simple rules of operation. My intuition for why realistic dynamical systems fail to reach it is that they cannot grow in their dimensionality, which is something computer programs easily do.

S: So then your point is that evolution must have also stumbled upon universal computation?

H: It is hard to believe that evolution would not have stumbled upon universal computation. And once it finds a system that does compute like standard computers do, the selective advantages would be enormous.

S: And how exactly does biology implement universal computation?

H: I don’t know. We need to discover it.

S: I know. But I want to get a better sense of what it is that you think we should be looking for. A common theme that keeps coming up is growth or adding new variables. You say that a computer does this by providing a reservoir of memory space that programs can recruit when needed. A computer program can grow in the memory space provided in the computer’s hardware. So should we be looking for neural network architectures that implement general purpose memory units? Or some sort of dynamical system that has dormant dimensions that can be recruited according to specific rules?

H: That might actually work. Network motifs that serve as general purpose memory might be a solution to universal computation. But I think they’re unlikely to exist, especially given the fact that we haven’t found anything yet.

I think the most promising place to search for a universal computation system is at the molecular level. RNA and DNA molecules are bizarrely well-suited to be used in a computation system. They are extremely stable, can be modified with very little energy, and most importantly are structured as a string of symbols from a limited alphabet.

S: I’m getting the sense that you don’t know what you’re talking about. You think RNA or DNA are used for general purpose computation? Isn’t there a consensus in biology that DNA and RNA are for encoding and building proteins?

H: I am aware that this is a highly controversial claim. But I am not the first to suggest these ideas. In fact, some biologists are converging on RNA and DNA as substrates for memory and computation from two very different angles.

From the side of cognitive science, there is a [radical] idea that has been gaining momentum over the past decade: that memories are stored not in the synaptic weights between neurons, but rather in the structure of specific molecules inside neurons. The most well-known proponent of this idea is Randy Gallistel. I highly recommend you check out some of his writings. His 2008 book was a great source of inspiration for me, but you can also check out this paper summarizing the debate. These theories actually have their roots in the 1960s but the more recent articulation is much stronger and more convincing in my opinion.

S: Is there any experimental evidence for it though?

H: There’s some evidence, although not indisputable. There’s some evidence from studies of how neurons learn and remember time intervals, and some more evidence from studies of how sea slugs sensitize to shocks. I can point to a few others but the experimental evidence is scattered and controversial at this point.

But let me tell you about the other angle that is also converging upon RNA and DNA as having a computational role. You may know that less than 1% of our DNA is actually encoding for proteins. Every year, more functional roles are being discovered for the non-protein-coding portion of DNA and RNA. There are some that believe that most of our genome might be functional. Resolving the extent of functionality in non-protein-coding RNA has been described as “the most important issue in genetics today”. You can check out this paper to see some of their arguments in the context of the ongoing debate in the literature.

S: These are indeed some very radical claims. I don’t see why we should take them seriously without strong empirical evidence.

H: Perhaps. But I believe the mainstream paradigms for cognition and development also need to be substantiated empirically. The failure of our models to clearly explain how genes turn into organisms or how networks of neurons turn into behavior, is itself a sign that our models are lacking something.

S: It’s still not clear to me how a biological model for universal computation might look like. How exactly can RNA or DNA be used to compute things? The neural network model with general purpose memory units is a lot more tangible to me.

H: That is a question I have been thinking about for years. Can we come up with a biologically plausible system based on RNA or DNA, that is universal in its computation power? Is it in principle achievable at the molecular level? The answer, I argue, is yes. I wrote a paper on how a set of plausible RNA editing rules can make it possible to compute any computable function, or even simulate Turing machines at the molecular level. This is something that realistic dynamical systems are incapable of. I did not have to assume any extraordinarily complex molecular machinery in the model. It actually consists of very simple molecular processes, not very different from what we know to exist in cells.

S: So this is how you think biology computes things? This makes things a lot more concrete. You should have brought it up in the first place.

H: Well, to be honest, I still don’t know what I believe in. It’s very likely that nature’s universal computer will look different than what we can predict from pure theory.

The point of my model was to demonstrate that universal computation is in principle attainable at the molecular level. It is easily within the reach of evolution.

S: I see. And then you argue that it must have evolved.

H: Or that life began with one in the first place. The rules that make universal computation possible can be so simple that it is conceivable that it occurred naturally, by chance. RNA predated DNA and proteins. Maybe life began when matter suddenly stumbled upon universal computation by accident.

This may surprise you but my intuition is that nature needs Darwinian evolution to construct an eye, but it doesn’t need Darwinian evolution to construct a universal computer.

S: And you think this RNA-based computation system is being used in cognition, development, and cell regulation.

H: I would like to believe so. But, again, we don’t know yet. We have to discover life’s computer. It probably works in ways that are hard to predict with what we know now.

S: Good luck trying to discover it. These ideas are very interesting but I am afraid you may be placing too much faith in theory. At the end of the day, it is empirical evidence that will sway people.

H: And I am worried that you are placing too much faith in what we think we know about biology. Between that and computation theory, I choose to place my faith in computation theory. Either biology magically circumvents the insights we have learned from the theory of computation and from the 1930s, or we just haven’t discovered how biology computes yet. I think the latter is more likely.

This isn’t something I can go off and discover by myself. It requires resources and dedicated people willing to work on it. That’s the whole reason I am trying to sway people. I have come to realize over the years that an important component of scientific progress is collectively deciding what problems are worthwhile. It would be quite sad if humans never discover how they think and what brings matter to life. As a species, we have limited resources and limited time on this planet. There is too much left to discover for me to be fine with people working in directions that I believe to be dead-ends.

S: I agree with your last point. Although what you are proposing may be a dead-end too. Deciding what problems are worthwhile is a subjective matter. But if it makes you feel better, this discussion is something I will be thinking about. for some time

H: Great. That’s all I was hoping for.

bookmark_borderQuantifying Evidence (2): Evidence Is Limited By How Much a Study Can Be Trusted

In part 1, we defined evidence and showed that evidence across independent studies can be aggregated by addition; if Alice’s results provide 2 units of evidence and Bob’s results provide 3 units of evidence then we have a total of 5 units of evidence. The problem with this is that it doesn’t account for our intuition that single experiments cannot be trusted too much until they are replicated. 10 congruent studies each reporting 2 units of evidence should prevail over one conflicting study showing -20 units of evidence.

Let’s try to model this by assuming that every experiment has a chance of being flawed due to some mistake or systematic error. Each study can have its own probability of failure, in which case the results of that experiment should not be used at all. This is our first assumption: that any result is either completely valid or completely invalid. It is a simplification but a useful one.

We define trust (T) in a particular study as the logarithm of the odds ratio for the being valid versus being invalid. In formal terms:

    \[ T = (\text{subjective trust in a particular result}) = \log_{10}\left(\dfrac{Pr(\text{valid})}{Pr(\text{invalid})}\right)  = \log_{10}\left(\dfrac{Pr(V)}{Pr(\overline{V})}\right) \]

A trust T=2 corresponds to a belief that the odds the outcome being flawed is 1 to 100. T=3 corresponds to an odds of 1 to 1000. In my view, 1<T<3 is reasonable for the typical study. But trust is something subjective that cannot be objectively calculated. Much like priors, it depends on the person sitting outside of the study interpreting its results.

Take a study with data D that reports evidence E. That study can either be valid or invalid (represented by V and \overline{V}). The reported evidence was calculated under the assumption that the study was valid. So:

    \[ E = (\text{reported evidence}) = \log_{10}\left(\dfrac{Pr(D | H_1 \& V)}{Pr(D | H_2 \& V)}\right) \]

    \begin{align*} \\ \Rightarrow \begin{cases} P(D|H_1\&V) = (P(D|H_1\&V)+P(D|H_2\&V)) / (1 + 10^{-E}) \\ \\ P(D|H_2\&V) = (P(D|H_1\&V)+P(D|H_2\&V)) / (1 + 10^{E}) \end{cases} \end{align*}

From the perspective of an observer interpreting a study with trust T, we can calculate the effective evidence, \hat{E}.

    \begin{align*} \Hat{E} &= (\text{effective evidence}) = \log\left( \dfrac{Pr(D | H_1)}{Pr(D | H_2)} \right) \\ &= \log\left( \frac{P(D|H_1\&V)P(V) + P(D|H_1\&\overline{V})P(\overline{V})}{P(D|H_2\&V)P(V) + P(D|H_2\&\overline{V})P(\overline{V})} \right) \\ &= \log\left( \frac{P(D|H_1\&V)\times 10^T + P(D|H_1\&\overline{V})}{P(D|H_2\&V)\times 10^T + P(D|H_2\&\overline{V})} \right) \\ &= \log\left( \frac{(P(D|H_1\&V)+P(D|H_2\&V))(1 + 10^{-E})^{-1}  10^T + P(D|H_1\&\overline{V})}{(P(D|H_1\&V)+P(D|H_2\&V))(1 + 10^{E})^{-1} 10^T + P(D|H_2\&\overline{V})} \right) \end{align*}

We define G as the resulting evidence in case the study is invalid.

    \[ G = log\left(\frac{P(D|H_1\&\overline{V})}{P(D|H_2\&\overline{V})}\right) \]

    \begin{align*} \\ \Rightarrow \begin{cases} P(D|H_1\&\overline{V}) = (P(D|H_1\&\overline{V})+P(D|H_2\&\overline{V})) / (1 + 10^{-G}) \\ \\ P(D|H_2\&\overline{V}) = (P(D|H_1\&\overline{V})+P(D|H_2\&\overline{V})) / (1 + 10^{G}) \end{cases} \end{align*}

    \[ \Rightarrow \Hat{E} &= \log\left( \dfrac{(1 + 10^{-E})^{-1}  10^T + \dfrac{P(D|H_1\&\overline{V})+P(D|H_2\&\overline{V})}{P(D|H_1\&V)+P(D|H_2\&V)}(1 + 10^{-G})^{-1}}{(1 + 10^{E})^{-1} 10^T + \dfrac{P(D|H_1\&\overline{V})+P(D|H_2\&\overline{V})}{P(D|H_1\&V)+P(D|H_2\&V)}(1 + 10^{G})^{-1}} \right) \]

Here we are going to make another simplification: G = 0. The second assumption is that if a study is invalid, it provides no evidence for or against H1 versus H2. This means P(D|H_1\&\overline{V}) = P(D|H_2\&\overline{V}) = P(D|H_{1 \lor 2}\&\overline{V}). Substituting for these values we get:

    \begin{align*} \Hat{E} &= \log\left( \dfrac{(1 + 10^{-E})^{-1}  10^T + \dfrac{P(D|H_1\&\overline{V})+P(D|H_2\&\overline{V})}{P(D|H_1\&V)+P(D|H_2\&V)} \times \dfrac{1}{2}}{(1 + 10^{E})^{-1} 10^T + \dfrac{P(D|H_1\&\overline{V})+P(D|H_2\&\overline{V})}{P(D|H_1\&V)+P(D|H_2\&V)} \times \dfrac{1}{2}} \right) \\ &= \log\left( \dfrac{(1 + 10^{-E})^{-1}  10^T + \dfrac{P(D|H_{1,2}\&\overline{V})}{P(D|H_1\&V)+P(D|H_2\&V)}}{(1 + 10^{E})^{-1} 10^T + \dfrac{P(D|H_{1,2}\&\overline{V})}{P(D|H_1\&V)+P(D|H_2\&V)}} \right) \end{align*}

And to simplify the above formula we define another term: believability. Believability (B) is defined below.

    \[ B = log \left( \dfrac{P(D|H_1\&V)+P(D|H_2\&V)}{P(D|H_{1 \lor 2} \&\overline{V})} \right) \]

Substituting B we get the following:

    \begin{align*} \Hat{E} &= \log\left( \dfrac{(1 + 10^{-E})^{-1}  10^T +10^{-B}}{(1 + 10^{E})^{-1} 10^T + 10^{-B}}  \right) = \log\left( \dfrac{(1 + 10^{-E})^{-1}  10^{(T+B)} +1}{(1 + 10^{E})^{-1} 10^{(T+B)} + 1} \right) \\ &= \log\left( \dfrac{ \left(\dfrac{10^E + 1}{10^{E}}\right)^{-1}  10^{(T+B)} +1}{(1 + 10^{E})^{-1} 10^{(T+B)} + 1} \right) = \log\left( \dfrac{ \left(\dfrac{10^{E}}{1 + 10^E}\right) 10^{(T+B)} +1}{ \dfrac{1}{1 + 10^{E}} 10^{(T+B)} + 1} \right) \\ &= \log\left( \dfrac{ 10^{E} \times 10^{(T+B)} + (1 + 10^{E})}{ 10^{(T+B)} + (1 + 10^{E})} \right) = \log\left( 10^{E} \times \dfrac{ 10^{(T+B)} + 10^{-E} + 1}{ 10^{(T+B)} + 10^{E} + 1} \right) \\ &= E + \log\left( \dfrac{ 10^{(T+B)} + 10^{-E} + 1}{ 10^{(T+B)} + 10^{E} + 1} \right) \end{align*}

It’s alright if you didn’t closely follow the math up to here. What is important is that we now have a formula for calculating effective evidence (\hat{E}) based on reported evidence (E), trust (T), and believability (B).

    \[ \Hat{E}  = E + \log\left( \dfrac{ 10^{(T+B)} + 10^{-E} + 1}{ 10^{(T+B)} + 10^{E} + 1} \right) \]

The reported evidence (E) is an objective number we get from the study. Trust (T) is a subjective quantity that the subjects interpreting the study must determine for themselves, independent of the outcome of the study. Believability is a bit more complex. Believability is a number ascribed to a particular outcome or observation, much like evidence is. But in contrast to evidence, believability cannot be determined objectively. This is because of the term P(D|H_{1,2}\&\overline{V}) which has to be determined by the interpreter; it is subjective and can vary for different people. I will write more about believability in the next part of this series. (Suffice it to say that a study can be designed to guarantee a believability of B≥0).

meaningsubjective/objectivedependence on studyrange
Evidence (E)amount of evidence provided by the study’s outcomeobjectively calculateddepends on outcomepositive (in case the data favors H1) or negative (in case it favors H2)
Trust (T)amount of trust placed in a study prior to seeing the outcomedetermined subjectivelyindependent of outcometypically a positive number between 1 and 3
Believability (B)amount of believability ascribed to the outcome of an experimentdetermined subjectively, but a lower bound can sometimes be objectively calculateddepends on outcomenegative if the outcome is an indication that the study is likely flawed. The ideal study guarantees that B≥0.

To gain a better understanding about how the above formula works, I made the following plot.

Effective evidence begins to grow linearly with respect to reported evidence. But it plateaus at (T+B). In other words, evidence is effectively limited by how much a study can be trusted plus the believability of the study’s outcome. To first approximation, the magnitude of effective evidence is roughly equal to min(|E|, T+B). This approximation is least accurate when |E| T+B or when T+B < 1.

This formalizes our intuition that no single study can be used to decisively confirm or deny a hypothesis, no matter how strong the evidence turns out to be in that study. The amount of trust one places in a study limits the amount of evidence that can be acquired from it. For example, if you place a trust of T=1.5 in the typical paper, no single study can convince you by more than 1.5 units of evidence (assuming B=0; more on believability later). You would need to add the effective evidence (\hat{E}) from multiple independent studies to establish that there is higher than 1.5 units of evidence for something. This aspect of our framework is nice, because astronomically large or small values are commonplace when working with likelihood ratios. But by accounting for trust, extremely large amounts of reported evidence are not extremely informative.

A meta analysis of multiple studies can be done by calculating the effective evidence for each study and then summing the values. 10 studies that each report 2 units of evidence will almost certainly prevail over one [conflicting] study that reports -20 units of evidence, given that no study can reasonably be trusted with T≥20. If T = 3 and B = 0, then the overall evidence in this case is is 10×2-3 = 9. (Each of the first 10 studies will have an effective evidence of 2 and the single conflicting study will have an effective evidence of -3).

Now, here is a problem that will lead us to the next part. How do we deal with believability? From the perspective of a researcher, we would like to minimize it since it effectively limits the evidence that can be deduced from a study.

If the outcome of an experiment is a continuous value, then all the probabilities in the above formulas become marginal probabilities, meaning that the denominator can get infinitesimally small. The numerator depends on the person evaluating our study and can be infinitely large for some inconveniently skeptical interpreter. So there is no limit to how negative believability can get! If believability is not dealt with in a study, there is no guarantee that an interpreter will take away any information from that study. What can be done to guarantee something like this will not happen? I will discuss this in part 3.

bookmark_borderQuantifying Evidence (1): What Are Units of Evidence?

I am going to introduce a statistical framework for quantifying evidence as a series of blog posts. My hope is that by doing it through this format, people will understand it, build on these ideas, and actually use it as a practical replacement for p-value testing. If you haven’t already seen my post on why standard statistical methods that use p-values are flawed, you can check it out through this link.

My proposal builds on Bayesian hypothesis testing. Bayesian hypothesis testing makes use of the Bayes factor, which is the likelihood ratio of observing some data D for two competing hypotheses H1 and H2. A Bayes factor larger than 1 counts as evidence in favor of hypothesis H1; a smaller than one Bayes factor counts as evidence in favor of H2.

In classical hypothesis testing, we typically set a threshold for the p-value (say, p<0.01) below which a hypothesis can be rejected. But in the Bayesian framework, no such threshold can be defined as hypothesis rejection/confirmation will depend on the prior probabilities. Prior probabilities (i.e., the probabilities assigned prior to seeing data) are subjective. One person may assign equal probabilities for H1 and H2. Another may think H1 is ten times more likely than H2. And neither can be said to be objectively correct. But the Bayesian method leaves this subjective part out of the equation, allowing anyone to multiply the Bayes factor into their own prior probability ratio to obtain a posterior probability ratio. Depending on how likely you think the hypotheses are, you may require more or less evidence in order to reject one in favor of the other.

    \[ \dfrac{Pr(H_1 | \text{data})}{Pr(H_2 | \text{data})} = \dfrac{Pr(\text{data} | H_1)}{Pr(\text{data} | H_2)} \times \dfrac{Pr(H_1)}{Pr(H_2)} \]

    \[ \text{posterior odds ratio} = \text{Bayes factor} \times \text{prior odds ratio} \]

Let us define ‘evidence‘ as the logarithm of the Bayes factor. The logarithmic scale is much more convenient to work with, as we will quickly see.

    \[ E = \text{(evidence for $H_1$ against $H_2$ given $D$)} = \log_{10}(\text{Bayes factor}) =  \log_{10}(\dfrac{Pr(D | H_1)}{Pr(D | H_2)}) \]

Evidence is a quantity that depends on a particular observation or outcome and relates two hypothesis to one another. It can be positive or negative. For example, one can say Alice’s experimental results provide 3 units of evidence in favor of hypothesis H1 against hypothesis H2, or equivalently, -3 units of evidence in favor of hypothesis H2 against hypothesis H1.

Figure 1: In this example, two hypotheses are being tested against one another. H1 is the hypothesis that the opossum is genetically closer to humpback whales than salmon is. H2 is the hypothesis that salmon is closer to humpback whales than the opossum. The data D that is being used to compare these hypotheses can be some DNA sequencing data, for instance. (The data here isn’t real).

But what does, for instance, 3 units of evidence mean? How do we interpret this number? 3 units of evidence means that it was 103=1000 times more likely to observe that particular outcome under hypothesis H1 compared to H2. And this number can be multiplied into one’s prior odds ratio to get a posterior odds ratio. If prior to seeing Alice’s data, you believed the probabiliy for H1 was half that of H2 (Pr(H1)/Pr(H2) = 0.5) then after seeing Alice’s data with 3 units of evidence, you update your probability odds ratio to Pr(H1)/Pr(H2) = 0.5×103 = 500. After seeing Alice’s data you attribute a probability to H1 that is 500 times larger than the probability you attribute to H2.

What’s nice about this definition is that evidence from independent observations can be added. This definition aligns with our colloquial usage of the term when we say “adding up” or “accumulating” evidence. So if Alice reports 3 units of evidence and Bob independently reports 2 units of evidence, it is as if we have a total of 5 units of evidence in favor H1 against H2. And if Carol then comes along with new experimental data providing 1.5 units of evidence in favor of H2 against H1 (conflicting with the other studies), the total resulting evidence is 3+2-1.5 = 3.5.

None of what I have written up to here is new. I am not even sure if my definition of evidence is entirely original. I’ve seen people use log likelihood ratios and call it evidence. But from here on is where we begin constructing something new.

It is commonly accepted that a scientific result needs to be replicated before it can be trusted. If two independent labs obtain congruent evidence for something (say Alice found 3 units of evidence and Bob found 2 units of evidence) it should count as stronger evidence than if just one of them found very strong evidence for it, (say Alice had instead found 5 units of evidence). But Bayes factors does not seem to reflect this very well since both cases are said to result in 5 units of evidence. To take this to an extreme, 10 independent studies all reporting 2 units of evidence in favor of H1 should prevail over one study reporting 20 units of evidence in favor of H2. But the way we currently set it up, they cancel each other out. How can we improve this framework to incorporate our intuition about the need for replication? I will discuss this in part 2.

bookmark_borderWhen Biology Isn’t Messy

There is a belief in biology that goes like this: Biology is messy. Nature has no interest in making things easy to understand. So for many scientific questions, there will not be a straight-forward answer.

  • Q: Where and how is a particular memory stored in the brain? A: Biology is messy; memories are distributed all over the brain and stored in many different forms: as molecules and as structure, inside neurons and in the connections between them.
  • Q: How do genes determine the number of fingers per hand? A: Biology is messy; there isn’t a single factor for it. It emerges from the interactions of numerous different genes.

Now it is not always the case that our answers are so unsatisfying. Ask a biologist how the eye works and, well, there are quite a lot of similarities to how cameras work. And living organisms aren’t messy globs of formless flesh. They have an organized body plan, with separate organs, each responsible for specific functions that we can talk about; the heart pumps blood and the lungs pump air.

What we mean when we say “biology is messy” is that biology is as messy as it can be. Sometimes, biological systems need to abide by certain principles to accomplish things. The principles of optics drove the eye to include an adjustable lens, an adjustable aperture, and a two dimensional sensor upon which images are projected.

When we humans design something we try to make them as organized and intelligible as possible. When nature designs things they usually end up being messier than they need to be. In other words: biology is messy, unless it cannot afford to be. Natural selection does not care about elegance for the sake of elegance. Nature selects for traits and features that help organisms survive and reproduce. And if most of the solutions to a problem are messy, it will most probably pick a messy strategy. But if there is a cost associated with messiness that entails a selective disadvantage, then nature will tend to select a non-messy strategy.

Here is why I usually dislike it when people say “biology is messy”. When the messiness of biology is invoked, it is usually to rationalize why we don’t have an easy to understand explanation for some biological process (as is the case for the two example questions above). But we don’t know if this is because a simple explanation does not exist, or because we simply haven’t found it yet. Everything looks messy until you really understand how it works. This is even true for man-made devices like radios or microprocessors.

A historical example of how appealing to the messiness of biology can be misleading, is the old idea that genes are proteins. This was a widespread belief among biologists until the discovery of DNA. People used to think that inheritable traits emerged from the protein composition of a cell. The collection of proteins led one cell to become a turtle and one cell to turn into a butterfly. That idea was a far messier – less elegant – solution than having genes be written into strings of nucleotide (more on this later).

There is one particular usage of the “biology is messy” statement, that can actually be productive in my view. If we believe that “biology is messy unless it cannot afford to be” then we ought to believe in its corollary. In situations where biology is not messy, it is because it cannot afford to be messy due to some fundamental principles or constraints. We often run into things in biology that look like things humans design. When this happens, it suggests that there are some principles of design that are inherent to the problem at hand. We can then search for those principles that led nature to select a conceptually clean and simple solution, similar to how the evolution of the vertebrate eye was guided by the principles of optics to function an awful lot like a camera.

Take the the discovery of mechanical gears in insects, for instance. Gears are quite common in machines designed by humans. It is a conceptually simple solution for transferring torque between unaligned axes. Gears were not known to exist in nature until the discovery of functional mechanical gears in tiny insects called planthoppers. The planthopper’s gears synchronize the two hindlegs during a jumps.

Why would nature come up with such a neat solution? Leg synchronization is possible without gears. So why does it only appear in the planthopper, but nowhere else in the animal kingdom? There must be something about its purpose that could not have been easily achieved without gears.

The answer probably lies in the fact that, due to their small size, planthoppers only have less than a millisecond to accelerate. Within a span of 0.8ms they need to accelerate at over 700g to reach speeds of nearly 5 meters per second (Burrow, 2007). This is an exceptional regime of acceleration and requires sub-millisecond coordination of leg movement. If one legs starts moving half a millisecond earlier, the other leg would barely have any time left to push against the surface.

According to Burrow & Suttons, 2013, if leg movements are slightly uncoordinated during a jump the planthopper would spin, uncontrollably, in the horizontal plane. This is because the hindlegs rotate on a nearly horizontal plane underneath the body in planthoppers (whereas hindlegs of fleas and crickets rotate on separate vertical plane alongside the body). The gears provide a method of mechanical synchronization, enabling the legs to move within roughly 30 μs of one another.

But this is not the full story and there are still many unsolved questions regarding why gears evolved. Planthoppers lose their gears after they molt into adults and yet are nearly just as good at jumping. Adult planthoppers use a friction-based method of leg synchronization. Why was that not an adequate solution for younger nymphs? Leafhoppers and froghoppers, which also rotate their hindlegs in a nearly horizontal plane, jump without using mechanical gears. Froghoppers achieve very precise movement synchronization between their hindlegs (32±22 μs, according to Burrow & Sutton 2010). (I learned from my personal correspondence with Prof Burrow that the synchronization mechanism in froghoppers is unknown but likely to be mechanical).

The point here is that the presence of gears is an indication of some principles of design that is common across natural and artificial systems. (Can you identify those principles? What can gears do that other strategies cannot fulfill?) While those design principles can be ignored in most animals, the behavioral requirements in planthopper are so stringent that they have forced those principles onto the biological scene. (I am not convinced that we have identified all those requirements, since froghoppers and adult planthoppers seem to be just fine without gears).

But if we’re dedicated to the biology is messy assumption, we should be motivated to further investigate planthopper behavior/physiology until we find something that make those gears seem indispensable. (We are already very close to finding it and the idea of it having to do with the size of the insect has already been formulated). Now, it would be quite impressive if someone had gone the opposite route: predicted the existence of gears simply from studying planthopper behavior and the physics of jumping in small creatures. Would that have been possible?

This brings us to another domain of nature that is the main reason I am bringing up all of this. DNA and RNA are polymers of nucleotides or strings made up of an alphabet of four symbols. The cell appears to treat these nucleotides as abstract symbols that can be used to represent arbitrary things. The cell represents each of the 20 amino acid using a sequences of three nucleotides called codons. Some codons represent, not an amino acid, but an instruction to “stop” at the end of a sequence. What an elegant intelligible system! How reminiscent of the way computers use strings of binary symbols to encode data and instructions!

I would like to leave you with this question: Why do all living organisms use strings of abstract symbols? Or, under the biology is messy assumption, what functions are being fulfilled by DNA and RNA that would be difficult to fulfill through messier methods? What requirements forced digital coding through strings composed of a small alphabet onto the biological scene? Hopefully, you don’t think the answer is trivial. I will let you think about it and return to this question in a future post.

bookmark_borderThere Is a Fundamental Flaw in How We Do Statistics in Science

Suppose I tell you that only 1% of people with COVID have a body temperature less than 97°. If you take someone’s temperature and measure less than 97°, what is the probability that they have COVID? If your answer is 1% you have committed the conditional probability fallacy and you have essentially done what researchers do whenever they use p-values. In reality, these inverse probabilities (i.e., probability of having COVID if you have low temperature and probability of low temperature if you have COVID) are not the same.

To put it plain and simple: in practically every situation that people use statistical significance, they commit the conditional probability fallacy.

When I first realized this it hit me like a ton of bricks. P-value testing is everywhere in research; it’s hard to find a paper without it. I knew of many criticisms of using p-values, but this problem was far more serious than anything I had heard of. The issue is not that people misuse or misinterpret p-values. It’s something deeper that strikes at the core of p-value hypothesis testing.

This flaw has been raised in the literature over and over again. But most researchers just don’t seem to know about it. I find it astonishing that a rationally flawed method continues to dominate science, medicine, and other disciplines that pride themselves in championing reason and rationality.

What Is Statistical Significance?

This is how hypothesis testing is done in today’s research. When researchers decide to test a hypothesis, they collect data — either from experiments they themselves design or from observational studies — and then test to see if their data statistically confirms the hypothesis. Now, instead of directly testing the hypothesis, they attempt to rule out what is called the null hypothesis. Think of the null hypothesis as the default.

For example if I want to show that “turmeric slows cancer”, I rule out the null hypothesis that “turmeric has no affect on cancer”. The null hypothesis can be ruled out by showing that the data is unlikely to have occurred by chance if the null hypothesis were true. In our example, it would be something like saying “it is unlikely that this many people would have recovered if turmeric has no affect on cancer“. In other words, the data is statistically significant.

To quantify statistical significance, we use a measure called the p-value. The p-value for observing some data, represents the probability of getting results as extreme as what was observed under the null hypothesis. The lower the p-value, the less likely it is for the observations to have occurred by chance, hence the more significant the observations. So, in the turmeric treatment example, I may obtain p=0.008 which means the probability of having at least that many patients recover by chance would have been 0.8% (assuming the null hypothesis: that turmeric actually has no effect). Since it is so unlikely for us to have obtained these results by chance, we say these results are significant and it is therefore reasonable to conclude that the drug does indeed affect cancer.

The standard method for p-value testing is to choose a significance threshold before looking at the data. It is typical to set the threshold at p<0.05, or sometimes p<0.01. If the p-value is less than the threshold, the data is considered to be statistically significant and the null hypothesis is rejected. Otherwise, the study is considered to be inconclusive and nothing can be said about the null hypothesis.

The Fundamental Flaw

A rational observer rules something out if its probability is low. So perhaps we can all agree that it is rational to reject the null hypothesis (H0) if its probability falls below some threshold, say 1%. Now if we gather some new data (D), what needs to be examined is the probability of the null hypothesis given that we observed this data, not the inverse! That is, Pr(H0|D) should be compared with a 1% threshold, not Pr(D|H0). In our current methods of statistical testing, we use the latter as a proxy for the former.

The conditional probability fallacy is when one confuses the probability of A given B with the probability of B given A. For example, the probability of that you are sick if you have a fever is not equal to the probability that you have a fever if you are sick; if you have a fever you are very likely sick, but if you are sick the chances of you having a fever are not equally high.

By using p-values we effectively act as though we commit the conditional probability fallacy. The two values that are conflated are Pr(H0|p<α) and Pr(p<α|H0). We conflate the chances of observing a particular outcome under a hypothesis with the chances of that hypothesis being true given that we observed that particular outcome.

Now how exactly can this fallacy lead to incorrect inference? If we overestimate the probability of the null hypothesis, well, that is not such a serious problem; the study will be declared “inconclusive”. It is a more serious problem if we underestimate the probability of the null hypothesis and reject it when we shouldn’t have.

There are two sources of irrational hypothesis rejection: 1) high prior probability for the null hypothesis and 2) low statistical power.

The Priors (or Base Rate)

One factor that can lead to irrational hypothesis rejection is the the base rate (or prior probabilities). There is an illustrating examples of this on the most recent Wikipedia entry on “Base Rate Fallacy”.

The more we test hypotheses that are unlikely to be true begin with, the higher the rate of error. For example, if the number of benign drugs that we test outnumbers the number of effective drugs, we will end up declaring too many drugs as effective.

A common argument that is raised in defense of p-values is that priors are inaccessible, subjective, and cannot be agreed upon. How does one measure the probability of a hypothesis independent of data? For example, how does one objectively find the probability of drugs effectiveness prior to any data analysis? This is a fair point. But priors are not the only factor that lead to irrational hypothesis rejection.

It’s Not Just About Priors

Statistical power is the probability of correctly rejecting the null hypothesis. Typically, the larger the sample size in a study, the higher its statistical power (i.e. the higher the chance of being able to declare statistical significance in the data). Rarely do scientists ever calculate statistical power or impose any constraints on it. Researchers worry less about statistical power because we [incorrectly] think that the only harm in low powered tests is obtaining inconclusive results. (Think of the case we test a new drug and don’t find evidence that it cures cancer while it actually does). That seems like a problem that can be fixed by collecting more data and increasing the population size in future studies.

What we really want to avoid is what is called type I errors, i.e. incorrectly rejecting the null hypothesis, not type II errors, i.e. failure to correctly reject the null hypothesis. Concluding that a drug cures cancer when it actually has no effect (which can have awful consequences) versus failure to find evidence that a drug cures cancer when if actually does (which is bad but can be corrected through more research). Since we are primarily concerned with type I errors, we impose constraints on the significance levels of our tests. So we set a 0.05 threshold for p-values if a 5% type I error rate is deemed acceptable. However, this line of reasoning is problematic. As I show below, low statistical power can lead to more frequent type I errors.

Calculating the Error Rate

Let us calculate the probability of unjustifiably rejecting the null hypothesis. Remember, a rational person sets a threshold for Pr(H0|p<0.01), not Pr(p<0.01|H0), given they consider 1% probability to be an appropriate threshold for rejecting a hypothesis.

Assume that we have designed a test with statistical significance threshold α and statistical power 1-β to reject the null hypothesis H0. What is the probability that a conclusive study commits a type I error? In other words what is the probability that H0 is true given that the data passes our significance test (p<α)? Using Bayes’ rule, we have:

    \begin{align*} % the "starred" equation environments produce no equation numbers Pr(H_0 | p<\alpha) &= \frac{Pr(p<\alpha | H_0)Pr(H_0)}{Pr(p<\alpha)} = \frac{Pr(p<\alpha | H_0)Pr(H_0)}{Pr(p<\alpha | H_0)Pr(H_0) + Pr(p<\alpha | \neg H_0)Pr(\neg H_0)} \\  & \\ &= \dfrac{1}{1 + \dfrac{Pr(p<\alpha | \neg H_0)Pr(\neg H_0)}{Pr(p<\alpha | H_0)Pr(H_0)} } = \dfrac{1}{1 + \dfrac{1-\beta}{\alpha} \dfrac{Pr(\neg H_0)}{ Pr(H_0)} } \end{align*}

Let us set the probability of committing a type I error to be less than e and rearrange the equation.

    \begin{align*} % the "starred" equation environments produce no equation numbers & Pr(H_0 | p<\alpha) = \dfrac{1}{1 + \dfrac{1-\beta}{\alpha} \dfrac{Pr(\neg H_0)}{ Pr(H_0)} }  < e \\ \\ & \implies \dfrac{\alpha}{1-\beta} \dfrac{Pr(H_0)}{ Pr(\neg H_0)} < \dfrac{e}{1-e} \\ \\ & \implies \dfrac{\text{significance threshold}}{\text{statistical power}} \times \text{odds ratio prior} < \dfrac{e}{1-e} \end{align*}

This formula makes it clear how the significance level, statistical power, and the prior odds ratio can affect the rate of irrational hypothesis rejection. Assuming a prior odds ratio of 1 and α=0.01 (meaning that p<0.01 implies statistical significance), our test must have a statistical power of at least β>99% to achieve an error rate of e<1%. That is quite a high standard for a statistical power.

The diagram below is a graphical demonstration of the formula above. The math should be sufficiently convincing, but some feel that the Bayesian philosophy towards probability may give different results than the frequentist philosophy towards probability, and p-value tests were founded on the latter. So I made the figure below to visualize it. Each scientific study is represented by a paper.

In this figure, the significance level was set at p < 0.05. But the error rate among conclusive studies is not 5%. It is 41.5%. Close to half of the conclusive studies in the diagram commit type I errors. This is despite the low significance level α=0.05 and the modest statistical power (1-β = 35%). Notice how decreasing statistical power (1-β) is like moving the dashed line between the green and blue area to the right. This will lead to a smaller green area and therefore a larger error rate. Likewise notice how increasing the base rate (i.e. testing more hypotheses that are unlikely to begin with) is like moving the horizontal dashed line down, resulting in more red, less green, and a higher rate of type I errors.

Note, the standard terminology can be very misleading. The “type I error rate” is defined as the rate of error among the cases where the null hypothesis is incorrect, rather than the rate of error among the cases where the null hypothesis was rejected. What we should be looking at is the error rate among the conclusive studies, i.e. studies that reject the null hypothesis.

Dodgy Rationalizations

It is often said that if you use p-values appropriately and do not misinterpret them, then all is fine. “Just understand what they mean and don’t draw inappropriate conclusions“. So what is a p-value supposed to mean then? How is it intended to be used?

What we call null-hypothesis testing is a hybrid between two [incompatible] approaches that are each fallacious by themselves: Fisher’s approach and the Neyman-Pearson approach.

Ronald Fisher, who popularized p-values, stated that p-values can be interpreted as “a rational and well-defined measure of reluctance to accept the hypotheses they test“. This statement is demonstrably false. Reluctance to accept a hypothesis should be measure by the probability of that hypothesis. And it is irrational to confuse inverse probabilities. Fisher interpreted p-values as a continuous measure of evidence against the null hypothesis, rather than something to be used in a test with a binary outcome.

The Neyman-Pearson approach, on the other hand, avoids assigning any interpretation to p-values. It, rather, proposes a “rule of behavior” for using p-values. One must reject the null hypothesis when its p-value fall below a predetermined threshold. They dodge the issue of inference and claim that this decision process leads to sufficiently low error rates in the long run.

But to rationalize p-values using a decision-making framework is absurd. Why would it matter what someone believes in if they behave indistinguishably from someone committing the conditional probability fallacy?

Stop Trying to Salvage P-values

The concept of p-values is nearly a hundred years old. Despite the fact that its fundamental problem has been known, measuring significance remains the dominant method for testing statistical hypothesis. It is difficult to do statistics without using p-values and even more difficult to get published without statistical analysis of data.

However, things may be changing with the rise of the replication crisis in science [1, 2]. The extent to which the replication crisis can be attributed to the use of p-values is under debate. But the awareness about it has unleashed a wave of critisism and reconsideration of p-value testing.In 2016 the American Statistical Association published a statement saying: “Researchers often wish to turn a p-value into a statement about the truth of a null hypothesis or about the probability that random chance produced the observed data. The p-value is neither.” And in 2015 the Journal of Basic and Applied Social Psychology completely banned the use of p-values declaring that “The Null Hypothesis Significance Testing Procedure (NHSTP) is invalid” [3, 4].

Some have suggested abandoning statistical significance in favor of using continuous unthresholded p-values as a measure of evidence (Fisher’s approach) [5, 6, 7]. Others have suggested abandoning p-values as a continuous measure in favor of a dichotomous statistical significance (Neyman-Pearson’s approach) [8]. And others have suggested using more stringent thresholds for statistical significance [9]. But neither Fisher’s nor Neyman-Pearson’s approach are mathematically sound.

The mathematically sound approach is to abandon p-values, statistical significance, and null hypothesis testing all-together and to “to proceed immediately with other measures“, which is a “more radical and more difficult [proposal] but also more principled and more permanent“. (McShane et al 2018).

What alternatives do we have to p-values? Some suggest using confidence intervals to estimate effect sizes. Confidence intervals may have some advantages but they still suffer from the same fallacies (as nicely explained in Morey et al. 2016). Another alternative is to use Bayes factors as a measure for evidence. Bayesian model comparison has been around for nearly two decades but has not gained much traction, for a number of practical reasons.

The bottom line is that there is practically no correct way to use p-values. It does not matter if you understand what it means or if you frame it as a decision procedure rather than a method for inference . If you use p-values you are effectively behaving like someone that confuses conditional probabilities. Science needs a mathematically sound framework for doing statistics.

In future posts I will suggest a new simple framework for quantifying evidence. This framework is based on Bayes factors but makes a basic assumption: that every experiment has a probability of error that cannot be objectively determined. From this basic assumption a method of evidence quantification emerges that is highly reminescent of p-value testing but is 1) mathematically sound and 2) practical. (In contrast to Bayes factor, it produces numbers that are not extremely large or small).

bookmark_borderCan a finite physical device be Turing-equivalent?

If you believe in the following, I am going to try to change your mind:

“Turing machines aren’t realistic. They need infinite memory so they can’t be implemented. Any real computing device is limited in its memory capacity and, therefore, equivalent to a finite state machine.”

Cartoon of a Turing Machine by Tom Dunne 2002
Cartoon of a Turing Machine by Tom Dunne, 2002

This is a fairly commonly held view. I used to believe in it myself, but had always found it deeply unsatisfying. After all, modern-day computers have limited memory capacity but closely resemble Turing machines. And Turing’s abstract formulation arguably led to the digital revolution of the 20th century. How, then, can the Turing machine be a physically irrelevant mathematical abstraction? If all of our computers and devices are of the weaker class of computers, namely, finite state machines, why do they have to look so much like Turing machines?

This is important to clarify, especially for neuroscience and computational biology. If we think of Turing-equivalence as this abstract level of computation that is impossible to physically achieve, then we block out classical insights from the theory of computation and cannot even begin to ask the right questions. (I recently wrote a manuscript asking the question “where is life’s Turing-equivalent computer?” and showed that a set of plausible molecular operations on RNA molecules is sufficient to achieve Turing-equivalent computation in biology).

It took a good amount of reading and thinking to finally understand the meaning of Turing-equivalence. I will explain what I believe to be the only consistent way of looking at this issue. Here is the short version:

When we say Turing machines require “unbounded memory”, what we mean is that memory cannot be bounded by the systems descriptor, not that it cannot be bounded by other things such as the laws of physics or resource constraints. Turing-equivalence only requires a system in which memory usage grows, not one in which memory is infinite.

Below I explain precisely what all that means. I will try to convince you that this is the only consistent way of looking at this and that Turing, himself, shared this perspective.

What are Turing Machines and Finite State Machines?

Let me first introduce three computation systems, from weakest to strongest: look-up tables, finite state machines, and Turing machines.

A look-up table is the simplest form of computation. It consists of a table with two columns: an input column and an output column. When given an input, you find the row with the matching string in the first column and output the string in the second column of that same row. Look-up tables have no memory and are only capable of solving problems with a finite sized input set; the set of possible inputs is limited and predetermined in the table’s descriptor. For example, there is a look-up table with several thousand rows that given a tic-tac-toe board returns the optimal winning move. However, there is no description of a look-up table that can compute the remainder of an arbitrarily large number when divided by 3. Try it! Give me a table that solves the remainder problem and I can tell you a number which it is missing from its input column.

Finite state machines are a computation system one notch more powerful than look-up tables. They can deal with problems that have an infinitely large input set. The input is broken down into a string of symbols (or digits). Each symbol is fed sequentially into the machine. The machine has a number of predefined states and at every given moment it is in one of those states. The machine’s state determines how it will transition into a new state when fed a new symbol. For example, here is a graph representing a finite state machine with three states that computes a number’s remainder when divided by 3 (something a look-up tables couldn’t solve). A crucial feature of finite state machines is that (in contrast to look-up tables) the size of the input is not bounded by the descriptor.

A finite state machine that solves the "remainder by 3" problem
Descriptor of a finite state machine with three states A, B, and C that computes the remainder of a number when divided by 3. The machine starts at state A. It receives digits sequentially. When finished, A corresponds to remainder 0, B corresponds to remainder 1, and C corresponds to remainder 2.

However, there are some problems that a finite state machine cannot solve. For example, there is no descriptor of a finite state machine that can solve the following counting problem: given a string consisting of two letters X and Y (e.g. “YXXYX”) determine whether there were more Xs, or more Ys, or both letter occurred an equal number of times. Try it! Try constructing a finite state machine that counts the number of occurrences of each letter. Since you are forced to predetermine all the states in the machine’s description, it cannot count up to arbitrarily large numbers. So there will always be a string “XXYXYXXY…Y” that it will fail on.

Finally, Turing machines are the most powerful computation system. A Turing machine is like a finite state machine except that it has access to an infinitely long bidirectional memory tape divided into squares where it can read from and write symbols on to. The tape head contains a finite state machine. At every moment in time it only has access to a single square of the tape and must make a decision about what to write and which direction to move (left or write) using only what it can immediately see. Remarkably, Turing machines can solve any computable problem and simulate any other computation system. For this reason, Turing machines are called a universal computation system.

A crucial feature of Turing machines is that (in contrast to finite state machines) the number of states is not bounded by the descriptor. This is because part of what determines a Turing machine’s state is the symbols written onto the tape. There is no way of knowing how many tape squares will be used before seeing the input. Another way to say this is that the system’s memory usage can grow as a function of input size. This is not the case for finite state machines, where all the possible states are predetermined in the machine’s descriptions. (This feature is not unique to Turing-equivalent computers; computationally weaker machines like push-down automata can also expand memory usage, allowing them to solve problems like the counting problem defined above).

Turing-equivalence is as real as it gets

Firstly, it is unbounded memory, not infinite memory, that is necessary for Turing-equivalence. The infinite memory tape is only defined for convenience. At any given moment, a Turing machine is only using a finite number of squares. We can redefine Turing machines to have a finite memory tape that grows whenever the head reaches the end of the tape. (A Turing machine is also equivalent to a finite state machine with two stacks that can each grow arbitrarily large. That is another way of reformulating it to avoid an infinitely large memory tape).

Alright“, one may say, “but Turing machines still require unbounded memory. There are physical limits on how much memory of a real physical device can use. So any real physical implementation of a Turing machine will be limited in its number of states, and therefore is equivalent to a finite state machine“.

The problem with this argument is that by the same logic, no real physical device can be a finite state machine either! Because finite state machines assume unbounded input size. (Remember, unbounded input size is what distinguishes them from look-up tables). Any real physical implementation of a finite state machine will inevitably have limits on its input size. (Because, the machine will eventually run out of energy and time). Why would unbounded memory capacity be any different from unbounded energy or time? So is every physically realistic computer equivalent to a look-up table then?

Finite state machines are no less of a mathematical abstraction than Turing machines are. And both are relevant to studying the real world. You can implement a finite state machine by creating something that recruits more energy when needed. Similarly, you can implement a Turing-equivalent machine by creating something that recruits more memory space [and energy] when needed. The fact that both of these will eventually hit some limit does not make them physically unrealizable.

A physical implementation of Turing machines (with practical constraints on its memory, time, and energy) is more powerful than a physical implementation of finite state machines (with practical constraints on its time and energy). This is because they both still have the same descriptor/input/output relationships as their abstract models. There exists a Turing machine descriptor that can solve the counting problem irrespective of what the physical constraints on memory, energy and time are.

On the other hand, there is no single descriptor of a finite state machine that can solve the counting problem. You may be able to construct one that can count up to some very large number, but the descriptor of your finite state machine will have to change if the maximum input size changes.

It is important to understand that constraints on memory space are not conceptually different from constraints on time or energy. To be consistent, Turing machines and finite state machines are either both realistic models, or neither one is.

Turing machines use finite means

Part of the confusion about Turing-equivalence may have stemmed from the title of Minsky’s book “Computation: Finite and Infinite Machines“. In it he categorizes machines into those that are finite (such as finite state machines) and those that are infinite (such as Turing machines). This categorization has been widely popularized while less attention is paid toward the book’s section on “why study infinite machines?”. Minsky explains that “instead of an infinite tape, we need only an inexhaustible tape factory” and that “this picture gives a reassuringly finite picture [of Turing machines]“. Minsky recognized that the study of Turing machines is the study of “growing automata“, rather than “genuinely infinite automata“. Unfortunately, the book’s title does not convey this nuanced perspective. (Personally, I don’t find the finite/infinite categorization to be helpful at all. Finite state machines, although finite in their memory capacity, are unbounded in their number of steps and energy consumption. So they are infinite in the same sense that Turing machines are infinite).

The notion of “effective calculability” or “computability” that the early mathematicians were concerned with was intimately tied to what is practically achievable through finite means. In fact, Turing explicitly mentions in his 1936 paper that “computable numbers are [defined to be] those whose decimals are calculable by finite means“. He precluded the existence of an infinitely large alphabet, only allowed a finite number of “states of mind” for the machine, and did not allow an infinite number of steps. Only as a matter of convenience did he endow his abstract machine with an infinite memory tape. (Although, to be precise, he didn’t call it an “infinite memory tape”; rather he simply did not impose any bounds on its size and recognized that some machines are programmed to write infinitely many symbols). It was clear to everyone at the time that an infinite memory tape does not violate the condition of “finite means” as it can easily be reformulated into a growing tape.

Modern day computers are Turing-equivalent because for every computable problem, there is a valid assembly code that solves that problem. You can run code that solves prime factorization or code that solves the travelling salesman problem. The fact that your computer may run out of memory or run out of batteries for a large input is not important, because you can take the same code and run it on another computer with more resources and it will produce the correct output. No such thing would be possible if your computer operated as a finite state machine. Because, for finite state machines, the limits on memory capacity are imposed by the descriptor, not by the physical resources.

Turing-equivalence is precisely what you get when you ask the question: what is the most powerful scope of computation that finite physical devices can achieve? To consider them as infinite or physically unrealistic is deeply mistaken and obscures the classical insights from the theory of computation.