Having been in bio for 4+ years now, I continue to marvel at how powerful evolution is. The same simple algorithm bootstrapped a bunch of carbon, nitrogen, hydrogen, phosphorus (?), and oxygen floating in hot protoplasmic soup into:
Self-replicating solar arrays that spread through tiny spores
Billions of diverse nanomachines that bind nitrogen, act like programmable molecular scissors, emit precise wavelengths of light, hack other organisms for self replication, and translate light pulses into movement
Programmable agential materials governed by electronic gradients
AGI aka bipedal generalist meat robots who can live for nearly 100 cycles of the Earth around the sun and who have traveled to the moon in giant metallic cylinders
Why is evolution so powerful? One answer is that it's not - but human intuition fails to understand the advantages of billions of years and enormous parallelism. A less naive answer is that evolution found ways to evolve evolvability and then developed layers of modular, flexible systems to build on.
Evolution started as a purely natural phenomenon, but ever since some innovative risk taker befriended the first wolf, we have harnessed evolution to our own aims. Selective breeding was a big step towards doing so systematically and with it we produced both marvels and monstrosities. More recently, progress in molecular biology has provided us with new tools to perform directed evolution, which we’ve applied towards engineering simple organisms and proteins.
Directed evolution applies the principles of natural evolution but comes with the benefit of more precise control. Rather than cross-over indiscriminately, we can select chimeric subunits for crossover based on prior knowledge. Rather than rely purely on random hypermutation, high throughput synthesis lets us choose individual mutants or sets of mutants following a pre-specified probability distribution.
On the computational side, ML helps us make directed evolution more efficient. In Machine learning-guided directed evolution, we replace cycles in the lab with in silico cycles, selecting variants for our next generation’s population using scores from ML models rather than results from experiments. This can’t replace the lab step altogether (at least not yet) because models can only generalize so far out of distribution, but it makes the overall process much more efficient. Although recent, this idea has already been applied with success to enzymes, gene-delivery vectors, antibodies, and antibiotics. The antibody example is especially interesting because it takes advantage of information encoded in a protein sequence language model about survival (evolutionary plausiblity) and uses that to make selection more efficient. It’s evolutionary data being compressed into learned latent optima speeding up further evolution. With continued development of better assays, the potential scaling up of direct protein sequencing, and better models of biological phenomena, machine learning directed-guided directed evolution will continue to improve efficiency and become more widely applicable.
Even with all this progress, we continue to lag far behind evolution’s reach and power. In-silico, some specific impressive capabilities have been achieved that match or exceed natural selection: eg, breadth of knowledge of GPT-4 - and on timelines and compute budgets that are tiny compared to plausible evolutionary budgets. However, in terms of biological products, our capabilities lag far behind. As a qualitative comparison, consider that the most impressive biotech inventions merely piggyback or repurpose existing biological infrastructure
Discovering and then modifying existing antibiotics from other microorganisms’ products (eg, penicillin, rapamycin)
Mass production of specific existing proteins through genetic editing of a handful of organisms
Triggering the human immune system with weak or dead virus or tricking a human body into producing fragments of a virus (mRNA) vaccines
Repurposing a bacterial immune system for genetic editing (CRISPR)
Making (admittedly impressive) synthetic analogues of luciferase
Yet none of these or the examples from the prior paragraph, are as impressive as a single organism. Organisms custom-designed for a task, from the ground up are still impossible. As another benchmark, while many of us hope to see human life extended by 10 or even 100s of years, evolution developed us from organisms that lived a fraction of a single year. We have a long way to go.
Where are we lagging most? We’ve improved per iteration throughput substantially through multiplexing, but gains here translate better to some areas (sequence design amenable to high throughput synthesis and sequencing) than others (organism design). We’re unfortunately still way behind on the number of iteration cycles we perform. It took humans approximately 6 million years to evolve from monkeys, which means approximately 12 million serial lifetimes, presumably parallelized across millions of monkeys. On the flip side, our bleeding edge (and definitely impressive) automated directed evolution systems pride themselves on their ability to accomplish dozens of cycles per day, amounting to approximately 60 for a week-long experiment. ML methods can definitely help here but improving the number of iteration cycles will continue to matter. As covered by Gwern, real-world physical tasks act as a “backstop” to fast but possibly misguided in-silico directions. Progress will come from finding ways to continue to speed up iteration cycles without compromising or even boosting throughput.
Another related concept is that of external validity - making sure a given surrogate actually matches up with what we actually care about. For laboratory directed evolution, we’re often optimizing a proxy for the trait that matters “in the wild”. If these diverge as we optimize the former, then more evolutionary optimization will lead to no better or even worse performance on the thing we actually care about. (Jack Scannell has written about this “predictive validity” problem extensively in the context of therapeutics.) This means that, in some sense, the magic is finding easy to measure, easy to parallelize, and fast metrics that also match with what you care about in reality - or aggregating a number of such measures together.
Fortunately, relative to other bio problems, improvement here feels relatively tractable. Much of the work towards improving throughput and iteration speed involves engineering and iterative optimization. As hard as these things are, we’re far better at them than we are at decoding complex messy systems from year+ long clinical trials and other noisy data (with their own predictive validity challenges). As far as I can tell, work here is quite under-funded relative to traditional discovery work. Surprisingly few labs work on better bio tooling in general, and even fewer focus on sculpting evolution.
While most people understand evolution’s power intellectually, we’re still not even close to harnessing it to its full potential for biological design and optimization. Much of this essay has focused on how we can continue to grow this capability, but I want to end with a different call to action.
In addition to improving our directed evolution capabilities, we need to be thinking much more creatively about how to frame problems as amenable to an evolutionary approach. As one out there example, evolution found a way to extend life from days to a century. Can we apply that insight to our own life extension efforts? (No, I’m not proposing the Howard Foundation.) Ex vivo somatic evolution of cells or even organs sounds crazy on its face, but CAR-T also sounded crazy when it was first proposed. Maybe it’s as crazy as (rather than crazier than) trying to fix lifespan by understanding aging from first principles. Zooming out, if we assume our directed evolution capabilities will continue to improve in terms of generality, iteration speed/efficiency, and throughput, then we should be prepared with as many wild but possible application ideas as we can come up with.
If you have some, send them my way or leave a comment!
Acknowledgements: Especially big thanks to Willy C. for thorough and helpful feedback both on the original idea and execution of the post! Thanks to Eryney Marrogi for feedback as well!