Machine learning: Chapter 1

By Julia Borden

November 25, 2018

Zak Costello believes that one day, many common chemicals will be made simply from sugars. Instead of using petroleum-based, non-renewable sources, engineered bacteria and fungi will consume sugars and convert them to biochemical products such as fuel, materials, and even flavors and fragrances. And computers will help us get there.

In fact, a version of this type of chemical production exists today. However, it is very inefficient—most engineered bacteria and fungi cannot produce a desired chemical in large enough amounts to be useful. By making subtle changes to the bacteria or fungi with genetic engineering, researchers can increase the amount of a useful biochemical product.

"The problem is it's really hard to know which genetic changes are going to result in an organism that produces a good amount of what you're trying to make," explains Costello, a postdoctoral fellow at the Joint BioEnergy Institute (JBEI) in Berkeley, CA.

Call him a genetic pessimist, but his optimism lies in the field of machine learning, which has recently been making inroads into biology.

Costello and his advisor, Dr. Hector Garcia Martin, are bioengineers interested in producing large quantities of commodity chemicals from bacteria. This is typically done by adjusting expression of genes that do the chemistry inside the bacterium to convert the sugars into a desired chemical. Some genes may be expressed more, and some less, with the goal of finding the perfect ratio to produce the optimal amount of final product.

However, much of this genetic adjustment is guesswork. “One of the main problems that we have in biology is that we can’t really predict the behaviour of biological systems” says Garcia Martin, who is the Lead for Quantitative Metabolic Modeling at JBEI.

Originally, the group tried to make their engineering more targeted by solving sets of equations traditionally used to predict chemical outputs, called kinetic models. However, it proved extremely difficult, inaccurate, and unscalable.

So they pivoted to a machine learning approach. The machine “essentially figures out those rules so we can learn the dynamic behaviour of the cell just from data,” explains Costello. The goal was to design a program that would be able to accurately predict which genetic changes to make to the bacteria to produce a desired amount of chemical product.

The computer is given samples from a high producing and a low producing strain of bacteria. As a result, it can predict which mutations will lead to a medium producing bacterial strain.

Initially, applying the technique faced some difficulties. “We don’t have the large data sets that we need,” says Garcia Martin. For the team, that meant collecting highly precise data about the total chemical makeup of the bacteria. The team manually collected samples every few hours for 72 hours. The manual nature of the work additionally limited them to only two datasets with which to train the machine: one of a high biofuel-producing strain and one of a low-producing strain. Despite the limited datasets, their goal was to use these two strains to predict which genetic mutations would result in a bacteria that produced a medium amount of biofuel.

The team was thrilled to find that the model was able to reasonably predict final product levels of a medium producing strain. The results were published in a paper in April 2018.

Costello admits, “In this paper we used the best data that’s available for this kind of work. It was produced at JBEI by a couple of grad students, basically sleeping over at work.” In the study, they demonstrate with simulated data how the machine learning algorithms make much more accurate predictions when given more training data. To collect more data, they are currently working towards incorporating robotic systems and automated data collection.

Though challenges remain in the field of engineering organisms for chemical production, machine learning has already been making big inroads into other biological fields, particularly those that generate large volumes of data. It has been used to predict DNA and RNA binding sequences, identify mutations, map cell structures, and diagnose cancers.

Garcia Martin and Costello are already looking towards the future. “I’m really interested in this idea of being able to understand the genome computationally. So being able to feed large genetic sequences to a machine learning algorithm and have it basically be able to understand what’s going on, what the performance of that particular sequence will be,” Costello says. “We’re going to have machines that understand the genetic code much better than people.”


Julia Borden is a graduate student in molecular and cell biology.

This article is part of the Fall 2018 issue.

Notice something wrong?

Please report it here.