Tracing a cancer’s family tree to its roots reveals how tumors grow

Family trees of lung cancer cells reveal how cancer evolves from its earliest stages to an aggressive form capable of spreading throughout the body.

Greta Friar | Whitehead Institute
May 5, 2022

Over time, cancer cells can evolve to become resistant to treatment, more aggressive, and metastatic — capable of spreading to additional sites in the body and forming new tumors. The more of these traits that a cancer evolves, the more deadly it becomes. Researchers want to understand how cancers evolve these traits in order to prevent and treat deadly cancers, but by the time cancer is discovered in a patient, it has typically existed for years or even decades. The key evolutionary moments have come and gone unobserved.

MIT Professor Jonathan Weissman and collaborators have developed an approach to track cancer cells through the generations, allowing researchers to follow their evolutionary history. This lineage-tracing approach uses CRISPR technology to embed each cell with an inheritable and evolvable DNA barcode. Each time a cell divides, its barcode gets slightly modified. When the researchers eventually harvest the descendants of the original cells, they can compare the cells’ barcodes to reconstruct a family tree of every individual cell, just like an evolutionary tree of related species. Then researchers can use the cells’ relationships to reconstruct how and when the cells evolved important traits. Researchers have used similar approaches to follow the evolution of the virus that causes Covid-19, in order to track the origins of variants of concern.

Weissman and collaborators have used their lineage-tracing approach before to study how metastatic cancer spreads throughout the body. In their latest work, Weissman; Tyler Jacks, the Daniel K. Ludwig Scholar and David H. Koch Professor of Biology at MIT; and computer scientist Nir Yosef, associate professor at the University of California at Berkeley and the Weizmann Institute of Science, record their most comprehensive cancer cell history to date. The research, published today in Cell, tracks lung cancer cells from the very first activation of cancer-causing mutations. This detailed tumor history reveals new insights into how lung cancer progresses and metastasizes, demonstrating the wealth of understanding that lineage tracing can provide.

“This is a new way of looking at cancer evolution with much higher resolution,” says Weissman, who is a professor of biology at MIT, a member of the Whitehead Institute for Biomedical Research, and an investigator with Howard Hughes Medical Institute. “Previously, the critical events that cause a tumor to become life-threatening have been opaque because they are lost in a tumor’s distant past, but this gives us a window into that history.”

In order to track cancer from its very beginning, the researchers developed an approach to simultaneously trigger cancer-causing mutations in cells and start recording the cells’ history. They engineered mice such that when their lung cells were exposed to a tailor-made virus, that exposure activated a cancer-causing mutation in the Kras gene and deactivated tumor suppressing gene Trp53 in the cells, as well as activating the lineage tracing technology. The mouse model, developed in Jacks’ lab, was also engineered so that lung cancer would develop in it very similarly to how it would in humans.

“In this model, cancer cells develop from normal cells and tumor progression occurs over an extended time in its native environment. This closely replicates what occurs in patients,” Jacks says. Indeed, the researchers’ findings closely align with data about disease progression in lung cancer patients.

The researchers let the cancer cells evolve for several months before harvesting them. They then used a computational approach developed in their previous work to reconstruct the cells’ family trees from their modified DNA barcodes. They also measured gene expression in the cells using RNA sequencing to characterize each individual cell’s state. With this information, they began to piece together how this type of lung cancer becomes aggressive and metastatic.

“Revealing the relationships between cells in a tumor is key to making sense of their gene expression profiles and gaining insight into the emergence of aggressive states,” says Yosef, who is a co-corresponding author on both the current work and the previous lineage tracing paper.

The results showed significant diversity between subpopulations of cells within the same tumor. In this model, cancer cells evolved primarily through inheritable changes to their gene expression, rather than through genetic mutations. Certain subpopulations had evolved to become more fit — better at growth and survival — and more aggressive, and over time they dominated the tumor. Genes that the researchers identified as commonly expressed in the fittest cells could be good candidates for possible therapeutic targets in future research. The researchers also discovered that metastases originated only from these groups of dominant cells, and only late in their evolution. This is different from what has been proposed for some other cancers, in which cells may gain the ability to metastasize early in their evolution. This insight could be important for cancer treatment; metastasis is often when cancers become deadly, and if researchers know which types of cancer develop the ability to metastasize in this stepwise manner, they can design interventions to stop the progression.

“In order to develop better therapies, it’s important to understand the fundamental principles that tumors adopt to develop,” says co-first author Dian Yang, a Damon Runyon Postdoctoral Fellow in Weissman’s lab. “In the future, we want to be able to look at the state of the cancer cells when a patient comes in, and be able to predict how that cancer’s going to evolve, what the risks are, and what is the best treatment to stop that evolution.”

The researchers also figured out important details of the evolutionary paths that cancer subpopulations take to become fit and aggressive. Cells evolve through different states, defined by key characteristics that the cell has at that point in time. In this cancer model the researchers found that early on, cells in a tumor quickly diversified, switching between many different states. However, once a subpopulation landed in a particularly fit and aggressive state, it stayed there, dominating the tumor from that stable state. Furthermore, the ultimately dominant cells seemed to follow one of two distinct paths through different cell states. Either of those paths could then lead to further progression that enabled cancers to enter aggressive “mesenchymal” cell states, which are linked to metastasis.

After the researchers thoroughly mapped the cancer cells’ evolutionary paths, they wondered how those paths would be affected if the cells experienced additional cancer-linked mutations, so they deactivated one of two additional tumor suppressors. One of these affected which state cells stabilized in, while the other led cells to follow a completely new evolutionary pathway to fitness.

The researchers hope that others will use their approach to study all kinds of questions about cancer evolution, and they already have a number of questions in mind for themselves. One goal is to study the evolution of therapeutic resistance, by seeing how cancers evolve in response to different treatments. Another is to study how cancer cells’ local environments shape their evolution.

“The strength of this approach is that it lets us study the evolution of cancers with fine-grained detail,” says co-first author Matthew Jones, a graduate student in the Weissman and Yosef labs. “Every time there is a shift from bulk to single-cell analysis in a technology or approach, it dramatically widens the scope of the biological insights we can attain, and I think we are seeing something like that here.”

An ‘oracle’ for predicting the evolution of gene regulation

Researchers created a mathematical framework to examine the genome and detect signatures of natural selection, deciphering the evolutionary past and future of non-coding DNA.

Raleigh McElvery
March 9, 2022

Despite the sheer number of genes that each human cell contains, these so-called “coding” DNA sequences comprise just 1% of our entire genome. The remaining 99% is made up of “non-coding” DNA — which, unlike coding DNA, does not carry the instructions to build proteins.

One vital function of this non-coding DNA, also called “regulatory” DNA, is to help turn genes on and off, controlling how much (if any) of a protein is made. Over time, as cells replicate their DNA to grow and divide, mutations often crop up in these non-coding regions — sometimes tweaking their function and changing the way they control gene expression. Many of these mutations are trivial, and some are even beneficial. Occasionally, though, they can be associated with increased risk of common diseases, such as type 2 diabetes, or more life-threatening ones, including cancer.

To better understand the repercussions of such mutations, researchers have been hard at work on mathematical maps that allow them to look at an organism’s genome, predict which genes will be expressed, and determine how that expression will affect the organism’s observable traits. These maps, called fitness landscapes, were conceptualized roughly a century ago to understand how genetic makeup influences one common measure of organismal fitness in particular: reproductive success. Early fitness landscapes were very simple, often focusing on a limited number of mutations. Much richer data sets are now available, but researchers still require additional tools to characterize and visualize such complex data. This ability would not only facilitate a better understanding of how individual genes have evolved over time, but would also help to predict what sequence and expression changes might occur in the future.

In a new study published on March 9 in Nature, a team of scientists has developed a framework for studying the fitness landscapes of regulatory DNA. They created a neural network model that, when trained on hundreds of millions of experimental measurements, was capable of predicting how changes to these non-coding sequences in yeast affected gene expression. They also devised a unique way of representing the landscapes in two dimensions, making it easy to understand the past and forecast the future evolution of non-coding sequences in organisms beyond yeast — and even design custom gene expression patterns for gene therapies and industrial applications.

“We now have an ‘oracle’ that can be queried to ask: What if we tried all possible mutations of this sequence? Or, what new sequence should we design to give us a desired expression?” says Aviv Regev, a professor of biology at MIT (on leave), core member of the Broad Institute of Harvard and MIT (on leave), head of Genentech Research and Early Development, and the study’s senior author. “Scientists can now use the model for their own evolutionary question or scenario, and for other problems like making sequences that control gene expression in desired ways. I am also excited about the possibilities for machine learning researchers interested in interpretability; they can ask their questions in reverse, to better understand the underlying biology.”

Prior to this study, many researchers had simply trained their models on known mutations (or slight variations thereof) that exist in nature. However, Regev’s team wanted to go a step further by creating their own unbiased models capable of predicting an organism’s fitness and gene expression based on any possible DNA sequence — even sequences they’d never seen before. This would also enable researchers to use such models to engineer cells for pharmaceutical purposes, including new treatments for cancer and autoimmune disorders.

To accomplish this goal, Eeshit Dhaval Vaishnav, a graduate student at MIT and co-first author, Carl de Boer, now an assistant professor at the University of British Columbia, and their colleagues created a neural network model to predict gene expression. They trained it on a dataset generated by inserting millions of totally random non-coding DNA sequences into yeast, and observing how each random sequence affected gene expression. They focused on a particular subset of non-coding DNA sequences called promoters, which serve as binding sites for proteins that can switch nearby genes on or off.

“This work highlights what possibilities open up when we design new kinds of experiments to generate the right data to train models,” Regev says. “In the broader sense, I believe these kinds of approaches will be important for many problems — like understanding genetic variants in regulatory regions that confer disease risk in the human genome, but also for predicting the impact of combinations of mutations, or designing new molecules.”

Regev, Vaishnav, de Boer, and their coauthors went on to test their model’s predictive abilities in a variety of ways, in order to show how it could help demystify the evolutionary past — and possible future — of certain promoters. “Creating an accurate model was certainly an accomplishment, but, to me, it was really just a starting point,” Vaishnav explains.

First, to determine whether their model could help with synthetic biology applications like producing antibiotics, enzymes, and food, the researchers practiced using it to design promoters that could generate desired expression levels for any gene of interest. They then scoured other scientific papers to identify fundamental evolutionary questions, in order to see if their model could help answer them. The team even went so far as to feed their model a real-world population data set from one existing study, which contained genetic information from yeast strains around the world. In doing so, they were able to delineate thousands of years of past selection pressures that sculpted the genomes of today’s yeast.

But, in order to create a powerful tool that could probe any genome, the researchers knew they’d need to find a way to forecast the evolution of non-coding sequences even without such a comprehensive population data set. To address this goal, Vaishnav and his colleagues devised a computational technique that allowed them to plot the predictions from their framework onto a two-dimensional graph. This helped them show, in a remarkably simple manner, how any non-coding DNA sequence would affect gene expression and fitness, without needing to conduct any time-consuming experiments at the lab bench.

“One of the unsolved problems in fitness landscapes was that we didn’t have an approach for visualizing them in a way that meaningfully captured the evolutionary properties of sequences,” Vaishnav explains. “I really wanted to find a way to fill that gap, and contribute to the longstanding vision of creating a complete fitness landscape.”

Martin Taylor, a professor of genetics at the University of Edinburgh’s Medical Research Council Human Genetics Unit who was not involved in the research, says the study shows that artificial intelligence can not only predict the effect of regulatory DNA changes, but also reveal the underlying principles that govern millions of years of evolution.

Despite the fact that the model was trained on just a fraction of yeast regulatory DNA in a few growth conditions, he’s impressed that it’s capable of making such useful predictions about the evolution of gene regulation in mammals.

“There are obvious near-term applications, such as the custom design of regulatory DNA for yeast in brewing, baking, and biotechnology,” he explains. “But extensions of this work could also help identify disease mutations in human regulatory DNA that are currently difficult to find and largely overlooked in the clinic. This work suggests there is a bright future for AI models of gene regulation trained on richer, more complex, and more diverse data sets.”

Even before the study was formally published, Vaishnav began receiving queries from other researchers hoping to use the model to devise non-coding DNA sequences for use in gene therapies.

“People have been studying regulatory evolution and fitness landscapes for decades now,” Vaishnav says. “I think our framework will go a long way in answering fundamental, open questions about the evolution and evolvability of gene regulatory DNA — and even help us design biological sequences for exciting new applications.”

New high-throughput method greatly expands view of how mutations impact cells

Broad scientists have developed a new approach for studying the functional effects of the millions of mutations associated with cancer and other diseases

Tom Ulrich | Broad Institute
January 27, 2022

There are millions of mutations and other genetic variations in cancer. Understanding which of these mutations is an impactful tumor “driver” compared to an innocuous “passenger”, and what each of the drivers does to the cancer cell, however, has been a challenging undertaking. Many studies rely on bespoke, time-consuming, gene-specific approaches that provide one-dimensional views into a given mutation’s broader functional impacts. Alternatively, computational predictions can provide functional insights, but those findings must then be confirmed through experiments.

Now, in a report published in Nature Biotechnology, a research team at the Broad Institute of MIT and Harvard has unveiled a massive-scale, high resolution method for functionally assessing large numbers of protein-coding mutations simultaneously, one that returns rich phenotypic information and which could potentially be used to study any mutation in any gene in cancer and perhaps other diseases. Their results, gained through proof-of-concept experiments with cancer cell lines, also show that individual mutations can have a spectrum of effects not only on their impacted genes but also on molecular pathways and cell state as a whole, and add nuance to the long-accepted practice of dividing cancer mutations into so-called “drivers” and “passengers.”

“When you look at the genetic data from patients’ tumors, you see that the majority of cancer-associated mutations are actually quite rare, which means we have few insights into what these mutations do,” said Jesse Boehm of the Broad’s Cancer Program, who was co-senior author of the study with Aviv Regev, a Broad core institute member now at Genentech, a member of the Roche Group. “For cancer precision medicine to become a reality, we need a firm understanding of the function of each mutation, but a major challenge has been defining an experimental approach that could be implemented in the lab at the scale required. This new method may be the tool we need.”

The new method, called single-cell expression-based variant impact phenotyping (sc-eVIP), builds on Perturb-seq — an approach developed in 2016 by Regev and colleagues for manipulating genes and exploring the consequences of those manipulations using high-throughput single-cell RNA sequencing —  and eVIP, a method also developed in 2016 by Boehm and colleagues for profiling cancer variants at low scale using RNA measurements. While Perturb-seq assays originally relied on CRISPR to introduce mutations into cells, the sc-eVIP team adopted an overexpression-based approach, engineering DNA-barcoded gene constructs for each mutation of interest and introducing them into pools of cells in such a way that the cells expressed the mutated genes at higher-than-normal levels.

By then recording each perturbed cell’s expression profile using single cell RNA sequencing, the team could both identify which mutation a given cell carried (based on the constructs’ unique barcodes) and examine the mutation’s broader impact on the cell’s overall expression state. This approach provides a highly detailed view of a mutation’s impact on a variety of molecular pathways and circuits, and does not need to be adapted for each new gene studied.

“In a sense, we’re using the cell as a biosensor,” said Oana Ursu, a postdoctoral fellow in the Regev lab, formerly within the Broad’s Klarman Cell Observatory and now at Genentech, and co-first author of the study with JT Neal, a senior group leader in the Broad’s Cancer Program. “By looking at the expression changes that take place when we overexpress a mutated gene, we can learn whether it has a meaningful impact. But also, we can compare and categorize variants based on the changes they trigger, and look for patterns in the biology they affect.”

“Most of the technologies developed for interpreting coding variants up to now have been very scalable, but have had relatively simple readouts like cell viability or maybe looked at a single trait. Their information content has been low, and it takes a lot of work to optimize them,” said Neal. “With sc-eVIP, we’ve engineered a comprehensive approach that’s high throughput and information-rich, which could be a real boon for large-scale variant-to-function studies.”

To test sc-eVIP’s potential, the team chose to study TP53 — the most commonly mutated gene in cancer — and KRAS — which encodes a key oncogene responsible for abnormal growth of many cancers. Neal, Ursu, and their collaborators generated constructs containing 200 known TP53 and KRAS mutations (including cancer-associated mutations and control mutations known to leave gene function unaffected) and introduced them into 300,000 lung cancer cells, and captured each cell’s individual expression profile. Based on those profiles, the team categorized each mutation as either “wildtype-like” (that is, effectively functionally indistinguishable from the unmutated gene) or “putatively impactful,” from there further defining mutations based on whether they reduced or enhanced the gene’s function.

The profiles also revealed each mutation’s broader impact on cell state, based on how the activity of a variety of pathways changed across single cells. For instance, the sc-eVIP data revealed KRAS mutations that fall along a continuum in how they impact cell state at the population level, from having no impact to influencing subtle shifts in cellular abundances to causing outright activation or repression of key pathways in a majority of cells. These findings suggest that different mutations within the same gene can influence cell state along a spectrum of impact.

“The cancer community has long embraced a binary conceptual framework of ‘driver’ mutations, ones that promote cancer development and progression, versus ‘passenger’ mutations, which are completely inert and just happened to arise along the way,” Boehm noted. “These initial findings suggest that biologically those categories are likely overly simplistic, that there’s actually a continuum of functional impact from inert to completely tumorigenic.”

While the team focused on cancer-associated genes and mutations for this study, they noted that sc-eVIP is gene-agnostic, highly scalable, and that using single cell RNA sequencing as a readout offers an efficient and generalizable approach to producing rich phenotypic data. They also calculated that it should be possible to thoroughly characterize most mutations with only 20 to a few hundred cells. Based on those numbers, it may be possible with sc-eVIP to generate a first-draft functional map of more than 2 million variants in approximately 200 known cancer genes with 71 million cells.

“If we can map where every cancer-associated variant fits on the continuum of impact in a variety of cancers and cell types,” Boehm said, “we’ll have a much better grasp of how the interplay of variants affects cell state, which in turn affects cancer development, growth, and response. Such knowledge would represent a true advance toward cancer precision medicine.”

Support for this study came from the National Cancer Institute, the National Human Genome Research Institute, the Mark Foundation for Cancer Research, the Howard Hughes Medical Institute, the Broadnext10 and Variant to Function programs and the Klarman Cell Observatory at the Broad Institute, and other sources.

Paper(s) cited:

Ursu O, Neal JT, et al. Massively parallel phenotyping of coding variants in cancer with Perturb-seqNature Biotechnology. Online January 20, 2022. DOI:10.1038/s41587-021-01160-7.

Blending machine learning and biology to predict cell fates and other changes
Greta Friar | Whitehead Institute
February 1, 2022

Imagine a ball thrown in the air: it curves up, then down, tracing an arc to a point on the ground some distance away. The path of the ball can be described with a simple mathematical equation, and if you know the equation, you can figure out where the ball is going to land. Biological systems tend to be harder to forecast, but Whitehead Institute Member Jonathan Weissman, postdoc in his lab Xiaojie Qiu, and collaborators at the University of Pittsburgh School of Medicine are working on making the path taken by cells as predictable as the arc of a ball. Rather than looking at how cells move through space, they are considering how cells change with time.

Weissman, Qiu, and collaborators Jianhua Xing, professor of computational and systems biology at the University of Pittsburgh School of Medicine, and Xing lab graduate student Yan Zhang have built a machine learning framework that can define the mathematical equations describing a cell’s trajectory from one state to another, such as its development from a stem cell into one of several different types of mature cell. The framework, called dynamo, can also be used to figure out the underlying mechanisms—the specific cocktail of gene activity—driving changes in the cell. Researchers could potentially use these insights to manipulate cells into taking one path instead of another, a common goal in biomedical research and regenerative medicine.  

The researchers describe dynamo in a paper published in the journal Cell on February 1. They explain the framework’s many analytical capabilities and use it to help understand mechanisms of human blood cell production, such as why one type of blood cell forms first (appears more rapidly than others).

“Our goal is to move towards a more quantitative version of single cell biology,” Qiu says. “We want to be able to map how a cell changes in relation to the interplay of regulatory genes as accurately as an astronomer can chart a planet’s movement in relation to gravity, and then we want to understand and be able to control those changes.”

How to map a cell’s future journey

 Dynamo uses data from many individual cells to come up with its equations. The main information that it requires is how the expression of different genes in a cell changes from moment to moment. The researchers estimate this by looking at changes in the amount of RNA over time, because RNA is a measurable product of gene expression. In the same way that knowing the starting position and velocity of a ball is necessary to understand the arc it will follow, researchers use the starting levels of RNAs and how those RNA levels are changing to predict the path of the cell. However, calculating changes in the amount of RNA from single cell sequencing data is challenging, because sequencing only measures RNA once. Researchers must then use clues like RNA-being-made at the time of sequencing and equations for RNA turnover to estimate how RNA levels were changing. Qiu and colleagues had to improve on previous methods in several ways in order to get clean enough measurements for dynamo to work. In particular, they used a recently developed experimental method that tags new RNA to distinguish it from old RNA, and combined this with sophisticated mathematical modeling, to overcome limitations of older estimation approaches.

The researchers’ next challenge was to move from observing cells at discrete points in time to a continuous picture of how cells change. The difference is like switching from a map showing only landmarks to a map that shows the uninterrupted landscape, making it possible to trace the paths between landmarks. Led by Qiu and Zhang, the group used machine learning to reveal continuous functions that define these spaces. 

“There have been tremendous advances in methods for broadly profiling transcriptomes and other ‘omic’ information with single-cell resolution. The analytical tools for exploring these data, however, to date have been descriptive instead of predictive. With a continuous function, you can start to do things that weren’t possible with just accurately sampled cells at different states. For example, you can ask: if I changed one transcription factor, how is it going to change the expression of the other genes?” says Weissman, who is also a professor of biology at the Massachusetts Institute of Technology (MIT), a member of the Koch Institute for Integrative Biology Research at MIT, and an investigator of the Howard Hughes Medical Institute.

Dynamo can visualize these functions by turning them into math-based maps. The terrain of each map is determined by factors like the relative expression of key genes. A cell’s starting place on the map is determined by its current gene expression dynamics. Once you know where the cell starts, you can trace the path from that spot to find out where the cell will end up.

The researchers confirmed dynamo’s cell fate predictions by testing it against cloned cells–cells that share the same genetics and ancestry. One of two nearly-identical clones would be sequenced while the other clone went on to differentiate. Dynamo’s predictions for what would have happened to each sequenced cell matched what happened to its clone.

Moving from math to biological insight and non-trivial predictions

With a continuous function for a cell’s path over time determined, dynamo can then gain insights into the underlying biological mechanisms. Calculating derivatives of the function provides a wealth of information, for example by allowing researchers to determine the functional relationships between genes—whether and how they regulate each other. Calculating acceleration can show that a gene’s expression is growing or shrinking quickly even when its current level is low, and can be used to reveal which genes play key roles in determining a cell’s fate very early in the cell’s trajectory. The researchers tested their tools on blood cells, which have a large and branching differentiation tree. Together with blood cell expert Vijay Sankaran of Boston Children’s Hospital, the Dana-Farber Cancer Institute, Harvard Medical School, and Broad Institute of MIT and Harvard, and Eric Lander of Broad Institute, they found that dynamo accurately mapped blood cell differentiation and confirmed a recent finding that one type of blood cell, megakaryocytes, forms earlier than others. Dynamo also discovered the mechanism behind this early differentiation: the gene that drives megakaryocyte differentiation, FLI1, can self-activate, and because of this is present at relatively high levels early on in progenitor cells. This predisposes the progenitors to differentiate into megakaryocytes first.

The researchers hope that dynamo could not only help them understand how cells transition from one state to another, but also guide researchers in controlling this. To this end, dynamo includes tools to simulate how cells will change based on different manipulations, and a method to find the most efficient path from one cell state to another. These tools provide a powerful framework for researchers to predict how to optimally reprogram any cell type to another, a fundamental challenge in stem cell biology and regenerative medicine, as well as to generate hypotheses of how other genetic changes will alter cells’ fate. There are a variety of possible applications.

“If we devise a set of equations that can describe how genes within a cell regulate each other, we can computationally describe how to transform terminally differentiated cells into stem cells, or predict how a cancer cell may respond to various combinations of drugs that would be impractical to test experimentally,” Xing says.

Dynamo’s computational modeling can be used to predict the most likely path that a cell will follow when reprogramming one cell type to another, as well as the path that a cell will take after specific genetic manipulations. 

Dynamo moves beyond merely descriptive and statistical analyses of single cell sequencing data to derive a predictive theory of cell fate transitions. The dynamo toolset can provide deep insights into how cells change over time, hopefully making cells’ trajectories as predictable for researchers as the arc of a ball, and therefore also as easy to change as switching up a pitch.

The power of two

Graduate student Ellen Zhong helped biologists and mathematicians reach across departmental lines to address a longstanding problem in electron microscopy.

Saima Sidik | Department of Biology
July 1, 2021

MIT’s Hockfield Court is bordered on the west by the ultramodern Stata Center, with its reflective, silver alcoves that jut off at odd angles, and on the east by Building 68, which is a simple, window-lined, cement rectangle. At first glance, Bonnie Berger’s mathematics lab in the Stata Center and Joey Davis’s biology lab in Building 68 are as different as the buildings that house them. And yet, a recent collaboration between these two labs shows how their disciplines complement each other. The partnership started when Ellen Zhong, a graduate student from the Computational and Systems Biology (CSB) Program, decided to use a computational pattern-recognition tool called a neural network to study the shapes of molecular machines. Three years later, Zhong’s project is letting scientists see patterns that run beneath the surface of their data, and deepening their understanding of the molecules that shape life.

Zhong’s work builds on a technique from the 1970s called cryo-electron microscopy (cryo-EM), which lets researchers take high-resolution images of frozen protein complexes. Over the past decade, better microscopes and cameras have led to a “resolution revolution” in cryo-EM that’s allowed scientists to see individual atoms within proteins. But, as good as these images are, they’re still only static snapshots. In reality, many of these molecular machines are constantly changing shape and composition as cells carry out their normal functions and adjust to new situations.

Along with former Berger lab member Tristan Belper, Zhong devised software called cryoDRGN. The tool uses neural nets to combine hundreds of thousands of cryo-EM images, and shows scientists the full range of three-dimensional conformations that protein complexes can take, letting them reconstruct the proteins’ motion as they carry out cellular functions. Understanding the range of shapes that protein complexes can take helps scientists develop drugs that block viruses from entering cells, study how pests kill crops, and even design custom proteins that can cure disease. Covid-19 vaccines, for example, work partly because they include a mutated version of the virus’s spike protein that’s stuck in its active conformation, so vaccinated people produce antibodies that block the virus from entering human cells. Scientists needed to understand the variety of shapes that spike proteins can take in order to figure out how to force spike into its active conformation.

Getting off the computer and into the lab

Zhong’s interest in computational biology goes back to 2011 when, as a chemical engineering undergrad at the University of Virginia, she worked with Professor Michael Shirts to simulate how proteins fold and unfold. After college, Zhong took her skills to a company called D. E. Shaw Research, where, as a scientific programmer, she took a computational approach to studying how proteins interact with small-molecule drugs.

“The research was very exciting,” Zhong says, “but all based on computer simulations. To really understand biological systems, you need to do experiments.”

This goal of combining computation with experimentation motivated Zhong to join MIT’s CSB PhD program, where students often work with multiple supervisors to blend computational work with bench work. Zhong “rotated” in both the Davis and Berger labs, then decided to combine the Davis lab’s goal of understanding how protein complexes form with the Berger lab’s expertise in machine learning and algorithms. Davis was interested in building up the computational side of his lab, so he welcomed the opportunity to co-supervise a student with Berger, who has a long history of collaborating with biologists.

Davis himself holds a dual bachelor’s degree in computer science and biological engineering, so he’s long believed in the power of combining complementary disciplines. “There are a lot of things you can learn about biology by looking in a microscope,” he says. “But as we start to ask more complicated questions about entire systems, we’re going to require computation to manage the high-dimensional data that come back.”

Before rotating in the Davis lab, Zhong had never performed bench work before — or even touched a pipette. She was fascinated to find how streamlined some very powerful molecular biology techniques can be. Still, Zhong realized that physical limitations mean that biology is much slower when it’s done at the bench instead of on a computer. “With computational research, you can automate experiments and run them super quickly, whereas in the wet lab, you only have two hands, so you can only do one experiment at a time,” she says.

Zhong says that synergizing the two different cultures of the Davis and Berger labs is helping her become a well-rounded, adaptable scientist. Working around experimentalists in the Davis lab has shown her how much labor goes into experimental results, and also helped her to understand the hurdles that scientists face at the bench. In the Berger lab, she enjoys having coworkers who understand the challenges of computer programming.

“The key challenge in collaborating across disciplines is understanding each other’s ‘languages,’” Berger says. “Students like Ellen are fortunate to be learning both biology and computing dialects simultaneously.”

Bringing in the community

Last spring revealed another reason for biologists to learn computational skills: these tools can be used anywhere there’s a computer and an internet connection. When the Covid-19 pandemic hit, Zhong’s colleagues in the Davis lab had to wind down their bench work for a few months, and many of them filled their time at home by using cryo-EM data that’s freely available online to help Zhong test her cryoDRGN software. The difficulty of understanding another discipline’s language quickly became apparent, and Zhong spent a lot of time teaching her colleagues to be programmers. Seeing the problems that nonprogrammers ran into when they used cryoDRGN was very informative, Zhong says, and helped her create a more user-friendly interface.

Although the paper announcing cryoDRGN was just published in February, the tool created a stir as soon as Zhong posted her code online, many months prior. The cryoDRGN team thinks this is because leveraging knowledge from two disciplines let them visualize the full range of structures that protein complexes can have, and that’s something researchers have wanted to do for a long time. For example, the cryoDRGN team recently collaborated with researchers from Harvard and Washington universities to study locomotion of the single-celled organism Chlamydomonas reinhardtii. The mechanisms they uncovered could shed light on human health conditions, like male infertility, that arise when cells lose the ability to move. The team is also using cryoDRGN to study the structure of the SARS-CoV-2 spike protein, which could help scientists design treatments and vaccines to fight coronaviruses.

Zhong, Berger, and Davis say they’re excited to continue using neural nets to improve cryo-EM analysis, and to extend their computational work to other aspects of biology. Davis cited mass spectrometry as “a ripe area to apply computation.” This technique can complement cryo-EM by showing researchers the identities of proteins, how many of them are bound together, and how cells have modified them.

“Collaborations between disciplines are the future,” Berger says. “Researchers focused on a single discipline can take it only so far with existing techniques. Shining a different lens on the problem is how advances can be made.”

Zhong says it’s not a bad way to spend a PhD, either. Asked what she’d say to incoming graduate students considering interdisciplinary projects, she says: “Definitely do it.”

Harikesh S. Wong

Education

  • PhD, 2016, University of Toronto
  • BSc, 2010, Biochemistry, McMaster University

Research Summary

The immune system mounts destructive responses to protect the host from threats, including pathogens and tumors. However, a trade-off emerges: if immune responses cause too much damage, they can compromise host tissue function. Conversely, if they fail to generate sufficient damage, the host may succumb to a given threat. How is the optimal balance achieved? The Wong lab investigates how cells communicate with one another and their surrounding tissue environment to accurately control the magnitude of immune responses, both in time and space. To this end, we combine the tools of immunology with interdisciplinary methods—including high-resolution fluorescence microscopy, computational approaches, and gene manipulations—to resolve, model, and perturb the control of immune responses in intact tissues. Ultimately, we aim to understand how subtle shifts in control can lead to widely divergent host outcomes, including the successful elimination of threats, tolerance, autoimmunity, chronic infection, and cancer.

Francisco J. Sánchez-Rivera

Education

  • PhD, 2016, Biology, MIT
  • BS, 2008, Microbiology, University of Puerto Rico at Mayagüez

Research Summary

The overarching goal of the Sánchez-Rivera laboratory is to elucidate the cellular and molecular mechanisms by which genetic variation shapes normal physiology and disease, particularly in the context of cancer. To do so, we develop and apply genome engineering technologies, genetically-engineered mouse models (GEMMs), and single cell lineage tracing and omics approaches to obtain comprehensive biological pictures of disease evolution at single cell resolution. By doing so, we hope to produce actionable discoveries that could pave the way for better therapeutic strategies to treat cancer and other diseases.

Awards

  • V Foundation Award, 2022
  • Hanna H. Gray Fellowship, Howard Hughes Medical Institute, 2018-2026
  • GMTEC Postdoctoral Researcher Innovation Grant, Memorial Sloan Kettering Cancer Center, 2020-2022
  • 100 inspiring Hispanic/Latinx scientists in America, Cell Mentor/Cell Press, 2020
Olivia Corradin

Education

  • PhD, 2015, Case Western Reserve University
  • BS, 2010, Biochemistry, Marquette University

Research Summary

Our lab studies genetic and epigenetic variation that contributes to human disease by disrupting gene expression programs. We utilize biological insights into the mechanisms of gene regulation in order to determine the impact of disease-associated variants on cellular function. We aim to identify actionable insights into disease pathogenesis by studying the confluence of genetic and epigenetic risk factors of human diseases, including multiple sclerosis and opioid use disorder.

Awards

  • NIH Director’s Pioneer Award Program Avenir Award, 2017

The Davis and Berger labs combined cryo-electron microscopy and machine learning to visualize molecules in 3D.

February 4, 2021