Blending machine learning and biology to predict cell fates and other changes
Greta Friar | Whitehead Institute
February 1, 2022

Imagine a ball thrown in the air: it curves up, then down, tracing an arc to a point on the ground some distance away. The path of the ball can be described with a simple mathematical equation, and if you know the equation, you can figure out where the ball is going to land. Biological systems tend to be harder to forecast, but Whitehead Institute Member Jonathan Weissman, postdoc in his lab Xiaojie Qiu, and collaborators at the University of Pittsburgh School of Medicine are working on making the path taken by cells as predictable as the arc of a ball. Rather than looking at how cells move through space, they are considering how cells change with time.

Weissman, Qiu, and collaborators Jianhua Xing, professor of computational and systems biology at the University of Pittsburgh School of Medicine, and Xing lab graduate student Yan Zhang have built a machine learning framework that can define the mathematical equations describing a cell’s trajectory from one state to another, such as its development from a stem cell into one of several different types of mature cell. The framework, called dynamo, can also be used to figure out the underlying mechanisms—the specific cocktail of gene activity—driving changes in the cell. Researchers could potentially use these insights to manipulate cells into taking one path instead of another, a common goal in biomedical research and regenerative medicine.  

The researchers describe dynamo in a paper published in the journal Cell on February 1. They explain the framework’s many analytical capabilities and use it to help understand mechanisms of human blood cell production, such as why one type of blood cell forms first (appears more rapidly than others).

“Our goal is to move towards a more quantitative version of single cell biology,” Qiu says. “We want to be able to map how a cell changes in relation to the interplay of regulatory genes as accurately as an astronomer can chart a planet’s movement in relation to gravity, and then we want to understand and be able to control those changes.”

How to map a cell’s future journey

 Dynamo uses data from many individual cells to come up with its equations. The main information that it requires is how the expression of different genes in a cell changes from moment to moment. The researchers estimate this by looking at changes in the amount of RNA over time, because RNA is a measurable product of gene expression. In the same way that knowing the starting position and velocity of a ball is necessary to understand the arc it will follow, researchers use the starting levels of RNAs and how those RNA levels are changing to predict the path of the cell. However, calculating changes in the amount of RNA from single cell sequencing data is challenging, because sequencing only measures RNA once. Researchers must then use clues like RNA-being-made at the time of sequencing and equations for RNA turnover to estimate how RNA levels were changing. Qiu and colleagues had to improve on previous methods in several ways in order to get clean enough measurements for dynamo to work. In particular, they used a recently developed experimental method that tags new RNA to distinguish it from old RNA, and combined this with sophisticated mathematical modeling, to overcome limitations of older estimation approaches.

The researchers’ next challenge was to move from observing cells at discrete points in time to a continuous picture of how cells change. The difference is like switching from a map showing only landmarks to a map that shows the uninterrupted landscape, making it possible to trace the paths between landmarks. Led by Qiu and Zhang, the group used machine learning to reveal continuous functions that define these spaces. 

“There have been tremendous advances in methods for broadly profiling transcriptomes and other ‘omic’ information with single-cell resolution. The analytical tools for exploring these data, however, to date have been descriptive instead of predictive. With a continuous function, you can start to do things that weren’t possible with just accurately sampled cells at different states. For example, you can ask: if I changed one transcription factor, how is it going to change the expression of the other genes?” says Weissman, who is also a professor of biology at the Massachusetts Institute of Technology (MIT), a member of the Koch Institute for Integrative Biology Research at MIT, and an investigator of the Howard Hughes Medical Institute.

Dynamo can visualize these functions by turning them into math-based maps. The terrain of each map is determined by factors like the relative expression of key genes. A cell’s starting place on the map is determined by its current gene expression dynamics. Once you know where the cell starts, you can trace the path from that spot to find out where the cell will end up.

The researchers confirmed dynamo’s cell fate predictions by testing it against cloned cells–cells that share the same genetics and ancestry. One of two nearly-identical clones would be sequenced while the other clone went on to differentiate. Dynamo’s predictions for what would have happened to each sequenced cell matched what happened to its clone.

Moving from math to biological insight and non-trivial predictions

With a continuous function for a cell’s path over time determined, dynamo can then gain insights into the underlying biological mechanisms. Calculating derivatives of the function provides a wealth of information, for example by allowing researchers to determine the functional relationships between genes—whether and how they regulate each other. Calculating acceleration can show that a gene’s expression is growing or shrinking quickly even when its current level is low, and can be used to reveal which genes play key roles in determining a cell’s fate very early in the cell’s trajectory. The researchers tested their tools on blood cells, which have a large and branching differentiation tree. Together with blood cell expert Vijay Sankaran of Boston Children’s Hospital, the Dana-Farber Cancer Institute, Harvard Medical School, and Broad Institute of MIT and Harvard, and Eric Lander of Broad Institute, they found that dynamo accurately mapped blood cell differentiation and confirmed a recent finding that one type of blood cell, megakaryocytes, forms earlier than others. Dynamo also discovered the mechanism behind this early differentiation: the gene that drives megakaryocyte differentiation, FLI1, can self-activate, and because of this is present at relatively high levels early on in progenitor cells. This predisposes the progenitors to differentiate into megakaryocytes first.

The researchers hope that dynamo could not only help them understand how cells transition from one state to another, but also guide researchers in controlling this. To this end, dynamo includes tools to simulate how cells will change based on different manipulations, and a method to find the most efficient path from one cell state to another. These tools provide a powerful framework for researchers to predict how to optimally reprogram any cell type to another, a fundamental challenge in stem cell biology and regenerative medicine, as well as to generate hypotheses of how other genetic changes will alter cells’ fate. There are a variety of possible applications.

“If we devise a set of equations that can describe how genes within a cell regulate each other, we can computationally describe how to transform terminally differentiated cells into stem cells, or predict how a cancer cell may respond to various combinations of drugs that would be impractical to test experimentally,” Xing says.

Dynamo’s computational modeling can be used to predict the most likely path that a cell will follow when reprogramming one cell type to another, as well as the path that a cell will take after specific genetic manipulations. 

Dynamo moves beyond merely descriptive and statistical analyses of single cell sequencing data to derive a predictive theory of cell fate transitions. The dynamo toolset can provide deep insights into how cells change over time, hopefully making cells’ trajectories as predictable for researchers as the arc of a ball, and therefore also as easy to change as switching up a pitch.

The power of two

Graduate student Ellen Zhong helped biologists and mathematicians reach across departmental lines to address a longstanding problem in electron microscopy.

Saima Sidik | Department of Biology
July 1, 2021

MIT’s Hockfield Court is bordered on the west by the ultramodern Stata Center, with its reflective, silver alcoves that jut off at odd angles, and on the east by Building 68, which is a simple, window-lined, cement rectangle. At first glance, Bonnie Berger’s mathematics lab in the Stata Center and Joey Davis’s biology lab in Building 68 are as different as the buildings that house them. And yet, a recent collaboration between these two labs shows how their disciplines complement each other. The partnership started when Ellen Zhong, a graduate student from the Computational and Systems Biology (CSB) Program, decided to use a computational pattern-recognition tool called a neural network to study the shapes of molecular machines. Three years later, Zhong’s project is letting scientists see patterns that run beneath the surface of their data, and deepening their understanding of the molecules that shape life.

Zhong’s work builds on a technique from the 1970s called cryo-electron microscopy (cryo-EM), which lets researchers take high-resolution images of frozen protein complexes. Over the past decade, better microscopes and cameras have led to a “resolution revolution” in cryo-EM that’s allowed scientists to see individual atoms within proteins. But, as good as these images are, they’re still only static snapshots. In reality, many of these molecular machines are constantly changing shape and composition as cells carry out their normal functions and adjust to new situations.

Along with former Berger lab member Tristan Belper, Zhong devised software called cryoDRGN. The tool uses neural nets to combine hundreds of thousands of cryo-EM images, and shows scientists the full range of three-dimensional conformations that protein complexes can take, letting them reconstruct the proteins’ motion as they carry out cellular functions. Understanding the range of shapes that protein complexes can take helps scientists develop drugs that block viruses from entering cells, study how pests kill crops, and even design custom proteins that can cure disease. Covid-19 vaccines, for example, work partly because they include a mutated version of the virus’s spike protein that’s stuck in its active conformation, so vaccinated people produce antibodies that block the virus from entering human cells. Scientists needed to understand the variety of shapes that spike proteins can take in order to figure out how to force spike into its active conformation.

Getting off the computer and into the lab

Zhong’s interest in computational biology goes back to 2011 when, as a chemical engineering undergrad at the University of Virginia, she worked with Professor Michael Shirts to simulate how proteins fold and unfold. After college, Zhong took her skills to a company called D. E. Shaw Research, where, as a scientific programmer, she took a computational approach to studying how proteins interact with small-molecule drugs.

“The research was very exciting,” Zhong says, “but all based on computer simulations. To really understand biological systems, you need to do experiments.”

This goal of combining computation with experimentation motivated Zhong to join MIT’s CSB PhD program, where students often work with multiple supervisors to blend computational work with bench work. Zhong “rotated” in both the Davis and Berger labs, then decided to combine the Davis lab’s goal of understanding how protein complexes form with the Berger lab’s expertise in machine learning and algorithms. Davis was interested in building up the computational side of his lab, so he welcomed the opportunity to co-supervise a student with Berger, who has a long history of collaborating with biologists.

Davis himself holds a dual bachelor’s degree in computer science and biological engineering, so he’s long believed in the power of combining complementary disciplines. “There are a lot of things you can learn about biology by looking in a microscope,” he says. “But as we start to ask more complicated questions about entire systems, we’re going to require computation to manage the high-dimensional data that come back.”

Before rotating in the Davis lab, Zhong had never performed bench work before — or even touched a pipette. She was fascinated to find how streamlined some very powerful molecular biology techniques can be. Still, Zhong realized that physical limitations mean that biology is much slower when it’s done at the bench instead of on a computer. “With computational research, you can automate experiments and run them super quickly, whereas in the wet lab, you only have two hands, so you can only do one experiment at a time,” she says.

Zhong says that synergizing the two different cultures of the Davis and Berger labs is helping her become a well-rounded, adaptable scientist. Working around experimentalists in the Davis lab has shown her how much labor goes into experimental results, and also helped her to understand the hurdles that scientists face at the bench. In the Berger lab, she enjoys having coworkers who understand the challenges of computer programming.

“The key challenge in collaborating across disciplines is understanding each other’s ‘languages,’” Berger says. “Students like Ellen are fortunate to be learning both biology and computing dialects simultaneously.”

Bringing in the community

Last spring revealed another reason for biologists to learn computational skills: these tools can be used anywhere there’s a computer and an internet connection. When the Covid-19 pandemic hit, Zhong’s colleagues in the Davis lab had to wind down their bench work for a few months, and many of them filled their time at home by using cryo-EM data that’s freely available online to help Zhong test her cryoDRGN software. The difficulty of understanding another discipline’s language quickly became apparent, and Zhong spent a lot of time teaching her colleagues to be programmers. Seeing the problems that nonprogrammers ran into when they used cryoDRGN was very informative, Zhong says, and helped her create a more user-friendly interface.

Although the paper announcing cryoDRGN was just published in February, the tool created a stir as soon as Zhong posted her code online, many months prior. The cryoDRGN team thinks this is because leveraging knowledge from two disciplines let them visualize the full range of structures that protein complexes can have, and that’s something researchers have wanted to do for a long time. For example, the cryoDRGN team recently collaborated with researchers from Harvard and Washington universities to study locomotion of the single-celled organism Chlamydomonas reinhardtii. The mechanisms they uncovered could shed light on human health conditions, like male infertility, that arise when cells lose the ability to move. The team is also using cryoDRGN to study the structure of the SARS-CoV-2 spike protein, which could help scientists design treatments and vaccines to fight coronaviruses.

Zhong, Berger, and Davis say they’re excited to continue using neural nets to improve cryo-EM analysis, and to extend their computational work to other aspects of biology. Davis cited mass spectrometry as “a ripe area to apply computation.” This technique can complement cryo-EM by showing researchers the identities of proteins, how many of them are bound together, and how cells have modified them.

“Collaborations between disciplines are the future,” Berger says. “Researchers focused on a single discipline can take it only so far with existing techniques. Shining a different lens on the problem is how advances can be made.”

Zhong says it’s not a bad way to spend a PhD, either. Asked what she’d say to incoming graduate students considering interdisciplinary projects, she says: “Definitely do it.”

Harikesh S. Wong

Education

  • PhD, 2016, University of Toronto
  • BSc, 2010, Biochemistry, McMaster University

Research Summary

The immune system mounts destructive responses to protect the host from threats, including pathogens and tumors. However, a trade-off emerges: if immune responses cause too much damage, they can compromise host tissue function. Conversely, if they fail to generate sufficient damage, the host may succumb to a given threat. How is the optimal balance achieved? The Wong lab investigates how cells communicate with one another and their surrounding tissue environment to accurately control the magnitude of immune responses, both in time and space. To this end, we combine the tools of immunology with interdisciplinary methods—including high-resolution fluorescence microscopy, computational approaches, and gene manipulations—to resolve, model, and perturb the control of immune responses in intact tissues. Ultimately, we aim to understand how subtle shifts in control can lead to widely divergent host outcomes, including the successful elimination of threats, tolerance, autoimmunity, chronic infection, and cancer.

Francisco J. Sánchez-Rivera

Education

  • PhD, 2016, Biology, MIT
  • BS, 2008, Microbiology, University of Puerto Rico at Mayagüez

Research Summary

The overarching goal of the Sánchez-Rivera laboratory is to elucidate the cellular and molecular mechanisms by which genetic variation shapes normal physiology and disease, particularly in the context of cancer. To do so, we develop and apply genome engineering technologies, genetically-engineered mouse models (GEMMs), and single cell lineage tracing and omics approaches to obtain comprehensive biological pictures of disease evolution at single cell resolution. By doing so, we hope to produce actionable discoveries that could pave the way for better therapeutic strategies to treat cancer and other diseases.

Awards

  • V Foundation Award, 2022
  • Hanna H. Gray Fellowship, Howard Hughes Medical Institute, 2018-2026
  • GMTEC Postdoctoral Researcher Innovation Grant, Memorial Sloan Kettering Cancer Center, 2020-2022
  • 100 inspiring Hispanic/Latinx scientists in America, Cell Mentor/Cell Press, 2020
Olivia Corradin

Education

  • PhD, 2015, Case Western Reserve University
  • BS, 2010, Biochemistry, Marquette University

Research Summary

Our lab studies genetic and epigenetic variation that contributes to human disease by disrupting gene expression programs. We utilize biological insights into the mechanisms of gene regulation in order to determine the impact of disease-associated variants on cellular function. We aim to identify actionable insights into disease pathogenesis by studying the confluence of genetic and epigenetic risk factors of human diseases, including multiple sclerosis and opioid use disorder.

Awards

  • NIH Director’s Pioneer Award Program Avenir Award, 2017

The Davis and Berger labs combined cryo-electron microscopy and machine learning to visualize molecules in 3D.

February 4, 2021
Machine-learning model helps determine protein structures

New technique reveals many possible conformations that a protein may take.

Anne Trafton | MIT News Office
February 4, 2021

Cryo-electron microscopy (cryo-EM) allows scientists to produce high-resolution, three-dimensional images of tiny molecules such as proteins. This technique works best for imaging proteins that exist in only one conformation, but MIT researchers have now developed a machine-learning algorithm that helps them identify multiple possible structures that a protein can take.

Unlike AI techniques that aim to predict protein structure from sequence data alone, protein structure can also be experimentally determined using cryo-EM, which produces hundreds of thousands, or even millions, of two-dimensional images of protein samples frozen in a thin layer of ice. Computer algorithms then piece together these images, taken from different angles, into a three-dimensional representation of the protein in a process termed reconstruction.

In a Nature Methods paper, the MIT researchers report a new AI-based software for reconstructing multiple structures and motions of the imaged protein — a major goal in the protein science community. Instead of using the traditional representation of protein structure as electron-scattering intensities on a 3D lattice, which is impractical for modeling multiple structures, the researchers introduced a new neural network architecture that can efficiently generate the full ensemble of structures in a single model.

“With the broad representation power of neural networks, we can extract structural information from noisy images and visualize detailed movements of macromolecular machines,” says Ellen Zhong, an MIT graduate student and the lead author of the paper.

With their software, they discovered protein motions from imaging datasets where only a single static 3D structure was originally identified. They also visualized large-scale flexible motions of the spliceosome — a protein complex that coordinates the splicing of the protein coding sequences of transcribed RNA.

“Our idea was to try to use machine-learning techniques to better capture the underlying structural heterogeneity, and to allow us to inspect the variety of structural states that are present in a sample,” says Joseph Davis, the Whitehead Career Development Assistant Professor in MIT’s Department of Biology.

Davis and Bonnie Berger, the Simons Professor of Mathematics at MIT and head of the Computation and Biology group at the Computer Science and Artificial Intelligence Laboratory, are the senior authors of the study, which appears today in Nature Methods. MIT postdoc Tristan Bepler is also an author of the paper.

Visualizing a multistep process

The researchers demonstrated the utility of their new approach by analyzing structures that form during the process of assembling ribosomes — the cell organelles responsible for reading messenger RNA and translating it into proteins. Davis began studying the structure of ribosomes while a postdoc at the Scripps Research Institute. Ribosomes have two major subunits, each of which contains many individual proteins that are assembled in a multistep process.

To study the steps of ribosome assembly in detail, Davis stalled the process at different points and then took electron microscope images of the resulting structures. At some points, blocking assembly resulted in accumulation of just a single structure, suggesting that there is only one way for that step to occur. However, blocking other points resulted in many different structures, suggesting that the assembly could occur in a variety of ways.

Because some of these experiments generated so many different protein structures, traditional cryo-EM reconstruction tools did not work well to determine what those structures were.

“In general, it’s an extremely challenging problem to try to figure out how many states you have when you have a mixture of particles,” Davis says.

After starting his lab at MIT in 2017, he teamed up with Berger to use machine learning to develop a model that can use the two-dimensional images produced by cryo-EM to generate all of the three-dimensional structures found in the original sample.

In the new Nature Methods study, the researchers demonstrated the power of the technique by using it to identify a new ribosomal state that hadn’t been seen before. Previous studies had suggested that as a ribosome is assembled, large structural elements, which are akin to the foundation for a building, form first. Only after this foundation is formed are the “active sites” of the ribosome, which read messenger RNA and synthesize proteins, added to the structure.

In the new study, however, the researchers found that in a very small subset of ribosomes, about 1 percent, a structure that is normally added at the end actually appears before assembly of the foundation. To account for that, Davis hypothesizes that it might be too energetically expensive for cells to ensure that every single ribosome is assembled in the correct order.

“The cells are likely evolved to find a balance between what they can tolerate, which is maybe a small percentage of these types of potentially deleterious structures, and what it would cost to completely remove them from the assembly pathway,” he says.

Viral proteins

The researchers are now using this technique to study the coronavirus spike protein, which is the viral protein that binds to receptors on human cells and allows them to enter cells. The receptor binding domain (RBD) of the spike protein has three subunits, each of which can point either up or down.

“For me, watching the pandemic unfold over the past year has emphasized how important front-line antiviral drugs will be in battling similar viruses, which are likely to emerge in the future. As we start to think about how one might develop small molecule compounds to force all of the RBDs into the ‘down’ state so that they can’t interact with human cells, understanding exactly what the ‘up’ state looks like and how much conformational flexibility there is will be informative for drug design. We hope our new technique can reveal these sorts of structural details,” Davis says.

The research was funded by the National Science Foundation Graduate Research Fellowship Program, the National Institutes of Health, and the MIT Jameel Clinic for Machine Learning and Health. This work was supported by MIT Satori computation cluster hosted at the MGHPCC.

New gene regulation model provides insight into brain development

A well-known protein family binds to many more RNA sequences than previously thought to help neurons grow.

Raleigh McElvery
August 17, 2020

In every cell, RNA-binding proteins (RBPs) help tune gene expression and control biological processes by binding to RNA sequences. Researchers often assume that individual RBPs latch tightly to just one RNA sequence. For instance, an essential family of RBPs, the Rbfox family, was thought to bind one particular RNA sequence alone. However, it’s becoming increasingly clear that this idea greatly oversimplifies Rbfox’s vital role in development.

Members of the Rbfox family are among the best-studied RBPs and have been implicated in mammalian brain, heart, and muscle development since their discovery 25 years ago. They influence how RNA transcripts are “spliced” together to form a final RNA product, and have been associated with disorders like autism and epilepsy. But this family of RBPs is compelling for another reason as well: until recently, it was considered a classic example of predictable binding.

More often than not, it seemed, Rbfox proteins bound to a very specific sequence, or motif, of nucleotide bases, “GCAUG.” Occasionally, binding analyses hinted that Rbfox proteins might attach to other RNA sequences as well, but these findings were usually discarded. Now, a team of biologists from MIT has found that Rbfox proteins actually bind less tightly — but no less frequently — to a handful of other RNA nucleotide sequences besides GCAUG. These so-called “secondary motifs” could be key to normal brain development, and help neurons grow and assume specific roles.

“Previously, possible binding of Rbfox proteins to atypical sites had been largely ignored,” says Christopher Burge, professor of biology and the study’s senior author. “But we’ve helped demonstrate that these secondary motifs form their own separate class of binding sites with important physiological functions.”

Graduate student Bridget Begg is the first author of the study, published on August 17 in Nature Structural & Molecular Biology.

“Two-wave” regulation

After the discovery that GCAUG was the primary RNA binding site for mammalian Rbfox proteins, researchers characterized its binding in living cells using a technique called CLIP (crosslinking-immunoprecipitation). However, CLIP has several limitations. For example, it can indicate where a protein is bound, but not how much protein is bound there. It’s also hampered by some technical biases, including substantial false-negative and false-positive results.

To address these shortcomings, the Burge lab developed two complementary techniques to better quantify protein binding, this time in a test tube: RBNS (RNA Bind-n-Seq), and later, nsRBNS (RNA Bind-n-Seq with natural sequences), both of which incubate an RBP of interest with a synthetic RNA library. First author Begg performed nsRBNS with naturally-occurring mammalian RNA sequences, and identified a variety of intermediate-affinity secondary motifs that were bound in the absence of GCAUG. She then compared her own data with publicly-available CLIP results to examine the “aberrant” binding that had often been discarded, demonstrating that signals for these motifs existed across many CLIP datasets.

To probe the biological role of these motifs, Begg performed reporter assays to show that the motifs could regulate Rbfox’s RNA splicing behavior. Subsequently, computational analyses by Begg and co-author Marvin Jens using mouse neuronal data established a handful of secondary motifs that appeared to be involved in neuronal differentiation and cellular diversification.

Based on analyses of these key secondary motifs, Begg and colleagues devised a “two-wave” model. Early in development, they believe, Rbfox proteins bind predominantly to high-affinity RNA sequences like GCAUG, in order to tune gene expression. Later on, as the Rbfox concentration increases, those primary motifs become fully occupied and Rbfox additionally binds to the secondary motifs. This results in a second wave of Rbfox-regulated RNA splicing with a different set of genes.

Begg theorizes that the first wave of Rbfox proteins binds GCAUG sequences early in development, and she showed that they regulate genes involved in nerve growth, like cytoskeleton and membrane organization. The second wave appears to help neurons establish electrical and chemical signaling. In other cases, secondary motifs might help neurons specialize into different subtypes with different jobs.

John Conboy, a molecular biologist at Lawrence Berkeley National Lab and an expert in Rbfox binding, says the Burge lab’s two-wave model clearly shows how a single RBP can bind different RNA sequences — regulating splicing of distinct gene sets and influencing key processes during brain development. “This quantitative analysis of RNA-protein interactions, in a field that is often semi-quantitative at best, contributes fascinating new insights into the role of RNA splicing in cell type specification,” he says.

A binding spectrum

The researchers suspect that this two-wave model is not unique to Rbfox. “This is probably happening with many different RBPs that regulate development and other dynamic processes,” Burge says. “In the future, considering secondary motifs will help us to better understand developmental disorders and diseases, which can occur when RBPs are over- or under-expressed.”

Begg adds that secondary motifs should be incorporated into computer models that predict gene expression, in order to probe cellular behavior. “I think it’s very exciting that these more finely-tuned developmental processes, like neuronal differentiation, could be regulated by secondary motifs,” she says.

Both Begg and Burge agree it’s time to consider the entire spectrum of Rbfox binding, which are highly influenced by factors like protein concentration, binding strength, and timing. According to Begg, “Rbfox regulation is actually more complex than we sometimes give it credit for.”

Citation:
“Concentration-dependent splicing is enabled by Rbfox motifs of intermediate affinity”
Nature Structural & Molecular Biology, online August 17, 2020, DOI: 10.1038/s41594-020-0475-8
Bridget E. Begg, Marvin Jens, Peter Y. Wang, Christine M. Minor, and Christopher B. Burge

Top illustration: Some RNA-binding proteins like Rbfox (gold ellipses) help tune gene expression and control biological processes by latching onto more RNA sequences (black and gold lines) as their concentration increases (teal shading). Credit: Bridget Begg
Posted: 8.17.20
Bringing RNA into genomics

ENCODE consortium identifies RNA sequences that are involved in regulating gene expression.

Anne Trafton | MIT News Office
July 29, 2020

The human genome contains about 20,000 protein-coding genes, but the coding parts of our genes account for only about 2 percent of the entire genome. For the past two decades, scientists have been trying to find out what the other 98 percent is doing.

A research consortium known as ENCODE (Encyclopedia of DNA Elements) has made significant progress toward that goal, identifying many genome locations that bind to regulatory proteins, helping to control which genes get turned on or off. In a new study that is also part of ENCODE, researchers have now identified many additional sites that code for RNA molecules that are likely to influence gene expression.

These RNA sequences do not get translated into proteins, but act in a variety of ways to control how much protein is made from protein-coding genes. The research team, which includes scientists from MIT and several other institutions, made use of RNA-binding proteins to help them locate and assign possible functions to tens of thousands of sequences of the genome.

“This is the first large-scale functional genomic analysis of RNA-binding proteins with multiple different techniques,” says Christopher Burge, an MIT professor of biology. “With the technologies for studying RNA-binding proteins now approaching the level of those that have been available for studying DNA-binding proteins, we hope to bring RNA function more fully into the genomic world.”

Burge is one of the senior authors of the study, along with Xiang-Dong Fu and Gene Yeo of the University of California at San Diego, Eric Lecuyer of the University of Montreal, and Brenton Graveley of UConn Health.

The lead authors of the study, which appears today in Nature, are Peter Freese, a recent MIT PhD recipient in Computational and Systems Biology; Eric Van Nostrand, Gabriel Pratt, and Rui Xiao of UCSD; Xiaofeng Wang of the University of Montreal; and Xintao Wei of UConn Health.

RNA regulation

Much of the ENCODE project has thus far relied on detecting regulatory sequences of DNA using a technique called ChIP-seq. This technique allows researchers to identify DNA sites that are bound to DNA-binding proteins such as transcription factors, helping to determine the functions of those DNA sequences.

However, Burge points out, this technique won’t detect genomic elements that must be copied into RNA before getting involved in gene regulation. Instead, the RNA team relied on a technique known as eCLIP, which uses ultraviolet light to cross-link RNA molecules with RNA-binding proteins (RBPs) inside cells. Researchers then isolate specific RBPs using antibodies and sequence the RNAs they were bound to.

RBPs have many different functions — some are splicing factors, which help to cut out sections of protein-coding messenger RNA, while others terminate transcription, enhance protein translation, break down RNA after translation, or guide RNA to a specific location in the cell. Determining the RNA sequences that are bound to RBPs can help to reveal information about the function of those RNA molecules.

“RBP binding sites are candidate functional elements in the transcriptome,” Burge says. “However, not all sites of binding have a function, so then you need to complement that with other types of assays to assess function.”

The researchers performed eCLIP on about 150 RBPs and integrated those results with data from another set of experiments in which they knocked down the expression of about 260 RBPs, one at a time, in human cells. They then measured the effects of this knockdown on the RNA molecules that interact with the protein.

Using a technique developed by Burge’s lab, the researchers were also able to narrow down more precisely where the RBPs bind to RNA. This technique, known as RNA Bind-N-Seq, reveals very short sequences, sometimes containing structural motifs such as bulges or hairpins, that RBPs bind to.

Overall, the researchers were able to study about 350 of the 1,500 known human RBPs, using one or more of these techniques per protein. RNA splicing factors often have different activity depending on where they bind in a transcript, for example activating splicing when they bind at one end of an intron and repressing it when they bind the other end. Combining the data from these techniques allowed the researchers to produce an “atlas” of maps describing how each RBP’s activity depends on its binding location.

“Why they activate in one location and repress when they bind to another location is a longstanding puzzle,” Burge says. “But having this set of maps may help researchers to figure out what protein features are associated with each pattern of activity.”

Additionally, Lecuyer’s group at the University of Montreal used green fluorescent protein to tag more than 300 RBPs and pinpoint their locations within cells, such as the nucleus, the cytoplasm, or the mitochondria. This location information can also help scientists to learn more about the functions of each RBP and the RNA it binds to.

“The strength of this manuscript is in the generation of a comprehensive and multilayered dataset that can be used by the biomedical community to develop therapies targeted to specific sites on the genome using genome-editing strategies, or on the transcriptome using antisense oligonucleotides or agents that mediate RNA interference,” says Gil Ast, a professor of human molecular genetics and biochemistry at Tel Aviv University, who was not involved in the research.

Linking RNA and disease

Many research labs around the world are now using these data in an effort to uncover links between some of the RNA sequences identified and human diseases. For many diseases, researchers have identified genetic variants called single nucleotide polymorphisms (SNPs) that are more common in people with a particular disease.

“If those occur in a protein-coding region, you can predict the effects on protein structure and function, which is done all the time. But if they occur in a noncoding region, it’s harder to figure out what they may be doing,” Burge says. “If they hit a noncoding region that we identified as binding to an RBP, and disrupt the RBP’s motif, then we could predict that the SNP may alter the splicing or stability of the gene.”

Burge and his colleagues now plan to use their RNA-based techniques to generate data on additional RNA-binding proteins.

“This work provides a resource that the human genetics community can use to help identify genetic variants that function at the RNA level,” he says.

The research was funded by the National Human Genome Research Institute ENCODE Project, as well as a grant from the Fonds de Recherche de Québec-Santé.