Jiliang Tang receives NSF award
Abstract:
Life as we know it would be impossible without plants. They are a source of food, oxygen, timber, fiber, and medicine. Therefore, improving plant traits, such as yield, nutritional quality, and resilience, is crucial for sustainable production of plant products. Key to our ability to improve plants is a thorough understanding of how plant DNA controls traits. For example, corn DNA contains ~2 billion letters, and different sets of these letters affect different plant traits. But we have limited knowledge about which letters matter and how they control traits. When we do have a good understanding of the connection between DNA and traits, such understanding is limited to a handful of model plants chosen for their relative ease of study. Thus, to have more complete knowledge of how plants work, we will connect DNA sequences with traits they control using an Artificial Intelligence-based approach, machine learning where computers are used to uncover hidden patterns from a wide range of biological data. In addition, we will apply transfer learning to translate knowledge from one plant species to another so we can later transfer what we know about model plants to other species. The outcome of the project will be computer programs that can predict the connections between DNA sequence and traits and transfer information across species. Using these programs, scientists can better understand how plants work and this knowledge can ultimately be used to create more productive and resilient plants.
The rapid growth in omics data has led to discoveries transforming plant science. However, as more genomes become available, connecting sequences to their functions globally remains challenging. Thus, our first goal is to build and validate computational models that can predict sequence functions. The second project goal is to develop and apply transfer learning to address sequence-to-function problems across species and environments. To achieve the first goal, existing multi-omics and phenotype data from four model species–Arabidopsis, maize, rice, and tomato—will be integrated with machine learning to address two sequence-to-function problems: predictions of biological process functions such as enzyme or signaling pathway membership, and physiological and morphological phenotypes. These prediction models will be dissected using model interpretation methods to provide mechanistic insights through understanding why and how the models work. To achieve our second goal, using the same data from target model species and addressing the same focal problems, transfer learning strategies will be developed and optimized to assess how knowledge can be best transferred across species and environments. There is relatively abundant experimental data available for the four models we will focus on, and by holding out different amounts and types of data, a wide range of “data-poor” scenarios can be recreated and evaluated. For both project goals, the predictions will be validated with holdout experimental data independent from data used for modeling and new data from genetic experiments conducted for this project.
(Date Posted: 2021-06-04)