Part of a gene is better than none when it comes to identifying a microbial species. But for Rice University’s computer scientists, one part was far from enough to develop a program to identify every species in a microbiome.
Emu, their microbial community profiling software, effectively identifies bacterial species using long DNA sequences that span the full length of the gene under study.
The Emu project, led by computer scientist Todd Treangen and graduate student Kristen Curry of Rice’s George R. Brown School of Engineering, is facilitating the analysis of a key gene microbiome that researchers use to weed out types of bacteria that are harmful — or helpful could humans and the environment.
Their target, 16S, is a subunit of the rRNA (ribosomal ribonucleic acid) gene, the use of which was developed by Carl Woese in 1977. This region is highly conserved in bacteria and archaea and also contains variable regions that are crucial for separating different genera and species.
“It’s commonly used for microbiome analysis because it’s present in all bacteria and most archaea,” said Curry, in her third year in the Treangen group. “Because of this, there are regions that have been conserved over the years that make aiming easier. With DNA sequencing, parts of it have to be the same in all bacteria so we know what to look for, and then parts have to be different so we can distinguish bacteria.”
The study, by the Rice team with collaborators in Germany and at the Houston Methodist Research Institute, Baylor College of Medicine and Texas Children’s Hospital, appears in the journal natural methods.
“Years ago, we tended to focus on bad bacteria — or what we thought was bad — and we didn’t really care about the others,” Curry said. “But in the last 20 years there’s been a shift where we think maybe some of these other bacteria hanging around mean something.
This is what we call the microbiome, all the microscopic organisms in an environment. Commonly studied environments include water, soil and the intestinal tract, and microbes have been shown to affect plants, carbon sequestration and human health.”
Kristen Curry, a graduate student at Rice’s George R. Brown School of Engineering
Emu, the name derives from his “expectation maximization” task, analyzes full-length 16S sequences from bacteria processed by a portable Oxford Nanopore MinION sequencer and uses sophisticated error correction to identify species based on nine to identify different “hypervariable regions”.
“With previous technology, we could only read part of the 16S gene,” Curry explained. “It’s about 1,500 base pairs, and with short-read sequencing you can only sequence up to 25% to 30% of that gene. However, you really need the full-length gene to achieve species-level precision.”
But even the latest technology isn’t perfect, so mistakes can creep into sequences.
“Although error rates have fallen in recent years, they can still have up to 10% errors in a single DNA sequence, while species can be separated by a handful of differences in their 16S gene,” said Treangen, an assistant professor of Informatics who specializes in tracking infectious diseases. “Distinguishing sequencing errors from true differences was the main computational challenge of this research project.
“One problem is that a lot of the errors are non-random, which means they can appear repeatedly at certain positions and then start looking like real differences and not sequence errors,” he said.
“Another problem is that a given sample can contain thousands of bacterial species, creating a complex mix of microbes that can occur well below the sequencing error rate,” Treangen said. “This means we cannot simply rely on ad hoc cutoffs to distinguish signal from error.”
Instead, Emu learns to distinguish between signal and error by comparing a variety of long sequences first to a template and then to each other, and iteratively refining its error correction while profiling microbial communities. In the experiments conducted, false positives in emu decreased significantly compared to other approaches when analyzing the same datasets.
“Long reads represent a disruptive technology for microbiome research,” said Treangen. “The goal of Emu was to use all the information contained in the entire 16S gene without masking anything to see if we could achieve more accurate designations at the genus or species level. And that’s exactly what we’ve achieved with Emu, thanks to a fruitful, multidisciplinary collaboration.”
Alexander Dilthey, Professor of Genomic Microbiology and Immunity at Heinrich Heine University Düsseldorf, Germany, is a co-author of the publication.
Co-authors are Rice alumnus Qi Wang, postdoctoral fellow Michael Nute, and alumnus Elizabeth Reeves; Alona Tyshaieva, Enid Graeber and Patrick Finzer from Heinrich Heine University; Sirena Soriano and Sonia Villapol of the Houston Methodist Research Institute’s Center for Neuroregeneration; Qinglong Wu and Tor Savidge from Baylor College of Medicine and Texas Children’s Hospital Microbiome Center; and Werner Mendling from Helios University, Wuppertal, Germany.
Curry, KD, et al. (2022) Emu: Species-level microbial community profiling for full-length Nanopore 16S reads. natural methods. doi.org/10.1038/s41592-022-01520-4.