Analyzing metagenomic (DNA) and metatranscriptomic (RNA) sequence data is technically challenging and time-consuming. This process often necessitates assembly and mapping sequencing reads to existing databases of target genomic information; however, these databases are memory intensive and frequently limited in scope. Additionally, the mapping and assembly steps are computationally intensive, particularly when dealing with long-read data (5,000 - 20,000 base pairs). Traditional mapping approaches, developed for short-read data (75-300 base pairs), perform poorly with long-read data, and assembly methods require multiple runs with different software to accommodate data of all lengths.
microCafe, a software package developed by GW researchers, tackles these challenges by leveraging Large Language Models (LLMs) with guided tokenization and fine-tuning specifically for DNA and RNA sequences. This Genomic Language Model (GLM)-based approach enables high-accuracy inference of microbial composition, microbial feature profiling, and variant detection—all without the need for large, memory-intensive databases. By learning the intrinsic structure of DNA and RNA sequences, microCafe is capable of analyzing unseen genomes and facilitating novel biomarker discovery. Moreover, it eliminates the need for computationally expensive sequence mapping, significantly enhancing scalability and efficiency.
Figure 1: Guided Tokenization and Fine-Tunning, one of many unique elements to the Genomic Language Model (GLM). a. Sequencing data typically originates from various genomes or user-labeled classes, and reference datasets for genomes or other biological classes are known. b. During fine-tuning, unique reads can be leveraged to refine the model. c. During prediction, when a new read is processed, tokens unique to a genome can be prioritized, allowing the read to be segmented based on known unique tokens before undergoing the standard tokenization process.
Advantages
Applications