Back

Evo 2: Open‑Source AI Trained on Trillions of DNA Bases Across All Life Domains

Background and Motivation

Earlier coverage highlighted an AI system called Evo that was trained on an enormous number of bacterial genomes. The system could, when given sequences from a cluster of related genes, correctly identify the next gene or suggest a completely novel protein. This success relied on the relatively simple organization of bacterial genomes, where related genes are often clustered together and regulatory elements are compact.

Challenges with Complex Genomes

The original reporting noted uncertainty about whether the same approach would work with more complex genomes, such as those of eukaryotes. Eukaryotic DNA contains introns—non‑coding segments that interrupt coding regions—and regulatory sequences that can be scattered across vast stretches of DNA. These features are weakly defined, with only a few bases being strictly required and many showing probabilistic tendencies. Additionally, eukaryotic genomes include large amounts of DNA that has been labeled as “junk,” comprising inactive viruses and damaged genes.

Evo 2: Extending the Model

Undeterred by these challenges, the team behind Evo set out to create Evo 2, an open‑source AI trained on genomes from all three domains of life: bacteria, archaea and eukaryotes. By ingesting trillions of base pairs of DNA, Evo 2 developed internal representations of key genomic features that are difficult for humans to spot, including regulatory DNA motifs and splice‑site boundaries.

Key Capabilities

Evo 2’s training enables it to recognize patterns across the full spectrum of genomic complexity. In bacterial genomes, it continues to leverage the straightforward organization of contiguous genes and compact regulatory systems. In eukaryotic genomes, it can parse intron‑containing genes, locate weakly defined regulatory sites, and differentiate functional sequences from the extensive non‑functional DNA that surrounds them.

Implications for Research

The emergence of Evo 2 suggests that large‑scale AI models can bridge the gap between simple and complex genomic architectures. By learning from vast, diverse datasets, such models may assist scientists in identifying regulatory elements, predicting gene structures and uncovering novel proteins across a wide range of organisms. The open‑source nature of Evo 2 also invites collaboration and further development within the bioinformatics community.

Used: News Factory APP - news discovery and automation - ChatGPT for Business

Source: Ars Technica2

Also available in: