Learning the Language of Viral Evolution and Escape

By: Brian Hie, Ellen D. Zhong, Bonnie Berger, Bryan Bryson

Originally published in AAAS, Jan 15, 2021.

Natural language predicts viral escape

Viral mutations that evade neutralizing antibodies, an occurrence known as viral escape, can occur and may impede the development of vaccines. To predict which mutations may lead to viral escape, Hie et al. used a machine learning technique for natural language processing with two components: grammar (or syntax) and meaning (or semantics) (see the Perspective by Kim and Przytycka). Three different unsupervised language models were constructed for influenza A hemagglutinin, HIV-1 envelope glycoprotein, and severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) spike glycoprotein. Semantic landscapes for these viruses predicted viral escape mutations that produce sequences that are syntactically and/or grammatically correct but effectively different in semantics and thus able to evade the immune system.

Science, this issue p. 284; see also p. 233

Abstract

The ability for viruses to mutate and evade the human immune system and cause infection, called viral escape, remains an obstacle to antiviral and vaccine development. Understanding the complex rules that govern escape could inform therapeutic design. We modeled viral escape with machine learning algorithms originally developed for human natural language. We identified escape mutations as those that preserve viral infectivity but cause a virus to look different to the immune system, akin to word changes that preserve a sentence’s grammaticality but change its meaning. With this approach, language models of influenza hemagglutinin, HIV-1 envelope glycoprotein (HIV Env), and severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) Spike viral proteins can accurately predict structural escape patterns using sequence data alone. Our study represents a promising conceptual bridge between natural language and viral evolution.

Viral mutations that allow an infection to escape from recognition by neutralizing antibodies have prevented the development of a universal antibody-based vaccine for influenza (1, 2) or HIV (3) and are a concern in the development of therapies for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection (4, 5). Escape has motivated high-throughput experimental techniques that perform causal escape profiling of all single-residue mutations to a viral protein (1–4). Such techniques, however, require substantial effort to profile even a single viral strain, and testing the escape potential of many (combinatorial) mutations in many viral strains remains infeasible.

Instead, we sought to train an algorithm that learns to model escape from viral sequence data alone. This approach is not unlike learning properties of natural language from large text corpuses (6, 7) because languages such as English and Japanese use sequences of words to encode complex meanings and have complex rules (for example, grammar). To escape, a mutant virus must preserve infectivity and evolutionary fitness—it must obey a “grammar” of biological rules—and the mutant must no longer be recognized by the immune system, which is analogous to a change in the “meaning” or the “semantics” of the virus.

Currently, computational models of protein evolution focus either on fitness (8) or on functional or semantic similarity (9–11), but we want to understand both (Fig. 1A). Rather than developing two separate models of fitness and function, we developed a single model that simultaneously achieves these tasks. We leveraged state-of-the-art machine learning algorithms called language models (6, 7), which learn the probability of a token (such as an English word) given its sequence context (such as a sentence) (Fig. 1B). Internally, the language model constructs a semantic representation, or an “embedding,” for a given sequence (6), and the output of a language model encodes how well a particular token fits within the rules of the language, which we call “grammaticality” and can also be thought of as “syntactic fitness” (supplementary text, note S2). The same principles used to train a language model on a sequence of English words can train a language model on a sequence of amino acids. Although immune selection occurs on phenotypes (such as protein structures), evolution dictates that selection is reflected within genotypes (such as protein sequences), which language models can leverage to learn functional properties from sequence variation.

To continue reading this article, click here.

Originally published in AAAS, Jan 15, 2021.

Natural language predicts viral escape

Abstract

Footer