10/5/2023 0 Comments N in a protein sequence meaningThese pre-trained models are able to predict structure in an unsupervised way ( Rao et al., 2021b), either taking as input a single sequence ( Rives et al., 2021) or a multiple sequence alignment (MSA) ( Rao et al., 2021a), potentially by transferring knowledge from their large training set ( Bhattacharya et al., 2020 Bhattacharya et al., 2022). They are trained on large ensembles of protein sequences, and capture long-range dependencies within a protein sequence ( Alley et al., 2019 Elnaggar et al., 2021 Rives et al., 2021 Rao et al., 2021b Vig et al., 2021 Madani et al., 2020 Madani et al., 2021 Rao et al., 2021a). Protein language models are deep learning models based on natural language processing methods, especially attention ( Bahdanau et al., 2015) and transformers ( Vaswani et al., 2017). Variational autoencoders are deep learning models which also allow sampling, and they have been shown to successfully produce functional proteins ( Hawkins-Hooker et al., 2021a), although their statistical properties appear to have a lower quality than with Potts models ( McGee et al., 2021). In particular, Potts models, or DCA models ( Weigt et al., 2009 Morcos et al., 2011 Marks et al., 2011 Ekeberg et al., 2013), which are pairwise maximum entropy models trained to reproduce the one- and two-body statistics of the sequences of a family, allow direct sampling from a probability distribution modeling this family ( Figliuzzi et al., 2018), and have been used successfully for protein design ( Russ et al., 2020). Generative computational models that build on the breadth of available natural protein sequence data, and capture a representation of protein families, now offer great alternatives that can allow to sample novel sequences belonging to protein families. Conversely, directed evolution allows to perform a local search of sequence space, but generally remains limited to the vicinity of a natural sequence ( Arnold, 2018). De novo or rational protein design, which starts with target three-dimensional structures and physico-chemical potentials, can generate proteins which are not in a known protein family ( Dahiyat and Mayo, 1997 Kuhlman et al., 2003 Liang et al., 2009), but is generally restricted to small proteins ( Rocklin et al., 2017). Furthermore, only a small fraction of this space comprises sequences that do fold, as demonstrated by experiments studying random sequences ( Socolich et al., 2005), and by theoretical arguments based on the physics of disordered systems ( Bialek, 2012). The search for novel proteins is difficult due to the huge size of protein sequence space: for instance, there are 20 100 different possible sequences for a short protein domain with 100 amino acids. Indeed, it can allow to tune their stability or their biochemical properties, including their enzymatic activities, enabling important medical applications. Editor's evaluationÄesigning new proteins with specific structure and function is a highly important goal of bioengineering. MSA Transformer is thus a strong candidate for protein sequence generation and protein design. Our method also more accurately reproduces the higher-order statistics and the distribution of sequences in sequence space of natural data than Potts models. Moreover, for small protein families, our generation method based on MSA Transformer outperforms Potts models. For large protein families, our synthetic sequences have similar or better properties compared to sequences generated by Potts models, including experimentally validated ones. We demonstrate that the resulting sequences score as well as natural sequences, for homology, coevolution, and structure-based measures. We propose and test an iterative method that directly employs the masked language modeling objective to generate sequences using MSA Transformer. Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end. They thus open the possibility for generating novel sequences belonging to protein families. Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families and learn constraints associated to protein structure and function.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |