首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
We introduce a new approach to learning statistical models from multiple sequence alignments (MSA) of proteins. Our method, called GREMLIN (Generative REgularized ModeLs of proteINs), learns an undirected probabilistic graphical model of the amino acid composition within the MSA. The resulting model encodes both the position-specific conservation statistics and the correlated mutation statistics between sequential and long-range pairs of residues. Existing techniques for learning graphical models from MSA either make strong, and often inappropriate assumptions about the conditional independencies within the MSA (e.g., Hidden Markov Models), or else use suboptimal algorithms to learn the parameters of the model. In contrast, GREMLIN makes no a priori assumptions about the conditional independencies within the MSA. We formulate and solve a convex optimization problem, thus guaranteeing that we find a globally optimal model at convergence. The resulting model is also generative, allowing for the design of new protein sequences that have the same statistical properties as those in the MSA. We perform a detailed analysis of covariation statistics on the extensively studied WW and PDZ domains and show that our method out-performs an existing algorithm for learning undirected probabilistic graphical models from MSA. We then apply our approach to 71 additional families from the PFAM database and demonstrate that the resulting models significantly out-perform Hidden Markov Models in terms of predictive accuracy.  相似文献   

2.
Generative molecular design for drug discovery and development has seen a recent resurgence promising to improve the efficiency of the design-make-test-analyse cycle; by computationally exploring much larger chemical spaces than traditional virtual screening techniques. However, most generative models thus far have only utilized small-molecule information to train and condition de novo molecule generators. Here, we instead focus on recent approaches that incorporate protein structure into de novo molecule optimization in an attempt to maximize the predicted on-target binding affinity of generated molecules. We summarize these structure integration principles into either distribution learning or goal-directed optimization and for each case whether the approach is protein structure-explicit or implicit with respect to the generative model. We discuss recent approaches in the context of this categorization and provide our perspective on the future direction of the field.  相似文献   

3.
Deep learning approaches have produced substantial breakthroughs in fields such as image classification and natural language processing and are making rapid inroads in the area of protein design. Many generative models of proteins have been developed that encompass all known protein sequences, model specific protein families, or extrapolate the dynamics of individual proteins. Those generative models can learn protein representations that are often more informative of protein structure and function than hand-engineered features. Furthermore, they can be used to quickly propose millions of novel proteins that resemble the native counterparts in terms of expression level, stability, or other attributes. The protein design process can further be guided by discriminative oracles to select candidates with the highest probability of having the desired properties. In this review, we discuss five classes of generative models that have been most successful at modeling proteins and provide a framework for model guided protein design.  相似文献   

4.
Bacterial bioluminescence can display a wide range of intensities among strains, from very bright to undetectable, and it has been shown previously that there are nonluminous vibrios that possess lux genes. In this paper, we report the isolation and characterization of completely dark natural mutants in the genus Vibrio. Screening of over 600 Vibrio isolates with a luxA gene probe revealed that approximately 5% carried the luxA gene. Bioluminescence assays of the luxA-positive isolates, followed by repetitive extragenic palindromic-PCR fingerprinting, showed three unique genotypes that are completely dark. The dark mutants show a variety of lesions, including an insertion sequence, point mutations, and deletions. Strain BCB451 has an IS10 insertion sequence in luxA, a mutated luxE stop codon, and a truncated luxH. Strain BCB494 has a 396-bp deletion in luxC, and strain BCB440 has a frameshift in luxC. This paper represents the first molecular characterization of natural dark mutants and the first demonstration of incomplete lux operons in natural isolates.  相似文献   

5.

Neural networks such as variational autoencoders (VAE) perform dimensionality reduction for the visualization and analysis of genomic data, but are limited in their interpretability: it is unknown which data features are represented by each embedding dimension. We present siVAE, a VAE that is interpretable by design, thereby enhancing downstream analysis tasks. Through interpretation, siVAE also identifies gene modules and hubs without explicit gene network inference. We use siVAE to identify gene modules whose connectivity is associated with diverse phenotypes such as iPSC neuronal differentiation efficiency and dementia, showcasing the wide applicability of interpretable generative models for genomic data analysis.

  相似文献   

6.
A variety of different methods to generate diverse proteins, including random mutagenesis and recombination, are currently available and most of them accumulate the mutations on the target gene of a protein, whose sequence space remains unchanged. On the other hand, a pool of diverse genes, which is generated by random insertions, deletions and exchange of the homologous domains with different lengths in the target gene, would present the protein lineages resulting in new fitness landscapes. Here we report a method to generate a pool of protein variants with different sequence spaces by employing green fluorescent protein (GFP) as a model protein. This process, designated functional salvage screen (FSS), comprises the following procedures: a defective GFP template expressing no fluorescence is first constructed by genetically disrupting a predetermined region(s) of the protein and a library of GFP variants is generated from the defective template by incorporating the randomly fragmented genomic DNA from Escherichia coli into the defined region(s) of the target gene, followed by screening of the functionally salvaged, fluorescence-emitting GFPs. Two approaches, sequence-directed and PCR-coupled methods, were attempted to generate the library of GFP variants with new sequences derived from the genomic segments of E.coli. The functionally salvaged GFPs were selected and analyzed in terms of the sequence space and functional properties. The results demonstrate that the functional salvage process not only can be a simple and effective method to create protein lineages with new sequence spaces, but also can be useful in elucidating the involvement of a specific region(s) or domain(s) in the structure and function of protein.  相似文献   

7.
Computational protein design can generate proteins not found in nature that adopt desired structures and perform novel functions. Although proteins could, in theory, be designed with ab initio methods, practical success has come from using large amounts of data that describe the sequences, structures, and functions of existing proteins and their variants. We present recent creative uses of multiple-sequence alignments, protein structures, and high-throughput functional assays in computational protein design. Approaches range from enhancing structure-based design with experimental data to building regression models to training deep neural nets that generate novel sequences. Looking ahead, deep learning will be increasingly important for maximizing the value of data for protein design.  相似文献   

8.
A variational autoencoder (VAE) is a machine learning algorithm, useful for generating a compressed and interpretable latent space. These representations have been generated from various biomedical data types and can be used to produce realistic-looking simulated data. However, standard vanilla VAEs suffer from entangled and uninformative latent spaces, which can be mitigated using other types of VAEs such as β-VAE and MMD-VAE. In this project, we evaluated the ability of VAEs to learn cell morphology characteristics derived from cell images. We trained and evaluated these three VAE variants—Vanilla VAE, β-VAE, and MMD-VAE—on cell morphology readouts and explored the generative capacity of each model to predict compound polypharmacology (the interactions of a drug with more than one target) using an approach called latent space arithmetic (LSA). To test the generalizability of the strategy, we also trained these VAEs using gene expression data of the same compound perturbations and found that gene expression provides complementary information. We found that the β-VAE and MMD-VAE disentangle morphology signals and reveal a more interpretable latent space. We reliably simulated morphology and gene expression readouts from certain compounds thereby predicting cell states perturbed with compounds of known polypharmacology. Inferring cell state for specific drug mechanisms could aid researchers in developing and identifying targeted therapeutics and categorizing off-target effects in the future.  相似文献   

9.
Coevolution between protein residues is normally interpreted as direct contact. However, the evolutionary record of a protein sequence contains rich information that may include long-range functional couplings, couplings that report on homo-oligomeric states or even conformational changes. Due to the complexity of the sequence space and the lack of structural information on various members of a protein family, it has been difficult to effectively mine the additional information encoded in a multiple sequence alignment (MSA). Here, taking advantage of the recent release of the AlphaFold (AF) database we attempt to identify coevolutionary couplings that cannot be explained simply by spatial proximity. We propose a simple computational method that performs direct coupling analysis on a MSA and searches for couplings that are not satisfied in any of the AF models of members of the identified protein family. Application of this method on 2012 protein families suggests that ~12% of the total identified coevolving residue pairs are spatially distant and more likely to be disordered than their contacting counterparts. We expect that this analysis will help improve the quality of coevolutionary distance restraints used for structure determination and will be useful in identifying potentially functional/allosteric cross-talk between distant residues.  相似文献   

10.
11.
How to explore protein sequence space efficiently and how to generate high-quality mutant libraries that allow to identify improved variants with current screening technologies are key questions for any directed protein evolution experiment. High-quality mutant libraries can be generated through improved random mutagenesis methodologies and by restricting diversity generation through computational methods to residues which have high success probabilities. Advances in mutant library design and computational tools to focus diversity generation are summarized in this minireview and discussed from an experimentalist point of view in the context of directed protein evolution.  相似文献   

12.
While deep learning models have seen increasing applications in protein science, few have been implemented for protein backbone generation—an important task in structure-based problems such as active site and interface design. We present a new approach to building class-specific backbones, using a variational auto-encoder to directly generate the 3D coordinates of immunoglobulins. Our model is torsion- and distance-aware, learns a high-resolution embedding of the dataset, and generates novel, high-quality structures compatible with existing design tools. We show that the Ig-VAE can be used with Rosetta to create a computational model of a SARS-CoV2-RBD binder via latent space sampling. We further demonstrate that the model’s generative prior is a powerful tool for guiding computational protein design, motivating a new paradigm under which backbone design is solved as constrained optimization problem in the latent space of a generative model.  相似文献   

13.
Deep generative models have gained recent popularity for chemical design. Many of these models have historically operated in 2D space; however, more recently explicit 3D molecular generative models have become of interest, which are the topic of this article. Dozens of published models have been developed in the last few years to generate molecules directly in 3D, outputting both the atom types and coordinates, either in one-shot or adding atoms or fragments step-by-step. These 3D generative models can also be guided by structural information such as a binding pocket representation to successfully generate molecules with docking score ranges similar to known actives, but still showing lower computational efficiency and generation throughput than 1D/2D generative models and sometimes producing unrealistic conformations. We advocate for a unified benchmark of metrics to evaluate generation and propose perspectives to be addressed in next implementations.  相似文献   

14.

Background  

Remote homology detection is a hard computational problem. Most approaches have trained computational models by using either full protein sequences or multiple sequence alignments (MSA), including all positions. However, when we deal with proteins in the "twilight zone" we can observe that only some segments of sequences (motifs) are conserved. We introduce a novel logical representation that allows us to represent physico-chemical properties of sequences, conserved amino acid positions and conserved physico-chemical positions in the MSA. From this, Inductive Logic Programming (ILP) finds the most frequent patterns (motifs) and uses them to train propositional models, such as decision trees and support vector machines (SVM).  相似文献   

15.
Several species of the genus Vibrio, including Vibrio cholerae, are bioluminescent or contain bioluminescent strains. Previous studies have reported that only 10% of V. cholerae strains are luminescent. Analysis of 224 isolates of non-O1/non-O139 V. cholerae collected from Chesapeake Bay, MD, revealed that 52% (116/224) were luminescent when an improved assay method was employed and 58% (130/224) of isolates harbored the luxA gene. In contrast, 334 non-O1/non-O139 V. cholerae strains isolated from two rural provinces in Bangladesh yielded only 21 (6.3%) luminescent and 35 (10.5%) luxA+ isolates. An additional 270 clinical and environmental isolates of V. cholerae serogroups O1 and O139 were tested, and none were luminescent or harbored luxA. These results indicate that bioluminescence may be a trait specific for non-O1/non-O139 V. cholerae strains that frequently occur in certain environments. Luminescence expression patterns of V. cholerae were also investigated, and isolates could be grouped based on expression level. Several strains with defective expression of the lux operon, including natural K variants, were identified.  相似文献   

16.
Protein chemical shifts encode detailed structural information that is difficult and computationally costly to describe at a fundamental level. Statistical and machine learning approaches have been used to infer correlations between chemical shifts and secondary structure from experimental chemical shifts. These methods range from simple statistics such as the chemical shift index to complex methods using neural networks. Notwithstanding their higher accuracy, more complex approaches tend to obscure the relationship between secondary structure and chemical shift and often involve many parameters that need to be trained. We present hidden Markov models (HMMs) with Gaussian emission probabilities to model the dependence between protein chemical shifts and secondary structure. The continuous emission probabilities are modeled as conditional probabilities for a given amino acid and secondary structure type. Using these distributions as outputs of first‐ and second‐order HMMs, we achieve a prediction accuracy of 82.3%, which is competitive with existing methods for predicting secondary structure from protein chemical shifts. Incorporation of sequence‐based secondary structure prediction into our HMM improves the prediction accuracy to 84.0%. Our findings suggest that an HMM with correlated Gaussian distributions conditioned on the secondary structure provides an adequate generative model of chemical shifts. Proteins 2013; © 2012 Wiley Periodicals, Inc.  相似文献   

17.
The accuracy of a homology model based on the structure of a distant relative or other topologically equivalent protein is primarily limited by the quality of the alignment. Here we describe a systematic approach for sequence-to-structure alignment, called ‘K*Sync’, in which alignments are generated by dynamic programming using a scoring function that combines information on many protein features, including a novel measure of how obligate a sequence region is to the protein fold. By systematically varying the weights on the different features that contribute to the alignment score, we generate very large ensembles of diverse alignments, each optimal under a particular constellation of weights. We investigate a variety of approaches to select the best models from the ensemble, including consensus of the alignments, a hydrophobic burial measure, low- and high-resolution energy functions, and combinations of these evaluation methods. The effect on model quality and selection resulting from loop modeling and backbone optimization is also studied. The performance of the method on a benchmark set is reported and shows the approach to be effective at both generating and selecting accurate alignments. The method serves as the foundation of the homology modeling module in the Robetta server.  相似文献   

18.
19.
Glycan masking is an emerging vaccine design strategy to focus antibody responses to specific epitopes, but it has mostly been evaluated on the already heavily glycosylated HIV gp120 envelope glycoprotein. Here this approach was used to investigate the binding interaction of Plasmodium vivax Duffy Binding Protein (PvDBP) and the Duffy Antigen Receptor for Chemokines (DARC) and to evaluate if glycan-masked PvDBPII immunogens would focus the antibody response on key interaction surfaces. Four variants of PVDBPII were generated and probed for function and immunogenicity. Whereas two PvDBPII glycosylation variants with increased glycan surface coverage distant from predicted interaction sites had equivalent binding activity to wild-type protein, one of them elicited slightly better DARC-binding-inhibitory activity than wild-type immunogen. Conversely, the addition of an N-glycosylation site adjacent to a predicted PvDBP interaction site both abolished its interaction with DARC and resulted in weaker inhibitory antibody responses. PvDBP is composed of three subdomains and is thought to function as a dimer; a meta-analysis of published PvDBP mutants and the new DBPII glycosylation variants indicates that critical DARC binding residues are concentrated at the dimer interface and along a relatively flat surface spanning portions of two subdomains. Our findings suggest that DARC-binding-inhibitory antibody epitope(s) lie close to the predicted DARC interaction site, and that addition of N-glycan sites distant from this site may augment inhibitory antibodies. Thus, glycan resurfacing is an attractive and feasible tool to investigate protein structure-function, and glycan-masked PvDBPII immunogens might contribute to P. vivax vaccine development.  相似文献   

20.
《Luminescence》2003,18(3):140-144
It was demonstrated recently that luminescence of a free‐living marine bacterium, Vibrio harveyi, stimulates DNA repair, most probably by activation of the photoreactivation process. Here, we ask whether the stimulation of DNA repair could be an evolutionary drive that ensured maintenance and development of early bacterial luminescent systems. To test this hypothesis, we cultivated V. harveyi lux+ bacteria and luxA mutants in mixed cultures. Initial cultures were mixed to obtain a culture consisting of roughly 50% lux+ cells and 50% luxA mutants. Then bacteria were cultivated for several days and ratio of luminescent to dark bacteria was measured. Under these conditions, luxA mutants became highly predominant within a few days of cultivation. This indicates that, without a selective pressure, the luminescence is a disadvantage for bacteria, perhaps due to consumption of significant portion of cell energy. However, when the same experiments were repeated but cultures were irradiated with low UV doses, luminescent bacteria started to predominate shortly after the irradiation. Therefore, we conclude that stimulation of photoreactivation may be an evolutionary drive for bacterial bioluminescence. Copyright © 2003 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号