共查询到20条相似文献,搜索用时 15 毫秒
1.
The study of protein structure has been driven largely by the careful inspection of experimental data by human experts. However, the rapid determination of protein structures from structural-genomics projects will make it increasingly difficult to analyse (and determine the principles responsible for) the distribution of proteins in fold space by inspection alone. Here, we demonstrate a machine-learning strategy that automatically determines the structural principles describing 45 folds. The rules learnt were shown to be both statistically significant and meaningful to protein experts. With the increasing emphasis on high-throughput experimental initiatives, machine-learning and other automated methods of analysis will become increasingly important for many biological problems. 相似文献
2.
Automated discovery of 3D motifs for protein function annotation 总被引:2,自引:0,他引:2
MOTIVATION: Function inference from structure is facilitated by the use of patterns of residues (3D motifs), normally identified by expert knowledge, that correlate with function. As an alternative to often limited expert knowledge, we use machine-learning techniques to identify patterns of 3-10 residues that maximize function prediction. This approach allows us to test the assumption that residues that provide function are the most informative for predicting function. RESULTS: We apply our method, GASPS, to the haloacid dehalogenase, enolase, amidohydrolase and crotonase superfamilies and to the serine proteases. The motifs found by GASPS are as good at function prediction as 3D motifs based on expert knowledge. The GASPS motifs with the greatest ability to predict protein function consist mainly of known functional residues. However, several residues with no known functional role are equally predictive. For four groups, we show that the predictive power of our 3D motifs is comparable with or better than approaches that use the entire fold (Combinatorial-Extension) or sequence profiles (PSI-BLAST). AVAILABILITY: Source code is freely available for academic use by contacting the authors. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. 相似文献
3.
4.
ABSTRACT: BACKGROUND: The NCBI Conserved Domain Database (CDD) consists of a collection of multiple sequence alignments of protein domains that are at various stages of being manually curated into evolutionary hierarchies based on conserved and divergent sequence and structural features. These domain models are annotated to provide insights into the relationships between sequence, structure and function via web-based BLAST searches. RESULTS: Here we automate the generation of conserved domain (CD) hierarchies using a combination of heuristic and Markov chain Monte Carlo sampling procedures and starting from a (typically very large) multiple sequence alignment. This procedure relies on statistical criteria to define each hierarchy based on the conserved and divergent sequence patterns associated with protein functional-specialization. At the same time this facilitates the sequence and structural annotation of residues that are functionally important. These statistical criteria also provide a means to objectively assess the quality of CD hierarchies, a non-trivial task considering that the protein subgroups are often very distantly related--a situation in which standard phylogenetic methods can be unreliable. Our aim here is to automatically generate (typically sub-optimal) hierarchies that, based on statistical criteria and visual comparisons, are comparable to manually curated hierarchies; this serves as the first step toward the ultimate goal of obtaining optimal hierarchical classifications. A plot of runtimes for the most time-intensive (non-parallelizable) part of the algorithm indicates a nearly linear time complexity so that, even for the extremely large Rossmann fold protein class, results are obtained in about a day. CONCLUSIONS: This approach automates the rapid creation of protein domain hierarchies and thus will eliminate one of the most time consuming aspects of conserved domain database curation. At the same time, it also facilitates protein domain annotation by identifying those pattern residues that most distinguish each protein domain subgroup from other related subgroups. 相似文献
5.
Automated protein function prediction--the genomic challenge 总被引:2,自引:0,他引:2
Friedberg I 《Briefings in bioinformatics》2006,7(3):225-242
Overwhelmed with genomic data, biologists are facing the first big post-genomic question--what do all genes do? First, not only is the volume of pure sequence and structure data growing, but its diversity is growing as well, leading to a disproportionate growth in the number of uncharacterized gene products. Consequently, established methods of gene and protein annotation, such as homology-based transfer, are annotating less data and in many cases are amplifying existing erroneous annotation. Second, there is a need for a functional annotation which is standardized and machine readable so that function prediction programs could be incorporated into larger workflows. This is problematic due to the subjective and contextual definition of protein function. Third, there is a need to assess the quality of function predictors. Again, the subjectivity of the term 'function' and the various aspects of biological function make this a challenging effort. This article briefly outlines the history of automated protein function prediction and surveys the latest innovations in all three topics. 相似文献
6.
Background
Molecular signatures are sets of genes, proteins, genetic variants or other variables that can be used as markers for a particular phenotype. Reliable signature discovery methods could yield valuable insight into cell biology and mechanisms of human disease. However, it is currently not clear how to control error rates such as the false discovery rate (FDR) in signature discovery. Moreover, signatures for cancer gene expression have been shown to be unstable, that is, difficult to replicate in independent studies, casting doubts on their reliability. 相似文献7.
Automated generation and refinement of protein signatures: case study with G-protein coupled receptors 总被引:1,自引:0,他引:1
MOTIVATION: Previous work had established that it was possible to derive sparse signatures (essentially sequence-length motifs) by examining points of contact between residues in proteins of known three-dimensional (3D) structure. Many interesting protein families have very little tertiary structural information. Methods for deriving signatures using only primary and secondary-structural information were therefore developed. RESULTS: Two methods for deriving protein signatures using protein sequence information and predicted secondary structures are described. One method is based on a scoring approach, the other on the Genetic Algorithm (GA). The effectiveness of the method was tested on the superfamily of GPCRs and compared with the established hidden Markov model (HMM) method. The signature method is shown to perform well, detecting 68% of superfamily members before the first false positive sequence and detecting several distant relationships. The GA population was used to provide information on alignment regions of particular importance for selection of key residues. 相似文献
8.
This paper presents a web service named MAGIICPRO,which aims to discover functional signatures of a query protein by sequential pattern mining. Automatic discovery of patterns from unaligned biological sequences is an important problem in molecular biology. MAGIIC-PRO is different from several previously established methods performing similar tasks in two major ways. The first remarkable feature of MAGIIC-PRO is its efficiency in delivering long patterns. With incorporating a new type of gap constraints and some of the state-of-theart data mining techniques, MAGIIC-PRO usually identifies satisfied patterns within an acceptable response time. The efficiency of MAGIIC-PRO enables the users to quickly discover functional signatures of which the residues are not from only one region of the protein sequences or are only conserved in few members of a protein family. The second remarkable feature of MAGIIC-PRO is its effort in refining the mining results. Considering large flexible gaps improves the completeness of the derived functional signatures. The users can be directly guided to the patterns with as many blocks as that are conserved simultaneously. In this paper,we show by experiments that MAGIIC-PRO is efficient and effective in identifying ligand-binding sites and hot regions in protein-protein interactions directly from sequences. The web service is availableat http://biominer.bime.ntu.edu.tw/magiicproand a mirror site at http://biominer.cse.yzu.edu.tw/magiicpro. 相似文献
9.
Pannucci J Cai H Pardington PE Williams E Okinaka RT Kuske CR Cary RB 《Biosensors & bioelectronics》2004,20(4):706-718
Rapid, accurate, and sensitive detection of biothreat agents requires a broad-spectrum assay capable of discriminating between closely related microbial or viral pathogens. Moreover, in cases where a biological agent release has been identified, forensic analysis demands detailed genetic signature data for accurate strain identification and attribution. To date, nucleic acid sequences have provided the most robust and phylogentically illuminating signature information. Nucleic acid signature sequences are not often linked to genomic or extrachromosomal determinants of virulence, a link that would further facilitate discrimination between pathogens and closely related species. Inextricably coupling genetic determinants of virulence with highly informative nucleic acid signatures would provide a robust means of identifying human, livestock, and agricultural pathogens. By means of example, we present here an overview of two general applications of microarray-based methods for: (1) the identification of candidate virulence factors; and (2) the analysis of genetic polymorphisms that are coupled to Bacillus anthracis virulence factors using an accurate, low cost solid-phase mini-sequencing assay. We show that microarray-based analysis of gene expression can identify potential virulence associated genes for use as candidate signature targets, and, further, that microarray-based single nucleotide polymorphism assays provide a robust platform for the detection and identification of signature sequences in a manner independent of the genetic background in which the signature is embedded. We discuss the strategy as a general approach or pipeline for the discovery of virulence-linked nucleic acid signatures for biothreat agents. 相似文献
10.
Among the multitude of methods available for the study of origin and evolution of various life forms on Earth, the phylogenetic approach, i.e. the delineation of natural genetic relatedness amongst different groups of organisms, has been of particular interest to evolutionary biologists. An approach towards analysing phylogeny is the comparison of genome sequences of extant organisms by a variety of computational techniques. These studies rely mostly on the similarity or dissimilarity in global character of the genome in terms of sequence, without any consideration to its structure. In this work, we report a potentially new methodology towards elucidation of molecular phylogeny. The approach considers a structural parameter of the genome, namely its flexibility, and uses it to compare the small subunit ribosomal ribonucleic acid (SSU rRNA) gene from a cross-section of species. We find that the flexibility pattern of the genome is strikingly similar in organisms that are closer in evolutionary distance than the ones that are separated. This method of comparison thus might be utilised in constructing phylogenetic trees from flexibility patterns derived from nucleotide sequence. 相似文献
11.
The sequence infrastructure that has arisen through large-scale genomic projects dedicated to protein analysis, has provided a wealth of information and brought together scientists and institutions from all over the world. As a consequence, the development of novel technologies and methodologies in proteomics research is helping to unravel the biochemical and physiological mechanisms of complex multivariate diseases at both a functional and molecular level. In the late sixties, when X-ray crystallography had just been established, the idea of determining protein structure on an almost universal basis was akin to an impossible dream or a miracle. Yet only forty years after, automated protein structure determination platforms have been established. The widespread use of robotics in protein crystallography has had a huge impact at every stage of the pipeline from protein cloning, over-expression, purification, crystallization, data collection, structure solution, refinement, validation and data management- all of which have become more or less automated with minimal human intervention necessary. Here, recent advances in protein crystal structure analysis in the context of structural genomics will be discussed. In addition, this review aims to give an overview of recent developments in high throughput instrumentation, and technologies and strategies to accelerate protein structure/function analysis. 相似文献
12.
A number of recent advances have been made in deriving function information from protein structure. A fold relationship to an already characterized protein will often allow general information about function to be deduced. More detailed information can be obtained using sequence relationships to already studied proteins. Methods of deducing function directly from structure, without the use of evolutionary relationships, are developing rapidly. All such methods may be used with models of protein structure, rather than with experimentally determined ones, but model accuracy imposes limitations. The rapid expansion of the structural genomics field has created a new urgency for improved methods of structure-based annotation of function. 相似文献
13.
14.
Background
Methods for predicting protein function directly from amino acid sequences are useful tools in the study of uncharacterised protein families and in comparative genomics. Until now, this problem has been approached using machine learning techniques that attempt to predict membership, or otherwise, to predefined functional categories or subcellular locations. A potential drawback of this approach is that the human-designated functional classes may not accurately reflect the underlying biology, and consequently important sequence-to-function relationships may be missed. 相似文献15.
Comparisons of human genomes show that more base pairs are altered as a result of structural variation - including copy number variation - than as a result of point mutations. Here we review advances and challenges in the discovery and genotyping of structural variation. The recent application of massively parallel sequencing methods has complemented microarray-based methods and has led to an exponential increase in the discovery of smaller structural-variation events. Some global discovery biases remain, but the integration of experimental and computational approaches is proving fruitful for accurate characterization of the copy, content and structure of variable regions. We argue that the long-term goal should be routine, cost-effective and high quality de novo assembly of human genomes to comprehensively assess all classes of structural variation. 相似文献
16.
The regulation of cellular traction forces on the extracellular matrix is critical to cell adhesion, migration, proliferation, and differentiation. Diverse lamellar actin organizations ranging from contractile lamellar networks to stress fibers are observed in adherent cells. Although lamellar organization is thought to reflect the extent of cellular force generation, understanding of the physical behaviors of the lamellar actin cytoskeleton is lacking. To elucidate these properties, we visualized the actomyosin dynamics and organization in U2OS cells over a broad range of forces. At low forces, contractile lamellar networks predominate and force generation is strongly correlated to actomyosin retrograde flow dynamics with nominal change in organization. Lamellar networks build ~60% of cellular tension over rapid time scales. At high forces, reorganization of the lamellar network into stress fibers results in moderate changes in cellular tension over slower time scales. As stress fibers build and tension increases, myosin band spacing decreases and α-actinin bands form. On soft matrices, force generation by lamellar networks is unaffected, whereas tension-dependent stress fiber assembly is abrogated. These data elucidate the dynamic and structural signatures of the actomyosin cytoskeleton at different levels of tension and set a foundation for quantitative models of cell and tissue mechanics. 相似文献
17.
18.
Background
An organism's ability to adapt to its particular environmental niche is of fundamental importance to its survival and proliferation. In the largest study of its kind, we sought to identify and exploit the amino-acid signatures that make species-specific protein adaptation possible across 100 complete genomes.Results
Environmental niche was determined to be a significant factor in variability from correspondence analysis using the amino acid composition of over 360,000 predicted open reading frames (ORFs) from 17 archae, 76 bacteria and 7 eukaryote complete genomes. Additionally, we found clusters of phylogenetically unrelated archae and bacteria that share similar environments by amino acid composition clustering. Composition analyses of conservative, domain-based homology modeling suggested an enrichment of small hydrophobic residues Ala, Gly, Val and charged residues Asp, Glu, His and Arg across all genomes. However, larger aromatic residues Phe, Trp and Tyr are reduced in folds, and these results were not affected by low complexity biases. We derived two simple log-odds scoring functions from ORFs (CG) and folds (CF) for each of the complete genomes. CF achieved an average cross-validation success rate of 85 ± 8% whereas the CG detected 73 ± 9% species-specific sequences when competing against all other non-redundant CG. Continuously updated results are available at http://genome.mshri.on.ca.Conclusion
Our analysis of amino acid compositions from the complete genomes provides stronger evidence for species-specific and environmental residue preferences in genomic sequences as well as in folds. Scoring functions derived from this work will be useful in future protein engineering experiments and possibly in identifying horizontal transfer events. 相似文献19.
Background
Automated surveillance of the Internet provides a timely and sensitive method for alerting on global emerging infectious disease threats. HealthMap is part of a new generation of online systems designed to monitor and visualize, on a real-time basis, disease outbreak alerts as reported by online news media and public health sources. HealthMap is of specific interest for national and international public health organizations and international travelers. A particular task that makes such a surveillance useful is the automated discovery of the geographic references contained in the retrieved outbreak alerts. This task is sometimes referred to as "geo-parsing". A typical approach to geo-parsing would demand an expensive training corpus of alerts manually tagged by a human. 相似文献20.
Sara M.Ø. Solbak Alok Sharma Karsten Bruns René Röder David Mitzner Friedrich Hahn Rebekka Niebert Anni Vedeler Petra Henklein Peter Henklein Ulrich Schubert Victor Wray Torgils Fossen 《Biochimica et Biophysica Acta - Proteins and Proteomics》2013,1834(2):568-582
The proapoptotic influenza A virus PB1-F2 protein contributes to viral pathogenicity and is present in most human and avian influenza isolates. The structures of full-length PB1-F2 of the influenza strains Pandemic flu 2009 H1N1, 1918 Spanish flu H1N1, Bird flu H5N1 and H1N1 PR8, have been characterized by NMR and CD spectroscopy. The study was conducted using chemically synthesized full-length PB1-F2 protein and fragments thereof. The amino acid residues 30–70 of PR8 PB1-F2 were found to be responsible for amyloid formation of the protein, which could be assigned to formation of β-sheet structures, although α-helices were the only structural features detected under conditions that mimic a membranous environment. At membranous conditions, in which the proteins are found in their most structured state, significant differences become apparent between the PB1-F2 variants investigated. In contrast to Pandemic flu 2009 H1N1 and PR8 PB1-F2, which exhibit a continuous extensive C-terminal α-helix, both Spanish flu H1N1 and Bird flu H5N1 PB1-F2 contain a loop region with residues 66–71 that divides the C-terminus into two shorter helices. The observed structural differences are located to the C-terminal ends of the proteins to which most of the known functions of these proteins have been assigned. A C-terminal helix–loop–helix motif might be a structural signature for PB1-F2 of the highly pathogenic influenza viruses as observed for 1918 Spanish flu H1N1 and Bird flu H5N1 PB1-F2. This signature could indicate the pathological nature of viruses emerging in the future and thus aid in the recognition of these viruses. 相似文献