首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In the wake of the numerous now-fruitful genome projects, we have witnessed a 'tsunami' of sequence data and with it the birth of the field of bioinformatics. Bioinformatics involves the application of information technology to the management and analysis of biological data. For many of us, this means that databases and their search tools have become an essential part of the research environment. However, the rate of sequence generation and the haphazard proliferation of databases have made it difficult to keep pace with developments, even for the cognoscenti. Moreover, increasing amounts of sequence information do not necessarily equate with an increase in knowledge, and in the panic to automate the route from raw data to biological insight, we may be generating and propagating innumerable errors in our precious databases. In the genome era upon us, researchers want rapid, easy-to-use, reliable tools for functional characterisation of newly determined sequences. For the pharmaceutical industry in particular, the Pandora's box of bioinformatics harbours an information-rich nugget, ripe with potential drug targets and possible new avenues for the development of therapeutic agents. This review outlines the current status of the major pattern databases now used routinely in the analysis of protein sequences. The review is divided into three main sections. In the first, commonly used terms are defined and the methods behind the databases are briefly described; in the second, the structure and content of the principal pattern databases are discussed; and in the final part, several alignment databases, which are frequently confused with pattern databases, are mentioned. For the new-comer, the array of resources, the range of methods behind them and the different tools required to search them can be confusing. The review therefore also briefly mentions a current international endeavour to integrate the diverse databases, which effort should facilitate sequence analysis in the future. This is particularly important for target-discovery programmes, where the challenge is to rationalise the enormous numbers of potential targets generated by sequence database searches. This problem may be addressed, at least in part, by reducing search outputs to the more focused and manageable subsets suggested by searches of integrated groups of family-specific pattern databases.  相似文献   

2.
Databases and computational tools are increasingly important in the study of allergies, particularly in the assessment of allergenicity and allergic cross-reactivity. ALLERDB database contains sequences of allergens and information on reported cross-reactivity between allergens. It focuses on analysis of allergenicity and allergic cross-reactivity of clinically relevant protein allergens. The official IUIS allergen data were extracted from the IUIS Allergen Nomenclature Sub-Committee website, and their sequence information from the public databases, and reference publications. The analysis tools assist allergen data analysis and retrieval, and include keyword searching, BLAST, prediction of allergenicity, modification of BLAST that displays cross-reactive allergens, and graphics representation of cross-reactivity data. ALLERDB is new brand of allergen databases with a rich set of tools for sequence comparison, pattern identification, and visualization of results. It is accessible at http://research.i2r.a-star.edu.sg/Templar/DB/Allergen.  相似文献   

3.
Current research of gene regulatory mechanisms is increasingly dependent on the availability of high-quality information from manually curated databases. Biocurators undertake the task of extracting knowledge claims from scholarly publications, organizing these claims in a meaningful format and making them computable. In doing so, they enhance the value of existing scientific knowledge by making it accessible to the users of their databases.In this capacity, biocurators are well positioned to identify and weed out information that is of insufficient quality. The criteria that define information quality are typically outlined in curation guidelines developed by biocurators. These guidelines have been prudently developed to reflect the needs of the user community the database caters to. The guidelines depict the standard evidence that this community recognizes as sufficient justification for trustworthy data. Additionally, these guidelines determine the process by which data should be organized and maintained to be valuable to users. Following these guidelines, biocurators assess the quality, reliability, and validity of the information they encounter.In this article we explore to what extent different use cases agree with the inclusion criteria that define positive and negative data, implemented by the database. What are the drawbacks to users who have queries that would be well served by results that fall just short of the criteria used by a database? Finally, how can databases (and biocurators) accommodate the needs of such more explorative use cases?  相似文献   

4.
Linking similar proteins structurally is a challenging task that may help in finding the novel members of a protein family. In this respect, identification of conserved sequence can facilitate understanding and classifying the exact role of proteins. However, the exact role of these conserved elements cannot be elucidated without structural and physiochemical information. In this work, we present a novel desktop application MotViz designed for searching and analyzing the conserved sequence segments within protein structure. With MotViz, the user can extract a complete list of sequence motifs from loaded 3D structures, annotate the motifs structurally and analyze their physiochemical properties. The conservation value calculated for an individual motif can be visualized graphically. To check the efficiency, predicted motifs from the data sets of 9 protein families were analyzed and MotViz algorithm was more efficient in comparison to other online motif prediction tools. Furthermore, a database was also integrated for storing, retrieving and performing the detailed functional annotation studies. In summary, MotViz effectively predicts motifs with high sensitivity and simultaneously visualizes them into 3D strucures. Moreover, MotViz is user-friendly with optimized graphical parameters and better processing speed due to the inclusion of a database at the back end. MotViz is available at http://www.fi-pk.com/motviz.html.  相似文献   

5.
Pathways database system: an integrated system for biological pathways   总被引:1,自引:0,他引:1  
MOTIVATION: During the next phase of the Human Genome Project, research will focus on functional studies of attributing functions to genes, their regulatory elements, and other DNA sequences. To facilitate the use of genomic information in such studies, a new modeling perspective is needed to examine and study genome sequences in the context of many kinds of biological information. Pathways are the logical format for modeling and presenting such information in a manner that is familiar to biological researchers. RESULTS: In this paper we present an integrated system, called Pathways Database System, with a set of software tools for modeling, storing, analyzing, visualizing, and querying biological pathways data at different levels of genetic, molecular, biochemical and organismal detail. The novel features of the system include: (a) genomic information integrated with other biological data and presented from a pathway, rather than from the DNA sequence, perspective; (b) design for biologists who are possibly unfamiliar with genomics, but whose research is essential for annotating gene and genome sequences with biological functions; (c) database design, implementation and graphical tools which enable users to visualize pathways data in multiple abstraction levels, and to pose predetermined queries; and (d) an implementation that allows for web(XML)-based dissemination of query outputs (i.e. pathways data) to researchers in the community, giving them control on the use of pathways data. AVAILABILITY: Available on request from the authors.  相似文献   

6.
Vector NTI, a balanced all-in-one sequence analysis suite   总被引:6,自引:0,他引:6  
Vector NTI is a well-balanced desktop application integrated for molecular sequence analysis and biological data management. It has a centralised database and five application modules: Vector NTI, AlignX, BioAnnotator, ContigExpress and GenomBench. In this review, the features and functions available in this software are examined. These include database management, primer design, virtual cloning, alignments, sequence assembly, 3D molecular viewer and internet tools. Some problems encountered when using this software are also discussed. It is hoped that this review will introduce this software to more molecular biologists so they can make better-informed decisions when choosing computational tools to facilitate their everyday laboratory work. This tool can save time and enhance analysis but it requires some learning on the user's part and there are some issues that need to be addressed by the developer.  相似文献   

7.
The protein kinase superfamily is an important group of enzymes controlling cellular signaling cascades. The increasing amount of available experimental data provides a foundation for deeper understanding of details of signaling systems and the underlying cellular processes. Here, we describe the Protein Kinase Resource, an integrated online service that provides access to information relevant to cell signaling and enables kinase researchers to visualize and analyze the data directly in an online environment. The data set is synchronized with Uniprot and Protein Data Bank (PDB) databases and is regularly updated and verified. Additional annotation includes interactive display of domain composition, cross-references between orthologs and functional mapping to OMIM records. The Protein Kinase Resource provides an integrated view of the protein kinase superfamily by linking data with their visual representation. Thus, human kinases can be mapped onto the human kinome tree via an interactive display. Sequence and structure data can be easily displayed using applications developed for the PKR and integrated with the website and the underlying database. Advanced search mechanisms, such as multiparameter lookup, sequence pattern, and blast search, enable fast access to the desired information, while statistics tools provide the ability to analyze the relationships among the kinases under study. The integration of data presentation and visualization implemented in the Protein Kinase Resource can be adapted by other online providers of scientific data and should become an effective way to access available experimental information.  相似文献   

8.
9.
MOTIVATION: Protein sequence and family data is accumulating at such a rapid rate that state-of-the-art databases and interface tools are required to aid curators with their classifications. We have designed such a system, MetaFam, to facilitate the comparison and integration of public protein sequence and family data. This paper presents the global schema, integration issues, and query capabilities of MetaFam. RESULTS: MetaFam is an integrated data warehouse of information about protein families and their sequences. This data has been collected into a consistent global schema, and stored in an Oracle relational database. The warehouse implementation allows for quick removal of outdated data sets. In addition to the relational implementation of the primary schema, we have developed several derived tables that enable efficient access from data visualization and exploration tools. Through a series of straightforward SQL queries, we demonstrate the usefulness of this data warehouse for comparing protein family classifications and for functional assignment of new sequences.  相似文献   

10.
Babnigg G  Giometti CS 《Proteomics》2006,6(16):4514-4522
In proteome studies, identification of proteins requires searching protein sequence databases. The public protein sequence databases (e.g., NCBInr, UniProt) each contain millions of entries, and private databases add thousands more. Although much of the sequence information in these databases is redundant, each database uses distinct identifiers for the identical protein sequence and often contains unique annotation information. Users of one database obtain a database-specific sequence identifier that is often difficult to reconcile with the identifiers from a different database. When multiple databases are used for searches or the databases being searched are updated frequently, interpreting the protein identifications and associated annotations can be problematic. We have developed a database of unique protein sequence identifiers called Sequence Globally Unique Identifiers (SEGUID) derived from primary protein sequences. These identifiers serve as a common link between multiple sequence databases and are resilient to annotation changes in either public or private databases throughout the lifetime of a given protein sequence. The SEGUID Database can be downloaded (http://bioinformatics.anl.gov/SEGUID/) or easily generated at any site with access to primary protein sequence databases. Since SEGUIDs are stable, predictions based on the primary sequence information (e.g., pI, Mr) can be calculated just once; we have generated approximately 500 different calculations for more than 2.5 million sequences. SEGUIDs are used to integrate MS and 2-DE data with bioinformatics information and provide the opportunity to search multiple protein sequence databases, thereby providing a higher probability of finding the most valid protein identifications.  相似文献   

11.
MOTIVATION: The recent rapid rise in the availability of whole genome DNA sequence data has led to bottlenecks in their complete analysis. Specifically, there is a need for software tools that will allow mining of gene and putative gene data at a whole genome level. These new tools will complement the current set already in use for studying specific aspects of individual genes and putative genes in detail. A key software challenge is to make them user-friendly, without losing their flexibility and capability for use in research. RESULTS: The creation of GeneOrder-a web-based interactive, computational tool-allows researchers to compare the order of genes in two genomes. It has been tested on full genome sequence data for viruses, mitochondria and chloroplasts that were obtained from the NCBI GenBank database. It is accessible at http://www.bif.atcc.org/GENEOrder/index.html. GeneOrder prepares the comparison in table form, listing the order of similar genes. Hyperlinks are provided from this output; these lead to the 'Protein Coding Regions' in the NCBI database.  相似文献   

12.
Multiple alignments among genomes are becoming increasingly prevalent. This trend motivates the development of tools for efficient homology search between a query sequence and a database of multiple alignments. In this paper, we present an algorithm that uses the information implicit in a multiple alignment to dynamically build an index that is weighted most heavily towards the promising regions of the multiple alignment. We have implemented Typhon, a local alignment tool that incorporates our indexing algorithm, which our test results show to be more sensitive than algorithms that index only a sequence. This suggests that when applied on a whole-genome scale, Typhon should provide improved homology searches in time comparable to existing algorithms.  相似文献   

13.
Numerous computer-based statistical packages have been developed in recent years and it has become easier to analyze nucleotide sequence data and gather subsequent information that would not normally be available. Multilocus sequence typing (MLST) is used for characterizing isolates of bacterial and fungal species and uses nucleotide sequences of internal fragments of housekeeping genes. This method is finding a place in clinical microbiology and public health by providing data for epidemiological surveillance and development of vaccine policy. It adds greatly to our knowledge of the genetic variation that can occur within a species and has therefore been used for studies of population biology. Analysis requires the detailed interpretation of nucleotide sequence data obtained from housekeeping and nonhousekeeping genes. This is due to the amount of data generated from nucleotide sequencing and the information generated from an array of analytical tools improves our understanding of bacterial pathogens. This can benefit public health interventions and the development of enhanced therapies and vaccines. This review concentrates on the analytical tools used in MLST and their use in the clinical microbiology and public health fields.  相似文献   

14.
Profile search methods based on protein domain alignments have proven to be useful tools in comparative sequence analysis. Domain alignments used by currently available search methods have been computed by sequence comparison. With the growth of the protein structure database, however, alignments of many domain pairs have also been computed by structure comparison. Here, we examine the extent to which information from these two sources agrees. We measure agreement with respect to identification of homologous regions in each protein, that is, with respect to the location of domain boundaries. We also measure agreement with respect to identification of homologous residue sites by comparing alignments and assessing the accuracy of the molecular models they predict. We find that domain alignments in publicly available collections based on sequence and structure comparison are largely consistent. However, the homologous regions identified by sequence comparison are often shorter than those identified by 3D structure comparison. In addition, when overall sequence similarity is low alignments from sequence comparison produce less accurate molecular models, suggesting that they less accurately identify homologous sites. These observations suggest that structure comparison results might be used to improve the overall accuracy of domain alignment collections and the performance of profile search methods based on them.  相似文献   

15.
Many protein pairs that share the same fold do not have any detectable sequence similarity, providing a valuable source of information for studying sequence-structure relationship. In this study, we use a stringent data set of structurally similar, sequence-dissimilar protein pairs to characterize residues that may play a role in the determination of protein structure and/or function. For each protein in the database, we identify amino-acid positions that show residue conservation within both close and distant family members. These positions are termed "persistently conserved". We then proceed to determine the "mutually" persistently conserved (MPC) positions: those structurally aligned positions in a protein pair that are persistently conserved in both pair mates. Because of their intra- and interfamily conservation, these positions are good candidates for determining protein fold and function. We find that 45% of the persistently conserved positions are mutually conserved. A significant fraction of them are located in critical positions for secondary structure determination, they are mostly buried, and many of them form spatial clusters within their protein structures. A substitution matrix based on the subset of MPC positions shows two distinct characteristics: (i) it is different from other available matrices, even those that are derived from structural alignments; (ii) its relative entropy is high, emphasizing the special residue restrictions imposed on these positions. Such a substitution matrix should be valuable for protein design experiments.  相似文献   

16.
The PSI-BLAST algorithm has been acknowledged as one of the most powerful tools for detecting remote evolutionary relationships by sequence considerations only. This has been demonstrated by its ability to recognize remote structural homologues and by the greatest coverage it enables in annotation of a complete genome. Although recognizing the correct fold of a sequence is of major importance, the accuracy of the alignment is crucial for the success of modeling one sequence by the structure of its remote homologue. Here we assess the accuracy of PSI-BLAST alignments on a stringent database of 123 structurally similar, sequence-dissimilar pairs of proteins, by comparing them to the alignments defined on a structural basis. Each protein sequence is compared to a nonredundant database of the protein sequences by PSI-BLAST. Whenever a pair member detects its pair-mate, the positions that are aligned both in the sequential and structural alignments are determined, and the alignment sensitivity is expressed as the percentage of these positions out of the structural alignment. Fifty-two sequences detected their pair-mates (for 16 pairs the success was bi-directional when either pair member was used as a query). The average percentage of correctly aligned residues per structural alignment was 43.5+/-2.2%. Other properties of the alignments were also examined, such as the sensitivity vs. specificity and the change in these parameters over consecutive iterations. Notably, there is an improvement in alignment sensitivity over consecutive iterations, reaching an average of 50.9+/-2.5% within the five iterations tested in the current study.  相似文献   

17.
Protein sequence alignment has become an essential task in modern molecular biology research. A number of alignment techniques have been documented in literature and their corresponding tools are made available as freeware and commercial software. The choice and use of these tools for sequence alignment through the complete interpretation of alignment results is often considered non-trivial by end-users with limited skill in Bioinformatics algorithm development. Here, we discuss the comparison of sequence alignment techniques based on dynamic programming (N-W, S-W) and heuristics (LFASTA, BL2SEQ) for four sets of sequence data towards an educational purpose. The analysis suggests that heuristics based methods are faster than dynamic programming methods in alignment speed.  相似文献   

18.
Many biological databases that provide comparative genomics information and tools are now available on the internet. While certainly quite useful, to our knowledge none of the existing databases combine results from multiple comparative genomics methods with manually curated information from the literature. Here we describe the Princeton Protein Orthology Database (P-POD, http://ortholog.princeton.edu), a user-friendly database system that allows users to find and visualize the phylogenetic relationships among predicted orthologs (based on the OrthoMCL method) to a query gene from any of eight eukaryotic organisms, and to see the orthologs in a wider evolutionary context (based on the Jaccard clustering method). In addition to the phylogenetic information, the database contains experimental results manually collected from the literature that can be compared to the computational analyses, as well as links to relevant human disease and gene information via the OMIM, model organism, and sequence databases. Our aim is for the P-POD resource to be extremely useful to typical experimental biologists wanting to learn more about the evolutionary context of their favorite genes. P-POD is based on the commonly used Generic Model Organism Database (GMOD) schema and can be downloaded in its entirety for installation on one's own system. Thus, bioinformaticians and software developers may also find P-POD useful because they can use the P-POD database infrastructure when developing their own comparative genomics resources and database tools.  相似文献   

19.
MOTIVATION: Tandem mass spectrometry combined with sequence database searching is one of the most powerful tools for protein identification. As thousands of spectra are generated by a mass spectrometer in one hour, the speed of database searching is critical, especially when searching against a large sequence database, or when the peptide is generated by some unknown or non-specific enzyme, even or when the target peptides have post-translational modifications (PTM). In practice, about 70-90% of the spectra have no match in the database. Many believe that a significant portion of them are due to peptides of non-specific digestions by unknown enzymes or amino acid modifications. In another case, scientists may choose to use some non-specific enzymes such as pepsin or thermolysin for proteolysis in proteomic study, in that not all proteins are amenable to be digested by some site-specific enzymes, and furthermore many digested peptides may not fall within the rang of molecular weight suitable for mass spectrometry analysis. Interpreting mass spectra of these kinds will cost a lot of computational time of database search engines. OVERVIEW: The present study was designed to speed up the database searching process for both cases. More specifically speaking, we employed an approach combining suffix tree data structure and spectrum graph. The suffix tree is used to preprocess the protein sequence database, while the spectrum graph is used to preprocess the tandem mass spectrum. We then search the suffix tree against the spectrum graph for candidate peptides. We design an efficient algorithm to compute a matching threshold with some statistical significance level, e.g. p = 0.01, for each spectrum, and use it to select candidate peptides. Then we rank these peptides using a SEQUEST-like scoring function. The algorithms were implemented and tested on experimental data. For post-translational modifications, we allow arbitrary number of any modification to a protein. AVAILABILITY: The executable program and other supplementary materials are available online at: http://hto-c.usc.edu:8000/msms/suffix/.  相似文献   

20.
Choi Y  Deane CM 《Molecular bioSystems》2011,7(12):3327-3334
Antibodies are used extensively in medical and biological research. Their complementarity determining regions (CDRs) define the majority of their antigen binding functionality. CDR structures have been intensively studied and classified (canonical structures). Here we show that CDR structure prediction is no different from the standard loop structure prediction problem and predict them without classification. FREAD, a successful database loop prediction technique, is able to produce accurate predictions for all CDR loops (0.81, 0.42, 0.96, 0.98, 0.88 and 2.25 ? RMSD for CDR-L1 to CDR-H3). In order to overcome the relatively poor predictions of CDR-H3, we developed two variants of FREAD, one focused on sequence similarity (FREAD-S) and another which includes contact information (ConFREAD). Both of the methods improve accuracy for CDR-H3 to 1.34 ? and 1.23 ? respectively. The FREAD variants are also tested on homology models and compared to RosettaAntibody (CDR-H3 prediction on models: 1.98 and 2.62 ? for ConFREAD and RosettaAntibody respectively). CDRs are known to change their structural conformations upon binding the antigen. Traditional CDR classifications are based on sequence similarity and do not account for such environment changes. Using a set of antigen-free and antigen-bound structures, we compared our FREAD variants. ConFREAD which includes contact information successfully discriminates the bound and unbound CDR structures and achieves an accuracy of 1.35 ? for bound structures of CDR-H3.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号