共查询到20条相似文献,搜索用时 0 毫秒
1.
UniqueProt is a practical and easy to use web service designed to create representative, unbiased data sets of protein sequences. The largest possible representative sets are found through a simple greedy algorithm using the HSSP-value to establish sequence similarity. UniqueProt is not a real clustering program in the sense that the 'representatives' are not at the centres of well-defined clusters since the definition of such clusters is problem-specific. Overall, UniqueProt is a reasonable fast solution for bias in data sets. The service is accessible at http://cubic.bioc.columbia.edu/services/uniqueprot; a command-line version for Linux is downloadable from this web site. 相似文献
2.
Nucleic acid and protein sequences contain a wealth of informationof interest to molecular biologists. The advent of molecularsequence databases provides a unique opportunity for the computeranalysis of all available sequences. Sequence databases servetwo main functions: (i) to facilitate comparisons with newlydetermined sequences, and (ii) to act as a source of data forthe generation and testing of hypotheses concerning molecularsequence organisation and evolution. The large amounts of sequencedata now becoming available require that algorithms for databasesearching be fast and efficient and considerable progress isbeing made in this area. 相似文献
3.
4.
5.
Background
Recognition of relevant sequence deviations can be valuable for elucidating functional differences between protein subfamilies. Interesting residues at highly conserved positions can then be mutated and experimentally analyzed. However, identification of such sites is tedious because automated approaches are scarce. 相似文献6.
A simple procedure is described for finding similarities between proteins using nucleotide sequence databases. The approach is illustrated by several examples of previously unknown correspondences with important biological implications: Drosophila elongation factor Tu is shown to be encoded by two genes that are differently expressed during development; a cluster of three Drosophila genes likely encode maltases; a flesh-fly fat body protein resembles the hypothesized Drosophila alcohol dehydrogenase ancestral protein; an unknown protein encoded at the multifunctional E. coli hisT locus resembles aspartate beta-semialdehyde dehydrogenase; and the E. coli tyrR protein is related to nitrogen regulatory proteins. These and other matches were discovered using a personal computer of the type available in most laboratories collecting DNA sequence data. As relatively few sequences were sampled to find these matches, it is likely that much of the existing data has not been adequately examined. 相似文献
7.
A strategy has been developed for the construction of a validated, comprehensive composite protein sequence database. Entries are amalgamated from primary source data bases by a largely automated set of processes in which redundant and trivially different entries are eliminated. A modular approach has been adopted to allow scientific judgement to be used at each stage of database processing and amalgamation. Source databases are assigned a priority depending on the quality of sequence validation and commenting. Rejection of entries from the lower priority database, in each pairwise comparison of databases, is carried out according to optionally defined redundancy criteria based on sequence segment mismatches. Efficient algorithms for this methodology are embodied in the COMPO software system. COMPO has been applied for over 2 years in construction and regular updating of the OWL composite protein sequence database from the source databases NBRF-PIR, SWISS-PROT, a GenBank translation retrieved from the feature tables, NBRF-NEW, NEWAT86, PSD-KYOTO and the sequences contained in the Brookhaven protein structure databank. OWL is part of the ISIS integrated data resource of protein sequence and structure [Akrigg et al. (1988) Nature, 335, 745-746]. The modular nature of the integration process greatly facilitates the frequent updating of OWL following releases of the source databases. The extent of redundancy in these sources is revealed by the comparison process. The advantages of a robust composite database for sequence similarity searching and information retrieval are discussed. 相似文献
8.
9.
10.
Protein sequence databases 总被引:2,自引:0,他引:2
A variety of protein sequence databases exist, ranging from simple sequence repositories, which store data with little or no manual intervention in the creation of the records, to expertly curated universal databases that cover all species and in which the original sequence data are enhanced by the manual addition of further information in each sequence record. As the focus of researchers moves from the genome to the proteins encoded by it, these databases will play an even more important role as central comprehensive resources of protein information. Several the leading protein sequence databases are discussed here, with special emphasis on the databases now provided by the Universal Protein Knowledgebase (UniProt) consortium. 相似文献
11.
12.
Randomized sequence databases for tandem mass spectrometry peptide and protein identification 总被引:4,自引:0,他引:4
Tandem mass spectrometry (MS/MS) combined with database searching is currently the most widely used method for high-throughput peptide and protein identification. Many different algorithms, scoring criteria, and statistical models have been used to identify peptides and proteins in complex biological samples, and many studies, including our own, describe the accuracy of these identifications, using at best generic terms such as "high confidence." False positive identification rates for these criteria can vary substantially with changing organisms under study, growth conditions, sequence databases, experimental protocols, and instrumentation; therefore, study-specific methods are needed to estimate the accuracy (false positive rates) of these peptide and protein identifications. We present and evaluate methods for estimating false positive identification rates based on searches of randomized databases (reversed and reshuffled). We examine the use of separate searches of a forward then a randomized database and combined searches of a randomized database appended to a forward sequence database. Estimated error rates from randomized database searches are first compared against actual error rates from MS/MS runs of known protein standards. These methods are then applied to biological samples of the model microorganism Shewanella oneidensis strain MR-1. Based on the results obtained in this study, we recommend the use of use of combined searches of a reshuffled database appended to a forward sequence database as a means providing quantitative estimates of false positive identification rates of peptides and proteins. This will allow researchers to set criteria and thresholds to achieve a desired error rate and provide the scientific community with direct and quantifiable measures of peptide and protein identification accuracy as opposed to vague assessments such as "high confidence." 相似文献
13.
14.
The ACNUC biological sequence database system provides powerful and fast query and extraction capabilities to a variety of nucleotide and protein sequence databases. The collection of ACNUC databases served by the Pôle Bio-Informatique Lyonnais includes the EMBL, GenBank, RefSeq and UniProt nucleotide and protein sequence databases and a series of other sequence databases that support comparative genomics analyses: HOVERGEN and HOGENOM containing families of homologous protein-coding genes from vertebrate and prokaryotic genomes, respectively; Ensembl and Genome Reviews for analyses of prokaryotic and of selected eukaryotic genomes. This report describes the main features of the ACNUC system and the access to ACNUC databases from any internet-connected computer. Such access was made possible by the definition of a remote ACNUC access protocol and the implementation of Application Programming Interfaces between the C, Python and R languages and this communication protocol. Two retrieval programs for ACNUC databases, Query_win, with a graphical user interface and raa_query, with a command line interface, are also described. Altogether, these bioinformatics tools provide users with either ready-to-use means of querying remote sequence databases through a variety of selection criteria, or a simple way to endow application programs with an extensive access to these databases. Remote access to ACNUC databases is open to all and fully documented (http://pbil.univ-lyon1.fr/databases/acnuc/acnuc.html). 相似文献
15.
Searching databases of conserved sequence regions by aligning protein multiple-alignments. 总被引:14,自引:2,他引:14
下载免费PDF全文

S Pietrokovski 《Nucleic acids research》1996,24(19):3836-3845
A general searching method for comparing multiple sequence alignments was developed to detect sequence relationships between conserved protein regions. Multiple alignments are treated as sequences of amino acid distributions and aligned by comparing pairs of such distributions. Four different comparison measures were tested and the Pearson correlation coefficient chosen. The method is sensitive, detecting weak sequence relationships between protein families. Relationships are detected beyond the range of conventional sequence database searches, illustrating the potential usefulness of the method. The previously undetected relation between flavoprotein subunits of two oxidoreductase families points to the potential active site in one of the families. The similarity between the bacterial RecA, DnaA and Rad51 protein families reveals a region in DnaA and Rad51 proteins likely to bind and unstack single-stranded DNA. Helix--turn--helix DNA binding domains from diverse proteins are readily detected and shown to be similar to each other. Glycosylasparaginase and gamma-glutamyltransferase enzymes are found to be similar in their proteolytic cleavage sites. The method has been fully implemented on the World Wide Web at URL: http://blocks.fhcrc.org/blocks-bin/LAMAvsearch. 相似文献
16.
Jin-Ho Ahn Kyung-Ho Lee Jae-Won Shim Eun Yeol Lee Dong-Myung Kim 《Biotechnology and Bioprocess Engineering》2013,18(6):1101-1108
In this study, we describe a cell-free protein synthesis consolidated with polymerase chain reaction (PCR)-based synthetic gene assembly that allows for streamlined translation of genetic information. In silico-designed fragments of target genes were PCR-assembled and directly expressed in a cell-free synthesis system to generate functional proteins. This method bypasses the procedures required in conventional cell-based gene expression methods, integrates gene synthesis and cell-free protein synthesis, shortens the time to protein production, and allows for facile regulation of gene expression by manipulating the oligomer sequences used for gene synthesis. The strategy proposed herein expands the flexibility and throughput of the protein synthesis process, a fundamental component in the construction of synthetic biological systems. 相似文献
17.
Rackovsky S Scheraga HA 《Journal of biomolecular structure & dynamics》2011,28(4):593-4; discussion 669-674
18.
19.
We provide a detailed overview of the choices inherent in performing a sequence database search, including the choice of algorithm, substitution matrix and gap model. Each of these choices has implications that can be described as restrictions on the underlying model of sequence evolution, the expected degree of divergence between the query sequence and the database sequences (if one uses an evolutionary based matrix), as well as the sensitivity and selectivity of the search. We conclude with a series of recommendations for researchers performing these searches based on our experience and literature studies. 相似文献
20.