期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Uncovering conserved patterns in bioactive peptides in Metazoa

Liu F Baggerman G Schoofs L Wets G 《Peptides》2006,27(12):3137-3153

Bioactive (neuro)peptides play critical roles in regulating most biological processes in animals. Peptides belonging to the same family are characterized by a typical sequence pattern that is conserved among the family's peptide members. Such a conserved pattern or motif usually corresponds to the functionally important part of the biologically active peptide. In this paper, all known bioactive (neuro)peptides annotated in Swiss-Prot and TrEMBL protein databases are collected, and the pattern searching program Pratt is used to search these unaligned peptide sequences for conserved patterns. The obtained patterns are then refined by combining the information on amino acids at important functional sites collected from the literature. All the identified patterns are further tested by scanning them against Swiss-Prot and TrEMBL protein databases. The diagnostic power of each pattern is validated by the fact that any annotated protein from Swiss-Prot and TrEMBL that contains one of the established patterns, is indeed a known (neuro)peptide precursor. We discovered 155 novel peptide patterns in addition to the 56 established ones in the PROSITE database. All the patterns cover 110 peptide families. Fifty-five of these families are not characterized by the PROSITE signatures, and 12 are also not identified by other existing motif databases, such as Pfam and SMART. Using the newly identified peptide signatures as a search tool, we predicted 95 hypothetical proteins as putative peptide precursors. 相似文献

2.

Improved Detection of Remote Homologues Using Cascade PSI-BLAST: Influence of Neighbouring Protein Families on Sequence Coverage

Swati Kaushik Eshita Mutt Ajithavalli Chellappan Sandhya Sankaran Narayanaswamy Srinivasan Ramanathan Sowdhamini 《PloS one》2013,8(2)

Background

Development of sensitive sequence search procedures for the detection of distant relationships between proteins at superfamily/fold level is still a big challenge. The intermediate sequence search approach is the most frequently employed manner of identifying remote homologues effectively. In this study, examination of serine proteases of prolyl oligopeptidase, rhomboid and subtilisin protein families were carried out using plant serine proteases as queries from two genomes including A. thaliana and O. sativa and 13 other families of unrelated folds to identify the distant homologues which could not be obtained using PSI-BLAST.

Methodology/Principal Findings

We have proposed to start with multiple queries of classical serine protease members to identify remote homologues in families, using a rigorous approach like Cascade PSI-BLAST. We found that classical sequence based approaches, like PSI-BLAST, showed very low sequence coverage in identifying plant serine proteases. The algorithm was applied on enriched sequence database of homologous domains and we obtained overall average coverage of 88% at family, 77% at superfamily or fold level along with specificity of ∼100% and Mathew’s correlation coefficient of 0.91. Similar approach was also implemented on 13 other protein families representing every structural class in SCOP database. Further investigation with statistical tests, like jackknifing, helped us to better understand the influence of neighbouring protein families.

Conclusions/Significance

Our study suggests that employment of multiple queries of a family for the Cascade PSI-BLAST searches is useful for predicting distant relationships effectively even at superfamily level. We have proposed a generalized strategy to cover all the distant members of a particular family using multiple query sequences. Our findings reveal that prior selection of sequences as query and the presence of neighbouring families can be important for covering the search space effectively in minimal computational time. This study also provides an understanding of the ‘bridging’ role of related families. 相似文献

3.

X-Ray crystal structure and molecular dynamics simulations of silver hake parvalbumin (Isoform B)

下载免费PDF全文

Richardson RC King NM Harrington DJ Sun H Royer WE Nelson DJ 《Protein science : a publication of the Protein Society》2000,9(1):73-82

Parvalbumins constitute a class of calcium-binding proteins characterized by the presence of several helix-loop-helix (EF-hand) motifs. In a previous study (Revett SP, King G, Shabanowitz J, Hunt DF, Hartman KL, Laue TM, Nelson DJ, 1997, Protein Sci 7:2397-2408), we presented the sequence of the major parvalbumin isoform from the silver hake (Merluccius bilinearis) and presented spectroscopic and structural information on the excised "EF-hand" portion of the protein. In this study, the X-ray crystal structure of the silver hake major parvalbumin has been determined to high resolution, in the frozen state, using the molecular replacement method with the carp parvalbumin structure as a starting model. The crystals are orthorhombic, space group C2221, with a = 75.7 A, b = 80.7 A, and c = 42.1 A. Data were collected from a single crystal grown in 15% glycerol, which served as a cryoprotectant for flash freezing at -188 degrees C. The structure refined to a conventional R-value of 21% (free R 25%) for observed reflections in the range 8 to 1.65 A [1 > 2sigma(I)]. The refined model includes an acetylated amino terminus, 108 residues (characteristic of a beta parvalbumin lineage), 2 calcium ions, and 114 water molecules per protein molecule. The resulting structure was used in molecular dynamics (MD) simulations focused primarily on the dynamics of the ligands coordinating the Ca2+ ions in the CD and EF sites. MD simulations were performed on both the fully Ca2+ loaded protein and on a Ca2+ deficient variant, with Ca2+ only in the CD site. There was substantial agreement between the MD and X-ray results in addressing the issue of mobility of key residues in the calcium-binding sites, especially with regard to the side chain of Ser55 in the CD site and Asp92 in the EF site. 相似文献

4.

Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks 总被引：11，自引：0，他引：11

de Brevern AG Etchebest C Hazout S 《Proteins》2000,41(3):271-287

By using an unsupervised cluster analyzer, we have identified a local structural alphabet composed of 16 folding patterns of five consecutive C(alpha) ("protein blocks"). The dependence that exists between successive blocks is explicitly taken into account. A Bayesian approach based on the relation protein block-amino acid propensity is used for prediction and leads to a success rate close to 35%. Sharing sequence windows associated with certain blocks into "sequence families" improves the prediction accuracy by 6%. This prediction accuracy exceeds 75% when keeping the first four predicted protein blocks at each site of the protein. In addition, two different strategies are proposed: the first one defines the number of protein blocks in each site needed for respecting a user-fixed prediction accuracy, and alternatively, the second one defines the different protein sites to be predicted with a user-fixed number of blocks and a chosen accuracy. This last strategy applied to the ubiquitin conjugating enzyme (alpha/beta protein) shows that 91% of the sites may be predicted with a prediction accuracy larger than 77% considering only three blocks per site. The prediction strategies proposed improve our knowledge about sequence-structure dependence and should be very useful in ab initio protein modelling. 相似文献

5.

Exploratory studies of ab initio protein structure prediction: multiple copy simulated annealing, AMBER energy functions, and a generalized born/solvent accessibility solvation model.

Yongxing Liu D L Beveridge 《Proteins》2002,46(1):128-146

A theoretical and computational approach to ab initio structure prediction for polypeptides in water is described and applied to selected amino acid sequences for testing and preliminary validation. The method builds systematically on the extensive efforts applied to parameterization of molecular dynamics (MD) force fields, employs an empirically well-validated continuum dielectric model for solvation, and an eminently parallelizable approach to conformational search. The effective free energy of polypeptide chains is estimated from AMBER united atom potential functions, with internal degrees of freedom for both backbone and amino acid side chains explicitly treated. The hydration free energy of each structure is determined using the Generalized Born/Solvent Accessibility (GBSA) method, modified and reparameterized to include atom types consistent with the AMBER force field. The conformational search procedure employs a multiple copy, Monte Carlo simulated annealing (MCSA) protocol in full torsion angle space, applied iteratively on sets of structures of progressively lower free energy until a prediction of a structure with lowest effective free energy is obtained. Calibration tests for the effective energy function and search algorithm are performed on the alanine dipeptide, selected protein crystal structures, and united atom decoys on barnase, crambin, and six examples from the Rosetta set. Specific demonstration cases of the method are provided for the 8-mer sequence of Ala residues, a 12-residue peptide with longer side chains QLLKKLLQQLKQ, a de novo designed 16 residue peptide of sequence (AAQAA)3Y, a 15-residue sequence with a beta sheet motif, GEWTWDATKTFTVTE, and a 36 residue small protein, Villin headpiece. The Ala 8-mer readily formed an alpha-helix. An alpha-helix structure was predicted for the 16-mer, consistent with observed results from IR and CD spectroscopy and with the pattern in psi/straight phi angles of known protein structures. The predicted structure for the 12-mer, composed of a mix of helix and less regular elements of secondary structure, lies 2.65 A RMS from the observed crystal structure. Structure prediction for the 8-mer beta-motif resulted in form 4.50 A RMS from the crystal geometry. For Villin, the predicted native form is very close to the crystal structure, RMS values of 3.5 A (including sidechains), and 1.01 A (main chain only). The methodology permits a detailed analysis of the molecular forces which dominate various segments of the predicted folding trajectory. Analysis of the results in terms of internal torsional, electrostatic and van der Waals and the electrostatic and non-electrostatic contributions to hydration, including the hydrophobic effect, is presented. 相似文献

6.

FORESST: fold recognition from secondary structure predictions of proteins

Di Francesco V Munson PJ Garnier J 《Bioinformatics (Oxford, England)》1999,15(2):131-140

MOTIVATION: A method for recognizing the three-dimensional fold from the protein amino acid sequence based on a combination of hidden Markov models (HMMs) and secondary structure prediction was recently developed for proteins in the Mainly-Alpha structural class. Here, this methodology is extended to Mainly-Beta and Alpha-Beta class proteins. Compared to other fold recognition methods based on HMMs, this approach is novel in that only secondary structure information is used. Each HMM is trained from known secondary structure sequences of proteins having a similar fold. Secondary structure prediction is performed for the amino acid sequence of a query protein. The predicted fold of a query protein is the fold described by the model fitting the predicted sequence the best. RESULTS: After model cross-validation, the success rate on 44 test proteins covering the three structural classes was found to be 59%. On seven fold predictions performed prior to the publication of experimental structure, the success rate was 71%. In conclusion, this approach manages to capture important information about the fold of a protein embedded in the length and arrangement of the predicted helices, strands and coils along the polypeptide chain. When a more extensive library of HMMs representing the universe of known structural families is available (work in progress), the program will allow rapid screening of genomic databases and sequence annotation when fold similarity is not detectable from the amino acid sequence. AVAILABILITY: FORESST web server at http://absalpha.dcrt.nih.gov:8008/ for the library of HMMs of structural families used in this paper. FORESST web server at http://www.tigr.org/ for a more extensive library of HMMs (work in progress). CONTACT: valedf@tigr.org; munson@helix.nih.gov; garnier@helix.nih.gov 相似文献

7.

Cloning and molecular analysis of the bifunctional dihydrofolate reductase-thymidylate synthase gene in the ciliated protozoanParamecium tetraurelia

I. Martha Schlichtherle David S. Roos Judith L. Van Houten 《Molecular genetics and genomics : MGG》1996,250(6):665-673

We have cloned the first bifunctional gene dihydrofolate reductase-thymidylate synthase (DHFR-TS) from a free-living, ciliated protozoan,Paramecium tetraurelia, and determined its macronuclear sequence using a modified ligation-mediated polymerase chain reaction (PCR) that can be of general use in cloning strategies, especially where cDNA libraries are limiting. While bifunctional enzyme sequences are known from parasitic protozoa, none had previously been found in free-living protozoa. The AT-rich (68%) coding region spanning 1386 bp appears to lack introns. DHFR-TS localizes to a ≈500 kb macronuclear chromosome and is transcribed as an mRNA of ≈1.66 kb, predicted to encode a 53 kDa protein of 462 residues. The N-terminal one-third of the protein is encoded by DHFR, which is joined by a short junctional peptide of ≈12 amino acids to the highly conserved C-terminal TS domain. Among known DHFR-TS sequences, theP. tetraurelia gene is most similar to that fromToxoplasma gondii, based on primary sequence and parsimony analyses. The predicted secondary protein structure is similar to those of previously crystallized monofunctional sequences. 相似文献

8.

Importance of context in protein folding: secondary structural propensities versus tertiary contact-assisted secondary structure formation

Scott KA Alonso DO Pan Y Daggett V 《Biochemistry》2006,45(13):4153-4163

Molecular dynamics simulations can be used to reveal the detailed conformational behaviors of peptides and proteins. By comparing fragment and full-length protein simulations, we can investigate the role of each peptide segment in the folding process. Here, we take advantage of information regarding the helix formation process from our previous simulations of barnase and protein A as well as new simulations of four helical fragments from these proteins at three different temperatures, starting with both helical and extended structures. Segments with high helical propensity began the folding process by tethering the chain through side chain interactions involving either polar interactions, such as salt bridges, or hydrophobic staples. These tethers were frequently nonnative (i.e., not i --> i + 4 spacing) and provided a scaffold for other residues, thereby limiting the conformational search. The helical structure then propagated on both sides of the tether. Segments with low stability and propensity formed later in the folding process and utilized contacts with other portions of the protein when folding. These helices formed via a tertiary contact-assisted mechanism, primarily via hydrophobic contacts between residues distant in sequence. Thus, segments with different helical propensities appear to play different roles during protein folding. Furthermore, the active role of nonlocal side chains in helix formation highlights why we must move beyond simple hierarchical models of protein folding. 相似文献

9.

New Assembly,Reannotation and Analysis of the Entamoeba histolytica Genome Reveal New Genomic Features and Protein Content Information

Hernan A. Lorenzi Daniela Puiu Jason R. Miller Lauren M. Brinkac Paolo Amedeo Neil Hall Elisabet V. Caler 《PLoS neglected tropical diseases》2010,4(6)

Background

In order to maintain genome information accurately and relevantly, original genome annotations need to be updated and evaluated regularly. Manual reannotation of genomes is important as it can significantly reduce the propagation of errors and consequently diminishes the time spent on mistaken research. For this reason, after five years from the initial submission of the Entamoeba histolytica draft genome publication, we have re-examined the original 23 Mb assembly and the annotation of the predicted genes.

Principal Findings

The evaluation of the genomic sequence led to the identification of more than one hundred artifactual tandem duplications that were eliminated by re-assembling the genome. The reannotation was done using a combination of manual and automated genome analysis. The new 20 Mb assembly contains 1,496 scaffolds and 8,201 predicted genes, of which 60% are identical to the initial annotation and the remaining 40% underwent structural changes. Functional classification of 60% of the genes was modified based on recent sequence comparisons and new experimental data. We have assigned putative function to 3,788 proteins (46% of the predicted proteome) based on the annotation of predicted gene families, and have identified 58 protein families of five or more members that share no homology with known proteins and thus could be entamoeba specific. Genome analysis also revealed new features such as the presence of segmental duplications of up to 16 kb flanked by inverted repeats, and the tight association of some gene families with transposable elements.

Significance

This new genome annotation and analysis represents a more refined and accurate blueprint of the pathogen genome, and provides an upgraded tool as reference for the study of many important aspects of E. histolytica biology, such as genome evolution and pathogenesis. 相似文献

10.

An Expanded Conformation of an Antibody Fab Region by X-Ray Scattering,Molecular Dynamics,and smFRET Identifies an Aggregation Mechanism

《Journal of molecular biology》2019,431(7):1409-1425

Protein aggregation is the underlying cause of many diseases, and also limits the usefulness of many natural and engineered proteins in biotechnology. Better mechanistic understanding and characterization of aggregation-prone states is needed to guide protein engineering, formulation, and drug-targeting strategies that prevent aggregation. While several final aggregated states—notably amyloids—have been characterized structurally, very little is known about the native structural conformers that initiate aggregation. We used a novel combination of small-angle x-ray scattering (SAXS), atomistic molecular dynamic simulations, single-molecule Förster resonance energy transfer, and aggregation-prone region predictions, to characterize structural changes in a native humanized Fab A33 antibody fragment, that correlated with the experimental aggregation kinetics. SAXS revealed increases in the native state radius of gyration, R_g, of 2.2% to 4.1%, at pH 5.5 and below, concomitant with accelerated aggregation. In a cutting-edge approach, we fitted the SAXS data to full MD simulations from the same conditions and located the conformational changes in the native state to the constant domain of the light chain (C_L). This C_L displacement was independently confirmed using single-molecule Förster resonance energy transfer measurements with two dual-labeled Fabs. These conformational changes were also found to increase the solvent exposure of a predicted APR, suggesting a likely mechanism through which they promote aggregation. Our findings provide a means by which aggregation-prone conformational states can be readily determined experimentally, and thus potentially used to guide protein engineering, or ligand binding strategies, with the aim of stabilizing the protein against aggregation. 相似文献

11.

Discovery of Fur binding site clusters in Escherichia coli by information theory models

Chen Z Lewis KA Shultzaberger RK Lyakhov IG Zheng M Doan B Storz G Schneider TD 《Nucleic acids research》2007,35(20):6762-6777

Fur is a DNA binding protein that represses bacterial iron uptake systems. Eleven footprinted Escherichia coli Fur binding sites were used to create an initial information theory model of Fur binding, which was then refined by adding 13 experimentally confirmed sites. When the refined model was scanned across all available footprinted sequences, sequence walkers, which are visual depictions of predicted binding sites, frequently appeared in clusters that fit the footprints (~83% coverage). This indicated that the model can accurately predict Fur binding. Within the clusters, individual walkers were separated from their neighbors by exactly 3 or 6 bases, consistent with models in which Fur dimers bind on different faces of the DNA helix. When the E. coli genome was scanned, we found 363 unique clusters, which includes all known Fur-repressed genes that are involved in iron metabolism. In contrast, only a few of the known Fur-activated genes have predicted Fur binding sites at their promoters. These observations suggest that Fur is either a direct repressor or an indirect activator. The Pseudomonas aeruginosa and Bacillus subtilis Fur models are highly similar to the E. coli Fur model, suggesting that the Fur–DNA recognition mechanism may be conserved for even distantly related bacteria. 相似文献

12.

Molecular cloning and expression analysis of a new WD40 repeat protein gene in upland cotton

Quan Sun Yingfan Cai Xiaoyan Zhu Xiaohong He Huaizhong Jiang Guanghua He 《Biologia》2012,67(6):1112-1118

A new member of the WD repeat protein family, named GhWD40, was cloned from a near-isogenic line for glands in cotton. It has 2629 bp cDNA and a complete opening reading frame (ORF) of 1239 bp, containing the initial code (ATG) and terminal code (TAG); there is a 1061 bp non-coding sequence at the 5??-end, and a 329 bp non-coding sequence at the 3??-end, including the poly(A) sequence (accession number: JN714279). The predicted protein of the complete ORF comprised 412 amino acids with a calculated molecular mass of 47.1 kDa and an isoelectric point of 8.88. Protein domain scanning showed that the novel protein has five wd40 motifs and belongs to the WD40 family. From a search for GhWD40 cDNA and amino acid sequences in the database, it has 77% sequence identity and was 90% sequence positive with the WD-40 repeat protein from Trifolium pratense (accession number BAE71307.1), and 80% sequence identity and 89% sequence positivity with the ribosome biogenesis protein bop1 from Ricinus communis (accession number XP 002529002.1). We propose that GhWD40 may play the same role as bop1. In addition, expression of GhWD40 in near-isogenic lines 11 and 3 (with and without glands, respectively) was studied by quantitative RT-polymerase chain reaction, and the level in near-isogenic line 11 was higher than that in near-isogenic line 3, suggesting that GhWD40 may be related to gland formation. 相似文献

13.

Bioinformatic identification and validation of conservative microRNAs in Ictalurus punctatus

Zhiqiang Xu Qin Qin Jiachun Ge Jianlin Pan Xiaofeng Xu 《Molecular biology reports》2012,39(12):10395-10405

相似文献

14.

Evolution of protein sequences and structures. 总被引：9，自引：0，他引：9

T C Wood W R Pearson 《Journal of molecular biology》1999,291(4):977-995

The relationship between sequence similarity and structural similarity has been examined in 36 protein families with five or more diverse members whose structures are known. The structural similarity within a family (as determined with the DALI structure comparison program) is linearly related to sequence similarity (as determined by a Smith-Waterman search of the protein sequences in the structure database). The correlation between structural similarity and sequence similarity is very high; 18 of the 36 families had linear correlation coefficients r>/=0.878, and only nine had correlation coefficients r相似文献

15.

Sequence clustering strategies improve remote homology recognitions while reducing search times 总被引：8，自引：0，他引：8

Li W Jaroszewski L Godzik A 《Protein engineering》2002,15(8):643-649

Sequence databases are rapidly growing, thereby increasing the coverage of protein sequence space, but this coverage is uneven because most sequencing efforts have concentrated on a small number of organisms. The resulting granularity of sequence space creates many problems for profile-based sequence comparison programs. In this paper, we suggest several strategies that address these problems, and at the same time speed up the searches for homologous proteins and improve the ability of profile methods to recognize distant homologies. One of our strategies combines database clustering, which removes highly redundant sequence, and a two-step PSI-BLAST (PDB-BLAST), which separates sequence spaces of profile composition and space of homology searching. The combination of these strategies improves distant homology recognitions by more than 100%, while using only 10% of the CPU time of the standard PSI-BLAST search. Another method, intermediate profile searches, allows for the exploration of additional search directions that are normally dominated by large protein sub-families within very diverse families. All methods are evaluated with a large fold-recognition benchmark. 相似文献

16.

NLDB: a database for 3D protein–ligand interactions in enzymatic reactions

Yoichi Murakami Satoshi Omori Kengo Kinoshita 《Journal of structural and functional genomics》2016,17(4):101-110

NLDB (Natural Ligand DataBase; URL: http://nldb.hgc.jp) is a database of automatically collected and predicted 3D protein–ligand interactions for the enzymatic reactions of metabolic pathways registered in KEGG. Structural information about these reactions is important for studying the molecular functions of enzymes, however a large number of the 3D interactions are still unknown. Therefore, in order to complement such missing information, we predicted protein–ligand complex structures, and constructed a database of the 3D interactions in reactions. NLDB provides three different types of data resources; the natural complexes are experimentally determined protein–ligand complex structures in PDB, the analog complexes are predicted based on known protein structures in a complex with a similar ligand, and the ab initio complexes are predicted by docking simulations. In addition, NLDB shows the known polymorphisms found in human genome on protein structures. The database has a flexible search function based on various types of keywords, and an enrichment analysis function based on a set of KEGG compound IDs. NLDB will be a valuable resource for experimental biologists studying protein–ligand interactions in specific reactions, and for theoretical researchers wishing to undertake more precise simulations of interactions. 相似文献

17.

Computational Prediction of Protein-Protein Interactions in Leishmania Predicted Proteomes

Antonio M. Rezende Edson L. Folador Daniela de M. Resende Jeronimo C. Ruiz 《PloS one》2012,7(12)

The Trypanosomatids parasites Leishmania braziliensis, Leishmania major and Leishmania infantum are important human pathogens. Despite of years of study and genome availability, effective vaccine has not been developed yet, and the chemotherapy is highly toxic. Therefore, it is clear just interdisciplinary integrated studies will have success in trying to search new targets for developing of vaccines and drugs. An essential part of this rationale is related to protein-protein interaction network (PPI) study which can provide a better understanding of complex protein interactions in biological system. Thus, we modeled PPIs for Trypanosomatids through computational methods using sequence comparison against public database of protein or domain interaction for interaction prediction (Interolog Mapping) and developed a dedicated combined system score to address the predictions robustness. The confidence evaluation of network prediction approach was addressed using gold standard positive and negative datasets and the AUC value obtained was 0.94. As result, 39,420, 43,531 and 45,235 interactions were predicted for L. braziliensis, L. major and L. infantum respectively. For each predicted network the top 20 proteins were ranked by MCC topological index. In addition, information related with immunological potential, degree of protein sequence conservation among orthologs and degree of identity compared to proteins of potential parasite hosts was integrated. This information integration provides a better understanding and usefulness of the predicted networks that can be valuable to select new potential biological targets for drug and vaccine development. Network modularity which is a key when one is interested in destabilizing the PPIs for drug or vaccine purposes along with multiple alignments of the predicted PPIs were performed revealing patterns associated with protein turnover. In addition, around 50% of hypothetical protein present in the networks received some degree of functional annotation which represents an important contribution since approximately 60% of Leishmania predicted proteomes has no predicted function. 相似文献

18.

Probing metagenomics by rapid cluster analysis of very large datasets

Li W Wooley JC Godzik A 《PloS one》2008,3(10):e3375

Background

The scale and diversity of metagenomic sequencing projects challenge both our technical and conceptual approaches in gene and genome annotations. The recent Sorcerer II Global Ocean Sampling (GOS) expedition yielded millions of predicted protein sequences, which significantly altered the landscape of known protein space by more than doubling its size and adding thousands of new families (Yooseph et al., 2007 PLoS Biol 5, e16). Such datasets, not only by their sheer size, but also by many other features, defy conventional analysis and annotation methods.

Methodology/Principal Findings

In this study, we describe an approach for rapid analysis of the sequence diversity and the internal structure of such very large datasets by advanced clustering strategies using the newly modified CD-HIT algorithm. We performed a hierarchical clustering analysis on the 17.4 million Open Reading Frames (ORFs) identified from the GOS study and found over 33 thousand large predicted protein clusters comprising nearly 6 million sequences. Twenty percent of these clusters did not match known protein families by sequence similarity search and might represent novel protein families. Distributions of the large clusters were illustrated on organism composition, functional class, and sample locations.

Conclusion/Significance

Our clustering took about two orders of magnitude less computational effort than the similar protein family analysis of original GOS study. This approach will help to analyze other large metagenomic datasets in the future. A Web server with our clustering results and annotations of predicted protein clusters is available online at http://tools.camera.calit2.net/gos under the CAMERA project. 相似文献

19.

A kinase sequence database: sequence alignments and family assignment

Buzko O Shokat KM 《Bioinformatics (Oxford, England)》2002,18(9):1274-1275

SUMMARY: The Kinase Sequence Database (KSD) located at http://kinase.ucsf.edu/ksd contains information on 290 protein kinase families derived by profile-based clustering of the non-redundant list of sequences obtained from a GenBank-wide search. Included in the database are a total of 5,041 protein kinases from over 100 organisms. Clustering into families is based on the extent of homology within the kinase catalytic domain (250-300 residues in length). Alignments of the families are viewed by interactive Excel-based sequence spreadsheets. In addition, KSD features evolutionary trees derived for each family and detailed information on each sequence as well as links to the corresponding GenBank entries. Sequence manipulation tools, such as evolutionary tree generation, novel sequence assignment, and statistical analysis, are also provided. AVAILABILITY: The kinase sequence database is a web-based service accessible at http://kinase.ucsf.edu/ksd CONTACT: buzko@cmp.ucsf.edu; shokat@cmp.ucsf.edu/ksd 相似文献

20.

Yeast chromosome III: new gene functions. 总被引：19，自引：1，他引：18

下载免费PDF全文

E V Koonin P Bork C Sander 《The EMBO journal》1994,13(3):493-503

相似文献