首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 661 毫秒
1.
MOTIVATION: The completion of the Arabidopsis genome offers the first opportunity to analyze all of the membrane protein sequences of a plant. The majority of integral membrane proteins including transporters, channels, and pumps contain hydrophobic alpha-helices and can be selected based on TransMembrane Spanning (TMS) domain prediction. By clustering the predicted membrane proteins based on sequence, it is possible to sort the membrane proteins into families of known function, based on experimental evidence or homology, or unknown function. This provides a way to identify target sequences for future functional analysis. RESULTS: An automated approach was used to select potential membrane protein sequences from the set of all predicted proteins and cluster the sequences into related families. The recently completed sequence of Arabidopsis thaliana, a model plant, was analyzed. Of the 25,470 predicted protein sequences 4589 (18%) were identified as containing two or more membrane spanning domains. The membrane protein sequences clustered into 628 distinct families containing 3208 sequences. Of these, 211 families (1764 sequences) either contained proteins of known function or showed homology to proteins of known function in other species. However, 417 families (1444 sequences) contained only sequences with no known function and no homology to proteins of known function. In addition, 1381 sequences did not cluster with any family and no function could be assigned to 1337 of these.  相似文献   

2.
一种新的EST聚类方法   总被引:11,自引:0,他引:11  
该研究发展了一种EST(expressed sequence tag)聚类方法(ESTClustering),用于分析大规模EST测序中所产生的大量数据,以获得高质量,非重复表达序列,该方法在聚类过程中采用MEGABLAST工具对一致序列进行序列同源比较,并用phrap程序对每一EST簇进行拼接检验。这一聚类策略能降低测序错误带来的影响,有效识别基因家族成员,并避免选择性剪接的干扰,与NCB(National Center for Biotechnology Information)的UniGene clustering)方法相比,ESTClustering的聚类结果可以更好地反映表达序列的多样性,用ESTClustering对112256条拟南芥EST聚类测试,产生23581个EST簇,其中13597个EST簇有对应拟南芥基因组编码序列,与该基因组中有EST作为依据的预测基因数目接近。应用该方法对收集的147191条水稻EST序列进行聚类,形成33896个EST簇。  相似文献   

3.
Reversible protein phosphorylation is critically important in the modulation of a wide variety of cellular functions. Several families of protein phosphatases remove phosphate groups placed on key cellular proteins by protein kinases. The complete genomic sequence of the model plant Arabidopsis permits a comprehensive survey of the phosphatases encoded by this organism. Several errors in the sequencing project gene models were found via analysis of predicted phosphatase coding sequences. Structural sequence probes from aligned and unaligned sequence models, and all-against-all BLAST searches, were used to identify 112 phosphatase catalytic subunit sequences, distributed among the serine (Ser)/threonine (Thr) phosphatases (STs) of the protein phosphatase P (PPP) family, STs of the protein phosphatase M (PPM) family (protein phosphatases 2C [PP2Cs] subfamily), protein tyrosine (Tyr) phosphatases (PTPs), low-M(r) protein Tyr phosphatases, and dual-specificity (Tyr and Ser/Thr) phosphatases (DSPs). The Arabidopsis genome contains an abundance of PP2Cs (69) and a dearth of PTPs (one). Eight sequences were identified as new protein phosphatase candidates: five dual-specificity phosphatases and three PP2Cs. We used phylogenetic analyses to infer clustering patterns reflecting sequence similarity and evolutionary ancestry. These clusters, particularly for the largely unexplored PP2C set, will be a rich source of material for plant biologists, allowing the systematic sampling of protein function by genetic and biochemical means.  相似文献   

4.
Wang D  Harper JF  Gribskov M 《Plant physiology》2003,132(4):2152-2165
The genome of the budding yeast (Saccharomyces cerevisiae) provides an important paradigm for transgenomic comparisons with other eukaryotic species. Here, we report a systematic comparison of the protein kinases of yeast (119 kinases) and a reference plant Arabidopsis (1,019 kinases). Using a whole-protein-based, hierarchical clustering approach, the complete set of protein kinases from both species were clustered. We validated our clustering by three observations: (a) clustering pattern of functional orthologs proven in genetic complementation experiments, (b) consistency with reported classifications of yeast kinases, and (c) consistency with the biochemical properties of those Arabidopsis kinases already experimentally characterized. The clustering pattern identified no overlap between yeast kinases and the receptor-like kinases (RLKs) of Arabidopsis. Ten more kinase families were found to be specific for one of the two species. Among them, the calcium-dependent protein kinase and phosphoenolpyruvate carboxylase kinase families are specific for plants, whereas the Ca(2+)/calmodulin-dependent protein kinase and provirus insertion in mouse-like kinase families were found only in yeast and animals. Three yeast kinase families, nitrogen permease reactivator/halotolerance-5), polyamine transport kinase, and negative regulator of sexual conjugation and meiosis, are absent in both plants and animals. The majority of yeast kinase families (21 of 26) display Arabidopsis counterparts, and all are mapped into Arabidopsis families of intracellular kinases that are not related to RLKs. Representatives from 11 of the common families (54 kinases from Arabidopsis and 17 from yeast) share an extremely high degree of similarity (blast E value < 10(-80)), suggesting the likelihood of orthologous functions. Selective expansion of yeast kinase families was observed in Arabidopsis. This is most evident for yeast genes CBK1, HRR25, and SNF1 and the kinase family S6K. Reduction of kinase families was also observed, as in the case of the NEK-like family. The distinguishing features between the two sets of kinases are the selective expansion of yeast families and the generation of a limited number of new kinase families for new functionality in Arabidopsis, most notably, the Arabidopsis RLKs that constitute important components of plant intercellular communication apparatus.  相似文献   

5.
DiffTool is a resource to build and visualize protein clusters computed from a sequence database. The package provides a clustering tool to construct protein families according to sequence similarities and a web interface to query the corresponding clusters. A subtractive genome analysis tool selects protein families specific for a genome or a group of genomes. For each protein cluster, DiffTool includes access to sequences, coloured multiple alignments and phylogenetic trees. AVAILABILITY: A cluster database built from yeast and complete prokaryotic genomes is queryable at http://bioweb.pasteur.fr/seqanal/difftool. All the Perl sources are freely available to non-profit organizations upon request.  相似文献   

6.
Li W  Liu Z  Lai L 《Biopolymers》1999,49(6):481-495
A general problem in comparative modeling and protein design is the conformational evaluation of loops with a certain sequence in specific environmental protein frameworks. Loops of different sequences and structures on similar scaffolds are common in the Protein Data Bank (PDB). In order to explore both structural and sequential diversity of them, a data base of loops connecting similar secondary structure fragments is constructed by searching the data base of families of structurally similar proteins and PDB. A total of 84 loop families having 2-13 residues are found among the well-determined structures of resolution better than 2.5 A. Eight alpha-alpha, 20 alpha-beta, 19 beta-alpha, and 37 beta-beta families are identified. Every family contains more than 5 loop motifs. In each family, no loops share same sequence and all the frameworks are well superimposed. Forty-three new loop classes are distinguished in the data base. The structural variability of loops in homologous proteins are examined and shown in 44 families. Motif families are characterized with geometric parameters and sequence patterns. The conformations of loops in each family are clustered into subfamilies using average linkage cluster analysis method. Information such as geometric properties, sequence profile, sequential and structural variability in loop, structural alignment parameters, sequence similarities, and clustering results are provided. Correlations between the conformation of loops and loop sequence, motif sequence, and global sequence of PDB chain are examined in order to find how loop structures depend on their sequences and how they are affected by the local and global environment. Strong correlations (R > 0.75) are only found in 24 families. The best R value is 0.98. The data base is available through the Internet.  相似文献   

7.

Background  

Phylogenetic analysis can be used to divide a protein family into subfamilies in the absence of experimental information. Most phylogenetic analysis methods utilize multiple alignment of sequences and are based on an evolutionary model. However, multiple alignment is not an automated procedure and requires human intervention to maintain alignment integrity and to produce phylogenies consistent with the functional splits in underlying sequences. To address this problem, we propose to use the alignment-free Relative Complexity Measure (RCM) combined with reduced amino acid alphabets to cluster protein families into functional subtypes purely on sequence criteria. Comparison with an alignment-based approach was also carried out to test the quality of the clustering.  相似文献   

8.
9.
Yau SS  Yu C  He R 《DNA and cell biology》2008,27(5):241-250
Graphical representation of gene sequences provides a simple way of viewing, sorting, and comparing various gene structures. Here we first report a two-dimensional graphical representation for protein sequences. With this method, we constructed the moment vectors for protein sequences, and mathematically proved that the correspondence between moment vectors and protein sequences is one-to-one. Therefore, each protein sequence can be represented as a point in a map, which we call protein map, and cluster analysis can be used for comparison between the points. Sixty-six proteins from five protein families were analyzed using this method. Our data showed that for proteins in the same family, their corresponding points in the map are close to each other. We also illustrate the efficiency of this approach by performing an extensive cluster analysis of the protein kinase C family. These results indicate that this protein map could be used to mathematically specify the similarity of two proteins and predict properties of an unknown protein based on its amino acid sequence.  相似文献   

10.
An efficient algorithm for large-scale detection of protein families   总被引:6,自引:0,他引:6  
Detection of protein families in large databases is one of the principal research objectives in structural and functional genomics. Protein family classification can significantly contribute to the delineation of functional diversity of homologous proteins, the prediction of function based on domain architecture or the presence of sequence motifs as well as comparative genomics, providing valuable evolutionary insights. We present a novel approach called TRIBE-MCL for rapid and accurate clustering of protein sequences into families. The method relies on the Markov cluster (MCL) algorithm for the assignment of proteins into families based on precomputed sequence similarity information. This novel approach does not suffer from the problems that normally hinder other protein sequence clustering algorithms, such as the presence of multi-domain proteins, promiscuous domains and fragmented proteins. The method has been rigorously tested and validated on a number of very large databases, including SwissProt, InterPro, SCOP and the draft human genome. Our results indicate that the method is ideally suited to the rapid and accurate detection of protein families on a large scale. The method has been used to detect and categorise protein families within the draft human genome and the resulting families have been used to annotate a large proportion of human proteins.  相似文献   

11.
12.
Li W  Wooley JC  Godzik A 《PloS one》2008,3(10):e3375

Background

The scale and diversity of metagenomic sequencing projects challenge both our technical and conceptual approaches in gene and genome annotations. The recent Sorcerer II Global Ocean Sampling (GOS) expedition yielded millions of predicted protein sequences, which significantly altered the landscape of known protein space by more than doubling its size and adding thousands of new families (Yooseph et al., 2007 PLoS Biol 5, e16). Such datasets, not only by their sheer size, but also by many other features, defy conventional analysis and annotation methods.

Methodology/Principal Findings

In this study, we describe an approach for rapid analysis of the sequence diversity and the internal structure of such very large datasets by advanced clustering strategies using the newly modified CD-HIT algorithm. We performed a hierarchical clustering analysis on the 17.4 million Open Reading Frames (ORFs) identified from the GOS study and found over 33 thousand large predicted protein clusters comprising nearly 6 million sequences. Twenty percent of these clusters did not match known protein families by sequence similarity search and might represent novel protein families. Distributions of the large clusters were illustrated on organism composition, functional class, and sample locations.

Conclusion/Significance

Our clustering took about two orders of magnitude less computational effort than the similar protein family analysis of original GOS study. This approach will help to analyze other large metagenomic datasets in the future. A Web server with our clustering results and annotations of predicted protein clusters is available online at http://tools.camera.calit2.net/gos under the CAMERA project.  相似文献   

13.
Accurate tRNA 3' end maturation is essential for aminoacylation and thus for protein synthesis in all organisms. Here we report the first identification of protein and DNA sequences for tRNA 3'-processing endonucleases (RNase Z). Purification of RNase Z from wheat identified a 43 kDa protein correlated with the activity. Peptide sequences obtained from the purified protein were used to identify the corresponding gene. In vitro expression of the homologous proteins from Arabidopsis thaliana and Methano coccus janaschii confirmed their tRNA 3'-processing activities. These RNase Z proteins belong to the ELAC1/2 family of proteins and to the cluster of orthologous proteins COG 1234. The RNase Z enzymes from A.thaliana and M.janaschii are the first members of these families to which a function can now be assigned. Proteins with high sequence similarity to the RNase Z enzymes from A.thaliana and M.janaschii are present in all three kingdoms.  相似文献   

14.
15.
Lee D  Grant A  Marsden RL  Orengo C 《Proteins》2005,59(3):603-615
Using a new protocol, PFscape, we undertake a systematic identification of protein families and domain architectures in 120 complete genomes. PFscape clusters sequences into protein families using a Markov clustering algorithm (Enright et al., Nucleic Acids Res 2002;30:1575-1584) followed by complete linkage clustering according to sequence identity. Within each protein family, domains are recognized using a library of hidden Markov models comprising CATH structural and Pfam functional domains. Domain architectures are then determined using DomainFinder (Pearl et al., Protein Sci 2002;11:233-244) and the protein family and domain architecture data are amalgamated in the Gene3D database (Buchan et al., Genome Res 2002;12:503-514). Using Gene3D, we have investigated protein sequence space, the extent of structural annotation, and the distribution of different domain architectures in completed genomes from all kingdoms of life. As with earlier studies by other researchers, the distribution of domain families shows power-law behavior such that the largest 2,000 domain families can be mapped to approximately 70% of nonsingleton genome sequences; the remaining sequences are assigned to much smaller families. While approximately 50% of domain annotations within a genome are assigned to 219 universal domain families, a much smaller proportion (< 10%) of protein sequences are assigned to universal protein families. This supports the mosaic theory of evolution whereby domain duplication followed by domain shuffling gives rise to novel domain architectures that can expand the protein functional repertoire of an organism. Functional data (e.g. COG/KEGG/GO) integrated within Gene3D result in a comprehensive resource that is currently being used in structure genomics initiatives and can be accessed via http://www.biochem.ucl.ac.uk/bsm/cath/Gene3D/.  相似文献   

16.
Sequence similarity and profile searching tools were used to analyze the genome sequences of Arabidopsis thaliana, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Caenorhabditis elegans and Drosophila melanogaster for genes encoding three families of histone deacetylase (HDAC) proteins and three families of histone acetyltransferase (HAT) proteins. Plants, animals and fungi were found to have a single member of each of three subfamilies of the GNAT family of HATs, suggesting conservation of these functions. However, major differences were found with respect to sizes of gene families and multi-domain protein structures within other families of HATs and HDACs, indicating substantial evolutionary diversification. Phylogenetic analysis identified a new class of HDACs within the RPD3/HDA1 family that is represented only in plants and animals. A similar analysis of the plant-specific HD2 family of HDACs suggests a duplication event early in dicot evolution, followed by further diversification in the lineage leading to Arabidopsis. Of three major classes of SIR2-type HDACs that are found in animals, fungi have representatives only in one class, whereas plants have representatives only in the other two. Plants possess five CREB-binding protein (CBP)-type HATs compared with one to two in animals and none in fungi. Domain and phylogenetic analyses of the CBP family proteins showed that this family has evolved three distinct types of CBPs in plants. The domain architecture of CBP and TAF(II)250 families of HATs show significant differences between plants and animals, most notably with respect to bromodomain occurrence and their number. Bromodomain-containing proteins in Arabidopsis differ strikingly from animal bromodomain proteins with respect to the numbers of bromodomains and the other types of domains that are present. The substantial diversification of HATs and HDACs that has occurred since the divergence of plants, animals and fungi suggests a surprising degree of evolutionary plasticity and functional diversification in these core chromatin components.  相似文献   

17.
The Fabaceae, the third largest family of plants and the source of many crops, has been the target of many genomic studies. Currently, only the grasses surpass the legumes for the number of publicly available expressed sequence tags (ESTs). The quantity of sequences from diverse plants enables the use of computational approaches to identify novel genes in specific taxa. We used BLAST algorithms to compare unigene sets from Medicago truncatula, Lotus japonicus, and soybean (Glycine max and Glycine soja) to nonlegume unigene sets, to GenBank's nonredundant and EST databases, and to the genomic sequences of rice (Oryza sativa) and Arabidopsis. As a working definition, putatively legume-specific genes had no sequence homology, below a specified threshold, to publicly available sequences of nonlegumes. Using this approach, 2,525 legume-specific EST contigs were identified, of which less than three percent had clear homology to previously characterized legume genes. As a first step toward predicting function, related sequences were clustered to build motifs that could be searched against protein databases. Three families of interest were more deeply characterized: F-box related proteins, Pro-rich proteins, and Cys cluster proteins (CCPs). Of particular interest were the >300 CCPs, primarily from nodules or seeds, with predicted similarity to defensins. Motif searching also identified several previously unknown CCP-like open reading frames in Arabidopsis. Evolutionary analyses of the genomic sequences of several CCPs in M. truncatula suggest that this family has evolved by local duplications and divergent selection.  相似文献   

18.
Domains are considered as the basic units of protein folding, evolution, and function. Decomposing each protein into modular domains is thus a basic prerequisite for accurate functional classification of biological molecules. Here, we present ADDA, an automatic algorithm for domain decomposition and clustering of all protein domain families. We use alignments derived from an all-on-all sequence comparison to define domains within protein sequences based on a global maximum likelihood model. In all, 90% of domain boundaries are predicted within 10% of domain size when compared with the manual domain definitions given in the SCOP database. A representative database of 249,264 protein sequences were decomposed into 450,462 domains. These domains were clustered on the basis of sequence similarities into 33,879 domain families containing at least two members with less than 40% sequence identity. Validation against family definitions in the manually curated databases SCOP and PFAM indicates almost perfect unification of various large domain families while contamination by unrelated sequences remains at a low level. The global survey of protein-domain space by ADDA confirms that most large and universal domain families are already described in PFAM and/or SMART. However, a survey of the complete set of mobile modules leads to the identification of 1479 new interesting domain families which shuffle around in multi-domain proteins. The data are publicly available at ftp://ftp.ebi.ac.uk/pub/contrib/heger/adda.  相似文献   

19.
The SYSTERS (short for SYSTEmatic Re-Searching) protein sequence cluster set consists of the classification of all sequences from SWISS-PROT and PIR into disjoint protein family clusters and hierarchically into superfamily and subfamily clusters. The cluster set can be searched with a sequence using the SSMAL search tool or a traditional database search tool like BLAST or FASTA. Additionally a multiple alignment is generated for each cluster and annotated with domain information from the Pfam database of protein domain families. A taxonomic overview of the organisms covered by a cluster is given based on the NCBI taxonomy. The cluster set is available for querying and browsing at http://www.dkfz-heidelberg. de/tbi/services/cluster/systersform  相似文献   

20.
The complete genomic sequence for Arabidopsis provides the opportunity to combine phylogenetic and genomic approaches to study the evolution of gene families in plants. The Aux/IAA and ARF gene families, consisting of 29 and 23 loci in Arabidopsis, respectively, encode proteins that interact to mediate auxin responses and regulate various aspects of plant morphological development. We developed scenarios for the genomic proliferation of the Aux/IAA and ARF families by combining phylogenetic analysis with information on the relationship between each locus and the previously identified duplicated genomic segments in Arabidopsis. This analysis shows that both gene families date back at least to the origin of land plants and that the major Aux/IAA and ARF lineages originated before the monocot-eudicot divergence. We found that the extant Aux/IAA loci arose primarily through segmental duplication events, in sharp contrast to the ARF family and to the general pattern of gene family proliferation in Arabidopsis. Possible explanations for the unusual mode of Aux/IAA duplication include evolutionary constraints imposed by complex interactions among proteins and pathways, or the presence of long-distance cis-regulatory sequences. The antiquity of the two gene families and the unusual mode of Aux/IAA diversification have a number of potential implications for understanding both the functional and evolutionary roles of these genes.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号