首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Mishra P  Pandey PN 《Bioinformation》2011,6(10):372-374
The number of amino acid sequences is increasing very rapidly in the protein databases like Swiss-Prot, Uniprot, PIR and others, but the structure of only some amino acid sequences are found in the Protein Data Bank. Thus, an important problem in genomics is automatically clustering homologous protein sequences when only sequence information is available. Here, we use graph theoretic techniques for clustering amino acid sequences. A similarity graph is defined and clusters in that graph correspond to connected subgraphs. Cluster analysis seeks grouping of amino acid sequences into subsets based on distance or similarity score between pairs of sequences. Our goal is to find disjoint subsets, called clusters, such that two criteria are satisfied: homogeneity: sequences in the same cluster are highly similar to each other; and separation: sequences in different clusters have low similarity to each other. We tested our method on several subsets of SCOP (Structural Classification of proteins) database, a gold standard for protein structure classification. The results show that for a given set of proteins the number of clusters we obtained is close to the superfamilies in that set; there are fewer singeltons; and the method correctly groups most remote homologs.  相似文献   

2.
Multiple sequence alignment with hierarchical clustering.   总被引:155,自引:8,他引:147       下载免费PDF全文
F Corpet 《Nucleic acids research》1988,16(22):10881-10890
An algorithm is presented for the multiple alignment of sequences, either proteins or nucleic acids, that is both accurate and easy to use on microcomputers. The approach is based on the conventional dynamic-programming method of pairwise alignment. Initially, a hierarchical clustering of the sequences is performed using the matrix of the pairwise alignment scores. The closest sequences are aligned creating groups of aligned sequences. Then close groups are aligned until all sequences are aligned in one group. The pairwise alignments included in the multiple alignment form a new matrix that is used to produce a hierarchical clustering. If it is different from the first one, iteration of the process can be performed. The method is illustrated by an example: a global alignment of 39 sequences of cytochrome c.  相似文献   

3.
With various ‘omics’ data becoming available recently, new challenges and opportunities are provided for researches on the assembly of next-generation sequences. As an attempt to utilize novel opportunities, we developed a next-generation sequence clustering method focusing on interdependency between genomics and proteomics data. Under the assumption that we can obtain next-generation read sequences and proteomics data of a target species, we mapped the read sequences against protein sequences and found physically adjacent reads based on a machine learning-based read assignment method. We measured the performance of our method by using simulated read sequences and collected protein sequences of Escherichia coli (E. coli). Here, we concentrated on the actual adjacency of the clustered reads in the E. coli genome and found that (i) the proposed method improves the performance of read clustering and (ii) the use of proteomics data does have a potential for enhancing the performance of genome assemblers. These results demonstrate that the integrative approach is effective for the accurate grouping of adjacent reads in a genome, which will result in a better genome assembly.  相似文献   

4.
MOTIVATION: Clustering sequences of a full-length cDNA library into alternative splice form candidates is a very important problem. RESULTS: We developed a new efficient algorithm to cluster sequences of a full-length cDNA library into alternative splice form candidates. Current clustering algorithms for cDNAs tend to produce too many clusters containing incorrect splice form candidates. Our algorithm is based on a spliced sequence alignment algorithm that considers splice sites. The spliced sequence alignment algorithm is a variant of an ordinary dynamic programming algorithm, which requires O(nm) time for checking a pair of sequences where n and m are the lengths of the two sequences. Since the time bound is too large to perform all-pair comparison for a large set of sequences, we developed new techniques to reduce the computation time without affecting the accuracy of the output clusters. Our algorithm was applied to 21 076 mouse cDNA sequences of the FANTOM 1.10 database to examine its performance and accuracy. In these experiments, we achieved about 2-12-fold speedup against a method using only a traditional hash-based technique. Moreover, without using any information of the mouse genome sequence data or any gene data in public databases, we succeeded in listing 87-89% of all the clusters that biologists have annotated manually. AVAILABILITY: We provide a web service for cDNA clustering located at https://access.obigrid.org/ibm/cluspa/, for which registration for the OBIGrid (http://www.obigrid.org) is required.  相似文献   

5.
In this paper we report a novel mathematical method to transform the DNA sequences into the distribution vectors which correspond to points in the sixty dimensional space. Each component of the distribution vector represents the distribution of one kind of nucleotide in k segments of the DNA sequences. The mathematical and statistical properties of the distribution vectors are demonstrated and examined with huge datasets of human DNA sequences and random sequences. The determined expectation and standard deviation can make the mapping stable and practicable. Moreover, we apply the distribution vectors to the clustering of the Haemagglutinin (HA) gene of 60 H1N1 viruses from Human, Swine and Avian, the complete mitochondrial genomes from 80 placental mammals and the complete genomes from 50 bacteria. The 60 H1N1 viruses, 80 placental mammals and 50 bacteria are classified accurately and rapidly compared to the multiple sequence alignment methods. The results indicate that the distribution vectors can reveal the similarity and evolutionary relationship among homologous DNA sequences based on the distances between any two of these distribution vectors. The advantage of fast computation offers the distribution vectors the opportunity to deal with a huge amount of DNA sequences efficiently.  相似文献   

6.
This article introduces an alignment-free clustering method in order to cluster all the 66 DORs sequentially diverse protein sequences. Two different methods are discussed: one is utilizing twenty standard amino acids (without grouping) and another one is using chemical grouping of amino acids (with grouping). Two grayscale images (representing two protein sequences by order pair frequency matrices) are compared to find the similarity index using morphology technique. We could achieve the correlation coefficients of 0.9734 and 0.9403 for without and with grouping methods respectively with the ClustalW result in the ND5 dataset, which are much better than some of the existing alignment-free methods. Based on the similarity index, the 66 DORs are clustered into three classes - Highest, Moderate and Lowest - which are seen to be best fitted for 66 DORs protein sequences. OR83b is the distinguished olfactory receptor expressed in divergent insect population which is substantiated through our investigation.  相似文献   

7.
A clustering method for repeat analysis in DNA sequences   总被引:1,自引:0,他引:1  
Volfovsky N  Haas BJ  Salzberg SL 《Genome biology》2001,2(8):research0027.1-research002711

Background

A computational system for analysis of the repetitive structure of genomic sequences is described. The method uses suffix trees to organize and search the input sequences; this data structure has been used previously for efficient computation of exact and degenerate repeats.

Results

The resulting software tool collects all repeat classes and outputs summary statistics as well as a file containing multiple sequences (multi fasta), that can be used as the target of searches. Its use is demonstrated here on several complete microbial genomes, the entire Arabidopsis thaliana genome, and a large collection of rice bacterial artificial chromosome end sequences.

Conclusions

We propose a new clustering method for analysis of the repeat data captured in suffix trees. This method has been incorporated into a system that can find repeats in individual genome sequences or sets of sequences, and that can organize those repeats into classes. It quickly and accurately creates repeat databases from small and large genomes. The associated software (RepeatFinder), should prove helpful in the analysis of repeat structure for both complete and partial genome sequences.  相似文献   

8.
Spectral clustering of protein sequences   总被引:1,自引:0,他引:1  
An important problem in genomics is automatically clustering homologous proteins when only sequence information is available. Most methods for clustering proteins are local, and are based on simply thresholding a measure related to sequence distance. We first show how locality limits the performance of such methods by analysing the distribution of distances between protein sequences. We then present a global method based on spectral clustering and provide theoretical justification of why it will have a remarkable improvement over local methods. We extensively tested our method and compared its performance with other local methods on several subsets of the SCOP (Structural Classification of Proteins) database, a gold standard for protein structure classification. We consistently observed that, the number of clusters that we obtain for a given set of proteins is close to the number of superfamilies in that set; there are fewer singletons; and the method correctly groups most remote homologs. In our experiments, the quality of the clusters as quantified by a measure that combines sensitivity and specificity was consistently better [on average, improvements were 84% over hierarchical clustering, 34% over Connected Component Analysis (CCA) (similar to GeneRAGE) and 72% over another global method, TribeMCL].  相似文献   

9.
10.
Han Si  Lee SG  Kim KH  Choi CJ  Kim YH  Hwang KS 《Bio Systems》2006,84(3):175-182
Most multiple gene sequence alignment methods rely on conventions regarding the score of a multiple alignment in pairwise fashion. Therefore, as the number of sequences increases, the runtime of sequencing expands exponentially. In order to solve the problem, this paper presents a multiple sequence alignment method using a linear-time suffix tree algorithm to cluster similar sequences at one time without pairwise alignment. After searching for common subsequences, cross-matching common subsequences were generated, and sometimes inexact matching was found. So, a procedure aimed at masking the inexact cross-matching pairs was suggested here. In addition, BLAST was combined with a clustering tool in order to annotate the clusters generated by suffix tree clustering. The proposed method for clustering and annotating genes consists of the following steps: (1) construction of a suffix tree; (2) searching and overlapping common subsequences; (3) grouping subsequence pairs; (4) masking cross-matching pairs; (5) clustering gene sequences; (6) annotating gene clusters by the BLAST search. The performance of the proposed system, CLAGen, was successfully evaluated with 42 gene sequences in a TCA cycle (a citrate cycle) of bacteria. The system generated 11 clusters and found the longest subsequences of each cluster, which are biologically significant.  相似文献   

11.
INCLUSive allows automatic multistep analysis of microarray data (clustering and motif finding). The clustering algorithm (adaptive quality-based clustering) groups together genes with highly similar expression profiles. The upstream sequences of the genes belonging to a cluster are automatically retrieved from GenBank and can be fed directly into Motif Sampler, a Gibbs sampling algorithm that retrieves statistically over-represented motifs in sets of sequences, in this case upstream regions of co-expressed genes.  相似文献   

12.
The third codon positions are generally thought to be largely neutral, allowing for synonymous mutations. To see how much the third positions are loaded in general we analyzed the sequences of the nucleotides in the third positions. Simple word count analysis revealed excessive clustering of pyrimidines in the third position sequences of prokaryotic mRNA. The clusters have a clear tendency to follow one after another at characteristic distance of 25-30 triplets. Thus, the third codon positions do carry a rather strong message. Possible connection with loop-fold structure of proteins and (cotranslational) protein folding is discussed.  相似文献   

13.
Harte R  Ouzounis CA 《FEBS letters》2002,514(2-3):129-134
Ion channels represent an important class of molecules that can be classified into 13 distinct groups. We present a strategy using a "learning set" of well-annotated ion channel sequences to detect homologues in 32 entire genome sequences from Archaea, Bacteria and Eukarya. A total of 299 putative ion channel protein sequences were detected, with significant variations across species. The clustering of these sequences reveals complex relationships between the different ion channel families.  相似文献   

14.
Environmental shotgun sequencing (or metagenomics) is widely used to survey the communities of microbial organisms that live in many diverse ecosystems, such as the human body. Finding the protein-coding genes within the sequences is an important step for assessing the functional capacity of a metagenome. In this work, we developed a metagenomics gene prediction system Glimmer-MG that achieves significantly greater accuracy than previous systems via novel approaches to a number of important prediction subtasks. First, we introduce the use of phylogenetic classifications of the sequences to model parameterization. We also cluster the sequences, grouping together those that likely originated from the same organism. Analogous to iterative schemes that are useful for whole genomes, we retrain our models within each cluster on the initial gene predictions before making final predictions. Finally, we model both insertion/deletion and substitution sequencing errors using a different approach than previous software, allowing Glimmer-MG to change coding frame or pass through stop codons by predicting an error. In a comparison among multiple gene finding methods, Glimmer-MG makes the most sensitive and precise predictions on simulated and real metagenomes for all read lengths and error rates tested.  相似文献   

15.
In previous work, we have shown that a set of characteristics,defined as (code frequency) pairs, can be derived from a proteinfamily by the use of a signal-processing method. This methodenables the location and extraction of sequence patterns bytaking into account each (code frequency) pair individually.In the present paper, we propose to extend this method in orderto detect and visualize patterns by taking into account severalpairs simultaneously. Two ‘multifrequency’ methodsare described. The first one is based on a rewriting of thesequences with new symbols which summarize the frequency information.The second method is based on a clustering of the patterns associatedwith each pair. Both methods lead to the definition of significantconsensus sequences. Some results obtained with calcium-bindingproteins and serine proteases are also discussed. Received on March 6, 1990; accepted on September 24, 1990  相似文献   

16.
17.
Galat A 《Proteins》2004,56(4):808-820
The 18 kDa archetypal cyclosporin-A binding protein, cyclophilin-A, has multiple paralogues in the human genome. Only 18 of those paralogues have been detected as mRNAs or proteins whose masses vary from 18 to 354 kDa, whereas the functional significance of the open reading frames (ORFs) encoding other paralogues of cyclophilin-A remains unknown. The genomes of Drosophila melanogaster, Caenorhabditis elegans, Arabidopsis thaliana, Schizosaccharomyces pombe, and Saccharomyces cerevisiae encode different numbers of the cyclophilin paralogues, some of which are orthologous to the human cyclophilins. A library of novel algorithms was developed and used for computation of the conservation levels for hydrophobicity and bulkiness profiles, and amino acid compositions (AACs) of 303 aligned sequences of cyclophilins. The majority of the paralogues and orthologues encoded in these 6 genomes differ considerably from each other. Some of the orthologues and paralogues have high correlation coefficients (CCFs) for pairwise compared hydrophobicity and bulkiness profiles, and whose AACs differ to a low degree. Convergence of these three properties of the polypeptide chain and apparent conservation of the typical sequence hallmarks and parameters allowed for the clustering of the functionally related orthologues and paralogues of the cyclophilins. The clustering method allowed for sorting out the cyclophilins into several distinct classes. Analyses of the overlapping clusters of sequences permitted delineation of some hypothetical pathways that might have led to the creation of certain paralogues of cyclophilins in the eukaryotic genomes.  相似文献   

18.
We present a new computational method for predicting ligand binding residues and functional sites in protein sequences. These residues and sites tend to be not only conserved, but also exhibit strong correlation due to the selection pressure during evolution in order to maintain the required structure and/or function. To explore the effect of correlations among multiple positions in the sequences, the method uses graph theoretic clustering and kernel-based canonical correlation analysis (kCCA) to identify binding and functional sites in protein sequences as the residues that exhibit strong correlation between the residues’ evolutionary characterization at the sites and the structure-based functional classification of the proteins in the context of a functional family. The results of testing the method on two well-curated data sets show that the prediction accuracy as measured by Receiver Operating Characteristic (ROC) scores improves significantly when multipositional correlations are accounted for.  相似文献   

19.
Banerjee AK  M S  M N  Murty US 《Bioinformation》2010,4(10):456-462
Biological systems are highly organized and enormously coordinated maintaining greater complexity. The increment of secondary data generation and progress of modern mining techniques provided us an opportunity to discover hidden intra and inter relations among these non linear dataset. This will help in understanding the complex biological phenomenon with greater efficiency. In this paper we report comparative classification of Pyruvate Dehydrogenase protein sequences from bacterial sources based on 28 different physicochemical parameters (such as bulkiness, hydrophobicity, total positively and negatively charged residues, α helices, β strand etc.) and 20 type amino acid compositions. Logistic, MLP (Multi Layer Perceptron), SMO (Sequential Minimal Optimization), RBFN (Radial Basis Function Network) and SL (simple logistic) methods were compared in this study. MLP was found to be the best method with maximum average accuracy of 88.20%. Same dataset was subjected for clustering using 2*2 grid of a two dimensional SOM (Self Organizing Maps). Clustering analysis revealed the proximity of the unannotated sequences with the Mycobacterium and Synechococcus genus.  相似文献   

20.
Masking repeats while clustering ESTs   总被引:2,自引:0,他引:2       下载免费PDF全文
A problem in EST clustering is the presence of repeat sequences. To avoid false matches, repeats have to be masked. This can be a time-consuming process, and it depends on available repeat libraries. We present a fast and effective method that aims to eliminate the problems repeats cause in the process of clustering. Unlike traditional methods, repeats are inferred directly from the EST data, we do not rely on any external library of known repeats. This makes the method especially suitable for analysing the ESTs from organisms without good repeat libraries. We demonstrate that the result is very similar to performing standard repeat masking before clustering.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号