首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 109 毫秒
1.
In this paper, we introduce a probabilistic measure for computing the similarity between two biological sequences without alignment. The computation of the similarity measure is based on the Kullback-Leibler divergence of two constructed Markov models. We firstly validate the method on clustering nine chromosomes from three species. Secondly, we give the result of similarity search based on our new method. We lastly apply the measure to the construction of phylogenetic tree of 48 HEV genome sequences. Our results indicate that the weighted relative entropy is an efficient and powerful alignment-free measure for the analysis of sequences in the genomic scale.  相似文献   

2.
This article introduces an alignment-free clustering method in order to cluster all the 66 DORs sequentially diverse protein sequences. Two different methods are discussed: one is utilizing twenty standard amino acids (without grouping) and another one is using chemical grouping of amino acids (with grouping). Two grayscale images (representing two protein sequences by order pair frequency matrices) are compared to find the similarity index using morphology technique. We could achieve the correlation coefficients of 0.9734 and 0.9403 for without and with grouping methods respectively with the ClustalW result in the ND5 dataset, which are much better than some of the existing alignment-free methods. Based on the similarity index, the 66 DORs are clustered into three classes - Highest, Moderate and Lowest - which are seen to be best fitted for 66 DORs protein sequences. OR83b is the distinguished olfactory receptor expressed in divergent insect population which is substantiated through our investigation.  相似文献   

3.
Remote homology detection among proteins utilizing only the unlabelled sequences is a central problem in comparative genomics. The existing cluster kernel methods based on neighborhoods and profiles and the Markov clustering algorithms are currently the most popular methods for protein family recognition. The deviation from random walks with inflation or dependency on hard threshold in similarity measure in those methods requires an enhancement for homology detection among multi-domain proteins. We propose to combine spectral clustering with neighborhood kernels in Markov similarity for enhancing sensitivity in detecting homology independent of “recent” paralogs. The spectral clustering approach with new combined local alignment kernels more effectively exploits the unsupervised protein sequences globally reducing inter-cluster walks. When combined with the corrections based on modified symmetry based proximity norm deemphasizing outliers, the technique proposed in this article outperforms other state-of-the-art cluster kernels among all twelve implemented kernels. The comparison with the state-of-the-art string and mismatch kernels also show the superior performance scores provided by the proposed kernels. Similar performance improvement also is found over an existing large dataset. Therefore the proposed spectral clustering framework over combined local alignment kernels with modified symmetry based correction achieves superior performance for unsupervised remote homolog detection even in multi-domain and promiscuous domain proteins from Genolevures database families with better biological relevance. Source code available upon request. Contact: rf.irbal@rakras.  相似文献   

4.
5.

Background

DNA Clustering is an important technology to automatically find the inherent relationships on a large scale of DNA sequences. But the DNA clustering quality can still be improved greatly. The DNA sequences similarity metric is one of the key points of clustering. The alignment-free methodology is a very popular way to calculate DNA sequence similarity. It normally converts a sequence into a feature space based on words’ probability distribution rather than directly matches strings. Existing alignment-free models, e.g. k-tuple, merely employ word frequency information and ignore many types of useful information contained in the DNA sequence, such as classifications of nucleotide bases, position and the like. It is believed that the better data mining results can be achieved with compounded information. Therefore, we present a new alignment-free model that employs compounded information to improve the DNA clustering quality.

Results

This paper proposes a Category-Position-Frequency (CPF) model, which utilizes the word frequency, position and classification information of nucleotide bases from DNA sequences. The CPF model converts a DNA sequence into three sequences according to the categories of nucleotide bases, and then yields a 12-dimension feature vector. The feature values are computed by an entropy based model that takes both local word frequency and position information into account. We conduct DNA clustering experiments on several datasets and compare with some mainstream alignment-free models for evaluation, including k-tuple, DMk, TSM, AMI and CV. The experiments show that CPF model is superior to other models in terms of the clustering results and optimal settings.

Conclusions

The following conclusions can be drawn from the experiments. (1) The hybrid information model is better than the model based on word frequency only. (2) For DNA sequences no more than 5000 characters, the preferred size of sliding windows for CPF is two which provides a great advantage to promote system performance. (3) The CPF model is able to obtain an efficient stable performance and broad generalization.  相似文献   

6.
Bacteriocins are proteinaceous toxins produced and exported by both gram-negative and gram-positive bacteria as a defense mechanism. The bacteriocin protein family is highly diverse, which complicates the identification of bacteriocin-like sequences using alignment approaches. The use of topological indices (TIs) irrespective of sequence similarity can be a promising alternative to predict proteinaceous bacteriocins. Thus, we present Topological Indices to BioPolymers (TI2BioP) as an alignment-free approach inspired in both the Topological Substructural Molecular Design (TOPS-MODE) and Markov Chain Invariants for Network Selection and Design (MARCH-INSIDE) methodology. TI2BioP allows the calculation of the spectral moments as simple TIs to seek quantitative sequence-function relationships (QSFR) models. Since hydrophobicity and basicity are major criteria for the bactericide activity of bacteriocins, the spectral moments (HPμ k ) were derived for the first time from protein artificial secondary structures based on amino acid clustering into a Cartesian system of hydrophobicity and polarity. Several orders of HPμ k characterized numerically 196 bacteriocin-like sequences and a control group made up of 200 representative CATH domains. Subsequently, they were used to develop an alignment-free QSFR model allowing a 76.92% discrimination of bacteriocin proteins from other domains, a relevant result considering the high sequence diversity among the members of both groups. The model showed a prediction overall performance of 72.16%, detecting specifically 66.7% of proteinaceous bacteriocins whereas the InterProScan retrieved just 60.2%. As a practical validation, the model also predicted successfully the cryptic bactericide function of the Cry 1Ab C-terminal domain from Bacillus thuringiensis’s endotoxin, which has not been detected by classical alignment methods.  相似文献   

7.
8.

Background  

Phylogenetic analysis can be used to divide a protein family into subfamilies in the absence of experimental information. Most phylogenetic analysis methods utilize multiple alignment of sequences and are based on an evolutionary model. However, multiple alignment is not an automated procedure and requires human intervention to maintain alignment integrity and to produce phylogenies consistent with the functional splits in underlying sequences. To address this problem, we propose to use the alignment-free Relative Complexity Measure (RCM) combined with reduced amino acid alphabets to cluster protein families into functional subtypes purely on sequence criteria. Comparison with an alignment-based approach was also carried out to test the quality of the clustering.  相似文献   

9.
Following sequence alignment, clustering algorithms are among the most utilized techniques in gene expression data analysis. Clustering gene expression patterns allows researchers to determine which gene expression patterns are alike and most likely to participate in the same biological process being investigated. Gene expression data also allow the clustering of whole samples of data, which makes it possible to find which samples are similar and, consequently, which sampled biological conditions are alike. Here, a novel similarity measure calculation and the resulting rank-based clustering algorithm are presented. The clustering was applied in 418 gene expression samples from 13 data series spanning three model organisms: Homo sapiens, Mus musculus, and Arabidopsis thaliana. The initial results are striking: more than 91% of the samples were clustered as expected. The MESs (most expressed sequences) approach outperformed some of the most used clustering algorithms applied to this kind of data such as hierarchical clustering and K-means. The clustering performance suggests that the new similarity measure is an alternative to the traditional correlation/distance measures typically used in clustering algorithms.  相似文献   

10.
A probabilistic measure for alignment-free sequence comparison   总被引:3,自引:0,他引:3  
MOTIVATION: Alignment-free sequence comparison methods are still in the early stages of development compared to those of alignment-based sequence analysis. In this paper, we introduce a probabilistic measure of similarity between two biological sequences without alignment. The method is based on the concept of comparing the similarity/dissimilarity between two constructed Markov models. RESULTS: The method was tested against six DNA sequences, which are the thrA, thrB and thrC genes of the threonine operons from Escherichia coli K-12 and from Shigella flexneri; and one random sequence having the same base composition as thrA from E.coli. These results were compared with those obtained from CLUSTAL W algorithm (alignment-based) and the chaos game representation (alignment-free). The method was further tested against a more complex set of 40 DNA sequences and compared with other existing sequence similarity measures (alignment-free). AVAILABILITY: All datasets and computer codes written in MATLAB are available upon request from the first author.  相似文献   

11.

Background  

Several studies have demonstrated that protein fold space is structured hierarchically and that power-law statistics are satisfied in relation between the numbers of protein families and protein folds (or superfamilies). We examined the internal structure and statistics in the fold space of 50 amino-acid residue segments taken from various protein folds. We used inter-residue contact patterns to measure the tertiary structural similarity among segments. Using this similarity measure, the segments were classified into a number (K c) of clusters. We examined various K c values for the clustering. The special resolution to differentiate the segment tertiary structures increases with increasing K c. Furthermore, we constructed networks by linking structurally similar clusters.  相似文献   

12.

Background  

Vaccine development in the post-genomic era often begins with the in silico screening of genome information, with the most probable protective antigens being predicted rather than requiring causative microorganisms to be grown. Despite the obvious advantages of this approach – such as speed and cost efficiency – its success remains dependent on the accuracy of antigen prediction. Most approaches use sequence alignment to identify antigens. This is problematic for several reasons. Some proteins lack obvious sequence similarity, although they may share similar structures and biological properties. The antigenicity of a sequence may be encoded in a subtle and recondite manner not amendable to direct identification by sequence alignment. The discovery of truly novel antigens will be frustrated by their lack of similarity to antigens of known provenance. To overcome the limitations of alignment-dependent methods, we propose a new alignment-free approach for antigen prediction, which is based on auto cross covariance (ACC) transformation of protein sequences into uniform vectors of principal amino acid properties.  相似文献   

13.
MOTIVATION: A large fraction of biological research concentrates on individual proteins and on small families of proteins. One of the current major challenges in bioinformatics is to extend our knowledge to very large sets of proteins. Several major projects have tackled this problem. Such undertakings usually start with a process that clusters all known proteins or large subsets of this space. Some work in this area is carried out automatically, while other attempts incorporate expert advice and annotation. RESULTS: We propose a novel technique that automatically clusters protein sequences. We consider all proteins in SWISSPROT, and carry out an all-against-all BLAST similarity test among them. With this similarity measure in hand we proceed to perform a continuous bottom-up clustering process by applying alternative rules for merging clusters. The outcome of this clustering process is a classification of the input proteins into a hierarchy of clusters of varying degrees of granularity. Here we compare the clusters that result from alternative merging rules, and validate the results against InterPro. Our preliminary results show that clusters that are consistent with several rather than a single merging rule tend to comply with InterPro annotation. This is an affirmation of the view that the protein space consists of families that differ markedly in their evolutionary conservation.  相似文献   

14.
The data deluge in post-genomic era demands development of novel data mining tools. Existing molecular phylogeny analyses (MPAs) developed for individual gene/protein sequences are alignment-based. However, the size of genomic data and uncertainties associated with alignments, necessitate development of alignment-free methods for MPA. Derivation of distances between sequences is an important step in both, alignment-dependant and alignment-free methods. Various alignment-free distance measures based on oligo-nucleotide frequencies, information content, compression techniques, etc. have been proposed. However, these distance measures do not account for relative order of components viz. nucleotides or amino acids. A new distance measure, based on the concept of 'return time distribution' (RTD) of k-mers is proposed, which accounts for the sequence composition and their relative orders. Statistical parameters of RTDs are used to derive a distance function. The resultant distance matrix is used for clustering and phylogeny using Neighbor-joining. Its performance for MPA and subtyping was evaluated using simulated data generated by block-bootstrap, receiver operating characteristics and leave-one-out cross validation methods. The proposed method was successfully applied for MPA of family Flaviviridae and subtyping of Dengue viruses. It is observed that method retains resolution for classification and subtyping of viruses at varying levels of sequence similarity and taxonomic hierarchy.  相似文献   

15.

Background  

The developments of high-throughput genotyping technologies, which enable the simultaneous genotyping of hundreds of thousands of single nucleotide polymorphisms (SNP) have the potential to increase the benefits of genetic epidemiology studies. Although the enhanced resolution of these platforms increases the chance of interrogating functional SNPs that are themselves causative or in linkage disequilibrium with causal SNPs, commonly used single SNP-association approaches suffer from serious multiple hypothesis testing problems and provide limited insights into combinations of loci that may contribute to complex diseases. Drawing inspiration from Gene Set Enrichment Analysis developed for gene expression data, we have developed a method, named GLOSSI (Gene-loci Set Analysis), that integrates prior biological knowledge into the statistical analysis of genotyping data to test the association of a group of SNPs (loci-set) with complex disease phenotypes. The most significant loci-sets can be used to formulate hypotheses from a functional viewpoint that can be validated experimentally.  相似文献   

16.

Backgrounds

Recent explosion of biological data brings a great challenge for the traditional clustering algorithms. With increasing scale of data sets, much larger memory and longer runtime are required for the cluster identification problems. The affinity propagation algorithm outperforms many other classical clustering algorithms and is widely applied into the biological researches. However, the time and space complexity become a great bottleneck when handling the large-scale data sets. Moreover, the similarity matrix, whose constructing procedure takes long runtime, is required before running the affinity propagation algorithm, since the algorithm clusters data sets based on the similarities between data pairs.

Methods

Two types of parallel architectures are proposed in this paper to accelerate the similarity matrix constructing procedure and the affinity propagation algorithm. The memory-shared architecture is used to construct the similarity matrix, and the distributed system is taken for the affinity propagation algorithm, because of its large memory size and great computing capacity. An appropriate way of data partition and reduction is designed in our method, in order to minimize the global communication cost among processes.

Result

A speedup of 100 is gained with 128 cores. The runtime is reduced from serval hours to a few seconds, which indicates that parallel algorithm is capable of handling large-scale data sets effectively. The parallel affinity propagation also achieves a good performance when clustering large-scale gene data (microarray) and detecting families in large protein superfamilies.  相似文献   

17.
To provide a framework for studies to understand the contribution of SALT OVERLY SENSITIVE1 (SOS1) to salt tolerance in Thellungiella halophila, we sequenced and annotated a 193-kb T. halophila BAC containing a putative SOS1 locus (ThSOS1) and compared the sequence to the orthologous 146-kb region of the genome of its salt-sensitive relative, Arabidopsis thaliana. Overall, the two sequences were colinear, but three major expansion/contraction regions in T. halophila were found to contain five Long Terminal Repeat retrotransposons, MuDR DNA transposons and intergenic sequences that contribute to the 47.8-kb size variation in this region of the genome. Twenty-seven genes were annotated in the T. halophila BAC including the putative ThSOS1 locus. ThSOS1 shares gene structure and sequence with A. thaliana SOS1 including 11 predicted transmembrane domains and a cyclic nucleotide-binding domain; however, different patterns of Simple Sequence Repeats were found within a 540-bp region upstream of SOS1 in the two species.  相似文献   

18.
Locally optimal subalignments using nonlinear similarity functions   总被引:2,自引:0,他引:2  
Nonlinear similarity functions are often better than linear functions at distinguishing interesting subalignments from those due to chance. Nonlinear similarity functions useful for comparing biological sequences are developed. Several new algorithms are presented for finding locally optimal subalignments of two sequences. Unlike previous algorithms, they may use any reasonable similarity function as a selection criterion. Among these algorithms are VV-1, which finds all and only the locally optimal subalignments of two sequences, and CC-1, which finds all and only the weakly locally optimal subalignments of two sequences. The VV-1 algorithm is slow and interesting only for theoretical reasons. In contrast, the CC-1 algorithm has average time complexityO(MN) when used to find only very good subalignments.  相似文献   

19.
Given a set S of n locally aligned sequences, it is a needed prerequisite to partition it into groups of very similar sequences to facilitate subsequent computations, such as the generation of a phylogenetic tree. This article introduces a new method of clustering which partitions S into subsets such that the overlap of each pair of sequences within a subset is at least a given percentage c of the lengths of the two sequences. We show that this problem can be reduced to finding all maximal cliques in a special kind of max-tolerance graph which we call a c-max-tolerance graph. Previously we have shown that finding all maximal cliques in general max-tolerance graphs can be done efficiently in O(n 3 + out). Here, using a new kind of sweep-line algorithm, we show that the restriction to c-max-tolerance graphs yields a better runtime of O(n 2 log n + out). Furthermore, we present another algorithm which is much easier to implement, and though theoretically slower than the first one, is still running in polynomial time. We then experimentally analyze the number and structure of all maximal cliques in a c-max-tolerance graph, depending on the chosen c-value. We apply our simple algorithm to artificial and biological data and we show that this implementation is much faster than the well-known application Cliquer. By introducing a new heuristic that uses the set of all maximal cliques to partition S, we finally show that the computed partition gives a reasonable clustering for biological data sets.  相似文献   

20.

Background

Protein phosphorylation is a generic way to regulate signal transduction pathways in all kingdoms of life. In many organisms, it is achieved by the large family of Ser/Thr/Tyr protein kinases which are traditionally classified into groups and subfamilies on the basis of the amino acid sequence of their catalytic domains. Many protein kinases are multi-domain in nature but the diversity of the accessory domains and their organization are usually not taken into account while classifying kinases into groups or subfamilies.

Methodology

Here, we present an approach which considers amino acid sequences of complete gene products, in order to suggest refinements in sets of pre-classified sequences. The strategy is based on alignment-free similarity scores and iterative Area Under the Curve (AUC) computation. Similarity scores are computed by detecting common patterns between two sequences and scoring them using a substitution matrix, with a consistent normalization scheme. This allows us to handle full-length sequences, and implicitly takes into account domain diversity and domain shuffling. We quantitatively validate our approach on a subset of 212 human protein kinases. We then employ it on the complete repertoire of human protein kinases and suggest few qualitative refinements in the subfamily assignment stored in the KinG database, which is based on catalytic domains only. Based on our new measure, we delineate 37 cases of potential hybrid kinases: sequences for which classical classification based entirely on catalytic domains is inconsistent with the full-length similarity scores computed here, which implicitly consider multi-domain nature and regions outside the catalytic kinase domain. We also provide some examples of hybrid kinases of the protozoan parasite Entamoeba histolytica.

Conclusions

The implicit consideration of multi-domain architectures is a valuable inclusion to complement other classification schemes. The proposed algorithm may also be employed to classify other families of enzymes with multi-domain architecture.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号