首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 250 毫秒
1.
2.
In the past, a large number of methods have been developed for predicting various characteristics of a protein from its composition. In order to exploit the full potential of protein composition, we developed the web-server COPid to assist the researchers in annotating the function of a protein from its composition using whole or part of the protein. COPid has three modules called search, composition and analysis. The search module allows searching of protein sequences in six different databases. Search results list database proteins in ascending order of Euclidian distance or descending order of compositional similarity with the query sequence. The composition module allows calculation of the composition of a sequence and average composition of a group of sequences. The composition module also allows computing composition of various types of amino acids (e.g. charge, polar, hydrophobic residues). The analysis module provides the following options; i) comparing composition of two classes of proteins, ii) creating a phylogenetic tree based on the composition and iii) generating input patterns for machine learning techniques. We have evaluated the performance of composition-based (or alignment-free) similarity search in the subcellular localization of proteins. It was found that the alignment free method performs reasonably well in predicting certain classes of proteins. The COPid web-server is available at http://www.imtech.res.in/raghava/copid/.  相似文献   

3.
Subword composition plays an important role in a lot of analyses of sequences. Here we define and study the "local decoding of order N of sequences," an alternative that avoids some drawbacks of "subwords of length N" approaches while keeping informations about environments of length N in the sequences ("decoding" is taken here in the sense of hidden Markov modeling, i.e., associating some state to all positions of the sequence). We present an algorithm for computing the local decoding of order N of a given set of sequences. Its complexity is linear in the total length of the set (whatever the order N) both in time and memory space. In order to show a use of local decoding, we propose a very basic dissimilarity measure between sequences which can be computed both from local decoding of order N and composition in subwords of length N. The accuracies of these two dissimilarities are evaluated, over several datasets, by computing their linear correlations with a reference alignment-based distance. These accuracies are also compared to the one obtained from another recent alignment-free comparison.  相似文献   

4.
Zhang H  Liu X 《Bio Systems》2011,105(1):73-82
DNA computing has been applied in broad fields such as graph theory, finite state problems, and combinatorial problem. DNA computing approaches are more suitable used to solve many combinatorial problems because of the vast parallelism and high-density storage. The CLIQUE algorithm is one of the gird-based clustering techniques for spatial data. It is the combinatorial problem of the density cells. Therefore we utilize DNA computing using the closed-circle DNA sequences to execute the CLIQUE algorithm for the two-dimensional data. In our study, the process of clustering becomes a parallel bio-chemical reaction and the DNA sequences representing the marked cells can be combined to form a closed-circle DNA sequences. This strategy is a new application of DNA computing. Although the strategy is only for the two-dimensional data, it provides a new idea to consider the grids to be vertexes in a graph and transform the search problem into a combinatorial problem.  相似文献   

5.
SUMMARY: JaDis is a Java application for computing evolutionary distances between nucleic acid sequences and G+C base frequencies. It allows specific comparison of coding sequences, of non-coding sequences or of a non-coding sequence with coding sequences. AVAILABILITY: http://pbil.univ-lyon1.fr/software/jadis.html  相似文献   

6.
An efficient method for matching nucleic acid sequences.   总被引:2,自引:2,他引:0       下载免费PDF全文
A method of computing the fraction of matches between two nucleic acid sequences at all possible alignments is described. It makes use of the Fast Fourier Transform. It should be particularly efficient for very long sequences, achieving its result in a number of operations proportional to n ln n, where n is the length of the longer of the two sequences. Though the objective achieved is of limited interest, this method will complement algorithms for efficiently finding the longest matching parts of two sequences, and is faster than existing algorithms for finding matches allowing deletions and insertions. A variety of economies can be achieved by this Fast Fourier Transform technique in matching multiple sequences, looking for complementarity rather than identity, and matching the same sequences both in forward and reversed orientations.  相似文献   

7.
从蛋白质折叠成自由能最小的稳定结构类型为研究的出发点,为揭示蛋白质空间折叠的动力学本质,对非同源蛋白质数据库,以蛋白质序列的氮基酸频率和自协方差函数为特征矢量,求出表征特征矢量中各分量耦合作用与协同作用的协方差矩阵所对应的特征值.与Chou的方法相比,更全面地反映了蛋白质折叠密码的简并性、全局性和多意性,为定量表征折叠成不同结构类的蛋白质,提供了一种动力学参数分析方法.  相似文献   

8.
In a sample of DNA sequences where recombination can occur to the ancestors of the sample, distinct parts of the sequences may have different most recent common ancestors. This paper presents a Markov chain Monte Carlo algorithm for computing the expected time to the most recent common ancestor along the sequences, conditional on where the mutations occur on the sequences.  相似文献   

9.
A graphical formula is presented for determining the base ratio of melted DNA. By use of this formula, the composition of sequences which melt in different portions of the melting curves of Clostridium DNA, Escherichia coli DNA, and mouse DNA were determined. As the DNA melts, the per cent of adenine and thymine (AT) in the melted sequences decreases linearly with temperature. The average composition of sequences which melt in a given part of the melting curve is proportional to the base ratio of the DNA. The concentration and average composition of sequences were determined for three parts of the melting curves of the DNA samples, and a frequency distribution curve was constructed. The curve is symmetrical and has a maximum at about 56% AT. The distribution of GC-rich sequences on the E. coli chromosome was estimated by shearing, partially melting, and fractionating the DNA on hydroxylapatite. GC-rich sequences appear to occur every thousand base pairs, and have a maximum length of about 180 base pairs. The graphical formula was applied to the determination of the composition of sequences which melt in different parts of the melting curve of chromatin. Throughout the melting curve, the composition of the melting sequences is about 60% AT, which appears to suggest that relatively long sequences are melting simultaneously. Their melting temperature may be a function of the composition of the protein on different parts of the DNA. The problem of light scattering in DNA-protein and DNA was also investigated. A formula is presented which corrects for light scattering by relating the intensity of the scattered light to the rate of change of absorbance of DNA with wavelength.  相似文献   

10.
Several proteins and genes are members of families that share a public evolutionary. In order to outline the evolutionary relationships and to recognize conserved patterns, sequence comparison becomes an emerging process. The current work investigates critically the k-mer role in composition vector method for comparing genome sequences. Generally, composition vector methods using k-mer are applied under choice of different value of k to compare genome sequences. For some values of k, results are satisfactory, but for other values of k, results are unsatisfactory. Standard composition vector method is carried out in the proposed work using 3-mer string length. In addition, special type of information based similarity index is used as a distance measure. It establishes that use of 3-mer and information based similarity index provide satisfactory results especially for comparison of whole genome sequences in all cases. These selections provide a sort of unified approach towards comparison of genome sequences.  相似文献   

11.
Subirana JA  Anokian E 《Gene》2011,473(2):76-81
A very simple new program is presented (G-SQUARES). It is useful in order to visualize the composition and basic structural features of whole genomes and selected chromosome regions. The frequency of all dimer and tetramer sequences is reported. Overall structural features are calculated, such as the tendency for alternation. A direct visual comparison among different sequences is easily available. Furthermore, the features which are visualized indicate further studies which should be carried out. Examples are presented on Alu sequences, CpG islands, whole eukaryotic and bacterial genomes.  相似文献   

12.
本文介绍了一个在微机(IBM PC)上实现的、用于核酸顺序分析的计算机程序系统.该系统由三个层次和18个功能块构成,菜单及人机对话使得用户能较快地掌握和使用它.在编程中,采用了树结构、先进后出栈和稀疏矩阵等数据结构技巧,运用了Bayes法等统计分析方法,Kruskal算法和Floyd算法等一系列图论方法也被得到应用,这个软件系统的推出对于分子生物学研究具有一定的积极作用.  相似文献   

13.
北极太平洋扇区海洋沉积物细菌多样性的系统发育分析   总被引:9,自引:1,他引:9  
对北极太平洋扇区3个不同深度的海洋沉积物样品,采用PCR结合变性梯度凝胶电泳(DGGE)技术进行细菌16S rRNA基因V3区序列的系统发育分析。结果表明,同一个沉积物样品不同层次的DGGE电泳图谱不完全相同。从3个沉积物样品中共获得50条序列,大部分序列与从海洋环境尤其海洋沉积物获得的细菌16S rDNA序列相似性较高(88%~100%),归属于变形细菌(Proteobacteria)的gamma亚群、alpha亚群、beta亚群、epsilon亚群、delta亚群,Cytophaga_Flavobacterium_Bacteroides(CFB)群细菌和高G C含量的革兰氏阳性细菌等系统分类群,其中变形细菌(Proteobacteria)的gamma亚群为沉积物中的优势细菌类群。  相似文献   

14.
In molecular biology, the issue of quantifying the similarity between two biological sequences is very important. Past research has shown that word-based search tools are computationally efficient and can find some new functional similarities or dissimilarities invisible to other algorithms like FASTA. Recently, under the independent model of base composition, Wu, Burke, and Davison (1997, Biometrics 53, 1431 1439) characterized a family of word-based dissimilarity measures that defined distance between two sequences by simultaneously comparing the frequencies of all subsequences of n adjacent letters (i.e., n-words) in the two sequences. Specifically, they introduced the use of Mahalanobis distance and standardized Euclidean distance into the study of DNA sequence dissimilarity. They showed that both distances had better sensitivity and selectivity than the commonly used Euclidean distance. The purpose of this article is to extend Mahalanobis and standardized Euclidean distances to Markov chain models of base composition. In addition, a new dissimilarity measure based on Kullback-Leibler discrepancy between frequencies of all n-words in the two sequences is introduced. Applications to real data demonstrate that Kullback-Leibler discrepancy gives a better performance than Euclidean distance. Moreover, under a Markov chain model of order kQ for base composition, where kQ is the estimated order based on the query sequence, standardized Euclidean distance performs very well. Under such a model, it performs as well as Mahalanobis distance and better than Kullback-Leibler discrepancy and Euclidean distance. Since standardized Euclidean distance is drastically faster to compute than Mahalanobis distance, in a usual workstation/PC computing environment, the use of standardized Euclidean distance under the Markov chain model of order kQ of base composition is generally recommended. However, if the user is very concerned with computational efficiency, then the use of Kullback-Leibler discrepancy, which can be computed as fast as Euclidean distance, is recommended. This can significantly enhance the current technology in comparing large datasets of DNA sequences.  相似文献   

15.
The data of Fourier-analysis of nucleotide sequences are discussed. The existence of reflexes corresponding to regular position of nucleotides (mainly T and G) with 3-base period is the most striking feature of both phage and viral nucleic acid sequences spectra. The amplitude and phase of the similar reflexes in the dinucleotide spectra obtained by digital computing of Fourier-transform, give specific information on amino acid composition, codon bias, amino acid relations. The width of frequency band characterizes a tendency to nucleotide clustering or to separate existence. The blurring of reflexes shows the disturbance of far order in the regular nucleotide "lattice". The two-dimensional spectral analysis supports the existence of far correlation in nucleotide positions.  相似文献   

16.
A widely used algorithm for computing an optimal local alignment between two sequences requires a parameter set with a substitution matrix and gap penalties. It is recognized that a proper parameter set should be selected to suit the level of conservation between sequences. We describe an algorithm for selecting an appropriate substitution matrix at given gap penalties for computing an optimal local alignment between two sequences. In the algorithm, a substitution matrix that leads to the maximum alignment similarity score is selected among substitution matrices at various evolutionary distances. The evolutionary distance of the selected substitution matrix is defined as the distance of the computed alignment. To show the effects of gap penalties on alignments and their distances and help select appropriate gap penalties, alignments and their distances are computed at various gap penalties. The algorithm has been implemented as a computer program named SimDist. The SimDist program was compared with an existing local alignment program named SIM for finding reciprocally best-matching pairs (RBPs) of sequences in each of 100 protein families, where RBPs are commonly used as an operational definition of orthologous sequences. SimDist produced more accurate results than SIM on 50 of the 100 families, whereas both programs produced the same results on the other 50 families. SimDist was also used to compare three types of substitution matrices in scoring 444,461 pairs of homologous sequences from the 100 families.  相似文献   

17.
利用可视化的编程语言VisualBASIC编写了一个简易实用的核酸序列分析程序 ,能够自动实现对已知核酸序列的分子质量、Tm值、碱基组成的测定、各种核酸序列的转换和氨基酸序列的推导。同时也详细描述了核酸序列自动分析各功能实现的VisualBASIC语言编程过程。  相似文献   

18.
Human DNA has been fractionated according to base composition by sedimentation equilibrium in an HgCl2/Cs2SO4 density gradient, followed by sedimentation equilibrium in an actinomycin/cesium formate density gradient. The fractions of different base composition resulting from this procedure were subsequently analyzed by sedimentation equilibrium in CsCl, DNA renaturation kinetics, and electron microscopy. All fractions contain similar kinetic classes of repeated DNA sequences as judged by renaturation studies. Short (300 nucleotides) interspersed repeated sequences are found in all fractions with no noticeable enrichment for these sequences in any fraction. Repeated sequences from fractions of different base composition are partially able to cross-hybridize, demonstrating that nearly identical repeated sequences occur in molecules of different base composition. These findings are critically compared to reports of successful density gradient fractionations of different human DNA sequence classes.  相似文献   

19.
Storage of sequence data is a big concern as the amount of data generated is exponential in nature at several locations. Therefore, there is a need to develop techniques to store data using compression algorithm. Here we describe optimal storage algorithm (OPTSDNA) for storing large amount of DNA sequences of varying length. This paper provides performance analysis of optimal storage algorithm (OPTSDNA) of a distributed bioinformatics computing system for analysis of DNA sequences. OPTSDNA algorithm is used for storing various sizes of DNA sequences into database. DNA sequences of different lengths were stored by using this algorithm. These input DNA sequences are varied in size from very small to very large. Storage size is calculated by this algorithm. Response time is also calculated in this work. The efficiency and performance of the algorithm is high (in size calculation with percentage) when compared with other known with sequential approach.  相似文献   

20.
Uno R  Nakayama Y  Tomita M 《Gene》2006,380(1):30-37
Chi sequences (5'-GCTGGTGG-3') are cis-acting 8 bp sequence elements that enhance homologous recombination promoted by the RecBCD pathway in Escherichia coli. The genome of E. coli K-12 MG1655 contains 1009 Chi sequences and this frequency far exceeds the expected value for occurrence of an 8 bp sequence in a genome of this size. It is generally thought that the over-representation of Chi sequences indicates that they have been selected for during evolution because of their function in recombination. The genes from three E. coli strains (K-12, O157 and CFT) were classified into three categories (island, match to other E. coli, and backbone). Island genes have a different base composition and codon usage in comparison with those in the backbone genes, therefore they were relatively new and not yet adapted to the base composition patterns and codon usage typical of the recipient genome. The over-representation of Chi sequences was examined by comparing Chi frequencies and codon frequencies between island and backbone genes. The difference in the CTGGTG di-codon frequency between the backbone and island genes was correlated with the frequency of Chi sequences which were translated in the Leu-Val (-G/CTG/GTG/G-) reading frame in the K-12 strain. These results suggest that the main reading frame of Chi sequences increased as a result of the di-codon CTG-GTG increasing under a genome-wide pressure for adapting to the codon usage and base composition of the E. coli K-12 strain, and that the RecBCD recombinase might adjust its recognition sequence to a frequently occurring oligomer such as G-CTG-GTG-G.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号