首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
MOTIVATION: Traditional sequence distances require an alignment and therefore are not directly applicable to the problem of whole genome phylogeny where events such as rearrangements make full length alignments impossible. We present a sequence distance that works on unaligned sequences using the information theoretical concept of Kolmogorov complexity and a program to estimate this distance. RESULTS: We establish the mathematical foundations of our distance and illustrate its use by constructing a phylogeny of the Eutherian orders using complete unaligned mitochondrial genomes. This phylogeny is consistent with the commonly accepted one for the Eutherians. A second, larger mammalian dataset is also analyzed, yielding a phylogeny generally consistent with the commonly accepted one for the mammals. AVAILABILITY: The program to estimate our sequence distance, is available at http://www.cs.cityu.edu.hk/~cssamk/gencomp/GenCompress1.htm. The distance matrices used to generate our phylogenies are available at http://www.math.uwaterloo.ca/~mli/distance.html.  相似文献   

2.
In "The ends of a large RNA molecule are necessarily close", Yoffe et al. (Nucleic Acids Res 39(1):292-299, 2011) used the programs RNAfold [resp. RNAsubopt] from Vienna RNA Package to calculate the distance between 5' and 3' ends of the minimum free energy secondary structure [resp. thermal equilibrium structures] of viral and random RNA sequences. Here, the 5'-3' distance is defined to be the length of the shortest path from 5' node to 3' node in the undirected graph, whose edge set consists of edges {i, i + 1} corresponding to covalent backbone bonds and of edges {i, j} corresponding to canonical base pairs. From repeated simulations and using a heuristic theoretical argument, Yoffe et al. conclude that the 5'-3' distance is less than a fixed constant, independent of RNA sequence length. In this paper, we provide a rigorous, mathematical framework to study the expected distance from 5' to 3' ends of an RNA sequence. We present recurrence relations that precisely define the expected distance from 5' to 3' ends of an RNA sequence, both for the Turner nearest neighbor energy model, as well as for a simple homopolymer model first defined by Stein and Waterman. We implement dynamic programming algorithms to compute (rather than approximate by repeated application of Vienna RNA Package) the expected distance between 5' and 3' ends of a given RNA sequence, with respect to the Turner energy model. Using methods of analytical combinatorics, that depend on complex analysis, we prove that the asymptotic expected 5'-3' distance of length n homopolymers is approximately equal to the constant 5.47211, while the asymptotic distance is 6.771096 if hairpins have a minimum of 3 unpaired bases and the probability that any two positions can form a base pair is 1/4. Finally, we analyze the 5'-3' distance for secondary structures from the STRAND database, and conclude that the 5'-3' distance is correlated with RNA sequence length.  相似文献   

3.
4.
Summary An overview of recent molecular analyses regarding origins of plastids in algal lineages is presented. Since different phylogenetic analyses can yield contradictory views of algal plastid origins, we have examined the effect of two distance measurement methods and two distance matrix tree-building methods upon topologies for the ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit nucleotide sequence data set. These results are contrasted to those from bootstrap parsimony analysis of nucleotide sequence data subsets. It is shown that the phylogenetic information contained within nucleotide sequences for the chloroplast-encoded gene for the large subunit of ribulose-1,5-bisphosphate carboxylase/oxygenase, integral to photosynthesis, indicates an independent origin for this plastid gene in different plant taxa. This finding is contrasted to contrary results derived from 16S rRNA sequences. Possible explanations for discrepancies observed for these two different molecules are put forth. Other molecular sequence data which address questions of early plant evolution and the eubacterial origins of algal organelles are discussed. Offprint requests to: W. Martin  相似文献   

5.
Summary We have previously made a set of DNA constructs by fusing the mature part of Bacillus licheniformis -amylase with the signal sequence of B. amyloliquefaciens -amylase at various distances from the signal sequence cleavage site. We observed that the level of -amylase production in B. subtilis depended strongly on the distance of the junction from the signal sequence cleavage site, with quite a sharp optimum distance. To test whether the effect is limited to the pair of -amylase signal sequence and mature protein, we analysed the protein production in a set of constructs in which an Escherichia coli \-lactamase was similarly joined at different distances from the -amylase signal sequence. Also in this case the distance seemed to be an important factor in affecting the level of production in B. subtilis. The observed effect might depend on the modulation of pre-protein folding, which in turn could affect the secretion level. Offsprint requests to: M. Sibakov  相似文献   

6.
The wcd system is an open source tool for clustering expressed sequence tags (EST) and other DNA and RNA sequences. wcd allows efficient all-versus-all comparison of ESTs using either the d(2) distance function or edit distance, improving existing implementations of d(2). It supports merging, refinement and reclustering of clusters. It is 'drop in' compatible with the StackPack clustering package. wcd supports parallelization under both shared memory and cluster architectures. It is distributed with an EMBOSS wrapper allowing wcd to be installed as part of an EMBOSS installation (and so provided by a web server). AVAILABILITY: wcd is distributed under a GPL licence and is available from http://code.google.com/p/wcdest. SUPPLEMENTARY INFORMATION: Additional experimental results. The wcd manual, a companion paper describing underlying algorithms, and all datasets used for experimentation can also be found at www.bioinf.wits.ac.za/~scott/wcdsupp.html.  相似文献   

7.
To investigate the functional sites on a protein and the prediction of binding sites (residues)in proteins, it is often required to identify the binding site residues at different distance threshold from protein three dimensional (3D)structures. For the study of a particular protein chain and its interaction with the ligand in complex form, researchers have to parse the output of different available tools or databases for finding binding-site residues. Here we have developed a tool for calculating amino acid contact distances in proteins at different distance threshold from the 3D-structure of the protein. For an input of protein 3D-structure, ContPro can quickly find all binding-site residues in the protein by calculating distances and also allows researchers to select the different distance threshold, protein chain and ligand of interest. Additionally, it can also parse the protein model (in case of multi model protein coordinate file)and the sequence of selected protein chain in Fasta format from the input 3D-structure. The developed tool will be useful for the identification and analysis of binding sites of proteins from 3D-structure at different distance thresholds. AVAILABILITY: IT CAN BE ACCESSED AT: http://procarb.org/contpro/  相似文献   

8.
MOTIVATION: A tandem repeat in DNA is a sequence of two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats occur in the genomes of both eukaryotic and prokaryotic organisms. They are important in numerous fields including disease diagnosis, mapping studies, human identity testing (DNA fingerprinting), sequence homology and population studies. Although tandem repeats have been used by biologists for many years, there are few tools available for performing an exhaustive search for all tandem repeats in a given sequence. RESULTS: In this paper we describe an efficient algorithm for finding all tandem repeats within a sequence, under the edit distance measure. The contributions of this paper are two-fold: theoretical and practical. We present a precise definition for tandem repeats over the edit distance and an efficient, deterministic algorithm for finding these repeats. AVAILABILITY: The algorithm has been implemented in C++, and the software is available upon request and can be used at http://www.sci.brooklyn.cuny.edu/~sokol/trepeats. The use of this tool will assist biologists in discovering new ways that tandem repeats affect both the structure and function of DNA and protein molecules.  相似文献   

9.
We proposed a fast and unsupervised clustering method, minimum span clustering (MSC), for analyzing the sequence–structure–function relationship of biological networks, and demonstrated its validity in clustering the sequence/structure similarity networks (SSN) of 682 membrane protein (MP) chains. The MSC clustering of MPs based on their sequence information was found to be consistent with their tertiary structures and functions. For the largest seven clusters predicted by MSC, the consistency in chain function within the same cluster is found to be 100%. From analyzing the edge distribution of SSN for MPs, we found a characteristic threshold distance for the boundary between clusters, over which SSN of MPs could be properly clustered by an unsupervised sparsification of the network distance matrix. The clustering results of MPs from both MSC and the unsupervised sparsification methods are consistent with each other, and have high intracluster similarity and low intercluster similarity in sequence, structure, and function. Our study showed a strong sequence–structure–function relationship of MPs. We discussed evidence of convergent evolution of MPs and suggested applications in finding structural similarities and predicting biological functions of MP chains based on their sequence information. Proteins 2015; 83:1450–1461. © 2015 Wiley Periodicals, Inc.  相似文献   

10.
Most molecular analyses, including phylogenetic inference, are based on sequence alignments. We present an algorithm that estimates relatedness between biomolecules without the requirement of sequence alignment by using a protein frequency matrix that is reduced by singular value decomposition (SVD), in a latent semantic index information retrieval system. Two databases were used: one with 832 proteins from 13 mitochondrial gene families and another composed of 1000 sequences from nine types of proteins retrieved from GenBank. Firstly, 208 sequences from the first database and 200 from the second were randomly selected and compared using edit distance between each pair of sequences and respective cosines and Euclidean distances from SVD. Correlation between cosine and edit distance was -0.32 (P < 0.01) and between Euclidean distance and edit distance was +0.70 (P < 0.01). In order to check the ability of SVD in classifying sequences according to their categories, we used a sample of 202 sequences from the 13 gene families as queries (test set), and the other proteins (630) were used to generate the frequency matrix (training set). The classification algorithm applies a voting scheme based on the five most similar sequences with each query. With a 3-peptide frequency matrix, all 202 queries were correctly classified (accuracy = 100%). This algorithm is very attractive, because sequence alignments are neither generated nor required. In order to achieve results similar to those obtained with edit distance analysis, we recommend that Euclidean distance be used as a similarity measure for protein sequences in latent semantic indexing methods.  相似文献   

11.
12.
Bio3D is a family of R packages for the analysis of biomolecular sequence, structure, and dynamics. Major functionality includes biomolecular database searching and retrieval, sequence and structure conservation analysis, ensemble normal mode analysis, protein structure and correlation network analysis, principal component, and related multivariate analysis methods. Here, we review recent package developments, including a new underlying segregation into separate packages for distinct analysis, and introduce a new method for structure analysis named ensemble difference distance matrix analysis (eDDM). The eDDM approach calculates and compares atomic distance matrices across large sets of homologous atomic structures to help identify the residue wise determinants underlying specific functional processes. An eDDM workflow is detailed along with an example application to a large protein family. As a new member of the Bio3D family, the Bio3D‐eddm package supports both experimental and theoretical simulation‐generated structures, is integrated with other methods for dissecting sequence‐structure–function relationships, and can be used in a highly automated and reproducible manner. Bio3D is distributed as an integrated set of platform independent open source R packages available from: http://thegrantlab.org/bio3d/ .  相似文献   

13.
A multitude of motif-finding tools have been published, which can generally be assigned to one of three classes: expectation-maximization, Gibbs-sampling or enumeration. Irrespective of this grouping, most motif detection tools only take into account similarities across ungapped sequence regions, possibly causing short motifs located peripherally and in varying distance to a 'core' motif to be missed. We present a new method, adding to the set of expectation-maximization approaches, that permits the use of gapped alignments for motif elucidation. Availability: The program is available for download from: http://bioinfoserver.rsbs.anu.edu.au/downloads/mclip.jar. Supplementary information: http://bioinfoserver.rsbs.anu.edu.au/utils/mclip/info.php.  相似文献   

14.
Evolutionary distance matrices of the extant hominoids are computed from DNA sequence data, and hominoid DNA phylogenies are reconstructed by applying the neighbor-joining method to these distance matrices. The chimpanzee is clustered with the human in most of the phylogenetic trees thus obtained. The proportion of the distance between human and chimpanzee to that between human/chimpanzee and orangutan is estimated. Both mitochondrial DNA and nuclear DNA show a similar value (0.44), which is close to values derived from DNA-DNA hybridization data.  相似文献   

15.
SUMMARY: MatrixPlot is a program for making high-quality matrix plots, such as mutual information plots of sequence alignments and distance matrices of sequences with known three-dimensional coordinates. The user can add information about the sequences (e.g. a sequence logo profile) along the edges of the plot, as well as zoom in on any region in the plot. AVAILABILITY: MatrixPlot can be obtained on request, and can also be accessed online at http://www. cbs.dtu.dk/services/MatrixPlot. CONTACT: gorodkin@cbs.dtu.dk  相似文献   

16.
GRIL is a tool to automatically identify collinear regions in a set of bacterial-size genome sequences. GRIL uses three basic steps. First, regions of high sequence identity are located. Second, some of these regions are filtered based on user-specified criteria. Finally, the remaining regions of sequence identity are used to define significant collinear regions among the sequences. By locating collinear regions of sequence, GRIL provides a basis for multiple genome alignment using current alignment systems. GRIL also provides a basis for using current inversion distance tools to infer phylogeny. AVAILABILITY: GRIL is implemented in C++ and runs on any x86-based Linux or Windows platform. It is available from http://asap.ahabs.wisc.edu/gril  相似文献   

17.
A new sequence distance measure for phylogenetic tree construction   总被引:5,自引:0,他引:5  
MOTIVATION: Most existing approaches for phylogenetic inference use multiple alignment of sequences and assume some sort of an evolutionary model. The multiple alignment strategy does not work for all types of data, e.g. whole genome phylogeny, and the evolutionary models may not always be correct. We propose a new sequence distance measure based on the relative information between the sequences using Lempel-Ziv complexity. The distance matrix thus obtained can be used to construct phylogenetic trees. RESULTS: The proposed approach does not require sequence alignment and is totally automatic. The algorithm has successfully constructed consistent phylogenies for real and simulated data sets. AVAILABILITY: Available on request from the authors.  相似文献   

18.
MOTIVATION: No general theory guides the selection of gap penalties for local sequence alignment. We empirically determined the most effective gap penalties for protein sequence similarity searches with substitution matrices over a range of target evolutionary distances from 20 to 200 Point Accepted Mutations (PAMs). RESULTS: We embedded real and simulated homologs of protein sequences into a database and searched the database to determine the gap penalties that produced the best statistical significance for the distant homologs. The most effective penalty for the first residue in a gap (q+r) changes as a function of evolutionary distance, while the gap extension penalty for additional residues (r) does not. For these data, the optimal gap penalties for a given matrix scaled in 1/3 bit units (e.g. BLOSUM50, PAM200) are q=25-0.1 * (target PAM distance), r=5. Our results provide an empirical basis for selection of gap penalties and demonstrate how optimal gap penalties behave as a function of the target evolutionary distance of the substitution matrix. These gap penalties can improve expectation values by at least one order of magnitude when searching with short sequences, and improve the alignment of proteins containing short sequences repeated in tandem.  相似文献   

19.
We used sequence variation within 297 bp of control region mitochondrial DNA (mtDNA) amplified from 53 lesser long-nosed bats, Leptonycteris curasoae (Phyllostomidae: Glossophaginae) captured at 13 locations in south-western United States and Mexico and one site in Venezuela to infer population structure and possible migration routes of this endangered nectar- and fruit-eating species. Phylogenetic analysis using maximum parsimony and UPGMA confirmed species and subspecies distinctions within Leptonycteris and revealed two clades exhibiting 3% sequence divergence within the Mexican subspecies, L. c. yerbabuenae . Even though many roosts contained L. c. yerbabuenae from both clades, weak population structure was detected both by a correlation between genetic differentiation, F st, and geographical distance and by a cladistic estimate of the number of migration events required to align bat sequences with geographical location on maximum parsimony, as compared to random, trees. Three results suggest that L. c. yerbabuenae are more likely to migrate between sites along the Pacific coast of Mexico or along the foothills of the Sierra Madre Occidental than between these regions. (1) Seventeen of 20 bats which shared an identical sequence were captured up to 1800 km apart but within the same putative migration corridor. (2) Residuals from a regression of F st on distance were greater between than within these regions. (3) Fewer migration events were required to align bats with these two groups than expected from random assignment. We recommend analysing independent genetic data and monitoring bat visitation to roost sites during migration to confirm these postulated movements.  相似文献   

20.
Distance based algorithms are a common technique in the construction of phylogenetic trees from taxonomic sequence data. The first step in the implementation of these algorithms is the calculation of a pairwise distance matrix to give a measure of the evolutionary change between any pair of the extant taxa. A standard technique is to use the log det formula to construct pairwise distances from aligned sequence data. We review a distance measure valid for the most general models, and show how the log det formula can be used as an estimator thereof. We then show that the foundation upon which the log det formula is constructed can be generalized to produce a previously unknown estimator which improves the consistency of the distance matrices constructed from the log det formula. This distance estimator provides a consistent technique for constructing quartets from phylogenetic sequence data under the assumption of the most general Markov model of sequence evolution.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号