首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.

Background  

The most widely used multiple sequence alignment methods require sequences to be clustered as an initial step. Most sequence clustering methods require a full distance matrix to be computed between all pairs of sequences. This requires memory and time proportional to N 2 for N sequences. When N grows larger than 10,000 or so, this becomes increasingly prohibitive and can form a significant barrier to carrying out very large multiple alignments.  相似文献   

2.
模式发现是生物信息学的一个重要研究方向,但目前的大部分算法还不能保证获得最优的模式.文章推导了针对三个序列片段相似性关系的判据,将其作为剪枝规则,提出并实现了一种深度优先的穷举搜索算法——判据搜索算法(criterion search algorithm,CRISA),理论分析表明,对绝大多数模式发现问题,CRISA具有多项式的计算时间复杂度和线性的空间复杂度。对仿真的和实际的生物序列数据的测试也表明,CRISA能够快速而完全地识别出序列中所有的模式,具有优于其它算法的总体评价,能够应用于实际的模式发现问题。  相似文献   

3.
ABSTRACT: BACKGROUND: Ongoing improvements in throughput of the next-generation sequencing technologies challenge the current generation of de novo sequence assemblers. Most recent sequence assemblers are based on the construction of a de Bruijn graph. An alternative framework of growing interest is the assembly string graph, not necessitating a division of the reads into k-mers, but requiring fast algorithms for the computation of suffix-prefix matches among all pairs of reads. RESULTS: Here we present efficient methods for the construction of a string graph from a set of sequencing reads. Our approach employs suffix sorting and scanning methods to compute suffix-prefix matches. Transitive edges are recognized and eliminated early in the process and the graph is efficiently constructed including irreducible edges only. CONCLUSIONS: Our suffix-prefix match determination and string graph construction algorithms have been implemented in the software package Readjoiner. Comparison with existing string graph-based assemblers shows that Readjoiner is faster and more space efficient. Readjoiner is available at http://www.zbh.uni-hamburg.de/readjoiner.  相似文献   

4.
We developed an efficient Monte Carlo Simulated Annealing (MCSA) program for modeling protein loops with high speed. The total conformational energy in each step of MCSA simulation consists of two parts: the nonbonded atomic interaction represented by a simple soft-sphere potential and the harmonic distance constraint to ensure the smooth connection of the loop segment to the rest of the protein structure. The soft-sphere potential was a simplified potential that has been successfully used by the authors in modeling the carbohydrate part of glycoprotein systems [H. Zhang, Y. Yang, L. Lai, and Y. Tang (1996), Carbohydrate Research, Vol. 284, pp. 25–34]. It only considers the purely repulsive steric interactions to avoid artificial attractive forces between atoms in the absence of solvent molecules. The N-terminal of the loop segment was connected to the bulk protein part, and two dummy main-chain atoms N and Cα immediately following the C-terminal of the loop segment were constrained to their real positions in the protein structure, which not only assures the correct geometry of loop-protein connection but also is more rigorous than the previous work. To improve the speed, two strategies, the local region method and grid-mapping method, were devised to accelerate the computation of environmental interaction that is responsible for the major part of the computing consumption. The grid-mapping method can reduce computational time dramatically. Conformations with rational steric packing and smooth connection to the rest of the protein structure were generated by the MCSA program, and then were refined by the empirical force field CHARMm [B. R. Brook, R. E. Braccoleri, B. D. Olafson, D. J. States, S. Swaminathan, and M. Karplus (1983), Journal of Computational Chemistry, Vol. 4, pp. 187–217]. Bovine pancreatic trypsin inhibitor (BPTI) was used as an example to test the ability of loop modeling of the method, and five loops in BPTI were calculated. Conformations close to the crystal structure were generated for all of them. With the criteria of CHARMm energy, near-native conformations can be selected, for example, the backbone rms deviation 0.93 A from the crystal structure was gotten for the longest 9-residue loop. © 1997 John Wiley & Sons, Inc.  相似文献   

5.
6.
A new string searching algorithm is presented aimed at searchingfor the occurrence of character patterns in longer charactertexts. The algorithm, specifically designed for nucleic acidsequence data, is essentially derived from the Boyer –Moore method (Comm. ACM, 20, 762 – 772, 1977). Both patternand text data are compressed so that the natural 4-letter alphabetof nucleic acid sequences is considerably enlarged. The stringsearch starts from the last character of the pattern and proceedsin large jumps through the text to be searched. The data compressionand searching algorithm allows one to avoid searching for patternsnot present in the text as well as to inspect, for each pattern,all text characters until the exact match with the text is found.These considerations are supported by empirical evidence andcomparisons with other methods.  相似文献   

7.
We describe an alternative method for scoring of the pairwise alignment of two biological sequences. Designed to overcome the bias due to the composition of the alignment, it measures the distance (in standard deviations) between the given alignment and the mean value of all other alignments that can be obtained by a permutation of either sequence. We demonstrate that the standard deviation can be calculated efficiently. By concentrating upon the ungapped case, the mean and standard deviation can be calculated exactly and in two steps, the first being O(N) time, where N is the length of the sequence, the second in a fixed number of calculations, i.e., in O(1) time. We argue that this statistic is a more consistent measure than a similarity score based upon a standard scoring matrix. Even in the ungapped case, the statistic proves in many cases to be more accurate than the commonly used (FASTA) (Pearson and Lipman, 1988) gapped Z-score in which the sequence is matched against a random sample of the database. We demonstrate the use of the POZ-score as a secondary filter which screens out several well-known types of false positive, reducing the amount of manual screening to be done by the biologist.  相似文献   

8.
A fast and sensitive multiple sequence alignment algorithm   总被引:4,自引:0,他引:4  
A two-step multiple alignment strategy is presented that allowsrapid alignment of a set of homologous sequences and comparisonof pre-aligned groups of sequences. Examples are given demonstratingthe improvement in the quality of alignments when comparingentire groups instead of single sequences. The modular designof computer programs based on this algorithm allows for storageof aligned sequences and successive alignment of any numberof sequences. Received on August 23, 1988; accepted on December 6, 1988  相似文献   

9.
MOTIVATION: Graph drawing algorithms are often used for visualizing relational information, but a naive implementation of a graph drawing algorithm encounters real difficulties when drawing large-scale graphs such as protein interaction networks. RESULTS: We have developed a new, extremely fast layout algorithm for visualizing large-scale protein interaction networks in the three-dimensional space. The algorithm (1) first finds a layout of connected components of an entire network, (2) finds a global layout of nodes with respect to pivot nodes within a connected component and (3) refines the local layout of each connected component by first relocating midnodes with respect to their cutvertices and direct neighbors of the cutvertices and then by relocating all nodes with respect to their neighbors within distance 2. Advantages of this algorithm over classical graph drawing methods include: (1) it is an order of magnitude faster, (2) it can directly visualize data from protein interaction databases and (3) it provides several abstraction and comparison operations for effectively analyzing large-scale protein interaction networks. AVAILABILITY: http://wilab.inha.ac.kr/interviewer/  相似文献   

10.
艾亮  冯杰 《生物信息学》2023,21(3):179-186
本文提出了一种新的快速非比对的蛋白质序列相似性与进化分析方法。在刻画蛋白质序列特征时,首先将氨基酸的10种理化性质通过主成分分析浓缩为6个主成分,并且将每条蛋白质序列里的氨基酸数目作为权重对主成分得分值进行加权平均,然后再融合氨基酸的位置信息构成一个26维的蛋白质序列特征向量,最后利用欧式距离度量蛋白质序列间的相似性及进化关系。通过对3个蛋白质序列数据集的测试表明,本文提出的方法能将每条蛋白质序列准确聚类,并且简便快捷,说明了该方法的有效性。  相似文献   

11.
An efficient algorithm for large-scale detection of protein families   总被引:6,自引:0,他引:6  
Detection of protein families in large databases is one of the principal research objectives in structural and functional genomics. Protein family classification can significantly contribute to the delineation of functional diversity of homologous proteins, the prediction of function based on domain architecture or the presence of sequence motifs as well as comparative genomics, providing valuable evolutionary insights. We present a novel approach called TRIBE-MCL for rapid and accurate clustering of protein sequences into families. The method relies on the Markov cluster (MCL) algorithm for the assignment of proteins into families based on precomputed sequence similarity information. This novel approach does not suffer from the problems that normally hinder other protein sequence clustering algorithms, such as the presence of multi-domain proteins, promiscuous domains and fragmented proteins. The method has been rigorously tested and validated on a number of very large databases, including SwissProt, InterPro, SCOP and the draft human genome. Our results indicate that the method is ideally suited to the rapid and accurate detection of protein families on a large scale. The method has been used to detect and categorise protein families within the draft human genome and the resulting families have been used to annotate a large proportion of human proteins.  相似文献   

12.
MOTIVATION: Existing algorithms for automated protein structure alignment generate contradictory results and are difficult to interpret. An algorithm which can provide a context for interpreting the alignment and uses a simple method to characterize protein structure similarity is needed. RESULTS: We describe a heuristic for limiting the search space for structure alignment comparisons between two proteins, and an algorithm for finding minimal root-mean-squared-distance (RMSD) alignments as a function of the number of matching residue pairs within this limited search space. Our alignment algorithm uses coordinates of alpha-carbon atoms to represent each amino acid residue and requires a total computation time of O(m(3) n(2)), where m and n denote the lengths of the protein sequences. This makes our method fast enough for comparisons of moderate-size proteins (fewer than approximately 800 residues) on current workstation-class computers and therefore addresses the need for a systematic analysis of multiple plausible shape similarities between two proteins using a widely accepted comparison metric.  相似文献   

13.
The basic linear treatment of sequence comparisons limits the ability of contemporary sequence alignment algorithms to detect non-order-conserving recombinations. Here, we introduce the algorithm combAlign which addresses the assessment of pairwise sequence similarity on non-order-conserving recombinations on a large scale. Emphasizing a two-level approach, combAlign first detects locally well conserved subsequences in a target and a source sequence. Subsequently, the relative placement of alignments is mapped to a graph. Concatenating local alignments to reassemble the target sequence to the fullest extent, the maximum scoring path through the graph denotes the best attainable combAlignment. Parameters influencing this process can be set to meet the user's specific demands. combAlign is applied to examples demonstrating the possibility to reflect evolutionary kinship of proteins even if their domains and motifs are strongly rearranged.  相似文献   

14.
15.
A new and efficient Monte Carlo algorithm for sampling protein configurations in the continuous space is presented; the efficiency of this algorithm, named Local Moves for Proteins (LMProt), was compared to other alternative algorithms. For this purpose, we used an intrachain interaction energy function that is proportional to the root mean square deviation (rmsd) with respect to alpha-carbons from native structures of real proteins. For phantom chains, the LMProt method is approximately 10(4) and 20 times faster than the algorithms Thrashing (no local moves) and Sevenfold Way (local moves), respectively. Additionally, the LMProt was tested for real chains (excluded-volume all-atoms model); proteins 5NLL (138 residues) and 1BFF (129 residues) were used to determine the folding success xi as a function of the number eta of residues involved in the chain movements, and as a function of the maximum amplitude of atomic displacement delta r(max). Our results indicate that multiple local moves associated with relative chain flexibility, controlled by appropriate adjustments for eta and delta r(max), are essential for configurational search efficiency.  相似文献   

16.
We have parallelized the FASTA algorithm for biological sequencecomparison using Linda, a machine-independent parallel programminglanguage. The resulting parallel program runs on a variety ofdifferent parallel machines. A straightforward parallelizationstrategy works well if the amount of computation to be doneis relatively large. When the amount of computation is reduced,however, disk I/O becomes a bottleneck which may prevent additionalspeed-up as the number of processors is increased. The paperdescribes the parallelization of FASTA, and uses FASTA to illustratethe I/O bottleneck problem that may arise when performing paralleldatabase search with a fast sequence comparison algorithm. Thepaper also describes several program design strategies thatcan help with this problem. The paper discusses how this bottleneckis an example of a general problem that may occur when parallelizing,or otherwise speeding up, a time-consuming computation. Received on July 25, 1990; accepted on October 15, 1990  相似文献   

17.
Modifications of the amino acid sequence generally affect protein stability. Here, we use knowledge-based potentials to estimate the stability of protein structures under sequence variation. Calculations on a variety of protein scaffolds result in a clear distinction of known mutable regions from arbitrarily chosen control patches. For example, randomly changing the sequence of an antibody paratope yields a significantly lower number of destabilized mutants as compared to the randomization of comparable regions on the protein surface. The technique is computationally efficient and can be used to screen protein structures for regions that are amenable to molecular tinkering by preserving the stability of the mutated proteins.  相似文献   

18.
We present an efficient and sensitive hybrid algorithm for local structure alignment of a pair of 3D protein structures. The hybrid algorithm employs both the URMS (unit-vector root mean squared) metric and the RMS metric. Our algorithm searches efficiently the transformation space using a fast screening protocol; initial transformations (rotations) are identified using the URMS algorithm. These rotations are then clustered and an RMS-based dynamic programming algorithm is invoked to find the maximal local similarities for representative rotations of the clusters. Statistical significance of the alignments is estimated using a model that accounts for both the score of the match and the RMS. We tested our algorithm over the SCOP classification of protein domains. Our algorithm performs very well; its main advantages are that (1) it combines the advantages of the RMS and the URMS metrics, (2) it searches extensively the transformation space, (3) it detects complex similarities and structural repeats, and (4) its results are symmetric. The software is available for download at biozon.org/ftp/software/urms/.  相似文献   

19.
We present a fast algorithm to produce a graphic matrix representationof sequence homology. The algorithm is based on lexicographicalordering of fragments. It preserves most of the options of asimple naive algorithm with a significant increase in speed.This algorithm was the basis for a program, called DNAMAT, thathas been extensively tested during the last three years at theWeizmann Institute of Science and has proven to be very useful.In addition we suggest a way to extend our approach to analysea series of related DNA or RNA sequences, in order to determinecertain common structural features. The analysis is done by‘summing’ a set of dot-matrices to produce an overallmatrix that displays structural elements common to most of thesequences. We give an example of this procedure by analysingtRNA sequences. Received on June 26, 1986; accepted on September 28, 1986  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号