首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
An Eulerian path approach to global multiple alignment for DNA sequences.   总被引:3,自引:0,他引:3  
With the rapid increase in the dataset of genome sequences, the multiple sequence alignment problem is increasingly important and frequently involves the alignment of a large number of sequences. Many heuristic algorithms have been proposed to improve the speed of computation and the quality of alignment. We introduce a novel approach that is fundamentally different from all currently available methods. Our motivation comes from the Eulerian method for fragment assembly in DNA sequencing that transforms all DNA fragments into a de Bruijn graph and then reduces sequence assembly to a Eulerian path problem. The paper focuses on global multiple alignment of DNA sequences, where entire sequences are aligned into one configuration. Our main result is an algorithm with almost linear computational speed with respect to the total size (number of letters) of sequences to be aligned. Five hundred simulated sequences (averaging 500 bases per sequence and as low as 70% pairwise identity) have been aligned within three minutes on a personal computer, and the quality of alignment is satisfactory. As a result, accurate and simultaneous alignment of thousands of long sequences within a reasonable amount of time becomes possible. Data from an Arabidopsis sequencing project is used to demonstrate the performance.  相似文献   

2.
The performance of the computer program for phyloge netic analysis, POY, and its two implemented methods, "optimization alignment" and "fixed-states optimization," are explored for four data sets. Four gap costs are analyzed for every partition; some of the partitions (the 18S rRNA) are treated as a single fragment or in increasing fragments of 3, 10, and 30. Comparisons within and among methods are undertaken according to gap cost, number of fragments in which the sequences are divided, tree length, character congruence, topological congruence, primary homology statements, and computation time.  相似文献   

3.
The level of conservation between two homologous sequences often varies among sequence regions; functionally important domains are more conserved than the remaining regions. Thus, multiple parameter sets should be used in alignment of homologous sequences with a stringent parameter set for highly conserved regions and a moderate parameter set for weakly conserved regions. We describe an alignment algorithm to allow dynamic use of multiple parameter sets with different levels of stringency in computation of an optimal alignment of two sequences. The algorithm dynamically considers various candidate alignments, partitions each candidate alignment into sections, and determines the most appropriate set of parameter values for each section of the alignment. The algorithm and its local alignment version are implemented in a computer program named GAP4. The local alignment algorithm in GAP4, that in its predecessor GAP3, and an ordinary local alignment program SIM were evaluated on 257716 pairs of homologous sequences from 100 protein families. On 168475 of the 257716 pairs (a rate of 65.4%), alignments from GAP4 were more statistically significant than alignments from GAP3 and SIM.  相似文献   

4.
A greedy algorithm for aligning DNA sequences.   总被引:39,自引:0,他引:39  
For aligning DNA sequences that differ only by sequencing errors, or by equivalent errors from other sources, a greedy algorithm can be much faster than traditional dynamic programming approaches and yet produce an alignment that is guaranteed to be theoretically optimal. We introduce a new greedy alignment algorithm with particularly good performance and show that it computes the same alignment as does a certain dynamic programming algorithm, while executing over 10 times faster on appropriate data. An implementation of this algorithm is currently used in a program that assembles the UniGene database at the National Center for Biotechnology Information.  相似文献   

5.
Even when there is agreement on what measure a protein multiple structure alignment should be optimizing, finding the optimal alignment is computationally prohibitive. One approach used by many previous methods is aligned fragment pair chaining, where short structural fragments from all the proteins are aligned against each other optimally, and the final alignment chains these together in geometrically consistent ways. Ye and Godzik have recently suggested that adding geometric flexibility may help better model protein structures in a variety of contexts. We introduce the program Matt (Multiple Alignment with Translations and Twists), an aligned fragment pair chaining algorithm that, in intermediate steps, allows local flexibility between fragments: small translations and rotations are temporarily allowed to bring sets of aligned fragments closer, even if they are physically impossible under rigid body transformations. After a dynamic programming assembly guided by these “bent” alignments, geometric consistency is restored in the final step before the alignment is output. Matt is tested against other recent multiple protein structure alignment programs on the popular Homstrad and SABmark benchmark datasets. Matt's global performance is competitive with the other programs on Homstrad, but outperforms the other programs on SABmark, a benchmark of multiple structure alignments of proteins with more distant homology. On both datasets, Matt demonstrates an ability to better align the ends of α-helices and β-strands, an important characteristic of any structure alignment program intended to help construct a structural template library for threading approaches to the inverse protein-folding problem. The related question of whether Matt alignments can be used to distinguish distantly homologous structure pairs from pairs of proteins that are not homologous is also considered. For this purpose, a p-value score based on the length of the common core and average root mean squared deviation (RMSD) of Matt alignments is shown to largely separate decoys from homologous protein structures in the SABmark benchmark dataset. We postulate that Matt's strong performance comes from its ability to model proteins in different conformational states and, perhaps even more important, its ability to model backbone distortions in more distantly related proteins.  相似文献   

6.
A computer program is designed to facilitate the identification of coding gene's fragments using a set of peptides. The program is written on Basic programming language for personal computer "Iskra-226" (USSR). To accelerate some operations, computer code commands are used. Treatment of 50 DNA fragments by means of 10 peptides takes ca. 1 h of computer time. The program outputs list coding gene's fragments and corresponding peptides. The suggested algorithm is based on our finding that the number of false identifications of a coding gene fragments may be predicted by Poisson distribution and minimized using correct criteria. The suggested method enables one to evaluate the reliability of the true identification of DNA fragments in case of mistakes in primary structure of the gene fragments or peptides.  相似文献   

7.
Octamer sequencing technology (OST) is a primer-directed sequencing strategy in which an individual octamer primer is selected from a pre-synthesized octamer primer library and used to sequence a DNA fragment. However, selecting candidate primers from such a library is time consuming and can be a bottleneck in the sequencing process. To accelerate the sequencing process and to obtain high quality sequencing data, a computer program, electronic OST or eOST, was developed to automatically identify candidate primers from an octamer primer library. eOST integrates the base calling software PHRED to provide a quality assessment for target sequences and identifies potential primer binding sites located within a high quality target region. To increase the sequencing success rate, eOST includes a simple dynamic folding algorithm to automatically calculate the free energy and predict the secondary structure within the template in the vicinity of the octamer-binding site. Several parameters were found to be important, including base quality threshold, the window size of the template sequence segment, and the threshold ΔG value. OST, coupled with the eOST software, can be used to sequence short DNA fragments or in the finishing assembly stage of large-scale sequencing of genomic DNA.  相似文献   

8.
A new approach to sequence comparison: normalized sequence alignment   总被引:3,自引:0,他引:3  
The Smith-Waterman algorithm for local sequence alignment is one of the most important techniques in computational molecular biology. This ingenious dynamic programming approach was designed to reveal the highly conserved fragments by discarding poorly conserved initial and terminal segments. However, the existing notion of local similarity has a serious flaw: it does not discard poorly conserved intermediate segments. The Smith-Waterman algorithm finds the local alignment with maximal score but it is unable to find local alignment with maximum degree of similarity (e.g. maximal percent of matches). Moreover, there is still no efficient algorithm that answers the following natural question: do two sequences share a (sufficiently long) fragment with more than 70% of similarity? As a result, the local alignment sometimes produces a mosaic of well-conserved fragments artificially connected by poorly-conserved or even unrelated fragments. This may lead to problems in comparison of long genomic sequences and comparative gene prediction as recently pointed out by Zhang et al. (Bioinformatics, 15, 1012-1019, 1999). In this paper we propose a new sequence comparison algorithm (normalized local alignment ) that reports the regions with maximum degree of similarity. The algorithm is based on fractional programming and its running time is O(n2log n). In practice, normalized local alignment is only 3-5 times slower than the standard Smith-Waterman algorithm.  相似文献   

9.
A computer program is described, which constructs maps of restrictionendonuclease cleavage sites in linear or circular DNA molecules,given the fragment lengths in single and double digestions withtwo enzymes. The algorithm is based upon a partition methodand a very simple rule to chain fragments. The program is writtenin Prolog II. Received on July 28, 1987; accepted on December 31, 1987  相似文献   

10.
Rapid DNA sequencing based upon single molecule detection   总被引:1,自引:0,他引:1  
We are developing a laser-based technique for the rapid sequencing of 40-kb or larger fragments of DNA at a rate of 100 to 1000 bases per second. The approach relies on fluorescent labeling of the bases in a single fragment of DNA, attachment of this labeled DNA fragment to a support, movement of the supported DNA fragment into a flowing sample stream, and detection of individual fluorescently labeled bases as they are cleaved from the DNA fragment by an exonuclease. The ability to sequence large fragments of DNA will significantly reduce the amount of subcloning and the number of overlapping sequences required to assemble megabase segments of sequence information.  相似文献   

11.
Protein structure modeling by homology requires an accurate sequence alignment between the query protein and its structural template. However, sequence alignment methods based on dynamic programming (DP) are typically unable to generate accurate alignments for remote sequence homologs, thus limiting the applicability of modeling methods. A central problem is that the alignment that is "optimal" in terms of the DP score does not necessarily correspond to the alignment that produces the most accurate structural model. That is, the correct alignment based on structural superposition will generally have a lower score than the optimal alignment obtained from sequence. Variations of the DP algorithm have been developed that generate alternative alignments that are "suboptimal" in terms of the DP score, but these still encounter difficulties in detecting the correct structural alignment. We present here a new alternative sequence alignment method that relies heavily on the structure of the template. By initially aligning the query sequence to individual fragments in secondary structure elements and combining high-scoring fragments that pass basic tests for "modelability", we can generate accurate alignments within a small ensemble. Our results suggest that the set of sequences that can currently be modeled by homology can be greatly extended.  相似文献   

12.
MOTIVATION: Homologous sequences are sometimes similar over some regions but different over other regions. Homologous sequences have a much lower global similarity if the different regions are much longer than the similar regions. RESULTS: We present a generalized global alignment algorithm for comparing sequences with intermittent similarities, an ordered list of similar regions separated by different regions. A generalized global alignment model is defined to handle sequences with intermittent similarities. A dynamic programming algorithm is designed to compute an optimal general alignment in time proportional to the product of sequence lengths and in space proportional to the sum of sequence lengths. The algorithm is implemented as a computer program named GAP3 (Global Alignment Program Version 3). The generalized global alignment model is validated by experimental results produced with GAP3 on both DNA and protein sequences. The GAP3 program extends the ability of standard global alignment programs to recognize homologous sequences of lower similarity. AVAILABILITY: The GAP3 program is freely available for academic use at http://bioinformatics.iastate.edu/aat/align/align.html.  相似文献   

13.
A computer algorithm is described that utilizes both Edman and mass spectrometric data for simultaneous determination of the amino acid sequences of several peptides in a mixture. Gas phase sequencing of a peptide mixture results in a list of observed amino acids for each cycle of Edman degradation, which by itself may not be informative and typically requires reanalysis following additional chromatographic steps. Tandem mass spectrometry, on the other hand, has a proven ability to analyze sequences of peptides present in mixtures. However, mass spectrometric data may lack a complete set of sequence-defining fragment ions, so that more than one possible sequence may account for the observed fragment ions. A combination of the two types of data reduces the ambiguity inherent in each. The algorithm first utilizes the Edman data to determine all hypothetical sequences with a calculated mass equal to the observed mass of one of the peptides present in the mixture. These sequences are then assigned figures of merit according to how well each of them accounts for the fragment ions in the tandem mass spectrum of that peptide. The program was tested on tryptic and chymotryptic peptides from hen lysozyme, and the results are compared with those of another computer program that uses only mass spectral data for peptide sequencing. In order to assess the utility of this method the program is tested using simulated mixtures of varying complexity and tandem mass spectra of varying quality.  相似文献   

14.
The complexity of the overlap method for sequencing biopolymers   总被引:1,自引:0,他引:1  
The problem of trying to reconstruct the sequence of a biopolymer by using overlapping fragments obtained from cleaving agents is shown to be computationally intractable. This strongly suggests that any computer program for overlap sequencing, even though it may work well for a limited number of inputs, will not work sufficiently for all inputs. However, if the problem is restricted so that certain crucial fragments are known, called prime strings, a sequence can be found efficiently in all cases. Graph theory techniques for doing so can also be used to count the number of sequences consistent with the fragment data to determine whether a unique sequence has been obtained.  相似文献   

15.
KS Lee  RN Kim  BH Yoon  DS Kim  SH Choi  DW Kim  SH Nam  A Kim  A Kang  KH Park  JE Jung  SH Chae  HS Park 《Bioinformation》2012,8(11):532-534
Recently, next generation sequencing (NGS) technologies have led to a revolutionary increase in sequencing speed and costefficacy. Consequently, a vast number of contigs from many recently sequenced bacterial genomes remain to be accurately mapped and annotated, requiring the development of more convenient bioinformatics programs. In this paper, we present a newly developed web-based bioinformatics program, Bacterial Genome Mapper, which is suitable for mapping and annotating contigs that have been assembled from bacterial genome sequence raw data. By constructing a multiple alignment map between target contig sequences and two reference bacterial genome sequences, this program also provides very useful comparative genomics analysis of draft bacterial genomes. AVAILABILITY: The database is available for free at http://mbgm.kribb.re.kr.  相似文献   

16.
We have synthesized two 40 base pair DNA fragments; one fragment contains the consensus DNA site for CAP (fragment 'ICAP'); the other fragment contains the E. coli lac promoter DNA site for CAP (fragment 'LCAP'). We have investigated the binding of CAP to the two DNA fragments using the nitrocellulose filter binding assay. Under standard conditions [( NaCl] = 200 mM, pH = 7.3), CAP exhibits a 450-fold higher affinity for ICAP than for LCAP. The salt dependence of the binding equilibrium indicates that CAP makes eight ion pairs with ICAP, but only six ion pairs with LCAP. Approximately half of the difference in binding free energy for interaction of CAP with ICAP vs. LCAP is attributable to this difference in ion-pair formation. The pH dependence of the binding equilibrium indicates that the eight CAP-ICAP ion pairs and the six CAP-LCAP ion pairs do not involve His residues of CAP.  相似文献   

17.
The new computer algorithm FOUND, which is implemented as an integrated module of the DYANA structure calculation program, is capable of performing systematic local conformation analyses by exhaustive grid searches for arbitrary contiguous fragments of proteins and nucleic acids. It uses torsion angles as the only degrees of freedom to identify all conformations that fulfill the steric and NMR-derived conformational restraints within a contiguous molecular fragment, as defined either by limits on the maximal restraint violations or by the fragment-based DYANA target function value. Sets of mutually dependent torsion angles, for example in ribose rings, are treated as a single degree of freedom. The results of the local conformation analysis include allowed torsion angle ranges and stereospecific assignments for diastereotopic substituents, which are then included in the input of a subsequent structure calculation. FOUND can be used for grid searches comprising up to 13 torsion angles, such as the backbone of a complete -helical turn or dinucleotide fragments in nucleic acids, and yields a significantly higher number of stereospecific assignments than the precursor grid search algorithm HABAS.  相似文献   

18.
A widely used algorithm for computing an optimal local alignment between two sequences requires a parameter set with a substitution matrix and gap penalties. It is recognized that a proper parameter set should be selected to suit the level of conservation between sequences. We describe an algorithm for selecting an appropriate substitution matrix at given gap penalties for computing an optimal local alignment between two sequences. In the algorithm, a substitution matrix that leads to the maximum alignment similarity score is selected among substitution matrices at various evolutionary distances. The evolutionary distance of the selected substitution matrix is defined as the distance of the computed alignment. To show the effects of gap penalties on alignments and their distances and help select appropriate gap penalties, alignments and their distances are computed at various gap penalties. The algorithm has been implemented as a computer program named SimDist. The SimDist program was compared with an existing local alignment program named SIM for finding reciprocally best-matching pairs (RBPs) of sequences in each of 100 protein families, where RBPs are commonly used as an operational definition of orthologous sequences. SimDist produced more accurate results than SIM on 50 of the 100 families, whereas both programs produced the same results on the other 50 families. SimDist was also used to compare three types of substitution matrices in scoring 444,461 pairs of homologous sequences from the 100 families.  相似文献   

19.
Sequencing of megabase plus DNA by hybridization: theory of the method   总被引:42,自引:0,他引:42  
A mismatch-free hybridization of oligonucleotides containing from 11 to 20 monomers to unknown DNA represents, in essence, a sequencing of a complementary target. Realizing this, we have used probability calculations and, in part, computer simulations to estimate the types and numbers of oligonucleotides that would have to be synthesized in order to sequence a megabase plus segment of DNA. We estimate that 95,000 specific mixes of 11-mers, mainly of the 5'(A,T,C,G)(A,T,C,G)N8(A,T,C,G)3' type, hybridized consecutively to dot blots of cloned genomic DNA fragments would provide primary data for the sequence assembly. An optimal mixture of representative libraries in M13 vector, having inserts of (i) 7 kb, (ii) 0.5 kb genomic fragments randomly ligated in up to 10-kb inserts, and (iii) tandem "jumping" fragments 100 kb apart in the genome, will be needed. To sequence each million base pairs of DNA, one would need hybridization data from about 2100 separate hybridization sample dots. Inevitable gaps and uncertainties in alignment of sequenced fragments arising from nonrandom or repetitive sequence organization of complex genomes and difficulties in cloning "poisonous" sequences in Escherichia coli, inherent to large sequencing by any method, have been considered and minimized by choice of libraries and number of subclones used for hybridization. Because it is based on simpler biochemical procedures, our method is inherently easier to automate than existing sequencing methods. The sequence can be derived from simple primary data only by extensive computing. Phased experimental tests and computer simulations increasing in complexity are needed before accurate estimates can be made in terms of cost and speed of sequencing by the new approach. Nevertheless, sequencing by hybridization should show advantages over existing methods because of the inherent redundancy and parallelism in its data gathering.  相似文献   

20.
An approach for DNA sequencing is described that circumvents the need for synthetic oligonucleotide primers, which seriously restrict the progress of DNA sequencing in the commonly used protocol. The method is based on the use of short restriction fragments as primers randomly distributed along single-stranded templates. Premapping of target DNA is eliminated and subcloning manipulation is minimized. This method has been used successfully for sequencing genes in the range of 2 kb, for which about 10 restriction fragment primers per kilobase were sufficient to generate a continuous overlapping sequence in alignment. The approach has also been readily applied for an automated sequencing system with the fluorescent chain-terminating dideoxynucleotides, thus implying its potential for sequencing large genomic DNAs.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号