首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
This paper presents a method for the multiple alignment of asequence set. The MASH algorithm uses a non-redundant databaseof common motifs and an ‘alignment priority’ criterionthat depends on the length and the occurrence frequency of thepatterns in the set of sequences. This user-defined criterionallows the determination of the series of the patterns to bealigned. This program is applied to a fragment of envelope geneenv gp120 for 20 isolates of the immunodeficiency virus. Themultiplicity of alignments obtained by modifying the criterionparameters reveals different aspects of similarity between thesequences. Received on June 4, 1990; accepted on December 14, 1990  相似文献   

2.
We propose a new method, called ‘size leap’ algorithm,of search for motifs of maximum size and common to two fragmentsat least. It allows the creation of a reduced database of motifsfrom a set of sequences whose size obeys the series of Fibonaccinumbers. The convenience lies in the efficiency of the motifextraction. It can be applied in the establishment of overlapregions for DNA sequence reconstruction and multiple alignmentof biological sequences. The method of complete DNA sequencereconstruction by extraction of the longest motifs (‘anchormotifs’) is presented as an application of the size leapalgorithm. The details of a reconstruction from three sequencedfragments are given as an example. Received on February 12, 1991; accepted on February 15, 1991  相似文献   

3.
We present a fast algorithm to produce a graphic matrix representationof sequence homology. The algorithm is based on lexicographicalordering of fragments. It preserves most of the options of asimple naive algorithm with a significant increase in speed.This algorithm was the basis for a program, called DNAMAT, thathas been extensively tested during the last three years at theWeizmann Institute of Science and has proven to be very useful.In addition we suggest a way to extend our approach to analysea series of related DNA or RNA sequences, in order to determinecertain common structural features. The analysis is done by‘summing’ a set of dot-matrices to produce an overallmatrix that displays structural elements common to most of thesequences. We give an example of this procedure by analysingtRNA sequences. Received on June 26, 1986; accepted on September 28, 1986  相似文献   

4.
Analyzing protein-DNA recognition mechanisms   总被引:1,自引:0,他引:1  
We present a computational algorithm that can be used to analyze the generic mechanisms involved in protein-DNA recognition. Our approach is based on energy calculations for the full set of base sequences that can be threaded onto the DNA within a protein-DNA complex. It is able to reproduce experimental consensus binding sequences for a variety of DNA binding proteins and also correlates well with the order of measured binding free energies. These results suggest that the crystal structure of a protein-DNA complex can be used to identify all potential binding sequences. By analyzing the energy contributions that lead to base sequence selectivity, it is possible to quantify the concept of direct versus indirect recognition and to identify a new concept describing whether the protein-DNA interaction and DNA deformation terms select optimal binding sites by acting in accord or in disaccord.  相似文献   

5.
6.
Discovering simple DNA sequences by the algorithmic significance method   总被引:6,自引:1,他引:5  
A new method, ‘algorithmic significance’, is proposedas a tool for discovery of patterns in DNA sequences. The mainidea is that patterns can be discovered by finding ways to encodethe observed data concisely. In this sense, the method can beviewed as a formal version of the Occam's Razor principle. Inthis paper the method is applied to discover significantly simpleDNA sequences. We define DNA sequences to be simple if theycontain repeated occurrences of certain ‘words’and thus can be encoded in a small number of bits. Such definitionincludes minisatellites and microsatellites. A standard dynamicprogramming algorithm for data compression is applied to computethe minimal encoding lengths of sequences in linear time. Anelectronic mail server for identification of simple sequencesbased on the proposed method has been installed at the Internetaddress pythia@anl.gov.  相似文献   

7.
An algorithm has been developed for the identification of unknownpatterns which are distinctive for a set of short DNA sequencesbelieved to be functionally equivalent. A pattern is definedas being a string, containing fully or partially specified nucleotidesat each position of the string. The advantage of this ‘vague’definition of the pattern is that it imposes minimum constraintson the characterization of patterns. A new feature of the approachdeveloped here is that it allows a ‘fair’ simultaneoustesting of patterns of all degrees of degeneracy. This analysisis based on an evaluation of inhomogeneity in the empiricaloccurrence distribution of any such pattern within a set ofsequences. The use of the nonparametric kernel density estimationof Parzen allows one to assess small disturbances among thesequence alignments. The method also makes it possible to identifysequence subsets with different characteristic patterns. Thisalgorithm was implemented in the analysis of patterns characteristicof sets of promoters, terminators and splice junction sequences.The results are compared with those obtained by other methods. Received on November 17, 1986; accepted on June 15, 1987  相似文献   

8.
The explosive growth in biological data in recent years has led to the development of new methods to identify DNA sequences. Many algorithms have recently been developed that search DNA sequences looking for unique DNA sequences. This paper considers the application of the Burrows-Wheeler transform (BWT) to the problem of unique DNA sequence identification. The BWT transforms a block of data into a format that is extremely well suited for compression. This paper presents a time-efficient algorithm to search for unique DNA sequences in a set of genes. This algorithm is applicable to the identification of yeast species and other DNA sequence sets.  相似文献   

9.
Repetitive extragenic palindromic (REP) sequences are highly conserved inverted repeat sequences originally discovered in Escherichia coli and Salmonella typhimurium. We have physically mapped these sequences in the E. coli genome by using Southern hybridization of an ordered phage bank of E. coli (Y. Kohara, K. Akiyama, and K. Isono, Cell 50:495-508, 1987) with generic REP probes derived from the REP consensus sequence. The set of REP probe-hybridizing clones was correlated with a set of clones expected to contain REP sequences on the basis of computer searches. We also show that a generic REP probe can be used in Southern hybridization to analyze genomic DNA digested with restriction enzymes to determine genetic relatedness among natural isolates of E. coli. A search for these sequences in other members of the family Enterobacteriaceae shows a consistent correlation between both the number of occurrences and the hybridization strength and genealogical relationship.  相似文献   

10.
An algorithm for searching restriction maps   总被引:1,自引:0,他引:1  
This paper presents an algorithm thai searches a DNA restrictionenzyme map for regions that approximately match a shorter 'probe'map. Both the map and the probe consist of a sequence of address-enzymepairs denoting restriction sites, and the algorithm penalizesa potential match for undetected or missing sites and for discrepanciesin the distance between adjacent sites. The algorithm was designedspecifically for comparing relatively short DNA sequences witha long restriction map, a problem that will become increasingcommon as large physical maps are generated. The algorithm hasbeen used to extract information from a restriction map of theentire Escherichia coli genome. Received on October 28, 1989; accepted on February 2, 1990  相似文献   

11.
Here we present a performance test of a Kohonen features mapapplied to the fast extraction of uncommon sequences from thecoding region of the human insulin receptor gene. We used anetwork with 30 neurons and with a variable input window. Theprogram was aimed at detecting unique or uncommon DNA regionspresent in crude sequence data and was able to automaticallydetect the signal peptide coding regions of a set of human insulinreceptor gene data. The testing of this program with HSIRPRcDNA release (EMBL data bank) indicated the presence of uniquefeatures in the signal peptide coding region. On the basis ofour results this program can automatically detect ‘singularity’from crude sequencing data and it does not require knowledgeof the features to be found. Received on August 27, 1990; accepted on March 14, 1991  相似文献   

12.
13.
Correlation Finder is a free software which allows to seek exhaustively correlations between nucleotides in genomic sequences. It permits to analyze generic DNA sequences and genic sequences where the codon phase needs to be taken into account. Its graphic interface allows to easily set the parameters that characterize the motifs being sought. This tool handles large data sets and runs on the Windows operative system.  相似文献   

14.
gm: a practical tool for automating DNA sequence analysis   总被引:1,自引:0,他引:1  
The gm (gene modeler) program automates the identification ofcandidate genes in anonymous, genomic DNA sequence data, gmaccepts sequence data, organism-specific consensus matricesand codon asymmetry tables, and a set of parameters as input;it returns a set of models describing the structures of candidategenes in the sequence and a corresponding set of predicted aminoacid sequences as output, gm is implemented in C, and has beentested on Sun, VAX, Sequent, MIPS and Cray computers. It iscapable of analyzing sequences of several kilobases containingmulti-exon genes in >1 min execution time on a Sun 4/60. Received on December 4, 1989; accepted on February 28, 1990  相似文献   

15.
Reconsideration of the term “gene” should take into account (a) the potential clash between hierarchical levels of information discussed in the 1970s by Gregory Bateson, (b) the contrast between conventional and genome phenotypes discussed in the 1980s by Richard Grantham, and (c) the emergence in the 1990s of a new science—Evolutionary Bioinformatics—that views genomes as channels conveying multiple forms of information through the generations. From this perspective, there is conceptual continuity between the functional “gene” of Mendel and today’s GenBank sequences. If the function attributed to a gene can change specifically as the result of a DNA mutation, then the mutated part of DNA can be considered as part of the gene. Conversely, even if appearing to locate within a gene, a mutation that does not change the specific function is not part of the gene, although it may change some other function to which the DNA sequence contributes. This strict definition is impractical, but serves as a guide to more workable, context-dependent, definitions. The gene is either (1) The DNA sequence that is transcribed, (2) The latter plus the immediate 5′ and 3′ sequences that, when mutated, specifically affect the function, (3) The latter two, plus any remote sequences that, when mutated, specifically affect the function. Attempts, such as that of Scherrer and Jost, to redefine Mendel’s “gene,” may be too narrowly focused on regulation to the exclusion of other important themes.  相似文献   

16.
An algorithm, ‘phylogenetic scanning’, is describedfor mapping gene conversion events where comparative DNA sequencedata are available from different species. In this algorithm,sets of hypothetical phylogenetic trees are constructed thatdescribe possible sequence relationships due to gene conversionsin different species lineages; these trees are then evaluatedby the principle of parsimony at intervals in the sequence alignment.When used to map gene conversion events that occurred betweenthe pair of -globin genes of higher primates, the algorithmgives results nearly identical to those obtained using a tediousmanual approach. Suggestions are also provided for adaptationof this procedure to the analysis of other recombination events. Received on July 3, 1990; accepted on November 8, 1990  相似文献   

17.
In previous work, we have shown that a set of characteristics,defined as (code frequency) pairs, can be derived from a proteinfamily by the use of a signal-processing method. This methodenables the location and extraction of sequence patterns bytaking into account each (code frequency) pair individually.In the present paper, we propose to extend this method in orderto detect and visualize patterns by taking into account severalpairs simultaneously. Two ‘multifrequency’ methodsare described. The first one is based on a rewriting of thesequences with new symbols which summarize the frequency information.The second method is based on a clustering of the patterns associatedwith each pair. Both methods lead to the definition of significantconsensus sequences. Some results obtained with calcium-bindingproteins and serine proteases are also discussed. Received on March 6, 1990; accepted on September 24, 1990  相似文献   

18.
19.
We describe a fast computer algorithm for identifying consensuspatterns in DNA sequences. The method requires no prior assumptionsabout the consensus pattern other than its length. In particularno previous knowledge of the frequency or spacing of consensuspatterns is required. However, a priori information about theshape of the consensus pattern, or invariability of individualpositions, or the overall conservation level, can be utilizedto enhance the selectivity and sensitivity of search. As thenumber of all possible consensus words increases very rapidlywith length, comprehensive searches have usually been restrictedto a maximum of 10–12 nucleotides, even when large mainframesare used. Our algorithm enables searching for consensus patternsof this order on current mid-range and powerful microcomputers.Searches may be conducted on single, long sequences or a setof possibly aligned shorter sequences. We give examples of identifiedconsensus patterns in both prokaryotic and eukaryotic DNA sequences,along with some typical program timings. Received on January 14, 1991; accepted on March 5, 1991  相似文献   

20.
A space-efficient algorithm for local similarities   总被引:3,自引:0,他引:3  
Existing dynamic-programming algorithms for identifying similarregions of two sequences require time and space proportionalto the product of the sequence lengths. Often this space requirementis more limiting than the time requirement. We describe a dynamic-programminglocal-similarity algorithm that needs only space proportionalto the sum of the sequence lengths. The method can also findrepeats within a single long sequence. To illustrate the algorithm'spotential, we discuss comparison of a 73 360 nucleotide sequencecontaining the human ß-like globin gene cluster anda corresponding 44 594 nucleotide sequence for rabbit, a problemwell beyond the capabilities of other dynamic-programming software. Received on January 29, 1990; accepted on May 30, 1990  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号