首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 22 毫秒
1.
A new approach to search for common patterns in many sequencesis presented. The idea is that one sequence from the set ofsequences to be compared is considered as a ‘basic’one and all its similarities with other sequences are found.Multiple similarities are then reconstructed using these data.This approach allows one to search for similar segments whichcan differ in both substitutions and deletions/insertions. Thesesegments can be situated at different positions in various sequences.No regions of complete or strong similarity within the segmentsare required. The other parts of the sequences can have no similarityat all. The only requirement is that the similar segments canbe found in all the sequences (or in the majority of them, giventhe common segments are present in the basic sequence). Workingtime of an algorithm presented is proportional to n.L2when nsequences of length L are analyzed. The algorithm proposed isimplemented as programs for the IBM-PC and IBM/370. Its applicationsto the analysis of biopolymer primary structures as well asthe dependence of the results on the choice of basic sequenceare discussed.  相似文献   

2.
A new measure of subalignment similarity is introduced. Specifically, similaritys(l,c) is defined as the logarithm to the basep of the probability of findingc or fewer mismatches in a subalignment of lengthl, wherep is the probability of a match. Previous algorithms can not use this measure to find locally optimal subalignments because, unlike Needleman-Wunsch and Sellers similarities, this measure is nonlinear. A new pattern recognition algorithm is described for finding all locally optimal subalignments of two nucleotide sequences. The DD algorithm can uses(l, c) or any other reasonable similarity function to assess the relative interest of subalignments. The DD algorithm searches only the diagonal graph, which lacks insertions and deletions. This search strategy greatly decreases the computation time and does not require an arbitrary choice of gap cost. The paths of the resulting DD graph usually draw attention to likely locations for insertions and deletions. A heuristic formula is derived for estimating significance levels fors(l, c) in the context of the lengths of the two aligned sequences. The DD algorithm has been used to find interesting subalignments between the nucleotide sequences for human and murine interleukin 2.  相似文献   

3.
MOTIVATION: As a first approximation, similarity between two long orthologous regions of genomes can be represented by a chain of local similarities. Within such a chain, pairs of successive similarities are collinear (non-conflicting), i.e. segments involved in the nth similarity precede in both sequences segments involved in the (n+1)th similarity. However, when all similarities between two long sequences are considered, usually there are many conflicts between them. Although some conflicts can be avoided by masking transposons or low-complexity sequences, selecting only those similarities that reflect orthology and, thus, belong to the evolutionarily true chain is not trivial. RESULTS: We propose a simple, hierarchical algorithm of finding the true chain of local similarities. Starting from similarities with low P-values, we resolve each pairwise conflict by deleting a similarity with a higher P-value. This greedy approach constructs a chain of similarities faster than when a chain optimal with respect to some global criterion is sought, and makes more sense biologically.  相似文献   

4.
A finite conflict with given payoff matrix may have many ESS's (evolutionarily stable strategies). For a given set of pure strategies { 1, 2, ...,n} a set of subsets of these is called a pattern, and if there exists ann ×n matrix which has ESS's whose supports (i.e. the playable strategies) precisely match the elements of the pattern, then the pattern is said to be attainable. In [5] and [10] some methods were developed to specify when a pattern was, or was not, attainable. The object here is to present a somewhat different method which is essentially recursive. We derive certain results which allow one to deduce from the attainability of a pattern for givenn the attainability of other patterns forn+1, and by induction for anyn+r.  相似文献   

5.
SimShift: identifying structural similarities from NMR chemical shifts   总被引:3,自引:0,他引:3  
MOTIVATION: An important quantity that arises in NMR spectroscopy experiments is the chemical shift. The interpretation of these data is mostly done by human experts; to our knowledge there are no algorithms that predict protein structure from chemical shift sequences alone. One approach to facilitate this process could be to compare two such sequences, where the structure of one protein has already been resolved. Our claim is that similarity of chemical shifts thereby found implies structural similarity of the respective proteins. RESULTS: We present an algorithm to identify structural similarities of proteins by aligning their associated chemical shift sequences. To evaluate the correctness of our predictions, we propose a benchmark set of protein pairs that have high structural similarity, but low sequence similarity (because with high sequence similarity the structural similarities could easily be detected by a sequence alignment algorithm). We compare our results with those of HHsearch and SSEA and show that our method outperforms both in >50% of all cases.  相似文献   

6.
We developed a new method which searches sequence segments responsible for the recognition of a given chemical structure. These segments are detected as those locally conserved among a sequence to be analyzed (target sequence) and a set of sequences (reference sequences). Reference sequences are the sequences of functionally related proteins, ligands of which contain a common chemical substructure in their molecular structures. 'Similarity graphing' cuts target sequences into segments, aligns them with reference sequence pairwise, calculates the degree of similarity for each alignment, and shows graphically cumulative similarity values on target sequence. Any locally conserved regions, short or long in length and weak or strong in similarity, are detected at their optimal conditions by adjusting three parameters. The 'enzyme-reaction database' contains chemical structures and their related enzymes. When a chemical substructure is input into the database, sequences of the enzymes related to the input substructure are systematically searched from the NBRF sequence database and output as reference sequences. Examples of analysis using similarity graphing in combination with the enzyme-reaction database showed a great potentiality in the systematic analysis of the relationships between sequences and molecular recognitions for protein engineering.  相似文献   

7.
Given a set S of n locally aligned sequences, it is a needed prerequisite to partition it into groups of very similar sequences to facilitate subsequent computations, such as the generation of a phylogenetic tree. This article introduces a new method of clustering which partitions S into subsets such that the overlap of each pair of sequences within a subset is at least a given percentage c of the lengths of the two sequences. We show that this problem can be reduced to finding all maximal cliques in a special kind of max-tolerance graph which we call a c-max-tolerance graph. Previously we have shown that finding all maximal cliques in general max-tolerance graphs can be done efficiently in O(n 3 + out). Here, using a new kind of sweep-line algorithm, we show that the restriction to c-max-tolerance graphs yields a better runtime of O(n 2 log n + out). Furthermore, we present another algorithm which is much easier to implement, and though theoretically slower than the first one, is still running in polynomial time. We then experimentally analyze the number and structure of all maximal cliques in a c-max-tolerance graph, depending on the chosen c-value. We apply our simple algorithm to artificial and biological data and we show that this implementation is much faster than the well-known application Cliquer. By introducing a new heuristic that uses the set of all maximal cliques to partition S, we finally show that the computed partition gives a reasonable clustering for biological data sets.  相似文献   

8.
Using chaos game representation we introduce a novel and straightforward method for identifying similarities/dissimilarities between DNA sequences of the same type, from different organisms. A matrix is associated to each CGR pattern and the similarities result from the comparison between the matrices of the sequences of interest. Three different methods of analysis of the resulting difference matrix are considered: a 3-dimensional representation giving both local and global information, a numerical characterization by defining an n-letter word similarity measure and a statistical evaluation. The method is illustrated by implementation to the study of albumin nucleotides sequences from eight mammal species taking as reference the human albumin.  相似文献   

9.
Germline repertoire of the immunoglobulin V H 3 family in rhesus monkeys   总被引:2,自引:2,他引:0  
 To facilitate molecular studies of antibody responses in rhesus monkeys (Macaca mulatta), we cloned and sequenced germline segments from its largest and most diverse immunoglobulin heavy-chain gene family, V H 3. Using a PCR-based approach, we characterized 29 sequences, 20 with open reading frames (ORFs) and 9 pseudogenes. The leader sequences, introns, exons, and recombination signal sequences of M. mulatta V H 3 gene segments are not strictly identical to those of humans, but the mature coding regions demonstrate, on average, greater than 90% sequence similarity. Although the framework regions are more highly conserved, the complementarity-determining regions (CDRs) also show strong similarities, and their predicted three-dimensional structures resemble those of their human homologues. In one instance, homologous macaque and human CDR1 sequences were 100% identical at the nucleotide level, and some CDR2s shared nucleotide identity as high as 96.5%. However, some rhesus V H 3 ORFs have unusual structural features, including atypical CDR lengths and uncommon amino acids at structurally crucial positions. The similarity of rhesus and human V H 3 homologues reinforces the notion that humoral immunity in this nonhuman primate species is an appropriate system for modeling human antibody responses. Received: 10 August 1999 / Revised: 30 December 1999  相似文献   

10.
The publication of the crystallographic structure of calmodulin protein has offered an example leading us to believe that it is possible for many protein sequence segments to exhibit multiple 3D structures referred to as multi-structural segments. To this end, this paper presents statistical analysis of uniqueness of the 3D-structure of all possible protein sequence segments stored in the Protein Data Bank (PDB, Jan. of 2003, release 103) that occur at least twice and whose lengths are greater than 10 amino acids (AAs). We refined the set of segments by choosing only those that are not parts of longer segments, which resulted in 9297 segments called a sponge set. By adding 8197 signature segments, which occur uniquely in the PDB, into the sponge set we have generated a benchmark set. Statistical analysis of the sponge set demonstrates that rotating, missing and disarranging operations described in the text, result in the segments becoming multi-structural. It turns out that missing segments do not exhibit a change of shape in the 3D-structure of a multi-structural segment. We use the root mean square distance for unit vector sequence (URMSD) as an improved measure to describe the characteristics of hinge rotations, missing, and disarranging segments. We estimated the rate of occurrence for rotating and disarranging segments in the sponge set and divided it by the number of sequences in the benchmark set which is found to be less than 0.85%. Since two of the structure changing operations concern negligible number of segment and the third one is found not to have impact on the structure, we conclude that the 3D-structure of proteins is conserved statistically for more than 98% of the segments. At the same time, the remaining 2% of the sequences may pose problems for the sequence alignment based structure prediction methods.*Jishou Ruan research was supported by Liuhui Center for Applied Mathematics, China-Canada exchange program administered by MITACS and NSFC (10271061). #Ke Chen and Lukasz A. Kurgan research was partially supported by NSERC Canada. Jack A. Tuszynkski research has been supported by MITACS, NSERC Canada and the Allard Foundation.  相似文献   

11.
 Horse (Equus caballus) immunoglobulin mu chain-encoding (IgM) variable, joining, and constant gene segments were cloned and characterized. Nucleotide sequence analyses of 15 cDNA clones from a mesenteric lymph node library identified 7 unique variable gene segments, 5 separate joining segments, and a single constant region. Based on comparison with human sequences, horse variable segments could be grouped into either family 1 of immunoglobulin (Ig) clan I or family 4 of Ig clan II subclan IV. All horse sequences had a relatively conserved 16 base pair (bp) segment in framework 3 which was recognized with high specificity in polymerase chain reaction by a degenerate oligonucleotide primer. Horse complementarity determining regions (CDR) had considerable variability in predicted amino acid content and length but also included the presence of relatively conserved residues and several canonical sequences that may be necessary in formation of the β chain main structure and conformation of antigen-binding sites through interaction with light chain CDR. Sequence analysis of joining regions revealed the presence of nearly invariant 3′ regions similar to those found in human and mouse genes. A single horse IgM constant region comprising 1472 bp and encoding 451 residues was also identified. Direct comparison of the horse constant region predicted amino acid sequence with those from eleven other species revealed the presence of 53 invariant residues with particularly conserved sequences within the third and fourth exons. Phylogenetic analysis using a neighbor-joining algorithm showed closest similarity of the horse mu chain-encoding constant region gene to human and dog sequences. Together, these findings provide insights into the comparative biology of IgM as well as data for additional detailed studies of the horse immune system and investigation of immune-related diseases. Received: 14 October 1996 / Revised: 10 December 1996  相似文献   

12.
Adiposity is more prevalent among individuals with a predominance of small, dense low‐density lipoprotein (LDL) (pattern B) particles than among those with larger LDL (pattern A). We tested for differences in resting energy expenditure (REE) and respiratory quotient (RQ) in overweight men with pattern A (n = 36) or pattern B (n = 60). Men consumed a standardized isoenergetic diet for 3 weeks after which a ~9 kg weight loss was induced by caloric deficit for 9 weeks, followed by 4 weeks of weight stabilization. REE and RQ were measured by indirect calorimetry before and after weight loss. Results were analyzed separately in pattern B men who converted to pattern A (B→A; n = 35) and those who did not (B→B; n = 25). At baseline, B→B men had higher trunk fat, triacylglycerol (TG) and insulin concentrations, homeostasis model assessment of insulin resistance (HOMAIR), and smaller LDL particles compared to B→A men and baseline pattern A men who remained pattern A (A→A; n = 35). REE normalized to fat‐free mass did not change after weight loss. RQ decreased in A→A men, increased in B→A men, and did not change significantly in B→B men after weight loss. Calculated fat oxidation rates paralleled the RQ results. Baseline plasma TG concentrations were positively correlated with RQ and inversely correlated with the magnitude of weight loss achieved for a given prescribed energy reduction in the entire study population. Pattern B men who converted to pattern A with weight loss may have an underlying impairment in fat oxidation that predisposes to both dyslipidemia and an impaired ability to achieve weight loss by energy restriction.  相似文献   

13.
Homologies based on structural motifs characterize conserved structures and mechanisms of maintaining function. An algorithm was developed to quantitate homology among segments of two proteins based upon structural characteristics of an amphipathic α-helix. This helical mimicry algorithm scored homology among sequences of two proteins in terms of: (i) presence of Leu, Ile, Val, Phe, or Met in a longitudinal, hydrophobic strip-of-helix at positions n, n + 4, n + 7, n + 11, etc. in the primary sequence, (ii) identity or chemical similarity of amino acids at intervening positions and (iii) exchanges of amino acids from positions n to n − 1, n + 3, n + 4, n + 1, n − 3, n − 4 around n (on the surface of a putative helix). While such exchanges of amino acids on the surfaces of homologous helices may conserve function, they did not maintain specific interactions of those residues with apposing groups.  相似文献   

14.
A comparative analysis between human, mouse, and rabbit immunoglobulin (Ig) kappa-gene DNA sequences is presented. New formulas for determining the expected length and variance of the longest block identity (a succession of matching nucleotides) between multiple random sequences are given and are used to establish statistical criteria for ascertaining the significance of block identities shared in r out of s sequences. The statistically significant block identities within and between the Ig-kappa-gene sequences are ascertained, and alignment maps based on these similarities are constructed. The human and rabbit sequences (especially in the noncoding regions) and the human and mouse sequences (on the coding regions) show a similarity much stronger than that between the mouse and rabbit sequences. The existence of several highly significant shared oligonucleotides occurring in alignment with each other or with respect to the J- and C-gene segments suggests a configuration of multiple control sites. Discussion and interpretations of the form and distribution of the block identities are given.   相似文献   

15.

Background  

A new algorithm has been developed for generating conservation profiles that reflect the evolutionary history of the subfamily associated with a query sequence. It is based on n-gram patterns (NP{n,m}) which are sets of n residues and m wildcards in windows of size n+m. The generation of conservation profiles is treated as a signal-to-noise problem where the signal is the count of n-gram patterns in target sequences that are similar to the query sequence and the noise is the count over all target sequences. The signal is differentiated from the noise by applying singular value decomposition to sets of target sequences rank ordered by similarity with respect to the query.  相似文献   

16.
The set of "expansion segments" of any eukaryotic 26S/28S ribosomal RNA (rRNA) gene is responsible for the bulk of the difference in length between the prokaryotic 23S rRNA gene and the eukaryotic 26S/28S rRNA gene. The expansion segments are also responsible for interspecific fluctuations in length during eukaryotic evolution. They show a consistent bias in base composition in any species; for example, they are AT rich in Drosophila melanogaster and GC rich in vertebrate species. Dot-matrix comparisons of sets of expansion segments reveal high similarities between members of a set within any 28S rRNA gene of a species, in contrast to the little or spurious similarity that exists between sets of expansion segments from distantly related species. Similarities among members of a set of expansion segments within any 28S rRNA gene cannot be accounted for by their base-compositional bias alone. In contrast, no significant similarity exists within a set of "core" segments (regions between expansion segments) of any 28S rRNA gene, although core segments are conserved between species. The set of expansion segments of a 26S/28S gene is coevolving as a unit in each species, at the same time as the family of 28S rRNA genes, as a whole, is undergoing continual homogenization, making all sets of expansion segments from all ribosomal DNA (rDNA) arrays in a species similar in sequence. Analysis of DNA simplicity of 26S/28S rRNA genes shows a direct correlation between significantly high relative simplicity factors (RSFs) and sequence similarity among a set of expansion segments. A similar correlation exists between RSF values, overall rDNA lengths, and the lengths of individual expansion segments. Such correlations suggest that most length fluctuations reflect the gain and loss of simple sequence motifs by slippage-like mechanisms. We discuss the molecular coevolution of expansion segments, which takes place against a background of slippage-like and unequal crossing-over mechanisms of turnover that are responsible for the accumulation of interspecific differences in rDNA sequences.   相似文献   

17.
Calculation of dot-matrices is a widespread tool in the search for sequence similarities. When sequences are distant, even this approach may fail to point out common regions. If several plots calculated for all members of a sequence set consistently displayed a similarity between them, this would increase its credibility. We present an algorithm to delineate dot-plot agreement. A novel procedure based on matrix multiplication is developed to identify common patterns and reliably aligned regions in a set of distantly related sequences. The algorithm finds motifs independent of input sequence lengths and reduces the dependence on gap penalties. When sequences share greater similarity, the same approach converts to a multiple sequence alignment procedure.  相似文献   

18.
Multiple sequence alignment   总被引:13,自引:0,他引:13  
A method has been developed for aligning segments of several sequences at once. The number of search steps depends only polynomially on the number of sequences, instead of exponentially, because most alignments are rejected without being evaluated explicitly. A data structure herein called the "heap" facilitates this process. For a set of n sequence segments, the overall similarity is taken to be the sum of all the constituent segment pair similarities, which are in turn sums of corresponding residue similarity scores from a Table. The statistical models that test alignments for significance make it possible to group sequences objectively, even when most or all of the interrelationships are weak. These tests are very sensitive, while remaining quite conservative, and discourage the addition of "misfit" sequences to an existing set. The new techniques are applied to a set of five DNA-binding proteins, to a group of three enzymes that employ the coenzyme FAD, and to a control set. The alignment previously proposed for the DNA-binding proteins on the basis of structural comparisons and inspection of sequences is supported quite dramatically, and a highly significant alignment is found for the FAD-binding proteins.  相似文献   

19.

Background  

Assembling genomic sequences from a set of overlapping reads is one of the most fundamental problems in computational biology. Algorithms addressing the assembly problem fall into two broad categories - based on the data structures which they employ. The first class uses an overlap/string graph and the second type uses a de Bruijn graph. However with the recent advances in short read sequencing technology, de Bruijn graph based algorithms seem to play a vital role in practice. Efficient algorithms for building these massive de Bruijn graphs are very essential in large sequencing projects based on short reads. In an earlier work, an O(n/p) time parallel algorithm has been given for this problem. Here n is the size of the input and p is the number of processors. This algorithm enumerates all possible bi-directed edges which can overlap with a node and ends up generating Θ(nΣ) messages (Σ being the size of the alphabet).  相似文献   

20.
We discuss the statistical significance of local similarities found between DNA sequences, and illustrate the procedure with reference to the Queen and Korn algorithm. If the longest similarity found for two sequences has length L, this length is said to be significant at the 5% level if there is a probability of no more than 0.05 of finding a length of L or greater between a pair of sequences consisting of randomly chosen bases with the same overall base frequencies. The distribution of longest lengths is related to that of lengths from any particular pair of starting positions on the two sequences. For our implementation of the Queen and Korn algorithm, this latter distribution is constructed by combining the five different blocks of bases that may be added to extend a similarity. A table is given to assess the significance of longest similarities in sequences of length up to 1000 bases. Quite long similarities are expected to occur by chance alone. The critical values we calculate for assessing significance are preferable to expected numbers of similarities used by some commercial computer packages.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号