首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The information capacity of nucleotide sequences is defined through the calculation of specific entropy of their frequency dictionary. The specificentropy of the frequency dictionary is calculated against the reconstructeddictionary; this latter bears the most probable continuations of the shorterstrings. This developed measure allows to distinguish the sequences both from the randons ones, and from those with high level of (rather simple) order. Some implications of the developed methodology in the fields of genetics,bioinformatics, and molecular biology are discussed.  相似文献   

2.
The information capacity of nucleotide sequences is defined through the specific entropy of frequency dictionary of a sequence determined with respect to another one containing the most probable continuations of shorter strings. This measure distinguishes a sequence both from a random one, and from ordered entity. A comparison of sequences based on their information capacity is studied. An order within the genetic entities is found at the length scale ranged from 3 to 8. Some other applications of the developed methodology to genetics, bioinformatics, and molecular biology are discussed.  相似文献   

3.
Classification of 16S RNA sequences over their frequency dictionaries, both real ones, and transformed ones was studied. Two entities were considered to be close each other from the point of view of their structure, if their frequency dictionaries were close, in Eucledian metric. A transformation procedure of a frequency dictionary has been implemented that reveals the peculiarities of information structure of a nucleotide sequence. A comparative study of two classification developed over the real frequency dictionary vs. that one developed over the transformed frequency dictionary was carried out. The strong correlation is revealed between the classification and the taxonomy of 16S RNA bearer. For the classes isolated, the information valuable words were identified. These words are the main factors of a difference between the classes. The frequency dictionaries containing the words of the length 3 exhibit the best correlation between a class and a genus. A genus, as a rule, is included into the same class, and the exclusion are sporadic. A development of hierarchy classification over the transformed frequency dictionaries separated one or two taxonomy groups, as each stage of classification. The unexpectedly frequent, or contrary, unexpectedly rare occurred of words (of the length 3) in entities under consideration make the structure difference between the classes of the nucleotide sequences.  相似文献   

4.
A survey of multiple sequence comparison methods   总被引:7,自引:0,他引:7  
Multiple sequence comparison refers to the search for similarity in three or more sequences. This article presents a survey of the exhaustive (optimal) and heuristic (possibly sub-optimal) methods developed for the comparison of multiple macromolecular sequences. Emphasis is given to the different approaches of the heuristic methods. Four distance measures derived from information engineering and genetic studies are introduced for the comparison between two alignments of sequences. The use ofentropy, which plays a central role in information theory as measures of information, choice and uncertainty, is proposed as a simple measure for the evaluation of the optimality of an alignment in the absence of anya priori knowledge about the structures of the sequences being compared. This article also gives two examples of comparison between alternative alignments of the same set of 5SRNAs as obtained by several different heuristic methods.  相似文献   

5.
MR fingerprinting (MRF) is an innovative approach to quantitative MRI. A typical disadvantage of dictionary-based MRF is the explosive growth of the dictionary as a function of the number of reconstructed parameters, an instance of the curse of dimensionality, which determines an explosion of resource requirements. In this work, we describe a deep learning approach for MRF parameter map reconstruction using a fully connected architecture. Employing simulations, we have investigated how the performance of the Neural Networks (NN) approach scales with the number of parameters to be retrieved, compared to the standard dictionary approach. We have also studied optimal training procedures by comparing different strategies for noise addition and parameter space sampling, to achieve better accuracy and robustness to noise. Four MRF sequences were considered: IR-FISP, bSSFP, IR-FISP-B1, and IR-bSSFP-B1. A comparison between NN and the dictionary approaches in reconstructing parameter maps as a function of the number of parameters to be retrieved was performed using a numerical brain phantom. Results demonstrated that training with random sampling and different levels of noise variance yielded the best performance. NN performance was at least as good as the dictionary-based approach in reconstructing parameter maps using Gaussian noise as a source of artifacts: the difference in performance increased with the number of estimated parameters because the dictionary method suffers from the coarse resolution of the parameter space sampling. The NN proved to be more efficient in memory usage and computational burden, and has great potential for solving large-scale MRF problems.  相似文献   

6.
SUMMARY: The program tuple_plot identifies and visualizes local similarities between two genomic sequences, typically 100 kb or longer, by applying the well-known dotplot principle. A dictionary of sequence words built from the input sequences serves to construct a task-specific expectancy model that is used to attribute significance values to pairwise word hits. The dictionary-based approach allows fast computation, the computation time scaling to O(N log N), depending on the size of the input sequences. The proposed scoring scheme appreciably increases the signal-to-noise ratio and may help to improve other word-based sequence comparison approaches. AVAILABILITY: tuple_plot is available at http://genome.fli-leibniz.de/software.html and may be used under GNU public license.  相似文献   

7.
A novel method for predicting the secondary structures of proteins from amino acid sequence has been presented. The protein secondary structure seqlets that are analogous to the words in natural language have been extracted. These seqlets will capture the relationship between amino acid sequence and the secondary structures of proteins and further form the protein secondary structure dictionary. To be elaborate, the dictionary is organism-specific. Protein secondary structure prediction is formulated as an integrated word segmentation and part of speech tagging problem. The word-lattice is used to represent the results of the word segmentation and the maximum entropy model is used to calculate the probability of a seqlet tagged as a certain secondary structure type. The method is markovian in the seqlets, permitting efficient exact calculation of the posterior probability distribution over all possible word segmentations and their tags by viterbi algorithm. The optimal segmentations and their tags are computed as the results of protein secondary structure prediction. The method is applied to predict the secondary structures of proteins of four organisms respectively and compared with the PHD method. The results show that the performance of this method is higher than that of PHD by about 3.9% Q3 accuracy and 4.6% SOV accuracy. Combining with the local similarity protein sequences that are obtained by BLAST can give better prediction. The method is also tested on the 50 CASP5 target proteins with Q3 accuracy 78.9% and SOV accuracy 77.1%. A web server for protein secondary structure prediction has been constructed which is available at http://www.insun.hit.edu.cn:81/demos/biology/index.html.  相似文献   

8.
The ratios between frequency components of evoked otoacoustic emissions (OAE) were investigated for 100 ears. The signals were decomposed by means of an adaptive approximation method into basic waveforms coming from a very large and redundant dictionary of Gabor functions. The high time-frequency resolution of the method and the parametric representation of the waveforms allowed for an estimation of the frequency ratios of the basic components. A repetitive occurrence of the “fifths”, “fourths” and octaves connected with the Pythagorean temperament was found. The octaves containing “fifths” were identified. This kind of sequences in OAE tend to appear in the same form for tonal stimulations of different frequencies and for broadband stimuli. The significance of the results was confirmed by comparison to Monte Carlo simulations of the null hypothesis of random distribution of frequency modes. These findings support the resonance theory of hearing, which binds musical ratios with the geometrical spacing of outer hair cells in the cochlea.  相似文献   

9.
Douglas L. Vizard 《Biopolymers》1978,17(9):2057-2082
The method of DNA partial denaturation and intramolecular renaturation (in the absence of biomolecular reassociation) is developed analytically and presented as a means by which the supraorganization of the DNA sequences within large complex genomes may be studied. This analysis provides for the comparison of the actual borganization of DNA sequences with a random arrangement of the same sequences. The sequence organization of the E. coli genome does not appear to be very different from DNA sequences arranged along the genome without preference to sequence stabilities, whereas an orderly physical arrangement of DNA sequences is implicated for the human genoma.  相似文献   

10.
Most of the gene prediction algorithms for prokaryotes are based on Hidden Markov Models or similar machine-learning approaches, which imply the optimization of a high number of parameters. The present paper presents a novel method for the classification of coding and non-coding regions in prokaryotic genomes, based on a suitably defined compression index of a DNA sequence. The main features of this new method are the non-parametric logic and the costruction of a dictionary of words extracted from the sequences. These dictionaries can be very useful to perform further analyses on the genomic sequences themselves. The proposed approach has been applied on some prokaryotic complete genomes, obtaining optimal scores of correctly recognized coding and non-coding regions. Several false-positive and false-negative cases have been investigated in detail, which have revealed that this approach can fail in the presence of highly structured coding regions (e.g., genes coding for modular proteins) or quasi-random non-coding regions (e.g., regions hosting non-functional fragments of copies of functional genes; regions hosting promoters or other protein-binding sequences). We perform an overall comparison with other gene-finder software, since at this step we are not interested in building another gene-finder system, but only in exploring the possibility of the suggested approach.  相似文献   

11.
FORRepeats: detects repeats on entire chromosomes and between genomes   总被引:1,自引:0,他引:1  
MOTIVATION: As more and more whole genomes are available, there is a need for new methods to compare large sequences and transfer biological knowledge from annotated genomes to related new ones. BLAST is not suitable to compare multimegabase DNA sequences. MegaBLAST is designed to compare closely related large sequences. Some tools to detect repeats in large sequences have already been developed such as MUMmer or REPuter. They also have time or space restrictions. Moreover, in terms of applications, REPuter only computes repeats and MUMmer works better with related genomes. RESULTS: We present a heuristic method, named FORRepeats, which is based on a novel data structure called factor oracle. In the first step it detects exact repeats in large sequences. Then, in the second step, it computes approximate repeats and performs pairwise comparison. We compared its computational characteristics with BLAST and REPuter. Results demonstrate that it is fast and space economical. We show FORRepeats ability to perform intra-genomic comparison and to detect repeated DNA sequences in the complete genome of the model plant Arabidopsis thaliana.  相似文献   

12.

Background  

The BLAST algorithm compares biological sequences to one another in order to determine shared motifs and common ancestry. However, the comparison of all non-redundant (NR) sequences against all other NR sequences is a computationally intensive task. We developed NBLAST as a cluster computer implementation of the BLAST family of sequence comparison programs for the purpose of generating pre-computed BLAST alignments and neighbour lists of NR sequences.  相似文献   

13.
Annotations of the genes and their products are largely guided by inferring homology. Sequence similarity is the primary measure used for annotation purpose however, the domain content and order were given less importance albeit the fact that domain insertion, deletion, positional changes can bring in functional varieties. Of late, several methods developed quantify domain architecture similarity depending on alignments of their sequences and are focused on only homologous proteins. We present an alignment-free domain architecture-similarity search (ADASS) algorithm that identifies proteins that share very poor sequence similarity yet having similar domain architectures. We introduce a “singlet matching-triplet comparison” method in ADASS, wherein triplet of domains is compared with other triplets in a pair-wise comparison of two domain architectures. Different events in the triplet comparison are scored as per a scoring scheme and an average pairwise distance score (Domain Architecture Distance score - DAD Score) is calculated between protein domains architectures. We use domain architectures of a selected domain termed as centric domain and cluster them based on DAD score. The algorithm has high Positive Prediction Value (PPV) with respect to the clustering of the sequences of selected domain architectures. A comparison of domain architecture based dendrograms using ADASS method and an existing method revealed that ADASS can classify proteins depending on the extent of domain architecture level similarity. ADASS is more relevant in cases of proteins with tiny domains having little contribution to the overall sequence similarity but contributing significantly to the overall function.  相似文献   

14.
Protein structure alignment   总被引:22,自引:0,他引:22  
A new method of comparing protein structures is described, based on distance plot analysis. It is relatively insensitive to insertions and deletions in sequence and is tolerant of the displacement of equivalent substructures between the two molecules being compared. When presented with the co-ordinate sets of two structures, the method will produce automatically an alignment of their sequences based on structural criteria. The method uses the dynamic programming optimization technique, which is widely used in the comparison of protein sequences and thus unifies the techniques of protein structure and sequence comparison. Typical structure comparison problems were examined and the results of the new method compared to the published results obtained using conventional methods. In most examples, the new method produced a result that was equivalent, and in some cases superior, to those reported in the literature.  相似文献   

15.
The flexibility of the polypeptide fold of proteins is essentially due to the rotational freedom about the main chain bonds involving C alpha atoms. The polypeptide fold can therefore be represented by virtual bonds joining consecutive C alpha atoms. The ordered sequence of virtual torsion and bond angles involving these bonds can be used to specify the fold. Such representations can then be compared to reveal structural similarities using the Needleman & Wünsch algorithm, which has been developed for comparison of amino acid sequences. Such an approach is presented and illustrated with examples. The method is suitable for detecting structural similarities that extend over 7 or more residues.  相似文献   

16.
We have developed a pattern comparative method for identifying functionally important motifs in protein sequences. The essence of most standard pattern comparative methods is a comparison of patterns occurring in different sequences using an optimized weight matrix. In contrast, our approach is based on a measure of similarity among all the candidate motifs within the same sequence. This method may prove to be particularly efficient for proteins encoding the same biochemical function, but with different primary sequences, and when tertiary structure information from one or more sequences is available. We have applied this method to a special class of zinc-binding enzymes known as endopeptidases.  相似文献   

17.
1 Introduction The prediction of protein structure and function from amino acid sequences is one of the most impor-tant problems in molecular biology. This problem is becoming more pressing as the number of known pro-tein sequences is explored as a result of genome and other sequencing projects, and the protein sequence- structure gap is widening rapidly[1]. Therefore, com-putational tools to predict protein structures are needed to narrow the widening gap. Although the prediction of three dim…  相似文献   

18.
This article presents a new method for the comparison of multiple macromolecular sequences. It is based on a hierarchical sequence synthesis procedure that does not require anya priori knowledge of the molecular structure of the sequences or the phylogenetic relations among the sequences. It differs from the existing methods as it has the capability of: (i) generating a statistical-structural model of the sequences through a synthesis process that detects homologous groups of the sequences, and (ii) aligning the sequences while the taxonomic tree of the sequences is being constructed in one single phase. It produces superior results when compared with some existing methods.  相似文献   

19.
Digital signal processing methods for biosequence comparison.   总被引:1,自引:1,他引:0       下载免费PDF全文
A method is discussed for DNA or protein sequence comparison using a finite field fast Fourier transform, a digital signal processing technique; and statistical methods are discussed for analyzing the output of this algorithm. This method compares two sequences of length N in computing time proportional to N log N compared to N2 for methods currently used. This method makes it feasible to compare very long sequences. An example is given to show that the method correctly identifies sites of known homology.  相似文献   

20.
In this paper, we present a novel fractal coding method with the block classification scheme based on a shared domain block pool. In our method, the domain block pool is called dictionary and is constructed from fractal Julia sets. The image is encoded by searching the best matching domain block with the same BTC (Block Truncation Coding) value in the dictionary. The experimental results show that the scheme is competent both in encoding speed and in reconstruction quality. Particularly for large images, the proposed method can avoid excessive growth of the computational complexity compared with the traditional fractal coding algorithm.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号