共查询到20条相似文献,搜索用时 0 毫秒
1.
2.
3.
Background
Simple sequence repeats (SSRs), microsatellites or polymeric sequences are common in DNA and are important biologically. From mononucleotide to trinucleotide repeats and beyond, they can be found in long (> 6 repeating units) tracts and may be characterized by quantifying the frequencies in which they are found and their tract lengths. However, most of the existing computer programs that find SSR tracts do not include these methods. 相似文献4.
ABSTRACT: BACKGROUND: The molecular recognition based on the complementary base pairing of deoxyribonucleicacid (DNA) is the fundamental principle in the fields of genetics, DNA nanotechnologyand DNA computing. We present an exhaustive DNA sequence design algorithm thatallows to generate sets containing a maximum number of sequences with definedproperties. EGNAS (Exhaustive Generation of Nucleic Acid Sequences) offers thepossibility of controlling both interstrand and intrastrand properties. The guanine-cytosinecontent can be adjusted. Sequences can be forced to start and end with guanine orcytosine. This option reduces the risk of "fraying" of DNA strands. It is possible to limitcross hybridizations of a defined length, and to adjust the uniqueness of sequences.Self-complementarity and hairpin structures of certain length can be avoided. Sequencesand subsequences can optionally be forbidden. Furthermore, sequences can be designed tohave minimum interactions with predefined strands and neighboring sequences. RESULTS: The algorithm is realized in a C++ program. TAG sequences can be generated andcombined with primers for single-base extension reactions, which were described formultiplexed genotyping of single nucleotide polymorphisms. Thereby, possible foldbackthrough intrastrand interaction of TAG-primer pairs can be limited. The design ofsequences for specific attachment of molecular constructs to DNA origami is presented. CONCLUSIONS: We developed a new software tool called EGNAS for the design of unique nucleic acidsequences. The presented exhaustive algorithm allows to generate greater sets ofsequences than with previous software and equal constraints. EGNAS is freely availablefor noncommercial use at http://www.chm.tu-dresden.de/pc6/EGNAS. 相似文献
5.
6.
7.
We present a fast algorithm to produce a graphic matrix representationof sequence homology. The algorithm is based on lexicographicalordering of fragments. It preserves most of the options of asimple naive algorithm with a significant increase in speed.This algorithm was the basis for a program, called DNAMAT, thathas been extensively tested during the last three years at theWeizmann Institute of Science and has proven to be very useful.In addition we suggest a way to extend our approach to analysea series of related DNA or RNA sequences, in order to determinecertain common structural features. The analysis is done bysumming a set of dot-matrices to produce an overallmatrix that displays structural elements common to most of thesequences. We give an example of this procedure by analysingtRNA sequences. Received on June 26, 1986; accepted on September 28, 1986 相似文献
8.
Troyanskaya OG Arbell O Koren Y Landau GM Bolshoy A 《Bioinformatics (Oxford, England)》2002,18(5):679-688
MOTIVATION: One of the major features of genomic DNA sequences, distinguishing them from texts in most spoken or artificial languages, is their high repetitiveness. Variation in the repetitiveness of genomic texts reflects the presence and density of different biologically important messages. Thus, deviation from an expected number of repeats in both directions indicates a possible presence of a biological signal. Linguistic complexity corresponds to repetitiveness of a genomic text, and potential regulatory sites may be discovered through construction of typical patterns of complexity distribution. RESULTS: We developed software for fast calculation of linguistic sequence complexity of DNA sequences. Our program utilizes suffix trees to compute the number of subwords present in genomic sequences, thereby allowing calculation of linguistic complexity in time linear in genome size. The measure of linguistic complexity was applied to the complete genome of Haemophilus influenzae. Maps of complexity along the entire genome were obtained using sliding windows of 40, 100, and 2000 nucleotides. This approach provided an efficient way to detect simple sequence repeats in this genome. In addition, local profiles of complexity distribution around the starts of translation were constructed for 21 complete prokaryotic genomes. We hypothesize that complexity profiles correspond to evolutionary relationships between organisms. We found principal differences in profiles of the GC-rich and other (non-GC-rich) genomes. We also found characteristic differences in profiles of AT genomes, which probably reflect individual species variations in translational regulation. AVAILABILITY: The program is available upon request from Alexander Bolshoy or at http://csweb.haifa.ac.il/library/#complex. 相似文献
9.
Genomic prediction using an iterative conditional expectation algorithm for a fast BayesC-like model
Genomic prediction is feasible for estimating genomic breeding values because of dense genome-wide markers and credible statistical methods, such as Genomic Best Linear Unbiased Prediction (GBLUP) and various Bayesian methods. Compared with GBLUP, Bayesian methods propose more flexible assumptions for the distributions of SNP effects. However, most Bayesian methods are performed based on Markov chain Monte Carlo (MCMC) algorithms, leading to computational efficiency challenges. Hence, some fast Bayesian approaches, such as fast BayesB (fBayesB), were proposed to speed up the calculation. This study proposed another fast Bayesian method termed fast BayesC (fBayesC). The prior distribution of fBayesC assumes that a SNP with probability γ has a non-zero effect which comes from a normal density with a common variance. The simulated data from QTLMAS XII workshop and actual data on large yellow croaker were used to compare the predictive results of fBayesB, fBayesC and (MCMC-based) BayesC. The results showed that when γ was set as a small value, such as 0.01 in the simulated data or 0.001 in the actual data, fBayesB and fBayesC yielded lower prediction accuracies (abilities) than BayesC. In the actual data, fBayesC could yield very similar predictive abilities as BayesC when γ?≥?0.01. When γ?=?0.01, fBayesB could also yield similar results as fBayesC and BayesC. However, fBayesB could not yield an explicit result when γ?≥?0.1, but a similar situation was not observed for fBayesC. Moreover, the computational speed of fBayesC was significantly faster than that of BayesC, making fBayesC a promising method for genomic prediction. 相似文献
10.
11.
Alphabet size and informational entropy, two formal measures of sequence complexity, are herein applied to two prior studies on the folding of minimal proteins. These measures show a designed four-helix bundle to be unlike its natural counterparts but rather more like a coiled-coil dimer. Segments from a simplified sarc homology 3 domain and more than 2000000 segments from globular proteins both have lower bounds for alphabet size of 10 and for entropy near 2.9. These values are therefore suggested to be necessary and sufficient for folding into globular proteins having both rigid side chain packing and biological function. 相似文献
12.
The computer program PROFILEGRAPH, a graphical interactive toolfor the analysis of amino acid sequences, is described. Themain task of the program is to integrate a variety of sliding-windowmethods into a single user-friendly shell. The program allowsthe user to combine any amino acid specific parameter with aselection of several possible types of analysis and to plotthe resulting graph in one of several windows on the screen.It is also possible to calculate the moment of the amino acidspecific parameter for a given secondary structure and to displayboth the absolute moment value and the moment angle relativeto a reference residue. Also included are several utilitiesthat facilitate visual analysis of protein primary structureslike, for example, helical-wheel diagrams. It is possible toadapt the majority of published sliding-window analysis proceduresfor use with PROFILEGRAPH. 相似文献
13.
Complexity in biology. Exceeding the limits of reductionism and determinism using complexity theory 总被引:2,自引:0,他引:2
Mazzocchi F 《EMBO reports》2008,9(1):10-14
14.
CASTOR: clustering algorithm for sequence taxonomical organization and relationships. 总被引:1,自引:0,他引:1
Given a set of related proteins, two important problems in biology are the inference of protein subsets such that members of one subset share a common function and the identification of protein regions that possess functional significance. The former is typically approached by hierarchical bottom-up clustering based on pairwise sequence similarity and various linkage rules. The latter is typically approached in a supervised manner, based on global multiple sequence alignment. However, the two problems are inextricably linked, since functional subsets are usually characterized by distinctive functional regions. This paper introduces CASTOR, an automatic and unsupervised system that addresses both problems simultaneously and efficiently. It identifies protein regions that are likely to have functional significance by discovering and refining statistically significant motifs. It infers likely functional protein subsets and their relationships based on the presence of the discovered motifs in a top-down and recursive manner, allowing the identification of both hierarchical and nonhierarchical subset relationships. This is, to our knowledge, the first system that approaches both problems simultaneously in a top-down, systematic manner. CASTOR's performance is evaluated against the G-protein coupled receptor superfamily. The identified protein regions lead to a taxonomical organization of this superfamily that is in remarkable agreement with a biologically motivated one and which outperforms those produced by bottom-up clustering methods. We also find that conventional hierarchical representations may fail to accurately describe the complexity of evolutionary development responsible for the final organization of a complex protein family. In particular, many functional relationships governing distant subfamilies of such a protein family may not be represented hierarchically. 相似文献
15.
Burnett Leslie; Basten Antony; Hensley William J. 《Bioinformatics (Oxford, England)》1985,1(3):153-160
A new computer search strategy has been devised for high-resolutionnucleotide sequence analysis. The strategy differs from thoseused by earlier sequence analysing programs in that it is exhaustiveand capable of detecting all possible homologies and other typesof relationships between or within sequences irrespective ofthe pattern of matches and mismatches encountered. The implementationof this strategy into a working algorithm is described. Received on March 1, 1985; accepted on April 24, 1985 相似文献
16.
We describe a new approach to multiple sequence alignment using genetic algorithms and an associated software package called SAGA. The method involves evolving a population of alignments in a quasi evolutionary manner and gradually improving the fitness of the population as measured by an objective function which measures multiple alignment quality. SAGA uses an automatic scheduling scheme to control the usage of 22 different operators for combining alignments or mutating them between generations. When used to optimise the well known sums of pairs objective function, SAGA performs better than some of the widely used alternative packages. This is seen with respect to the ability to achieve an optimal solution and with regard to the accuracy of alignment by comparison with reference alignments based on sequences of known tertiary structure. The general attraction of the approach is the ability to optimise any objective function that one can invent. 相似文献
17.
W Just 《Journal of computational biology》2001,8(6):615-623
It is shown that the multiple alignment problem with SP-score is NP-hard for each scoring matrix in a broad class M that includes most scoring matrices actually used in biological applications. The problem remains NP-hard even if sequences can only be shifted relative to each other and no internal gaps are allowed. It is also shown that there is a scoring matrix M(0) such that the multiple alignment problem for M(0) is MAX-SNP-hard, regardless of whether or not internal gaps are allowed. 相似文献
18.
T G Dewey 《Journal of computational biology》2001,8(2):177-190
An algorithm for aligning biological sequences is presented that is an adaptation of the sequence generating function approach used in the statistical mechanics of biopolymers. This algorithm uses recursion relationships developed from a partition function formalism of alignment probabilities. It is implemented within a dynamic programming format that closely resembles the forward algorithm used in hidden Markov models (HMM). The algorithm aligns sequences or structures according to the statistically dominant alignment path and will be referred to as the SDP algorithm. An advantage of this method over previous ones is that it allows more complicated and physically realistic gap penalty functions to be incorporated into the algorithm in a facile manner. The performance of this algorithm in a case study of aligning the heavy and light chain from the variable region of an immunoglobulin is investigated. 相似文献
19.
Bolshoy A 《Applied bioinformatics》2003,2(2):103-112
This is a review of the methods based on counting oligomers in nucleotide and amino acid sequences. Such methods are analogous to the formal linguistic analysis of human texts. This review includes methods based on the calculation of observed occurrences (frequencies) of oligomers and their distribution, as well as those based on deviations between the observed and the expected occurrences (contrast words, genome signatures) in biological sequences. Both types of methods have a wide range of sensitivity and can identify homologous as well as functionally and taxonomically related sequences. 相似文献
20.
BALSA: Bayesian algorithm for local sequence alignment 总被引:2,自引:1,他引:2
The Smith–Waterman algorithm yields a single alignment, which, albeit optimal, can be strongly affected by the choice of the scoring matrix and the gap penalties. Additionally, the scores obtained are dependent upon the lengths of the aligned sequences, requiring a post-analysis conversion. To overcome some of these shortcomings, we developed a Bayesian algorithm for local sequence alignment (BALSA), that takes into account the uncertainty associated with all unknown variables by incorporating in its forward sums a series of scoring matrices, gap parameters and all possible alignments. The algorithm can return both the joint and the marginal optimal alignments, samples of alignments drawn from the posterior distribution and the posterior probabilities of gap penalties and scoring matrices. Furthermore, it automatically adjusts for variations in sequence lengths. BALSA was compared with SSEARCH, to date the best performing dynamic programming algorithm in the detection of structural neighbors. Using the SCOP databases PDB40D-B and PDB90D-B, BALSA detected 19.8 and 41.3% of remote homologs whereas SSEARCH detected 18.4 and 38% at an error rate of 1% errors per query over the databases, respectively. 相似文献