共查询到20条相似文献,搜索用时 0 毫秒
1.
MOTIVATION: Pair-wise alignment of protein sequences and local similarity searches produce many false positives because of compositionally biased regions, also called low-complexity regions (LCRs), of amino acid residues. Masking and filtering such regions significantly improves the reliability of homology searches and, consequently, functional predictions. Most of the available algorithms are based on a statistical approach. We wished to investigate the structural properties of LCRs in biological sequences and develop an algorithm for filtering them. RESULTS: We present an algorithm for detecting and masking LCRs in protein sequences to improve the quality of database searches. We developed the algorithm based on the complexity analysis of subsequences delimited by a pair of identical, repeating subsequences. Given a protein sequence, the algorithm first computes the suffix tree of the sequence. It then collects repeating subsequences from the tree. Finally, the algorithm iteratively tests whether each subsequence delimited by a pair of repeating subsequences meets a given criteria. Test results with 1000 proteins from 20 families in Pfam show that the repeating subsequences are a good indicator for the low-complexity regions, and the algorithm based on such structural information strongly compete with others. AVAILABILITY: http://bioinfo.knu.ac.kr/research/CARD/ CONTACT: swshin@bioinfo.knu.ac.kr 相似文献
2.
Summary DNA sequences reassociating within a Cot value of 1.8×10–1 and those producing a light satellite in a CsCl density gradient were isolated fromVicia faba DNA and hybridizedin situ on squashes of roots of the same species. Silver grains were seen to be scattered over both the interphase nuclei and the metaphase chromosomes after hybridization with fast renaturing DNA sequences, indicating these are fairly regularly interspersed in theV. faba genome. Clustered labeling occurred after hybridization with satellite DNA sequences, indicating these are clustered in the genome. The localization of satellite DNA in chromosomes appeared to correspond closely to the position of the bright bands detectable after staining with quinacrine mustard. After hybridization with both DNA probes, labeling intensity over the nuclei of meristematic cells was higher than that over the nuclei of differentiating and/or differentiated cells. These results are discussed in relation to the structure of the cell nucleus, the mechanism of quinacrine banding and to previous data suggesting underrepresentation of nuclear repeated DNA sequences in differentiatingV. faba root cells. 相似文献
3.
A fast method to predict protein interaction sites from sequences 总被引:15,自引:0,他引:15
A simple method for predicting residues involved in protein interaction sites is proposed. In the absence of any structural report, the procedure identifies linear stretches of sequences as "receptor-binding domains" (RBDs) by analysing hydrophobicity distribution. The sequences of two databases of non-homologous interaction sites eliciting various biological activities were tested; 59-80 % were detected as RBDs. A statistical analysis of amino acid frequencies was carried out in known interaction sites and in predicted RBDs. RBDs were predicted from the 80,000 sequences of the Swissprot database. In both cases, arginine is the most frequently occurring residue. The RBD procedure can also detect residues involved in specific interaction sites such as the DNA-binding (95 % detected) and Ca-binding domains (83 % detected). We report two recent analyses; from the prediction of RBDs in sequences to the experimental demonstration of the functional activities. The examples concern a retroviral Gag protein and a penicillin-binding protein. We support that this method is a quick way to predict protein interaction sites from sequences and is helpful for guiding experiments such as site-specific mutageneses, two-hybrid systems or the synthesis of inhibitors. 相似文献
4.
5.
6.
A greedy algorithm for aligning DNA sequences. 总被引:39,自引:0,他引:39
For aligning DNA sequences that differ only by sequencing errors, or by equivalent errors from other sources, a greedy algorithm can be much faster than traditional dynamic programming approaches and yet produce an alignment that is guaranteed to be theoretically optimal. We introduce a new greedy alignment algorithm with particularly good performance and show that it computes the same alignment as does a certain dynamic programming algorithm, while executing over 10 times faster on appropriate data. An implementation of this algorithm is currently used in a program that assembles the UniGene database at the National Center for Biotechnology Information. 相似文献
7.
8.
Nicola Prunella; Liuni Sabino; Attimonelli Marcella; Pasole Graziano 《Bioinformatics (Oxford, England)》1993,9(5):541-545
A new string searching algorithm is presented aimed at searchingfor the occurrence of character patterns in longer charactertexts. The algorithm, specifically designed for nucleic acidsequence data, is essentially derived from the Boyer Moore method (Comm. ACM, 20, 762 772, 1977). Both patternand text data are compressed so that the natural 4-letter alphabetof nucleic acid sequences is considerably enlarged. The stringsearch starts from the last character of the pattern and proceedsin large jumps through the text to be searched. The data compressionand searching algorithm allows one to avoid searching for patternsnot present in the text as well as to inspect, for each pattern,all text characters until the exact match with the text is found.These considerations are supported by empirical evidence andcomparisons with other methods. 相似文献
9.
10.
Tandemly polymerized regulatory elements, antisense RNA segments or ribozymes are potentially useful in selective gene silencing. However, existing methods of tandemly polymerizing short DNA segments are laborious. We present a procedure that can create cloned arrays of 40-70 monomer units in two steps. We have created long arrays of regulatory elements and potential ribozyme sequences. Silencing of human immunodeficiency virus (HIV-1) activation by tandem arrays of a regulatory element in human immune system cells and in other human and monkey cells is discussed. 相似文献
11.
We present a fast algorithm to search for repeating fragments within protein sequences. The technique is based on an extension of the Smith-Waterman algorithm that allows the calculation of sub-optimal alignments of a sequence against itself. We are able to estimate the statistical significance of all sub-optimal alignment scores. We also rapidly determine the length of the repeating fragment and the number of times it is found in a sequence. The technique is applied to sequences in the Swissprot database, and to 16 complete genomes. We find that eukaryotic proteins contain more internal repeats than those of prokaryotic and archael organisms. The finding that 18% of yeast sequences and 28% of the known human sequences contain detectable repeats emphasizes the importance of internal duplication in protein evolution. 相似文献
12.
A method to locate protein coding sequences in DNA of prokaryotic systems. 总被引:10,自引:2,他引:10
下载免费PDF全文

cDNA sequence data from E. coli phages, for which complete genome sequences are known, have been analysed, From this analysis thirteen triplets have been identified as markers to distinguish protein-coding frames from fortuitous open reading frames. The region of -18 to +18 nucleotides around ATG/GTG, has been analysed and used to identify initiator codons from internal ATG/GTG. With the aid of criteria defined above a method has been developed to locate protein coding sequences by a combination of 'gene search by signal' and 'gene search by content' approaches. Application of this method to prokaryotic systems including those which were not part of our data base indicates that it is quite accurate and general in nature. 相似文献
13.
The random-breakage mapping method [Game et al. (1990) Nucleic Acids Res., 18, 4453-4461] was applied to DNA sequences in human fibroblasts. The methodology involves NotI restriction endonuclease digestion of DNA from irradiated calls, followed by pulsed-field gel electrophoresis, Southern blotting and hybridization with DNA probes recognizing the single copy sequences of interest. The Southern blots show a band for the unbroken restriction fragments and a smear below this band due to radiation induced random breaks. This smear pattern contains two discontinuities in intensity at positions that correspond to the distance of the hybridization site to each end of the restriction fragment. By analyzing the positions of those discontinuities we confirmed the previously mapped position of the probe DXS1327 within a NotI fragment on the X chromosome, thus demonstrating the validity of the technique. We were also able to position the probes D21S1 and D21S15 with respect to the ends of their corresponding NotI fragments on chromosome 21. A third chromosome 21 probe, D21S11, has previously been reported to be close to D21S1, although an uncertainty about a second possible location existed. Since both probes D21S1 and D21S11 hybridized to a single NotI fragment and yielded a similar smear pattern, this uncertainty is removed by the random-breakage mapping method. 相似文献
14.
Pattern-matching algorithms are a powerful tool for findingsimilarities and relationships among the steadily growing amountof known protein sequences. We present a fast, sensitive pattern-matchingalgorithm that describes a pattern by its physico-chemical propertiesrather than by occurrence ofamino acids, using a fast, dynamicprogramming algorithm. Selected examples will demonstrate applicationsand advantages of our approach. 相似文献
15.
T K Blackwell J Huang A Ma L Kretzner F W Alt R N Eisenman H Weintraub 《Molecular and cellular biology》1993,13(9):5216-5224
Using an in vitro binding-site selection assay, we have demonstrated that c-Myc-Max complexes bind not only to canonical CACGTG or CATGTG motifs that are flanked by variable sequences but also to noncanonical sites that consist of an internal CG or TG dinucleotide in the context of particular variations in the CA--TG consensus. None of the selected sites contain an internal TA dinucleotide, suggesting that Myc proteins necessarily bind asymmetrically in the context of a CAT half-site. The noncanonical sites can all be bound by proteins of the Myc-Max family but not necessarily by the related CACGTG- and CATGTG-binding proteins USF and TFE3. Substitution of an arginine that is conserved in these proteins into MyoD (MyoD-R) changes its binding specificity so that it recognizes CACGTG instead of the MyoD cognate sequence (CAGCTG). However, like USF and TFE3, MyoD-R does not bind to all of the noncanonical c-Myc-Max sites. Although this R substitution changes the internal dinucleotide specificity of MyoD, it does not significantly alter its wild-type binding sequence preferences at positions outside of the CA--TG motif, suggesting that it does not dramatically change other important amino acid-DNA contacts; this observation has important implications for models of basic-helix-loop-helix protein-DNA binding. 相似文献
16.
I V Gar'kavtsev T G Tsvetkova N A Liapunova 《Molekuliarnaia genetika, mikrobiologiia i virusologiia》1989,(5):11-15
A new approach to screening of the repeated human DNA sequences tandemly arranged in the genome is described. Efficiency of the developed approach for search of tandemly arranged DNA sequences is corroborated by the obtained experimental data. 相似文献
17.
The possible addition of extra sequences to simian virus 40 (SV40) DNA was analyzed by electron microscopy in two different cell systems, productively infected monkey cells and activated heterokaryons on monkey and transformed mouse 3T3 cells. We found that the closed circular DNA fraction, extracted from monkey cells at 70 h after infection with nondefective SV40 at a multiplicity of infection of 6 PFU/cell, contained oversized molesules (1.1 to 2.0 fractional lengths of SV40 DNA) constituting about 8% of the molecules having lengths equal to or shorter than SV40 dinner DNA. The oversized molecules had the entired SV40 sequences. The added DNA was heterogeneous in length. The sites of addition were not specific with reference to the EcoRi site. These results suggest that recombination between monkey and SV40 DNAs or partial duplication of SV40 DNA occurs at many sites on the SV40 chromosome. The integrated SV40 DNA is excised and replicates in activated heterokaryons. In this system, besides SV40 DNA we found heterogeneous undersized and oversized molecules containing SV40 sequences in the closed circular DNA population. Additions differeing in size appeared to be overlapping and to have occurred at a preferential site on the SV40 chromosome. These results support the hypothesis that host DNA can be added to SV40 DNA at the site of integration at the time of excision. 相似文献
18.
P Taylor 《Nucleic acids research》1986,14(1):437-441
This paper describes a comprehensive program for translating one or two DNA sequences into amino acid sequences. Written in FORTRAN, it was designed for maximum flexibility of use and easy maintenance, modification and portability. It has full comments throughout. 相似文献
19.
W Bains 《Nucleic acids research》1986,14(1):159-177
I describe a computer program which can align a large number of nucleic acid sequences with one another. The program uses an heuristic, iterative algorithm which has been tested extensively, and is found to produce useful alignments of a variety of sequence families. The algorithm is fast enough to be practical for the analysis of large number of sequences, and is implemented in a program which contains a variety of other functions to facilitate the analysis of the aligned result. 相似文献
20.
A Markov analysis of DNA sequences 总被引:12,自引:0,他引:12
H Almagor 《Journal of theoretical biology》1983,104(4):633-645
We present a model by which we look at the DNA sequence as a Markov process. It has been suggested by several workers that some basic biological or chemical features of nucleic acids stand behind the frequencies of dinucleotides (doublets) in these chains. Comparing patterns of doublet frequencies in DNA of different organisms was shown to be a fruitful approach to some phylogenetic questions (Russel & Subak-Sharpe, 1977). Grantham (1978) formulated mRNA sequence indices, some of which involve certain doublet frequencies. He suggested that using these indices may provide indications of the molecular constraints existing during gene evolution. Nussinov (1981) has shown that a set of dinucleotide preference rules holds consistently for eukaryotes, and suggested a strong correlation between these rules and degenerate codon usage. Gruenbaum, Cedar & Razin (1982) found that methylation in eukaryotic DNA occurs exclusively at C-G sites. Important biological information thus seems to be contained in the doublet frequencies. One of the basic questions to be asked (the "correlation question") is to what extent are the 64 trinucleotide (triplet) frequencies measured in a sequence determined by the 16 doublet frequencies in the same sequence. The DNA is described here as a Markov process, with the nucleotides being outcomes of a sequence generator. Answering the correlation question mentioned above means finding the order of the Markov process. The difficulty is that natural sequences are of finite length, and statistical noise is quite strong. We show that even for a 16000 nucleotide long sequence (like that of the human mitochondrial genome) the finite length effect cannot be neglected. Using the Markov chain model, the correlation between doublet and triplet frequencies can, however, be determined even for finite sequences, taking proper account of the finite length. Two natural DNA sequences, the human mitochondrial genome and the SV40 DNA, are analysed as examples of the method. 相似文献