共查询到20条相似文献,搜索用时 0 毫秒
1.
The number of amino acid sequences is increasing very rapidly in the protein databases like Swiss-Prot, Uniprot, PIR and others, but the structure of only some amino acid sequences are found in the Protein Data Bank. Thus, an important problem in genomics is automatically clustering homologous protein sequences when only sequence information is available. Here, we use graph theoretic techniques for clustering amino acid sequences. A similarity graph is defined and clusters in that graph correspond to connected subgraphs. Cluster analysis seeks grouping of amino acid sequences into subsets based on distance or similarity score between pairs of sequences. Our goal is to find disjoint subsets, called clusters, such that two criteria are satisfied: homogeneity: sequences in the same cluster are highly similar to each other; and separation: sequences in different clusters have low similarity to each other. We tested our method on several subsets of SCOP (Structural Classification of proteins) database, a gold standard for protein structure classification. The results show that for a given set of proteins the number of clusters we obtained is close to the superfamilies in that set; there are fewer singeltons; and the method correctly groups most remote homologs. 相似文献
2.
Raphael B Liu LT Varghese G 《IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM》2004,1(2):91-94
Buhler and Tompa (2002) introduced the random projection algorithm for the motif discovery problem and demonstrated that this algorithm performs well on both simulated and biological samples. We describe a modification of the random projection algorithm, called the uniform projection algorithm, which utilizes a different choice of projections. We replace the random selection of projections by a greedy heuristic that approximately equalizes the coverage of the projections. We show that this change in selection of projections leads to improved performance on motif discovery problems. Furthermore, the uniform projection algorithm is directly applicable to other problems where the random projection algorithm has been used, including comparison of protein sequence databases. 相似文献
3.
With various ‘omics’ data becoming available recently, new challenges and opportunities are provided for researches on the assembly of next-generation sequences. As an attempt to utilize novel opportunities, we developed a next-generation sequence clustering method focusing on interdependency between genomics and proteomics data. Under the assumption that we can obtain next-generation read sequences and proteomics data of a target species, we mapped the read sequences against protein sequences and found physically adjacent reads based on a machine learning-based read assignment method. We measured the performance of our method by using simulated read sequences and collected protein sequences of Escherichia coli (E. coli). Here, we concentrated on the actual adjacency of the clustered reads in the E. coli genome and found that (i) the proposed method improves the performance of read clustering and (ii) the use of proteomics data does have a potential for enhancing the performance of genome assemblers. These results demonstrate that the integrative approach is effective for the accurate grouping of adjacent reads in a genome, which will result in a better genome assembly. 相似文献
4.
A Markov analysis of DNA sequences 总被引:12,自引:0,他引:12
H Almagor 《Journal of theoretical biology》1983,104(4):633-645
We present a model by which we look at the DNA sequence as a Markov process. It has been suggested by several workers that some basic biological or chemical features of nucleic acids stand behind the frequencies of dinucleotides (doublets) in these chains. Comparing patterns of doublet frequencies in DNA of different organisms was shown to be a fruitful approach to some phylogenetic questions (Russel & Subak-Sharpe, 1977). Grantham (1978) formulated mRNA sequence indices, some of which involve certain doublet frequencies. He suggested that using these indices may provide indications of the molecular constraints existing during gene evolution. Nussinov (1981) has shown that a set of dinucleotide preference rules holds consistently for eukaryotes, and suggested a strong correlation between these rules and degenerate codon usage. Gruenbaum, Cedar & Razin (1982) found that methylation in eukaryotic DNA occurs exclusively at C-G sites. Important biological information thus seems to be contained in the doublet frequencies. One of the basic questions to be asked (the "correlation question") is to what extent are the 64 trinucleotide (triplet) frequencies measured in a sequence determined by the 16 doublet frequencies in the same sequence. The DNA is described here as a Markov process, with the nucleotides being outcomes of a sequence generator. Answering the correlation question mentioned above means finding the order of the Markov process. The difficulty is that natural sequences are of finite length, and statistical noise is quite strong. We show that even for a 16000 nucleotide long sequence (like that of the human mitochondrial genome) the finite length effect cannot be neglected. Using the Markov chain model, the correlation between doublet and triplet frequencies can, however, be determined even for finite sequences, taking proper account of the finite length. Two natural DNA sequences, the human mitochondrial genome and the SV40 DNA, are analysed as examples of the method. 相似文献
5.
6.
A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase 总被引:1,自引:0,他引:1
A simple and rapid method for determining nucleotide sequences in single-stranded DNA by primed synthesis with DNA polymerase is described. It depends on the use of Escherichia coli DNA polymerase I and DNA polymerase from bacteriophage T4 under conditions of different limiting nucleoside triphosphates and concurrent fractionation of the products according to size by ionophoresis on acrylamide gels. The method was used to determine two sequences in bacteriophage φX174 DNA using the synthetic decanucleotide A-G-A-A-A-T-A-A-A-A and a restriction enzyme digestion product as primers. 相似文献
7.
8.
9.
Clustering is a major tool for microarray gene expression data analysis. The existing clustering methods fall mainly into two categories: parametric and nonparametric. The parametric methods generally assume a mixture of parametric subdistributions. When the mixture distribution approximately fits the true data generating mechanism, the parametric methods perform well, but not so when there is nonnegligible deviation between them. On the other hand, the nonparametric methods, which usually do not make distributional assumptions, are robust but pay the price for efficiency loss. In an attempt to utilize the known mixture form to increase efficiency, and to free assumptions about the unknown subdistributions to enhance robustness, we propose a semiparametric method for clustering. The proposed approach possesses the form of parametric mixture, with no assumptions to the subdistributions. The subdistributions are estimated nonparametrically, with constraints just being imposed on the modes. An expectation-maximization (EM) algorithm along with a classification step is invoked to cluster the data, and a modified Bayesian information criterion (BIC) is employed to guide the determination of the optimal number of clusters. Simulation studies are conducted to assess the performance and the robustness of the proposed method. The results show that the proposed method yields reasonable partition of the data. As an illustration, the proposed method is applied to a real microarray data set to cluster genes. 相似文献
10.
W. Grant Cooper 《Biochemical genetics》1995,33(5-6):173-181
A model explaining properties exhibited by fragile-X DNA systems arises from observations that time-dependent base substitutions are expressed at G-C sites but not at A–T sites (Biochem. Genet.32:383, 1994). [CGG]n sequences are classified as most sensitive to evolutionary base substitution processes involving time-dependent populating of G-C sites with enol-imine states having enhanced stability. Increased density of these states in oocyte DNA would introduce a ground-state collapse double-helix of reduced energy that would inhibit strand separation by the replicase. Evolutionarily altered G in CGG triplets allows CGG to be transcribed as CTG, an initiation codon. And this will cause reinitiation of DNA synthesis, thereby adding additional CGG units to the collapsed double helix. This situation would not occur in slower-evolving male haploid DNA that replicates frequently. 相似文献
11.
Here we propose a weighted measure for the similarity analysis of DNA sequences. It is based on LZ complexity and (0,1) characteristic sequences of DNA sequences. This weighted measure enables biologists to extract similarity information from biological sequences according to their requirements. For example, by this weighted measure, one can obtain either the full similarity information or a similarity analysis from a given biological aspect. Moreover, the length of DNA sequence is not problematic. The application of the weighted measure to the similarity analysis of β-globin genes from nine species shows its flexibility. 相似文献
12.
Background
Data clustering analysis has been extensively applied to extract information from gene expression profiles obtained with DNA microarrays. To this aim, existing clustering approaches, mainly developed in computer science, have been adapted to microarray data analysis. However, previous studies revealed that microarray datasets have very diverse structures, some of which may not be correctly captured by current clustering methods. We therefore approached the problem from a new starting point, and developed a clustering algorithm designed to capture dataset-specific structures at the beginning of the process. 相似文献13.
14.
A method has been developed which combines in situ hybridization with serial sectioning of a tissue. This procedure allows the analysis of three-dimensional distribution of specific DNA sequences within nuclei. 相似文献
15.
L V Gening T V Begetova T G Gazarian M V Alekseeva K G Gazarian 《Molekuliarnaia genetika, mikrobiologiia i virusologiia》1987,(7):16-18
The phage clones containing a gene coding for bovine growth hormone were isolated from a bovine genomic library. Comparison of the 5' and 3' regions flanking the bovine growth hormone gene by Southern blot hybridization revealed that they share homology. Screening the bovine genomic library by nick-translated DNA fragment from 5' flanking region leads to conclusion that this sequence is present in 0.1% of clones. Each analysed clone carrying the sequence contains some copies of it. 相似文献
16.
We describe a new method that is well-suited for the determination of the methylation level of repetitive sequences such as human Alu elements. We have applied the method to the analysis of cell and tissue DNAs and expect it to have wide utility in studies of DNA methylation in cancer and other disease states, in monitoring response to epigenetic cancer therapies and in epidemiological studies. Only 1 ng DNA is needed for a duplex, one-tube real-time PCR in which methylation level and the amount of input DNA are concurrently measured. The relative cutting by the methylation-sensitive enzyme BstUI is compared with that of the methylation-insensitive enzyme DraI to give a measure of DNA methylation. The method depends upon the use of 5'-tailed, 3'-blocked oligonucleotides called facilitator oligonucleotides (Foligos). Only cut DNAs with specific matching sequences at their 3' ends can copy the tails of the Foligos and thus become tagged and available for subsequent PCR. Both the tagging and PCR are carried out by the same enzyme, Taq DNA polymerase. Because amplification only occurs if suitable ends have been generated in the target DNA, we have called this method end-specific PCR (ESPCR). ESPCR avoids the bisulfite treatment step that is usually required to measure methylation. 相似文献
17.
Telomeric repeat sequences 总被引:6,自引:0,他引:6
Chromosomes not only carry transcribed genes and their regulatory DNA sequences, but also contain regions that are required for the stability and maintenace of the chromosome as a unit. These include centromeres, telomeres and origins of replication. It is clear for replication origins and centromeres that the positions of these chromosomal organelles are determined by sites of the appropriate DNA sequences, but also that functional performance requires one or more contributing proteins. Telomeres are also structurally complex, with one or more DNA components, including simple telomeric repeats and more complex telomere-associated sequences, as well as one or more specific proteins that recognize these sequences. Accumulating evidence suggests that the simple telomeric repeats are required in most, but not all species, although they are not sufficient to determine the chromosomal position of a telomere. 相似文献
18.
Detection of CAG repeat DNA sequences by pyrene-functionalized pyrrole-imidazole polyamides 总被引:1,自引:0,他引:1
Bando T Fujimoto J Minoshima M Shinohara K Sasaki S Kashiwazaki G Mizumura M Sugiyama H 《Bioorganic & medicinal chemistry》2007,15(22):6937-6942
Five N-methylpyrrole-N-methylimidazole (Py-Im) polyamides possessing a fluorescent pyrene were synthesized by Fmoc solid-phase synthesis using Py/Im monomers and pyrenylbutyl-pyrrole monomer compound 9. The steady state fluorescence of conjugates 1-5 was examined in the presence and absence of (CAG)(12)-containing oligodeoxynucleotides (ODNs) 1 and 2. Of the conjugates, conjugate 1 showed no background emission around 470 nm in the absence of ODNs, and a clear increase of emission at 475 nm was observed upon addition of ODNs 1 and 2. The emission of conjugate 1 at 475 nm increased linearly with the concentration of ODN and the number of CAG repeats. The results indicate that conjugate 1 efficiently forms a pyrene excimer upon binding in the minor groove of DNA. 相似文献
19.
In this study, a simple 4k-dimension feature representation vector is proposed to reconstruct phylogenetic trees, where k is the length of a word. The vector is composed of elements which characterize the relative difference of biological sequence from sequence generated by an independent random process. In addition, the variance of a vector which is obtained by averaging every column of feature representation matrix is employed to determine appropriate word length. In our experiments, reliable results can always be generated when word length is <7 which appears to be of lower computational complexity. Phylogenetic trees of 24 transferrins and 48 Hepatitis E viruses reconstructed at word length 6 are in good agreements with previous study, it shows that our method is efficient and powerful. 相似文献
20.
A quantitative method for analyzing specific DNA sequences directly from whole cells 总被引:10,自引:0,他引:10
A quick, accurate assay for specific DNA sequences is described in which whole cells are treated with 0.4 M sodium hydroxide at 80 degrees C. DNA is relatively resistant to alkaline hydrolysis, whereas proteins and RNA are degraded rapidly. The DNA in NaOH is then transferred through a slot directly onto a nylon membrane and hybridized with a probe. Since the procedure is so simple, many samples can be analyzed in a short time. A single-copy gene can be detected in as few as 1000 cells and, since the DNA from 10(5) cells can be loaded through a single slot, the sensitivity is sufficient to detect one specific DNA sequence per 100 cells. Accurate quantitative analysis can be achieved by normalizing the amount of DNA available for hybridization in each slot, using a probe derived from total DNA. 相似文献