首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In this paper, we propose a nongraphical representation for protein secondary structures. By counting the frequency of occurrence of all possible four-tuples (i.e., four-letter words) of a protein secondary structure sequence, we construct a set of 3x3 matrices for the corresponding protein secondary structure sequence. Furthermore, the leading eigenvalues of these matrices are computed and considered as invariants for the protein secondary structure sequences. To illustrate the utility of our approach, we apply it to a set of real data to distinguish protein structural classes. The result indicates that it can be used to complement the classification of protein secondary structures.  相似文献   

2.
The information capacity of nucleotide sequences is defined through the specific entropy of frequency dictionary of a sequence determined with respect to another one containing the most probable continuations of shorter strings. This measure distinguishes a sequence both from a random one, and from ordered entity. A comparison of sequences based on their information capacity is studied. An order within the genetic entities is found at the length scale ranged from 3 to 8. Some other applications of the developed methodology to genetics, bioinformatics, and molecular biology are discussed.  相似文献   

3.
Aita T  Husimi Y  Nishigaki K 《Bio Systems》2011,106(2-3):67-75
To measure the similarity or dissimilarity between two given biological sequences, several papers proposed metrics based on the "word-composition vector". The essence of these metrics is as follows. First, we count the appearance frequencies of all the K-tuple words throughout each of two given sequences. Then, the two given sequences are transformed into their respective word-composition vectors. Next, the distance metrics, for example the angle between the two vectors, are calculated. A significant issue is to determine the optimal word size K. With a mathematical model of mutational events (including substitutions, insertions, deletions and duplications) that occur in sequences, we analyzed how the angle between the composition vectors depends on the mutational events. We also considered the optimal word size (=resolution) from our original approach. Our results were verified by computational experiments using artificially generated sequences, amino acid sequences of hemoglobin and nucleotide sequences of 16S ribosomal RNA.  相似文献   

4.
Protein Glycosylation is an important post translational event that plays a pivotal role in protein folding and protein is trafficking. We describe a dictionary based and a rule based approach to mine ‘mentions‘ of protein glycosylation in text. The dictionary based approach relies on a set of manually curated dictionaries specially constructed to address this task. Abstracts are then screened for the ‘mentions‘ of words from these dictionaries which are further scored followed by classification on the basis of a threshold. The rule based approaches also relies on the words in the dictionary to arrive at the features which are used for classification. The performance of the system using both the approaches has been evaluated using a manually curated corpus of 3133 abstracts. The evaluation suggests that the performance of the Rule based approach supersedes that of the Dictionary based approach.  相似文献   

5.
We present IceMorph, a semi-supervised morphosyntactic analyzer of Old Icelandic. In addition to machine-read corpora and dictionaries, it applies a small set of declension prototypes to map corpus words to dictionary entries. A web-based GUI allows expert users to modify and augment data through an online process. A machine learning module incorporates prototype data, edit-distance metrics, and expert feedback to continuously update part-of-speech and morphosyntactic classification. An advantage of the analyzer is its ability to achieve competitive classification accuracy with minimum training data.  相似文献   

6.
Information capacity of nucleotide sequences measures the unexpectedness of a continuation of a given string of nucleotides, thus having a sound relation to a variety of biological issues. A continuation is defined in a way maximizing the entropy of the ensemble of such continuations. The capacity is defined as a mutual entropy of real frequency dictionary of a sequence with respect to the one bearing the most expected continuations; it does not depend on the length of strings contained in a dictionary. Various genomes exhibit a multi-minima pattern of the dependence of information capacity on the string length, thus reflecting an order within a sequence. The strings with significant deviation of an expected frequency from the real one are the words of increased information value. Such words exhibit a non-random distribution alongside a sequence, thus making it possible to retrieve the correlation between a structure, and a function encoded within a sequence.
Alexander S. ShchepanovskyEmail:
  相似文献   

7.
《Genomics》2022,114(4):110414
Classification of viruses into their taxonomic ranks (e.g., order, family, and genus) provides a framework to organize an abundant population of viruses. Next-generation metagenomic sequencing technologies lead to a rapid increase in generating sequencing data of viruses which require bioinformatics tools to analyze the taxonomy. Many metagenomic taxonomy classifiers have been developed to study microbiomes, but it is particularly challenging to assign the taxonomy of diverse virus sequences and there is a growing need for dedicated methods to be developed that are optimized to classify virus sequences into their taxa. For taxonomic classification of viruses from metagenomic sequences, we developed VirusTaxo using diverse (e.g., 402 DNA and 280 RNA) genera of viruses. VirusTaxo has an average accuracy of 93% at genus level prediction in DNA and RNA viruses. VirusTaxo outperformed existing taxonomic classifiers of viruses where it assigned taxonomy of a larger fraction of metagenomic contigs compared to other methods. Benchmarking of VirusTaxo on a collection of SARS-CoV-2 sequencing libraries and metavirome datasets suggests that VirusTaxo can characterize virus taxonomy from highly diverse contigs and provide a reliable decision on the taxonomy of viruses.  相似文献   

8.
The frequencies of "words", oligonucleotides within nucleotide sequences, reflect the genetic information contained in the sequence "texts". Nucleotide sequences are characteristically represented by their contrast word vocabularies. Comparison of the sequences by correlating their contrast vocabularies is shown to reflect well the relatedness (unrelatedness) between the sequences. A single value, the linguistic similarity between the sequences, is suggested as a measure of sequence relatedness. Sequences as short as 1000 bases can be characterized and quantitatively related to other sequences by this technique. The linguistic sequence similarity value is used for analysis of taxonomically and functionally diverse nucleotide sequences. The similarity value is shown to be very sensitive to the relatedness of the source species, thus providing a convenient tool for taxonomic classification of species by their sequence vocabularies. Functionally diverse sequences appear distinct by their linguistic similarity values. This can be a basis for a quick screening technique for functional characterization of the sequences and for mapping functionally distinct regions in long sequences.  相似文献   

9.
Comparative analysis of partial tuf sequences was evaluated for the identification and differentiation of lactobacilli. Comparison of the amino acid sequences allowed differentiation between species and also between the subspecies of Lactobacillus delbrueckii. The nucleotide sequence comparison allowed differentiation between other subspecies and between some strains. Lactobacilli from several collections and isolates from dairy samples were clearly identified by comparison of short tuf sequences with those of the type strains. In evaluating the taxonomy of the Lactobacillus casei-related taxa, different tuf amino acid signatures are in favour of a classification into three distinct species. The type strain designation for the L. casei species is discussed.  相似文献   

10.
The information capacity of nucleotide sequences is defined through the calculation of specific entropy of their frequency dictionary. The specificentropy of the frequency dictionary is calculated against the reconstructeddictionary; this latter bears the most probable continuations of the shorterstrings. This developed measure allows to distinguish the sequences both from the randons ones, and from those with high level of (rather simple) order. Some implications of the developed methodology in the fields of genetics,bioinformatics, and molecular biology are discussed.  相似文献   

11.
A complementary DNA library prepared from the 12S polyadenylated RNAs extracted from interferon-induced KG-1 cells, a human myeloblast cell line, was screened for the presence of induction-specific sequences. Clones that exhibited strong positive signals were separated by hybridization criteria into nine classes. Clones from classes I through IV consisted of about 78% of the total and unexpectedly were found to resemble human mitochondrial ribosomal RNA genes.  相似文献   

12.
A detailed knowledge of the mapping between sequence and structure spaces in populations of RNA molecules is essential to better understand their present-day functional properties, to envisage a plausible early evolution of RNA in a prebiotic chemical environment and to improve the design of in vitro evolution experiments, among others. Analysis of natural RNAs, as well as in vitro and computational studies, show that certain RNA structural motifs are much more abundant than others, pointing out a complex relation between sequence and structure. Within this framework, we have investigated computationally the structural properties of a large pool (108 molecules) of single-stranded, 35 nt-long, random RNA sequences. The secondary structures obtained are ranked and classified into structure families. The number of structures in main families is analytically calculated and compared with the numerical results. This permits a quantification of the fraction of structure space covered by a large pool of sequences. We further show that the number of structural motifs and their frequency is highly unbalanced with respect to the nucleotide composition: simple structures such as stem-loops and hairpins arise from sequences depleted in G, while more complex structures require an enrichment of G. In general, we observe a strong correlation between subfamilies—characterized by a fixed number of paired nucleotides—and nucleotide composition. Our results are compared to the structural repertoire obtained in a second pool where isolated base pairs are prohibited.  相似文献   

13.
Most of the gene prediction algorithms for prokaryotes are based on Hidden Markov Models or similar machine-learning approaches, which imply the optimization of a high number of parameters. The present paper presents a novel method for the classification of coding and non-coding regions in prokaryotic genomes, based on a suitably defined compression index of a DNA sequence. The main features of this new method are the non-parametric logic and the costruction of a dictionary of words extracted from the sequences. These dictionaries can be very useful to perform further analyses on the genomic sequences themselves. The proposed approach has been applied on some prokaryotic complete genomes, obtaining optimal scores of correctly recognized coding and non-coding regions. Several false-positive and false-negative cases have been investigated in detail, which have revealed that this approach can fail in the presence of highly structured coding regions (e.g., genes coding for modular proteins) or quasi-random non-coding regions (e.g., regions hosting non-functional fragments of copies of functional genes; regions hosting promoters or other protein-binding sequences). We perform an overall comparison with other gene-finder software, since at this step we are not interested in building another gene-finder system, but only in exploring the possibility of the suggested approach.  相似文献   

14.
RNA molecules, which are found in all living cells, fold into characteristic structures that account for their diverse functional activities. Many of these RNA structures consist of a collection of fundamental RNA motifs. The various combinations of RNA basic components form different RNA classes and define their unique structural and functional properties. The availability of many genome sequences makes it possible to search computationally for functional RNAs. Biological experiments indicate that functional RNAs have characteristic RNA structural motifs represented by specific combinations of base pairings and conserved nucleotides in the loop regions. The searching for those well-ordered RNA structures and their homologues in genomic sequences is very helpful for the understanding of RNA-based gene regulation. In this paper, we consider the following problem: given an RNA sequence with a known secondary structure, efficiently determine candidate segments in genomic sequences that can potentially form RNA secondary structures similar to the given RNA secondary structure. Our new bottom-up approach searches all potential stem-loops similar to ones of the given RNA secondary structure first, and then based on located stem-loops, detects potential homologous structural RNAs in genomic sequences.  相似文献   

15.
根据GenBank中检索到的南极棕囊藻(Phaeocystis globosa)psaA基因序列设计psaAL和psaAR引物,对球形棕囊藻(Phaeocystis globosa),的psaA基因片段进行PCR扩增并测序,获得了629bp的DNA序列。应用clustal X对球形棕囊藻P1、P2株系和南极棕囊藻的psaA基因片段序列进行比对,结果表明,球形棕囊藻psaA基因片段序列无插入/缺失,核苷酸差异率为3.34%。应用DNAstar分析软件推断球形棕囊藻和南极棕囊藻的psaA基因对应的氨基酸序列和RNA二级结构,发现它们的氨基酸序列差异不大,序列中209个氨基酸只有1个发生了变化,其氨基酸变异率为0.48%;除部分结构域比较相似外,RNA二级结构上体现一定程度的差异,这可能对棕囊藻的分子分类研究有参考价值。因所获得的psaA基因片段序列及氨基酸序列具有种的极端保守性,不适宜用作Phaeocystis属种间的分子分类研究。  相似文献   

16.
Some earlier studies suggested an evolutionary relationship between the Raphidophyceae (chloromonads) and Xanthophyceae (yellow-green algae), whereas other studies suggested relationships with different algal classes or the öomycete fungi. To evaluate the relationships, we determined the complete nucleotide sequences of the 18S ribosomal RNA gene from the raphidophytes Vacuolaria virescens, Chattonella subsalsa, and Heterosigma carterae, and the xanthophytes Vaucheria bursata, Botrydium stoloniferum, Botrydiopsis intercedens, and Xanthonema debile. The results showed that the Xanthophyceae were most closely related to the Phaeophyceae. A cladistic analysis of combined data sets (nucleotide sequences, ultrastructure, and pigments) suggested the Raphidophyceae are the sister taxon to the Phaeophyceae-Xanthophyceae clade, but the bootstrap value was low (40%). The raphidophyte genera were united with high (100%) bootstrap values, supporting a hypothesis based upon ultrastructural features that marine and freshwater raphidophytes form a monophyletic group. We examined the relationship between Vaucheria, a siphoneous xanthophyte alga, and the öomycetes, and we confirmed that Vaucheria is a member of the class Xanthophyceae. Partial nucleotide sequences of the 18S rRNA gene from eight xanthophytes (including Bumillariopsis filiformis, Heterococcus caespitiosus, and Mischococcus sphaerocephalus) produce a phylogeny that is not congruent with the current morphology-based classification scheme.  相似文献   

17.
J G Williams  R Hoffman  S Penman 《Cell》1977,11(4):901-907
The poly(A)-containing messenger RNA of normal diploid fibroblast and SV40-transformed progeny cells are compared by cross-hybridizing cDNA. We find a high degree of homology between the mRNA from normal and transformed cells. Despite imperfections in the procedure, the technique permits the conclusion that, at most, 3% of the mRNA in the transformed cell has sequences not present in the normal parental cell. Furthermore, much of the difference appears to occur in low and intermediate complexity classes of mRNA molecules. Extension homology in the mRNA sequences of disparate cell lines may be a general phenomenon, and even HeLa cell mRNA is nearly identical to that of diploid human fibroblasts.  相似文献   

18.
Poly(A)+ mRNA from mouse hepatoma ascites cell cytoplasm is characterized by three frequency classes: an abundant frequency class of a limited number of different nucleotide sequences, a less abundant frequency class of a larger number of different nucleotide sequences, and a rare frequency class containing a high number of different nucleotide sequences. [3H]cDNA synthesized on this poly(A)+ mRNA template hybridizes with some of the DNAs of the putative transcribable euchromatin fraction at a significantly faster rate than with total DNA if residual contaminating RNA is not removed. Following NaOH incubation to remove such RNA, the cDNA probe hybridized with essentially the same rate to the euchromatin fractions and total DNA. Nick translation of the nuclease-sensitive sequences of chromatin demonstrated that, even with limited nuclease digestion, the excised sequences rapidly converted to small oligonucleotides. The nick-translatable, small chromatin segments showed no enrichment for transcribable sequences. Chromatin segments, which distribute to the 50S-70S glycerol gradient fractions and which satisfy several of the presumptive criteria for enrichment for transcribable sequences, therefore show no enrichment for sequences complementary to the cDNA for poly(A)+ mRNA.  相似文献   

19.
Prokaryotic, eukaryotic and mitochondrial DNA sequences of total Length 300 000 nucleotides have been analyzed to find out whether stretches of alternating purines and pyrimidines are unusual in terms of occurrence, composition and base sequence. Alternating runs longer than 5 nucleotides are significantly under-represented in the natural sequences as compared to random ones. Octanucleotides are the most deficient, occurring at only 60% of the frequency expected in random sequences. An unexpectedly high proportion of these octamers consists of alternating tetramers with the repeat structure (PuPyPuPy)2 or (PyPuPyPu)2. DNA stretches containing such sequences can potentially form a S1 nuclease sensitive slippage (staggered loop) structure, which might serve as a locally unstacked intermediate in the B- to Z-DNA conformational transition.  相似文献   

20.
Computer system mRNA-FAST (mRNA--Function, Activity, STructure; http://wwwmgs.bionet.nsc.ru/mgs/dbases/trsig/) is described. The system has been developed to analyze nucleotide sequences of mRNA and to measure their essential properties. The system compiles the data base on translation signals including nucleotide sequences of the regulatory regions with structural and experimental information on their specific activities. It also contains programs to search for local homology between mRNA and translation signals, to search for potential signals basing on analysis of the oligonucleotide dictionaries, and to model secondary RNA structure. Possible applications of the system mRNA-FAST are discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号