首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Background: For understanding biological cellular systems, it is important to analyze interactions between protein residues and RNA bases. A method based on conditional random fields (CRFs) was developed for predicting contacts between residues and bases, which receives multiple sequence alignments for given protein and RNA sequences, respectively, and learns the model with many parameters involved in relationships between neighboring residue-base pairs by maximizing the pseudo likelihood function. Methods: In this paper, we proposed a novel CRF-based model with more complicated dependency relationships between random variables than the previous model, but which takes less parameters for the sake of avoidance of overfitting to training data. Results: We performed cross-validation experiments for evaluating the proposed model, and took the average of AUC (area under receiver operating characteristic curve) scores. The result suggests that the proposed CRF-based model without using L1-norm regularization (lasso) outperforms the existing model with and without the lasso under several input observations to CRFs. Conclusions: We proposed a novel stochastic model for predicting protein-RNA residue-base contacts, and improved the prediction accuracy in terms of the AUC score. It implies that more dependency relationships in a CRF could be controlled by less parameters.  相似文献   

2.
MOTIVATION: We are motivated by the fast-growing number of protein structures in the Protein Data Bank with necessary information for prediction of protein-protein interaction sites to develop methods for identification of residues participating in protein-protein interactions. We would like to compare conditional random fields (CRFs)-based method with conventional classification-based methods that omit the relation between two labels of neighboring residues to show the advantages of CRFs-based method in predicting protein-protein interaction sites. RESULTS: The prediction of protein-protein interaction sites is solved as a sequential labeling problem by applying CRFs with features including protein sequence profile and residue accessible surface area. The CRFs-based method can achieve a comparable performance with state-of-the-art methods, when 1276 nonredundant hetero-complex protein chains are used as training and test set. Experimental result shows that CRFs-based method is a powerful and robust protein-protein interaction site prediction method and can be used to guide biologists to make specific experiments on proteins. AVAILABILITY: http://www.insun.hit.edu.cn/~mhli/site_CRFs/index.html. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

3.
SUMMARY: Protein name extraction is an important step in mining biological literature. We describe two new methods for this task: semiCRFs and dictionary HMMs. SemiCRFs are a recently-proposed extension to conditional random fields (CRFs) that enables more effective use of dictionary information as features. Dictionary HMMs are a technique in which a dictionary is converted to a large HMM that recognizes phrases from the dictionary, as well as variations of these phrases. Standard training methods for HMMs can be used to learn which variants should be recognized. We compared the performance of our new approaches with that of Maximum Entropy (MaxEnt) and normal CRFs on three datasets, and improvement was obtained for all four methods over the best published results for two of the datasets. CRFs and semiCRFs achieved the highest overall performance according to the widely-used F-measure, while the dictionary HMMs performed the best at finding entities that actually appear in the dictionary-the measure of most interest in our intended application. AVAILABILITY: Dictionary HMMs were implemented in Java. Algorithms are available through an information extraction package MINORTHIRD on http://minorthird.sourceforge.net  相似文献   

4.
Zhu F  Shen B 《PloS one》2012,7(6):e39230
Biological named entity recognition, the identification of biological terms in text, is essential for biomedical information extraction. Machine learning-based approaches have been widely applied in this area. However, the recognition performance of current approaches could still be improved. Our novel approach is to combine support vector machines (SVMs) and conditional random fields (CRFs), which can complement and facilitate each other. During the hybrid process, we use SVM to separate biological terms from non-biological terms, before we use CRFs to determine the types of biological terms, which makes full use of the power of SVM as a binary-class classifier and the data-labeling capacity of CRFs. We then merge the results of SVM and CRFs. To remove any inconsistencies that might result from the merging, we develop a useful algorithm and apply two rules. To ensure biological terms with a maximum length are identified, we propose a maximal bidirectional squeezing approach that finds the longest term. We also add a positive gain to rare events to reinforce their probability and avoid bias. Our approach will also gradually extend the context so more contextual information can be included. We examined the performance of four approaches with GENIA corpus and JNLPBA04 data. The combination of SVM and CRFs improved performance. The macro-precision, macro-recall, and macro-F(1) of the SVM-CRFs hybrid approach surpassed conventional SVM and CRFs. After applying the new algorithms, the macro-F1 reached 91.67% with the GENIA corpus and 84.04% with the JNLPBA04 data.  相似文献   

5.
MOTIVATION: Protein secondary structure prediction is an important step towards understanding how proteins fold in three dimensions. Recent analysis by information theory indicates that the correlation between neighboring secondary structures are much stronger than that of neighboring amino acids. In this article, we focus on the combination problem for sequences, i.e. combining the scores or assignments from single or multiple prediction systems under the constraint of a whole sequence, as a target for improvement in protein secondary structure prediction. RESULTS: We apply several graphical chain models to solve the combination problem and show that they are consistently more effective than the traditional window-based methods. In particular, conditional random fields (CRFs) moderately improve the predictions for helices and, more importantly, for beta sheets, which are the major bottleneck for protein secondary structure prediction.  相似文献   

6.
Accurate tertiary structures are very important for the functional study of non-coding RNA molecules. However, predicting RNA tertiary structures is extremely challenging, because of a large conformation space to be explored and lack of an accurate scoring function differentiating the native structure from decoys. The fragment-based conformation sampling method (e.g. FARNA) bears shortcomings that the limited size of a fragment library makes it infeasible to represent all possible conformations well. A recent dynamic Bayesian network method, BARNACLE, overcomes the issue of fragment assembly. In addition, neither of these methods makes use of sequence information in sampling conformations. Here, we present a new probabilistic graphical model, conditional random fields (CRFs), to model RNA sequence-structure relationship, which enables us to accurately estimate the probability of an RNA conformation from sequence. Coupled with a novel tree-guided sampling scheme, our CRF model is then applied to RNA conformation sampling. Experimental results show that our CRF method can model RNA sequence-structure relationship well and sequence information is important for conformation sampling. Our method, named as TreeFolder, generates a much higher percentage of native-like decoys than FARNA and BARNACLE, although we use the same simple energy function as BARNACLE. CONTACT: zywang@ttic.edu; j3xu@ttic.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

7.
Tagging biomedical entities such as gene, protein, cell, and cell-line is the first step and an important pre-requisite in biomedical literature mining. In this paper, we describe our hybrid named entity tagging approach namely BCC-NER (bidirectional, contextual clues named entity tagger for gene/protein mention recognition). BCC-NER is deployed with three modules. The first module is for text processing which includes basic NLP pre-processing, feature extraction, and feature selection. The second module is for training and model building with bidirectional conditional random fields (CRF) to parse the text in both directions (forward and backward) and integrate the backward and forward trained models using margin-infused relaxed algorithm (MIRA). The third and final module is for post-processing to achieve a better performance, which includes surrounding text features, parenthesis mismatching, and two-tier abbreviation algorithm. The evaluation results on BioCreative II GM test corpus of BCC-NER achieve a precision of 89.95, recall of 84.15 and overall F-score of 86.95, which is higher than the other currently available open source taggers.  相似文献   

8.
MOTIVATION: Order and Disorder prediction using Conditional Random Fields (OnD-CRF) is a new method for accurately predicting the transition between structured and mobile or disordered regions in proteins. OnD-CRF applies CRFs relying on features which are generated from the amino acids sequence and from secondary structure prediction. Benchmarking results based on CASP7 targets, and evaluation with respect to several CASP criteria, rank the OnD-CRF model highest among the fully automatic server group. AVAILABILITY: http://babel.ucmp.umu.se/ond-crf/  相似文献   

9.
Medicago truncatula is a fast-emerging model for the study of legume functional biology. We used the tobacco retrotransposon Tnt1 to tag the Medicago genome and generated over 7600 independent lines representing an estimated 190 000 insertion events. Tnt1 inserted on average at 25 different locations per genome during tissue culture, and insertions were stable during subsequent generations in soil. Analysis of 2461 Tnt1 flanking sequence tags (FSTs) revealed that Tnt1 appears to prefer gene-rich regions. The proportion of Tnt1 insertion in coding sequences was 34.1%, compared to the expected 15.9% if random insertions were to occur. However, Tnt1 showed neither unique target site specificity nor strong insertion hot spots, although some genes were more frequently tagged than others. Forward-genetic screening of 3237 R1 lines resulted in identification of visible mutant phenotypes in approximately 30% of the regenerated lines. Tagging efficiency appears to be high, as all of the 20 mutants examined so far were found to be tagged. Taking the properties of Tnt1 into account and assuming 1.7 kb for the average M. truncatula gene size, we estimate that approximately 14 000–16 000 lines would be sufficient for 90% gene tagging coverage in M. truncatula . This is in contrast to more than 500 000 lines required to achieve the same saturation level using T-DNA tagging. Our data demonstrate that Tnt1 is an efficient insertional mutagen in M. truncatula , and could be a primary choice for other plant species with large genomes.  相似文献   

10.
11.
Perceptual learning of visual features occurs when multiple stimuli are presented in a fixed sequence (temporal patterning), but not when they are presented in random order (roving). This points to the need for proper stimulus coding in order for learning of multiple stimuli to occur. We examined the stimulus coding rules for learning with multiple stimuli. Our results demonstrate that: (1) stimulus rhythm is necessary for temporal patterning to take effect during practice; (2) learning consolidation is subject to disruption by roving up to 4 h after each practice session; (3) importantly, after completion of temporal-patterned learning, performance is undisrupted by extended roving training; (4) roving is ineffective if each stimulus is presented for five or more consecutive trials; and (5) roving is also ineffective if each stimulus has a distinct identity. We propose that for multi-stimulus learning to occur, the brain needs to conceptually “tag” each stimulus, in order to switch attention to the appropriate perceptual template. Stimulus temporal patterning assists in tagging stimuli and switching attention through its rhythmic stimulus sequence.  相似文献   

12.
One of the first steps in understanding a protein''s function is to determine its localization; however, the methods for localizing proteins in some systems have not kept pace with the developments in other fields, creating a bottleneck in the analysis of the large datasets that are generated in the post-genomic era. To address this, we developed tools for tagging proteins in trypanosomatids. We made a plasmid that, when coupled with long primer PCR, can be used to produce transgenes at their endogenous loci encoding proteins tagged at either terminus or within the protein coding sequence. This system can also be used to generate deletion mutants to investigate the function of different protein domains. We show that the length of homology required for successful integration precluded long primer PCR tagging in Leishmania mexicana. Hence, we developed plasmids and a fusion PCR approach to create gene tagging amplicons with sufficiently long homologous regions for targeted integration, suitable for use in trypanosomatids with less efficient homologous recombination than Trypanosoma brucei. Importantly, we have automated the primer design, developed universal PCR conditions and optimized the workflow to make this system reliable, efficient and scalable such that whole genome tagging is now an achievable goal.  相似文献   

13.
M de Zamaroczy  G Bernardi 《Gene》1992,122(1):91-99
The introns of three genes (oxi3, cob and 21S) from the mitochondrial (mt) genome of Saccharomyces cerevisiae contain closed reading frames (CRFs). In the present work, we have analyzed these sequences in their oligodeoxyribonucleotide (oligo; isostich) patterns. We have shown that the relative amounts of di- to hexanucleotides, when compared to random sequences having the same sizes and compositions, exhibit the same deviations as the intergenic noncoding sequences of the mt genome (except for the CRFs from 21S intron). In contrast, intronic open reading frames (ORFs) showed oligo patterns which were generally quite distinct from those of CRFs, although some similarities could be detected in some cases (especially for aI5 alpha). The mt introns of yeast, therefore, are endowed with a mosaic structure, in which CRFs derive from mt intergenic sequences, whereas ORFs have a different origin (indicated as exogenous by other evidences) yet show, in some cases, the effects of 'sequence assimilation' with CRFs.  相似文献   

14.
MOTIVATION: A large amount of biomolecular network data for multiple species have been generated by high-throughput experimental techniques, including undirected and directed networks such as protein-protein interaction networks, gene regulatory networks and metabolic networks. There are many conserved functionally similar modules and pathways among multiple biomolecular networks in different species; therefore, it is important to analyze the similarity between the biomolecular networks. Network querying approaches aim at efficiently discovering the similar subnetworks among different species. However, many existing methods only partially solve this problem. RESULTS: In this article, a novel approach for network querying problem based on conditional random fields (CRFs) model is presented, which can handle both undirected and directed networks, acyclic and cyclic networks and any number of insertions/deletions. The CRF method is fast and can query pathways in a large network in seconds using a PC. To evaluate the CRF method, extensive computational experiments are conducted on the simulated and real data, and the results are compared with the existing network querying methods. All results show that the CRF method is very useful and efficient to find the conserved functionally similar modules and pathways in multiple biomolecular networks.  相似文献   

15.
Transposon display for active DNA transposons in rice   总被引:2,自引:0,他引:2  
Transposon display (TD) is a powerful technique to identify the integration site of transposons in gene tagging as a functional genomic tool for elucidating gene function. Although active endogenous DNA transposons have been used extensively for gene tagging in maize, only two active endogenous DNA transposons in rice have been identified, the 0.43-kb element mPing of the MITE family and the 0.6-kb nDart element of the hAT family. The nDart transposition was shown to be induced by crossing with a line containing its autonomous element aDart and stabilized by segregating aDart under natural growth conditions, while mPing-related elements were shown to transpose in cultured cells, plants regenerated from an anther culture, and gamma-ray-irradiated plants. No somaclonal variation should occur in nDart-promoted gene tagging because no tissue culture was involved in nDart activation. As an initial step to develop an effective tagging system using nDart in rice, we tried to visualize GC-rich nDart-related elements comprising 18 nDart-related sequences of 0.6-kb and 63 nDart-related elements longer than 2 kb in Nipponbare by TD. Comparing the observed bands in TD with the anticipated virtual bands of the nDart-related elements based upon the available rice genome sequence, we have improved our TD protocol by optimizing the PCR amplification conditions and are able to visualize approximately 87% of the anticipated bands produced from the nDart-related elements. To compare the visualization efficiency of these nDart-related elements with that of 50 mPing elements and a unique Ping sequence in Nipponbare, we also tried to visualize the mPing-related elements; all mPing-related elements are easily visualized. Based on these results, we discuss the parameters affecting the visualization efficiencies of these rice DNA transposons. We also discuss the utilization of nDart elements in gene tagging for functional genomics in rice.  相似文献   

16.
We have screened a total of 5,500 T-DNA tagging rice lines in which beta-glucuronidase (GUS) gene sequence was randomly inserted as a transgene into the plant genome. Histochemical GUS assays were carried out to select the T-DNA tagging rice lines that show its expression in anther. Of the tagging lines screened, three lines were found to express GUS specifically in the anther that is about 0.05%. Microscopic observation of the anther-expressed lines showed specific expression patterns of GUS in the anther, either gametophytic or sporophytic specificities. Southern blot analysis revealed that the integration copy number of the transgene was 2.3 in average. The detailed expression patterns were analyzed and discussed.  相似文献   

17.
Plant tagnology     
Transposable elements have been used as an effective mutagen and as a tool to clone tagged genes. Insertion of a transposable element into a gene can lead to loss- or gain-of-function, changes in expression pattern, or can have no effect on gene function at all, depending on whether the insertion took place in coding or non-coding regions of the gene. Cloning transposable elements from different plant species has made them available as a tool for the isolation of tagged genes using homologous or heterologous tagging strategies. Based on these transposons, new elements have been engineered bearing reporter genes that can be used for expression analysis of the tagged gene, or resistance genes that can be used to select for knockout insertions. While many genes have been cloned using transposon tagging following traditional forward genetics strategies, gene cloning has ceased to be the rate-limiting step in the process of determining sequence–function relations in several important plant model species. Large-scale insertion mutagenesis and identification of insertion sites following a reverse genetics strategy appears to be the best method for unravelling the biological role of the thousands of genes with unknown functions identified by genome or expressed sequence tag (EST) sequencing projects. Here we review the progress in forward tagging technologies and discuss reverse genetics strategies and their applications in different model species.  相似文献   

18.
Mutation in the Caenorhabditis elegans gene osm-6 was previously shown to result in defects in the ultrastructure of sensory cilia and defects in chemosensory and mechanosensory behaviors. We have cloned osm-6 by transposon tagging and transformation rescue and have identified molecular lesions associated with five osm-6 mutations. The osm-6 gene encodes a protein that is 40% identical in amino acid sequence to a predicted mammalian protein of unknown function. We fused osm-6 with the gene for green fluorescent protein (GFP); the fusion gene rescued the osm-6 mutant phenotype and showed accumulation of GFP in ciliated sensory neurons exclusively. The OSM-6::GFP protein was localized to cytoplasm, including processes and dendritic endings where sensory cilia are situated. Mutations in other genes known to cause ciliary defects led to changes in the appearance of OSM-6::GFP in dendritic endings or, in the case of daf-19, reduced OSM-6::GFP accumulation. We conclude from an analysis of genetic mosaics that osm-6 acts cell autonomously in affecting cilium structure.  相似文献   

19.
ABNER (A Biomedical Named Entity Recognizer) is an open source software tool for molecular biology text mining. At its core is a machine learning system using conditional random fields with a variety of orthographic and contextual features. The latest version is 1.5, which has an intuitive graphical interface and includes two modules for tagging entities (e.g. protein and cell line) trained on standard corpora, for which performance is roughly state of the art. It also includes a Java application programming interface allowing users to incorporate ABNER into their own systems and train models on new corpora.  相似文献   

20.
A significant fraction of the nuclear DNA of all eukaryotes is occupied by simple sequence repeats (SSRs) or microsatellites. This type of sequence has sparked great interest as a means of studying genetic variation, linkage mapping, gene tagging and evolution. Although SSRs at different positions in a gene help determine the regulation of expression and the function of the protein produced, little attention has been paid to the chromosomal organisation and distribution of these sequences, even in model species. This review discusses the main achievements in the characterisation of long-range SSR organisation in the chromosomes of Triticum aestivum L., Secale cereale L., and Hordeum vulgare L. (all members of Triticeae). We have detected SSRs using an improved FISH technique based on the random primer labelling of synthetic oligonucleotides (15-24 bases) in multi-colour experiments. Detailed information on the presence and distribution of AC, AG and all the possible classes of trinucleotide repeats has been acquired. These data have revealed the motif-dependent and non-random chromosome distributions of SSRs in the different genomes, and allowed the correlation of particular SSRs with chromosome areas characterised by specific features (e.g., heterochromatin, euchromatin and centromeres) in all three species. The present review provides a detailed comparative study of the distribution of these SSRs in each of the seven chromosomes of the genomes A, B and D of wheat, H of barley and R of rye. The importance of SSRs in plant breeding and their possible role in chromosome structure, function and evolution is discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号