首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The identification of genes involved in host-pathogen interactions is important for the elucidation of mechanisms of disease resistance and host susceptibility. A traditional way to classify the origin of genes sampled from a pool of mixed cDNA is through sequence similarity to known genes from either the pathogen or host organism or other closely related species. This approach does not work when the identified sequence has no close homologues in the sequence databases. In our previous studies, we classified genes using their codon frequencies. This method, however, explicitly required the prediction of CDS regions and thus could not be applied to sequences composed from the non-coding regions of genes. In this study, we show that the use of sliding-window triplet frequencies extends the application of the algorithm to both coding and non-coding sequences and also increases the prediction accuracy of a Support Vector Machine classifier from 95.6+/-0.3 to 96.5+/-0.2. Thus the use of the triplet frequencies increased the prediction accuracy of the new method by more than 20% compared to our previous approach. A functional analysis of sequences detected gene families having significantly higher or lower probability to be correctly classified compared to the average accuracy of the method is described. The server to perform classification of EST sequences using triplet frequencies is available at (URL: http://mips.gsf.de/proj/est3).  相似文献   

2.
3.
Genomics projects have resulted in a flood of sequence data. Functional annotation currently relies almost exclusively on inter-species sequence comparison and is restricted in cases of limited data from related species and widely divergent sequences with no known homologs. Here, we demonstrate that codon composition, a fusion of codon usage bias and amino acid composition signals, can accurately discriminate, in the absence of sequence homology information, cytoplasmic ribosomal protein genes from all other genes of known function in Saccharomyces cerevisiae, Escherichia coli and Mycobacterium tuberculosis using an implementation of support vector machines, SVM(light). Analysis of these codon composition signals is instructive in determining features that confer individuality to ribosomal protein genes. Each of the sets of positively charged, negatively charged and small hydrophobic residues, as well as codon bias, contribute to their distinctive codon composition profile. The representation of all these signals is sensitively detected, combined and augmented by the SVMs to perform an accurate classification. Of special mention is an obvious outlier, yeast gene RPL22B, highly homologous to RPL22A but employing very different codon usage, perhaps indicating a non-ribosomal function. Finally, we propose that codon composition be used in combination with other attributes in gene/protein classification by supervised machine learning algorithms.  相似文献   

4.
5.
In order to study gene expression in a reproductive organ, we constructed a cDNA library of mature flower buds in Lotus japonicus, and characterized expressed sequence tags (ESTs) of 842 clones randomly selected. The EST sequences were clustered into 718 non-redundant groups. From BLAST and FASTA search analyses of both protein and DNA databases, 58.5% of the EST groups showed significant sequence similarities to known genes. Several genes encoding these EST clones were identified as pollen-specific genes, such as pectin methylesterase, ascorbate oxidase, and polygalacturonase, and as homologous genes involved in pollen-pistil interaction. Comparison of these EST sequences with those derived from the whole plant of L. japonicus, revealed that 64.8% of EST sequences from the flower buds were not found in EST sequences of the whole plant. Taken together, the EST data from flower buds generated in this study is useful in dissecting gene expression in floral organ of L. japonicus.  相似文献   

6.
7.
Identification and characterization of new plant microRNAs using EST analysis   总被引:50,自引:0,他引:50  
Seventy-five previously known plant microRNAs (miRNAs) were classified into 14 families according to their gene sequence identity. A total of 18,694 plant expressed sequence tags (EST) were found in the GenBank EST databases by comparing all previously known Arabidopsis miRNAs to GenBank‘s plant EST databases with BLAST algorithms. After removing the EST sequences with high numbers (more than 2) of mismatched nucleotides, a total of 812 EST contigs were identified. After predicting and scoring the RNA secondary structure of the 812 EST sequences using mFold software, 338 new potential miRNAs were identified in 60 plant species, miRNAs are widespread. Some microRNAsmay highly conserve in the plant kingdom, and they may have the same ancestor in very early evolution. There is no nucleotide substitution in most miRNAs among many plant species. Some of the new identified potential miRNAs may be induced and regulated by environmental biotic and abiotic stresses. Some may be preferentially expressed in specific tissues, and are regulated by developmental switching. These findings suggest that EST analysis is a good alternative strategy for identifying new miRNA candidates, their targets, and other genes. A large number of miRNAs exist in different plant species and play important roles in plant developmental switching and plant responses to environmental abiotic and biotic stresses as well as signal transduction. Environmental stresses and developmental switching may be the signals for synthesis and regulation of miRNAs in plants. A model for miRNA induction and expression, and gene regulation by miRNA is hypothesized.  相似文献   

8.
Abstract-- A novel approach for gene classification, which adopts codon usage bias as input feature vector for classification by support vector machines (SVM) is proposed. The DNA sequence is first converted to a 59-dimensional feature vector where each element corresponds to the relative synonymous usage frequency of a codon. As the input to the classifier is independent of sequence length and variance, our approach is useful when the sequences to be classified are of different lengths, a condition that homology-based methods tend to fail. The method is demonstrated by using 1,841 Human Leukocyte Antigen (HLA) sequences which are classified into two major classes: HLA-I and HLA-II; each major class is further subdivided into sub-groups of HLA-I and HLA-II molecules. Using codon usage frequencies, binary SVM achieved accuracy rate of 99.3% for HLA major class classification and multi-class SVM achieved accuracy rates of 99.73% and 98.38% for sub-class classification of HLA-I and HLA-II molecules, respectively. The results show that gene classification based on codon usage bias is consistent with the molecular structures and biological functions of HLA molecules.  相似文献   

9.
Knowledge of the three‐dimensional structure of a protein is essential for describing and understanding its function. Today, a large number of known protein sequences faces a small number of identified structures. Thus, the need arises to predict structure from sequence without using time‐consuming experimental identification. In this paper the performance of Support Vector Machines (SVMs) is compared to Neural Networks and to standard statistical classification methods as Discriminant Analysis and Nearest Neighbor Classification. We show that SVMs can beat the competing methods on a dataset of 268 protein sequences to be classified into a set of 42 fold classes. We discuss misclassification with respect to biological function and similarity. In a second step we examine the performance of SVMs if the embedding is varied from frequencies of single amino acids to frequencies of tripletts of amino acids. This work shows that SVMs provide a promising alternative to standard statistical classification and prediction methods in functional genomics.  相似文献   

10.
Plant microRNA: a small regulatory molecule with big impact   总被引:20,自引:0,他引:20  
  相似文献   

11.
12.
Helicosporidia are obligate invertebrate pathogens with a unique and highly adapted mode of infection. The evolutionary history of Helicosporidia has been uncertain, but several recent molecular phylogenetic studies have shown an unexpectedly close relationship to green algae, and specifically to the opportunistic pathogen Prototheca. To date, molecular sequences from Helicosporidia are restricted to those genes used for phylogenetic reconstruction and genes related to the existence and function of its cryptic plastid. We have therefore conducted a small expressed sequence tag (EST) project on Helicosporidium sp., yielding about 700 unique sequences. We have examined the functional distribution of known genes, the distribution of EST abundance, and the prevalence of previously unknown gene sequences. To demonstrate the potential utility of large amounts of data, we have used ribosomal proteins to test whether the phylogenetic position of Helicosporidium inferred from a small number of genes is broadly supported by a large number of genes. We conducted phylogenetic analyses on 69 ribosomal proteins and found that 98% supported the green algal origin of Helicosporidia and 80% support a specific relationship with Prototheca. Overall, these data multiply the available molecular information from Helicosporidium 100-fold, which should provide the basis for new insights into these unusual but interesting parasites.  相似文献   

13.
14.
Rhizoctonia solani is a ubiquitous basidiomycetous soilborne fungal pathogen causing damping-off of seedlings, aerial blights and postharvest diseases. To gain insight into the molecular mechanisms of pathogenesis a global approach based on analysis of expressed sequence tags (ESTs) was undertaken. To get broad gene-expression coverage, two normalized EST libraries were developed from mycelia grown under high nitrogen-induced virulent and low nitrogen/methylglucose-induced hypovirulent conditions. A pilot-scale assessment of gene diversity was made from the sequence analyses of the two libraries. A total of 2280 cDNA clones was sequenced that corresponded to 220 unique sequence sets or clusters (contigs) and 805 singlets, making up a total of 1025 unique genes identified from the two virulence-differentiated cDNA libraries. From the total sequences, 295 genes (38.7%) exhibited strong similarities with genes in public databases and were categorized into 11 functional groups. Approximately 61.3% of the R. solani ESTs have no apparent homologs in publicly available fungal genome databases and are considered unique genes. We have identified several cDNAs with potential roles in fungal pathogenicity, virulence, signal transduction, vegetative incompatibility and mating, drug resistance, lignin degradation, bioremediation and morphological differentiation. A codon-usage table has been formulated based on 14694 R. solani EST codons. Further analysis of ESTs might provide insights into virulence mechanisms of R. solani AG 4 as well as roles of these genes in development, saprophytic colonization and ecological adaptation of this important fungal plant pathogen.  相似文献   

15.
For comprehensive analysis of genes expressed in a model legume, Lotus japonicus, a total of 22,983 5' end expressed sequence tags (ESTs) were accumulated from normalized and size-selected cDNA libraries constructed from young (2 weeks old) plants. The EST sequences were clustered into 7137 non-redundant groups. Similarity search against public non-redundant protein database indicated that 3302 groups showed similarity to genes of known function, 1143 groups to hypothetical genes, and 2692 were novel sequences. Homologues of 5 nodule-specific genes which have been reported in other legume species were contained in the collected ESTs, suggesting that the EST source generated in this study will become a useful tool for identification of genes related to legume-specific biological processes. The sequence data of individual ESTs are available at the web site: http://www.kazusa.or.jp/en/plant/lotus/EST/.  相似文献   

16.
MOTIVATION: A whole set of Expressed Sequence Tags (ESTs) from the Sf9 cell line of Spodoptera frugiperda is presented here for the first time. By this way we want to identify both conserved and specific genes of this pest species. We also expect from this analysis to find a class of protein sequences providing a tool to explore genomic features and phylogeny of Lepidoptera. RESULTS: The ESTs display both housekeeping as well as developmentally regulated genes, and a high percentage of sequences with unknown function. Among the identified ORFs, almost all ribosomal proteins (RPs) were found with high EST redundancy and hence sequence accuracy. The codon usage found among RP genes is in average surprisingly much less biased in Lepidoptera than in other organisms. Other Spodoptera genes also displayed a low bias, suggesting a general genome expression feature in this Lepidoptera. We also found that the L35A and L36 RP sequences, respectively, display 40 and 10 amino-acid insertions, both being present only in insects. Sequence analysis suggests that they are probably not subjected to a strong selective pressure and may be good phylogenetic markers for Lepidoptera. Most interestingly, the Lepidoptera sequences of 9 RP genes displayed a specific signature different from the canonical one. We conclude that the RP family allows valuable comparative genomics and phylogeny of Lepidoptera. AVAILABILITY: All EST sequence data are available from the private 'Spodo-Base' upon request.  相似文献   

17.
18.
A large-scale comparative genomic analysis of unisequence sets obtained from an Ustilago maydis EST collection was performed against publicly available EST and genomic sequence datasets from 21 species. We annotated 70% of the collection based on similarity to known sequences and recognized protein signatures. Distinct grouping of the ESTs, defined by the presence or absence of similar sequences in the species examined, allowed the identification of U. maydis sequences present only (1) in fungal species, (2) in plants but not animals, (3) in animals but not plants, or (4) in all three eukaryotic lineages assessed. We also identified 215 U. maydis genes that are found in the ascomycete but not in the basidiomycete genome sequences searched. Candidate genes were identified for further functional characterization. These include 167 basidiomycete-specific sequences, 58 fungal pathogen-specific sequences (including 37 basidiomycete pathogen-specific sequences), and 18 plant pathogen-specific sequences, as well as two sequences present only in other plant pathogen and plant species.Supplemental Excel Table 1 used for analysis and the derivation of Fig. 3 as well as supplemental Tables 2 and 3 are available at All ESTs used in this analysis have been submitted to GenBank. The accession numbers are CF638289–CF645747, CF663122–CF663127, and CD487847–CD490309 (Supplemental Table 3)  相似文献   

19.
20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号