首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
3.

Background

DNA Clustering is an important technology to automatically find the inherent relationships on a large scale of DNA sequences. But the DNA clustering quality can still be improved greatly. The DNA sequences similarity metric is one of the key points of clustering. The alignment-free methodology is a very popular way to calculate DNA sequence similarity. It normally converts a sequence into a feature space based on words’ probability distribution rather than directly matches strings. Existing alignment-free models, e.g. k-tuple, merely employ word frequency information and ignore many types of useful information contained in the DNA sequence, such as classifications of nucleotide bases, position and the like. It is believed that the better data mining results can be achieved with compounded information. Therefore, we present a new alignment-free model that employs compounded information to improve the DNA clustering quality.

Results

This paper proposes a Category-Position-Frequency (CPF) model, which utilizes the word frequency, position and classification information of nucleotide bases from DNA sequences. The CPF model converts a DNA sequence into three sequences according to the categories of nucleotide bases, and then yields a 12-dimension feature vector. The feature values are computed by an entropy based model that takes both local word frequency and position information into account. We conduct DNA clustering experiments on several datasets and compare with some mainstream alignment-free models for evaluation, including k-tuple, DMk, TSM, AMI and CV. The experiments show that CPF model is superior to other models in terms of the clustering results and optimal settings.

Conclusions

The following conclusions can be drawn from the experiments. (1) The hybrid information model is better than the model based on word frequency only. (2) For DNA sequences no more than 5000 characters, the preferred size of sliding windows for CPF is two which provides a great advantage to promote system performance. (3) The CPF model is able to obtain an efficient stable performance and broad generalization.  相似文献   

4.
5.

Background

Brassica napus is the third leading source of vegetable oil in the world after soybean and oil palm. The accumulation of gene sequences, especially expressed sequence tags (ESTs) from plant cDNA libraries, has provided a rich resource for genes discovery including potential antimicrobial peptides (AMPs). In this study, we used ESTs including those generated from B. napus cDNA libraries of seeds, pathogen-challenged leaves and deposited in the public databases, as a model, to perform in silico identification and consequently in vitro confirmation of putative AMP activities through a highly efficient system of recombinant AMP prokaryotic expression.

Results

In total, 35,788 were generated from cDNA libraries of pathogen-challenged leaves and 187,272 ESTs from seeds of B. napus, and the 644,998 ESTs of B. napus were downloaded from the EST database of PlantGDB. They formed 201,200 unigenes. First, all the known AMPs from the AMP databank (APD2 database) were individually queried against all the unigenes using the BLASTX program. A total of 972 unigenes that matched the 27 known AMP sequences in APD2 database were extracted and annotated using Blast2GO program. Among these unigenes, 237 unigenes from B. napus pathogen-challenged leaves had the highest ratio (1.15 %) in this unigene dataset, which is 13 times that of the unigene datasets of B. napus seeds (0.09 %) and 2.3 times that of the public EST dataset. About 87 % of each EST library was lipid-transfer protein (LTP) (32 % of total unigenes), defensin, histone, endochitinase, and gibberellin-regulated proteins. The most abundant unigenes in the leaf library were endochitinase and defensin, and LTP and histone in the pub EST library. After masking of the repeat sequence, 606 peptides that were orthologous matched to different AMP families were found. The phylogeny and conserved structural motifs of seven AMPs families were also analysed. To investigate the antimicrobial activities of the predicted peptides, 31 potential AMP genes belonging to different AMP families were selected to test their antimicrobial activities after bioinformatics identification. The AMP genes were all optimized according to Escherichia coli codon usage and synthetized through one-step polymerase chain reaction method. The results showed that 28 recombinant AMPs displayed expected antimicrobial activities against E. coli and Micrococcus luteus and Sclerotinia sclerotiorum strains.

Conclusion

The study not only significantly expanded the number of known/predicted peptides, but also contributed to long-term plant genetic improvement for increased resistance to diverse pathogens of B.napus. These results proved that the high-throughput method developed that combined an in silico procedure with a recombinant AMP prokaryotic expression system is considerably efficient for identification of new AMPs from genome or EST sequence databases.

Electronic supplementary material

The online version of this article (doi:10.1186/s12864-015-1849-x) contains supplementary material, which is available to authorized users.  相似文献   

6.
7.
8.
9.
CP Li  ZG Yu  GS Han  KH Chu 《PloS one》2012,7(7):e42154

Background

The composition vector (CV) method has been proved to be a reliable and fast alignment-free method to analyze large COI barcoding data. In this study, we modify this method for analyzing multi-gene datasets for plant DNA barcoding. The modified method includes an adjustable-weighted algorithm for the vector distance according to the ratio in sequence length of the candidate genes for each pair of taxa.

Methodology/Principal Findings

Three datasets, matK+rbcL dataset with 2,083 sequences, matK+rbcL dataset with 397 sequences and matK+rbcL+trnH-psbA dataset with 397 sequences, were tested. We showed that the success rates of grouping sequences at the genus/species level based on this modified CV approach are always higher than those based on the traditional K2P/NJ method. For the matK+rbcL datasets, the modified CV approach outperformed the K2P-NJ approach by 7.9% in both the 2,083-sequence and 397-sequence datasets, and for the matK+rbcL+trnH-psbA dataset, the CV approach outperformed the traditional approach by 16.7%.

Conclusions

We conclude that the modified CV approach is an efficient method for analyzing large multi-gene datasets for plant DNA barcoding. Source code, implemented in C++ and supported on MS Windows, is freely available for download at http://math.xtu.edu.cn/myphp/math/research/source/Barcode_source_codes.zip.  相似文献   

10.
11.
12.

Background

Neurocysticercosis is a disease caused by the oral ingestion of eggs from the human parasitic worm Taenia solium. Although drugs are available they are controversial because of the side effects and poor efficiency. An expressed sequence tag (EST) library is a method used to describe the gene expression profile and sequence of mRNA from a specific organism and stage. Such information can be used in order to find new targets for the development of drugs and to get a better understanding of the parasite biology.

Methods and Findings

Here an EST library consisting of 5760 sequences from the pig cysticerca stage has been constructed. In the library 1650 unique sequences were found and of these, 845 sequences (52%) were novel to T. solium and not identified within other EST libraries. Furthermore, 918 sequences (55%) were of unknown function. Amongst the 25 most frequently expressed sequences 6 had no relevant similarity to other sequences found in the Genbank NR DNA database. A prediction of putative signal peptides was also performed and 4 among the 25 were found to be predicted with a signal peptide. Proposed vaccine and diagnostic targets T24, Tsol18/HP6 and Tso31d could also be identified among the 25 most frequently expressed.

Conclusions

An EST library has been produced from pig cysticerca and analyzed. More than half of the different ESTs sequenced contained a sequence with no suggested function and 845 novel EST sequences have been identified. The library increases the knowledge about what genes are expressed and to what level. It can also be used to study different areas of research such as drug and diagnostic development together with parasite fitness via e.g. immune modulation.  相似文献   

13.
14.
15.
Lai D  Li H  Fan S  Song M  Pang C  Wei H  Liu J  Wu D  Gong W  Yu S 《PloS one》2011,6(12):e28676

Background

Upland cotton, Gossypium hirsutum L., is one of the world''s most important economic crops. In the absence of the entire genomic sequence, a large number of expressed sequence tag (EST) resources of upland cotton have been generated and used in several studies. However, information about the flower development of this species is rare.

Methodology/Principal Findings

To clarify the molecular mechanism of flower development in upland cotton, 22,915 high-quality ESTs were generated and assembled into 14,373 unique sequences consisting of 4,563 contigs and 9,810 singletons from a normalized and full-length cDNA library constructed from pooled RNA isolated from shoot apexes, squares, and flowers. Comparative analysis indicated that 5,352 unique sequences had no high-degree matches to the cotton public database. Functional annotation showed that several upland cotton homologs with flowering-related genes were identified in our library. The majority of these genes were specifically expressed in flowering-related tissues. Three GhSEP (G. hirsutum L. SEPALLATA) genes determining floral organ development were cloned, and quantitative real-time PCR (qRT-PCR) revealed that these genes were expressed preferentially in squares or flowers. Furthermore, 670 new putative microsatellites with flanking sequences sufficient for primer design were identified from the 645 unigenes. Twenty-five EST–simple sequence repeats were randomly selected for validation and transferability testing in 17 Gossypium species. Of these, 23 were identified as true-to-type simple sequence repeat loci and were highly transferable among Gossypium species.

Conclusions/Significance

A high-quality, normalized, full-length cDNA library with a total of 14,373 unique ESTs was generated to provide sequence information for gene discovery and marker development related to upland cotton flower development. These EST resources form a valuable foundation for gene expression profiling analysis, functional analysis of newly discovered genes, genetic linkage, and quantitative trait loci analysis.  相似文献   

16.
He WY  Rao ZC  Zhou DH  Zheng SC  Xu WH  Feng QL 《PloS one》2012,7(3):e33621

Background

Out of total 3,081 assembled expressed sequence tags (ESTs) sequences representing 6,815 high-quality ESTs identified in three cDNA libraries constructed with RNA isolated from the midgut of Spodoptera litura, 1,039 ESTs showed significant hits and 1,107 ESTs did not show significant hits in BLAST searches. It is of interest to clarify whether or not these ESTs that did not show hits function in S. Litura.

Results

Twenty “no-hit” ESTs containing at least one putative open reading frame were selected for further expression analysis. The results from northern blot analysis showed that six of the selected ESTs are expressed in the larval midgut of this insect at different levels, suggesting that these ESTs represent true mRNA products, whereas the other 14 ESTs could not be detected. Homologues of the four larval midgut-predominant genes (Slmg2, Slmg7, Slmg9 and Slmg17) were detected in the genomes of other lepidopteran insects but not in Drosophila melanogaster. A novel gene, Slmg7, is expressed at a high level specifically in the midgut during each of the larval stages. Slmg7 is a single copy gene and encodes a 143-amino acids protein. The SLMG7 protein was localized to the cytoplasm of Spli-221 cells.

Conclusions

Six ESTs from the no hit list are transcribed into mRNA and are mainly expressed in the midgut of S. litura. Slmg7 is a novel gene that is localized to the cytoplasm.  相似文献   

17.

Background

Phytophthora infestans (Mont.) de Bary causes late blight of potato and tomato, and has a broad host range within the Solanaceae family. Most studies of the Phytophthora – Solanum pathosystem have focused on gene expression in the host and have not analyzed pathogen gene expression in planta.

Methodology/Principal Findings

We describe in detail an in silico approach to mine ESTs from inoculated host plants deposited in a database in order to identify particular pathogen sequences associated with disease. We identified candidate effector genes through mining of 22,795 ESTs corresponding to P. infestans cDNA libraries in compatible and incompatible interactions with hosts from the Solanaceae family.

Conclusions/Significance

We annotated genes of P. infestans expressed in planta associated with late blight using different approaches and assigned putative functions to 373 out of the 501 sequences found in the P. infestans genome draft, including putative secreted proteins, domains associated with pathogenicity and poorly characterized proteins ideal for further experimental studies. Our study provides a methodology for analyzing cDNA libraries and provides an understanding of the plant – oomycete pathosystems that is independent of the host, condition, or type of sample by identifying genes of the pathogen expressed in planta.  相似文献   

18.
19.
20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号