共查询到20条相似文献,搜索用时 15 毫秒
1.
Background
We apply a new machine learning method, the so-called Support Vector Machine method, to predict the protein structural class. Support Vector Machine method is performed based on the database derived from SCOP, in which protein domains are classified based on known structures and the evolutionary relationships and the principles that govern their 3-D structure. 相似文献2.
Background
The current progress in sequencing projects calls for rapid, reliable and accurate function assignments of gene products. A variety of methods has been designed to annotate sequences on a large scale. However, these methods can either only be applied for specific subsets, or their results are not formalised, or they do not provide precise confidence estimates for their predictions.Results
We have developed a large-scale annotation system that tackles all of these shortcomings. In our approach, annotation was provided through Gene Ontology terms by applying multiple Support Vector Machines (SVM) for the classification of correct and false predictions. The general performance of the system was benchmarked with a large dataset. An organism-wise cross-validation was performed to define confidence estimates, resulting in an average precision of 80% for 74% of all test sequences. The validation results show that the prediction performance was organism-independent and could reproduce the annotation of other automated systems as well as high-quality manual annotations. We applied our trained classification system to Xenopus laevis sequences, yielding functional annotation for more than half of the known expressed genome. Compared to the currently available annotation, we provided more than twice the number of contigs with good quality annotation, and additionally we assigned a confidence value to each predicted GO term.Conclusions
We present a complete automated annotation system that overcomes many of the usual problems by applying a controlled vocabulary of Gene Ontology and an established classification method on large and well-described sequence data sets. In a case study, the function for Xenopus laevis contig sequences was predicted and the results are publicly available at ftp://genome.dkfz-heidelberg.de/pub/agd/gene_association.agd_Xenopus.3.
4.
5.
SUMMARY: We describe an open source library written in the R programming language for Medline literature data mining. This MedlineR library includes programs to query Medline through the NCBI PubMed database; to construct the co-occurrence matrix; and to visualize the network topology of query terms. The open source nature of this library allows users to extend it freely in the statistical programming language of R. To demonstrate its utility, we have built an application to analyze term-association by using only 10 lines of code. We provide MedlineR as a library foundation for bioinformaticians and statisticians to build more sophisticated literature data mining applications. AVAILABILITY: The library is available from http://dbsr.duke.edu/pub/MedlineR. 相似文献
6.
Rebecca F Halperin Phillip Stafford Jack S Emery Krupa Arun Navalkar Stephen Albert Johnston 《BMC bioinformatics》2012,13(1):1
Background
Random-sequence peptide libraries are a commonly used tool to identify novel ligands for binding antibodies, other proteins, and small molecules. It is often of interest to compare the selected peptide sequences to the natural protein binding partners to infer the exact binding site or the importance of particular residues. The ability to search a set of sequences for similarity to a set of peptides may sometimes enable the prediction of an antibody epitope or a novel binding partner. We have developed a software application designed specifically for this task. 相似文献7.
Kulkarni AV Williams NS Lian Y Wren JD Mittelman D Pertsemlidis A Garner HR 《Bioinformatics (Oxford, England)》2002,18(11):1410-1417
ARROGANT (ARRay OrGANizing Tool) is a software tool developed to facilitate the identification, annotation and comparison of large collections of genes or clones. The objective is to enable users to compile gene/clone collections from different databases, allowing them to design experiments and analyze the collections as well as associated experimental data efficiently. ARROGANT can relate different sequence identifiers to their common reference sequence using the UniGene database, allowing for the comparison of data from two different microarray experiments. ARROGANT has been successfully used to analyze microarray expression data for colon cancer, to compile genes potentially related to cardiac diseases for subsequent resequencing (to identify single nucleotide polymorphisms, SNPs), to design a new comprehensive human cDNA microarray for cancer, to combine and compare expression data generated by different microarrays and to provide annotation for genes on custom and Affymetrix chips. 相似文献
8.
Background
Exogenous short interfering RNAs (siRNAs) induce a gene knockdown effect in cells by interacting with naturally occurring RNA processing machinery. However not all siRNAs induce this effect equally. Several heterogeneous kinds of machine learning techniques and feature sets have been applied to modeling siRNAs and their abilities to induce knockdown. There is some growing agreement to which techniques produce maximally predictive models and yet there is little consensus for methods to compare among predictive models. Also, there are few comparative studies that address what the effect of choosing learning technique, feature set or cross validation approach has on finding and discriminating among predictive models.Principal Findings
Three learning techniques were used to develop predictive models for effective siRNA sequences including Artificial Neural Networks (ANNs), General Linear Models (GLMs) and Support Vector Machines (SVMs). Five feature mapping methods were also used to generate models of siRNA activities. The 2 factors of learning technique and feature mapping were evaluated by complete 3×5 factorial ANOVA. Overall, both learning techniques and feature mapping contributed significantly to the observed variance in predictive models, but to differing degrees for precision and accuracy as well as across different kinds and levels of model cross-validation.Conclusions
The methods presented here provide a robust statistical framework to compare among models developed under distinct learning techniques and feature sets for siRNAs. Further comparisons among current or future modeling approaches should apply these or other suitable statistically equivalent methods to critically evaluate the performance of proposed models. ANN and GLM techniques tend to be more sensitive to the inclusion of noisy features, but the SVM technique is more robust under large numbers of features for measures of model precision and accuracy. Features found to result in maximally predictive models are not consistent across learning techniques, suggesting care should be taken in the interpretation of feature relevance. In the models developed here, there are statistically differentiable combinations of learning techniques and feature mapping methods where the SVM technique under a specific combination of features significantly outperforms all the best combinations of features within the ANN and GLM techniques. 相似文献9.
Lee Tzong-Yi Chang Wen-Chi Hsu Justin Bo-Kai Chang Tzu-Hao Shien Dray-Ming 《BMC genomics》2012,13(1):1-12
Background
Massively parallel sequencing technology is revolutionizing approaches to genomic and genetic research. Since its advent, the scale and efficiency of Next-Generation Sequencing (NGS) has rapidly improved. In spite of this success, sequencing genomes or genomic regions with extremely biased base composition is still a great challenge to the currently available NGS platforms. The genomes of some important pathogenic organisms like Plasmodium falciparum (high AT content) and Mycobacterium tuberculosis (high GC content) display extremes of base composition. The standard library preparation procedures that employ PCR amplification have been shown to cause uneven read coverage particularly across AT and GC rich regions, leading to problems in genome assembly and variation analyses. Alternative library-preparation approaches that omit PCR amplification require large quantities of starting material and hence are not suitable for small amounts of DNA/RNA such as those from clinical isolates. We have developed and optimized library-preparation procedures suitable for low quantity starting material and tolerant to extremely high AT content sequences.Results
We have used our optimized conditions in parallel with standard methods to prepare Illumina sequencing libraries from a non-clinical and a clinical isolate (containing ~53% host contamination). By analyzing and comparing the quality of sequence data generated, we show that our optimized conditions that involve a PCR additive (TMAC), produces amplified libraries with improved coverage of extremely AT-rich regions and reduced bias toward GC neutral templates.Conclusion
We have developed a robust and optimized Next-Generation Sequencing library amplification method suitable for extremely AT-rich genomes. The new amplification conditions significantly reduce bias and retain the complexity of either extremes of base composition. This development will greatly benefit sequencing clinical samples that often require amplification due to low mass of DNA starting material. 相似文献10.
11.
Haiyang Chen Yanguo Teng Jinsheng Wang Liuting Song Rui Zuo 《Biological trace element research》2013,151(3):462-470
In this study, a method of positive matrix factorization (PMF) combined support vector machines (SVMs) was adopted to identify possible sources and apportion contributions for trace element pollution in surface sediments from the Jinjiang River, Southeastern China. Utilizing diagnostics tools, four significant factors were extracted from sediment samplers, which were collected in December 2010 at 15 different sites. By treating source identification as a pattern recognition problem, the factor loadings derived from PMF were classified by SVM classifiers which have been trained and validated with fingerprints of eight potential source categories. Using SVM, industrial wastewater from lead ore mining and metal handcraft manufacture, atmospheric deposition, and natural background were identified as main sources of trace element pollution in surface sediments from the Jinjiang River, which were affirmed by visually comparing compound patterns and the differences between the predicted and actual fractional compositions. Apportionment results showed that source of lead ore mining made the largest contribution (33.62 %), followed by atmospheric deposition (30.99 %), metal handcraft manufacture (30.09 %), and natural background (5.29 %). 相似文献
12.
13.
14.
15.
Transient gene expression in suspension HEK-293 cells: application to large-scale protein production
Baldi L Muller N Picasso S Jacquet R Girard P Thanh HP Derow E Wurm FM 《Biotechnology progress》2005,21(1):148-153
Recent advances in genomics, proteomics, and structural biology raised the general need for significant amounts of pure recombinant protein (r-protein). Because of the difficulty in obtaining in some cases proper protein folding in bacteria, several methods have been established to obtain large amounts of r-proteins by transgene expression in mammalian cells. We have developed three nonviral DNA transfer protocols for suspension-adapted HEK-293 and CHO cells: (1) a calcium phosphate based method (Ca-Pi), (2) a calcium-mediated method called Calfection, and (3) a polyethylenimine-based method (PEI). The first two methods have already been scaled up to 14 L and 100 L for HEK-293 cells in bioreactors. The third method, entirely serum-free, has been successfully applied to both suspension-adapted CHO and HEK-293 cells. We describe here the application of this technology to the transient expression in suspension cultivated HEK-293 EBNA cells of some out of more than 20 secreted r-proteins, including antibodies, dimeric proteins, and tagged proteins of various complexity. Most of the proteins were expressed from different plasmid vectors within 5-10 days after the availability of the DNA. Transfections were successfully performed from the small scale (1 mL in 12-well microtiter plates) to the 2 L scale. The results reported made it possible to establish an optimized cell culture and transfection protocol that minimizes batch-to-batch variations in protein expression. The work presented here proves the applicability and robustness of transient transfection technology for the expression of a variety of recombinant proteins. 相似文献
16.
A novel tool for computer-aided design of single-site mutations in proteins and peptides is presented. It proceeds by performing in silico all possible point mutations in a given protein or protein region and estimating the stability changes with linear combinations of database-derived potentials, whose coefficients depend on the solvent accessibility of the mutated residues. Upon completion, it yields a list of the most stabilizing, destabilizing or neutral mutations. This tool is applied to mouse, hamster and human prion proteins to identify the point mutations that are the most likely to stabilize their cellular form. The selected mutations are essentially located in the second helix, which presents an intrinsic preference to form beta-structures, with the best mutations being T183-->F, T192-->A and Q186-->A. The T183 mutation is predicted to be by far the most stabilizing one, but should be considered with care as it blocks the glycosylation of N181 and this blockade is known to favor the cellular to scrapie conversion. Furthermore, following the hypothesis that the first helix might induce the formation of hydrophilic beta-aggregates, several mutations that are neutral with respect to the structure's stability but improve the helix hydrophobicity are selected, among which is E146-->L. These mutations are intended as good candidates to undergo experimental tests. 相似文献
17.
Obika S Yu W Shimoyama A Uneda T Miyashita K Doi T Imanishi T 《Bioorganic & medicinal chemistry》2001,9(2):245-254
Some cationic triglycerides 1Aa-1Cb which have a symmetrical structure were effectively synthesized and formulated into cationic liposomes with the co-lipid dioleoylphosphatidylethanolamine (DOPE) and/or dilauroylphosphatidylcholine (DLPC). The plasmid encoding a luciferase was delivered into CHO cells by using these cationic liposomes. Our symmetrical cationic triglycerides showed high transfection activity when DOPE was used as a co-lipid. Among the symmetrical cationic triglycerides synthesized here, 1Ab and 1Ac, which have an oleoyl group at the 1- and 3-position in the glycerol backbone and also have a relatively long linker connecting the 2-hydroxy group in glycerol with the quaternary ammonium head group, were found to be the most suitable for gene delivery into cells. The transfection activity of the symmetrical cationic triglyceride 1Ab was comparable with that of its asymmetrical congener 6 and several times higher than that of Lipofectin. 相似文献
18.
The trend toward high-throughput techniques in molecular biology and the explosion of online scientific data threaten to overwhelm the ability of researchers to take full advantage of available information. This problem is particularly severe in the rapidly expanding area of gene expression experiments, for example, those carried out with cDNA microarrays or oligonucleotide chips. We present an Internet-based hypertext program, MedMiner, which filters and organizes large amounts of textual and structured information returned from public search engines like GeneCards and PubMed. We demonstrate the value of the approach for the analysis of gene expression data, but MedMiner can also be extended to other areas involving molecular genetic or pharmacological information. More generally still, MedMiner can be used to organize the information returned from any arbitrary PubMed search. 相似文献
19.
20.
A report of the 6th Georgia Tech-Oak Ridge National Lab International Conference on Bioinformatics ''In silico Biology: Gene Discovery and Systems Genomics'', Atlanta, USA, 15-17 November, 2007.Technological developments have had a profound impact on biology during the past decade, spectacularly augmenting our ability to survey and interrogate biological phenomena. In particular, they have increased capacity for data generation by several orders of magnitude and made computation a necessary partner of biology. The sixth meeting in the biennial series of bioinformatics conferences co-sponsored by Georgia Institute of Technology in Atlanta and the Oak Ridge National Laboratory addressed the challenges that this technology-driven avalanche of data pose to bioinformatics - increasing the complexity of longstanding problems and creating new ones. 相似文献