共查询到20条相似文献,搜索用时 15 毫秒
1.
Proteins do not carry out their functions alone. Instead, they often act by participating in macromolecular complexes and play different functional roles depending on the other members of the complex. It is therefore interesting to identify co-complex relationships. Although protein complexes can be identified in a high-throughput manner by experimental technologies such as affinity purification coupled with mass spectrometry (APMS), these large-scale datasets often suffer from high false positive and false negative rates. Here, we present a computational method that predicts co-complexed protein pair (CCPP) relationships using kernel methods from heterogeneous data sources. We show that a diffusion kernel based on random walks on the full network topology yields good performance in predicting CCPPs from protein interaction networks. In the setting of direct ranking, a diffusion kernel performs much better than the mutual clustering coefficient. In the setting of SVM classifiers, a diffusion kernel performs much better than a linear kernel. We also show that combination of complementary information improves the performance of our CCPP recognizer. A summation of three diffusion kernels based on two-hybrid, APMS, and genetic interaction networks and three sequence kernels achieves better performance than the sequence kernels or diffusion kernels alone. Inclusion of additional features achieves a still better ROC(50) of 0.937. Assuming a negative-to-positive ratio of 600ratio1, the final classifier achieves 89.3% coverage at an estimated false discovery rate of 10%. Finally, we applied our prediction method to two recently described APMS datasets. We find that our predicted positives are highly enriched with CCPPs that are identified by both datasets, suggesting that our method successfully identifies true CCPPs. An SVM classifier trained from heterogeneous data sources provides accurate predictions of CCPPs in yeast. This computational method thereby provides an inexpensive method for identifying protein complexes that extends and complements high-throughput experimental data. 相似文献
2.
Continuing improvements in DNA sequencing technologies are providing us with vast amounts of genomic data from an ever-widening range of organisms. The resulting challenge for bioinformatics is to interpret this deluge of data and place it back into its biological context. Biological networks provide a conceptual framework with which we can describe part of this context, namely the different interactions that occur between the molecular components of a cell. Here, we review the computational methods available to predict biological networks from genomic sequence data and discuss how they relate to high-throughput experimental methods. 相似文献
3.
Francesc Sardà-Palomera Lluís Brotons Dani Villero Henk Sierdsema Stuart E. Newson Frédéric Jiguet 《Biodiversity and Conservation》2012,21(11):2927-2948
Field monitoring can vary from simple volunteer opportunistic observations to professional standardised monitoring surveys, leading to a trade-off between data quality and data collection costs. Such variability in data quality may result in biased predictions obtained from species distribution models (SDMs). We aimed to identify the limitations of different monitoring data sources for developing species distribution maps and to evaluate their potential for spatial data integration in a conservation context. Using Maxent, SDMs were generated from three different bird data sources in Catalonia, which differ in the degree of standardisation and available sample size. In addition, an alternative approach for modelling species distributions was applied, which combined the three data sources at a large spatial scale, but then downscaling to the required resolution. Finally, SDM predictions were used to identify species richness and high quality areas (hotspots) from different treatments. Models were evaluated by using high quality Atlas information. We show that both sample size and survey methodology used to collect the data are important in delivering robust information on species distributions. Models based on standardized monitoring provided higher accuracy with a lower sample size, especially when modelling common species. Accuracy of models from opportunistic observations substantially increased when modelling uncommon species, giving similar accuracy to a more standardized survey. Although downscaling data through a SDM approach appears to be a useful tool in cases of data shortage or low data quality and heterogeneity, it will tend to overestimate species distributions. In order to identify distributions of species, data with different quality may be appropriate. However, to identify biodiversity hotspots high quality information is needed. 相似文献
4.
Dramatic improvements in high throughput sequencing technologies have led to a staggering growth in the number of predicted genes. However, a large fraction of these newly discovered genes do not have a functional assignment. Fortunately, a variety of novel high-throughput genome-wide functional screening technologies provide important clues that shed light on gene function. The integration of heterogeneous data to predict protein function has been shown to improve the accuracy of automated gene annotation systems. In this paper, we propose and evaluate a probabilistic approach for protein function prediction that integrates protein-protein interaction (PPI) data, gene expression data, protein motif information, mutant phenotype data, and protein localization data. First, functional linkage graphs are constructed from PPI data and gene expression data, in which an edge between nodes (proteins) represents evidence for functional similarity. The assumption here is that graph neighbors are more likely to share protein function, compared to proteins that are not neighbors. The functional linkage graph model is then used in concert with protein domain, mutant phenotype and protein localization data to produce a functional prediction. Our method is applied to the functional prediction of Saccharomyces cerevisiae genes, using Gene Ontology (GO) terms as the basis of our annotation. In a cross validation study we show that the integrated model increases recall by 18%, compared to using PPI data alone at the 50% precision. We also show that the integrated predictor is significantly better than each individual predictor. However, the observed improvement vs. PPI depends on both the new source of data and the functional category to be predicted. Surprisingly, in some contexts integration hurts overall prediction accuracy. Lastly, we provide a comprehensive assignment of putative GO terms to 463 proteins that currently have no assigned function. 相似文献
5.
6.
Robust prostate cancer marker genes emerge from direct integration of inter-study microarray data 总被引:5,自引:0,他引:5
MOTIVATION: DNA microarray data analysis has been used previously to identify marker genes which discriminate cancer from normal samples. However, due to the limited sample size of each study, there are few common markers among different studies of the same cancer. With the rapid accumulation of microarray data, it is of great interest to integrate inter-study microarray data to increase sample size, which could lead to the discovery of more reliable markers. RESULTS: We present a novel, simple method of integrating different microarray datasets to identify marker genes and apply the method to prostate cancer datasets. In this study, by applying a new statistical method, referred to as the top-scoring pair (TSP) classifier, we have identified a pair of robust marker genes (HPN and STAT6) by integrating microarray datasets from three different prostate cancer studies. Cross-platform validation shows that the TSP classifier built from the marker gene pair, which simply compares relative expression values, achieves high accuracy, sensitivity and specificity on independent datasets generated using various array platforms. Our findings suggest a new model for the discovery of marker genes from accumulated microarray data and demonstrate how the great wealth of microarray data can be exploited to increase the power of statistical analysis. CONTACT: leixu@jhu.edu. 相似文献
7.
Protein-protein interaction (PPI) prediction is a central task in achieving a better understanding of cellular and intracellular processes. Because high-throughput experimental methods are both expensive and time-consuming, and are also known of suffering from the problems of incompleteness and noise, many computational methods have been developed, with varied degrees of success. However, the inference of PPI network from multiple heterogeneous data sources remains a great challenge. In this work, we developed a novel method based on approximate Bayesian computation and modified differential evolution sampling (ABC-DEP) and regularized laplacian (RL) kernel. The method enables inference of PPI networks from topological properties and multiple heterogeneous features including gene expression and Pfam domain profiles, in forms of weighted kernels. The optimal weights are obtained by ABC-DEP, and the kernel fusion built based on optimal weights serves as input to RL to infer missing or new edges in the PPI network. Detailed comparisons with control methods have been made, and the results show that the accuracy of PPI prediction measured by AUC is increased by up to 23 %, as compared to a baseline without using optimal weights. The method can provide insights into the relations between PPIs and various feature kernels and demonstrates strong capability of predicting faraway interactions that cannot be well detected by traditional RL method. 相似文献
8.
BACKGROUND: Pathway analysis of a set of genes represents an important area in large-scale omic data analysis. However, the application of traditional pathway enrichment methods to next-generation sequencing (NGS) data is prone to several potential biases, including genomic/genetic factors (e.g., the particular disease and gene length) and environmental factors (e.g., personal life-style and frequency and dosage of exposure to mutagens). Therefore, novel methods are urgently needed for these new data types, especially for individual-specific genome data. METHODOLOGY: In this study, we proposed a novel method for the pathway analysis of NGS mutation data by explicitly taking into account the gene-wise mutation rate. We estimated the gene-wise mutation rate based on the individual-specific background mutation rate along with the gene length. Taking the mutation rate as a weight for each gene, our weighted resampling strategy builds the null distribution for each pathway while matching the gene length patterns. The empirical P value obtained then provides an adjusted statistical evaluation. PRINCIPAL FINDINGS/CONCLUSIONS: We demonstrated our weighted resampling method to a lung adenocarcinomas dataset and a glioblastoma dataset, and compared it to other widely applied methods. By explicitly adjusting gene-length, the weighted resampling method performs as well as the standard methods for significant pathways with strong evidence. Importantly, our method could effectively reject many marginally significant pathways detected by standard methods, including several long-gene-based, cancer-unrelated pathways. We further demonstrated that by reducing such biases, pathway crosstalk for each individual and pathway co-mutation map across multiple individuals can be objectively explored and evaluated. This method performs pathway analysis in a sample-centered fashion, and provides an alternative way for accurate analysis of cancer-personalized genomes. It can be extended to other types of genomic data (genotyping and methylation) that have similar bias problems. 相似文献
9.
10.
Knowing the effects of climate and habitat on the distributions of insect pests and their natural enemy would help target
the search for natural enemies, increase establishment of intentional introductions, improve risk assessment for accidental
introductions and the effects of climate change. Most existing methods used to predict geographical distributions of insects
either involve subjective comparisons of climate or require data concerning insect responses to climate. Here we have used
geographical distributions of insects to develop statistical models for the effects of climate and habitat on these distributions.
We tested this approach using six insect pests found in the United States: Ostrinia nubilalis (European corn borer), Diuraphis noxia (Russian wheat aphid), Helicoverpa zea (Corn earworm), Leptinotarsa decemlineata (Colorado potato beetle), Solenopsis invicta (Red imported fire ant), and Conotrachelus nenuphar (Plum curculio). By randomly separating the data into model-building and test sets, we were able to estimate prediction accuracy.
For each species, a unique combination of predictor variables was identified. The models correctly predicted presence for
more than 92% of the data on each insect species. The models correctly predicted absence for 59% to 77% of the data on five
of six species. Absence predictions were poor for H. zea (21% correct), because distribution data were limited and inaccurate. Predictions of insect absence were more difficult because
absence data were less abundant and perhaps less reliable. This approach offers potential for the analysis of existing data
to produce predictions about insect establishment. However, accurate prediction depends heavily on data quality, and in particular,
more data are needed from locations where insects are sampled but not found. 相似文献
11.
Finding edging genes from microarray data 总被引:1,自引:0,他引:1
MOTIVATION: A set of genes and their gene expression levels are used to classify disease and normal tissues. Due to the massive number of genes in microarray, there are a large number of edges to divide different classes of genes in microarray space. The edging genes (EGs) can be co-regulated genes, they can also be on the same pathway or deregulated by the same non-coding genes, such as siRNA or miRNA. Every gene in EGs is vital for identifying a tissue's class. The changing in one EG's gene expression may cause a tissue alteration from normal to disease and vice versa. Finding EGs is of biological importance. In this work, we propose an algorithm to effectively find these EGs. RESULT: We tested our algorithm with five microarray datasets. The results are compared with the border-based algorithm which was used to find gene groups and subsequently divide different classes of tissues. Our algorithm finds a significantly larger amount of EGs than does the border-based algorithm. As our algorithm prunes irrelevant patterns at earlier stages, time and space complexities are much less prevalent than in the border-based algorithm. AVAILABILITY: The algorithm proposed is implemented in C++ on Linux platform. The EGs in five microarray datasets are calculated. The preprocessed datasets and the discovered EGs are available at http://www3.it.deakin.edu.au/~phoebe/microarray.html. 相似文献
12.
The qualitative dimension of gene expression data and its heterogeneous nature in cancerous specimens can be accounted for by phylogenetic modeling that incorporates the directionality of altered gene expressions, complex patterns of expressions among a group of specimens, and data-based rather than specimen-based gene linkage. Our phylogenetic modeling approach is a double algorithmic technique that includes polarity assessment that brings out the qualitative value of the data, followed by maximum parsimony analysis that is most suitable for the data heterogeneity of cancer gene expression. We demonstrate that polarity assessment of expression values into derived and ancestral states, via outgroup comparison, reduces experimental noise; reveals dichotomously expressed asynchronous genes; and allows data pooling as well as comparability of intra- and interplatforms. Parsimony phylogenetic analysis of the polarized values produces a multidimensional classification of specimens into clades that reveal shared derived gene expressions (the synapomorphies); provides better assessment of ontogenic pathways and phyletic relatedness of specimens; efficiently utilizes dichotomously expressed genes; produces highly predictive class recognition; illustrates gene linkage and multiple developmental pathways; provides higher concordance between gene lists; and projects the direction of change among specimens. Further implication of this phylogenetic approach is that it may transform microarray into diagnostic, prognostic, and predictive tool. 相似文献
13.
Khimani AH Mhashilkar AM Mikulskis A O'Malley M Liao J Golenko EE Mayer P Chada S Killian JB Lott ST 《BioTechniques》2005,38(5):739-745
Biological maintenance of cells under variable conditions should affect gene expression of only certain genes while leaving the rest unchanged. The latter, termed "housekeeping genes," by definition must reflect no change in their expression levels during cell development, treatment, or disease state anomalies. However, deviations from this rule have been observed. Using DNA microarray technology, we report here variations in expression levels of certain housekeeping genes in prostate cancer and a colorectal cancer gene therapy model system. To highlight, differential expression was observed for ribosomal protein genes in the prostate cancer cells and beta-actin in treated colorectal cells. High-throughput differential gene expression analysis via microarray technology and quantitative PCR has become a common platform for classifying variations in similar types of cancers, response to chemotherapy, identifying disease markers, etc. Therefore, normalization of the system based on housekeeping genes, such as those reported here in cancer, must be approached with caution. 相似文献
14.
Lenore Cowen Phil Bradley Matthew Menke Jonathan King Bonnie Berger 《Journal of computational biology》2002,9(2):261-276
A method is presented that uses beta-strand interactions to predict the parallel right-handed beta-helix super-secondary structural motif in protein sequences. A program called BetaWrap implements this method and is shown to score known beta-helices above non-beta-helices in the Protein Data Bank in cross-validation. It is demonstrated that BetaWrap learns each of the seven known SCOP beta-helix families, when trained primarily on beta-structures that are not beta-helices, together with structural features of known beta-helices from outside the family. BetaWrap also predicts many bacterial proteins of unknown structure to be beta-helices; in particular, these proteins serve as virulence factors, adhesins, and toxins in bacterial pathogenesis and include cell surface proteins from Chlamydia and the intestinal bacterium Helicobacter pylori. The computational method used here may generalize to other beta-structures for which strand topology and profiles of residue accessibility are well conserved. 相似文献
15.
16.
Direct sequencing of environmental DNA (metagenomics) has a great potential for describing the 16S rRNA gene diversity of microbial communities. However current approaches using this 16S rRNA gene information to describe community diversity suffer from low taxonomic resolution or chimera problems. Here we describe a new strategy that involves stringent assembly and data filtering to reconstruct full-length 16S rRNA genes from metagenomicpyrosequencing data. Simulations showed that reconstructed 16S rRNA genes provided a true picture of the community diversity, had minimal rates of chimera formation and gave taxonomic resolution down to genus level. The strategy was furthermore compared to PCR-based methods to determine the microbial diversity in two marine sponges. This showed that about 30% of the abundant phylotypes reconstructed from metagenomic data failed to be amplified by PCR. Our approach is readily applicable to existing metagenomic datasets and is expected to lead to the discovery of new microbial phylotypes. 相似文献
17.
Genome-wide association studies (GWAS) for quantitative traits and disease in humans and other species have shown that there are many loci that contribute to the observed resemblance between relatives. GWAS to date have mostly focussed on discovery of genes or regulatory regions habouring causative polymorphisms, using single SNP analyses and setting stringent type-I error rates. Genome-wide marker data can also be used to predict genetic values and therefore predict phenotypes. Here, we propose a Bayesian method that utilises all marker data simultaneously to predict phenotypes. We apply the method to three traits: coat colour, %CD8 cells, and mean cell haemoglobin, measured in a heterogeneous stock mouse population. We find that a model that contains both additive and dominance effects, estimated from genome-wide marker data, is successful in predicting unobserved phenotypes and is significantly better than a prediction based upon the phenotypes of close relatives. Correlations between predicted and actual phenotypes were in the range of 0.4 to 0.9 when half of the number of families was used to estimate effects and the other half for prediction. Posterior probabilities of SNPs being associated with coat colour were high for regions that are known to contain loci for this trait. The prediction of phenotypes using large samples, high-density SNP data, and appropriate statistical methodology is feasible and can be applied in human medicine, forensics, or artificial selection programs. 相似文献
18.
Secondary structure in heterogeneous nuclear RNA: involvement of regions from repeated DNA sites 总被引:14,自引:0,他引:14
W Jelinek G Molloy R Fernandez-Munoz M Salditt J E Darnell 《Journal of molecular biology》1974,82(3):361-370
Heterogeneous nuclear RNA was found to contain regions of secondary structure based on a relative resistance to nuclease treatment compared with mRNA or poliovirus RNA and a shift in density toward double-stranded RNA early in the course of nuclease digestion. The regions involved in this secondary structure are enriched for RNA segments transcribed from repeated sites in the DNA. Thus, to maximize hybridization to repetitive sites heterogeneous nuclear RNA molecules must be both denatured and fragmented. Some of the self-complementary regions in heterogeneous nuclear RNA are released by alkali denaturation and fragmentation below 1500 nucleotides but maximum release is not achieved until fragmentation below 500 nucleotides. These results indicate that these self-complementary regions (“loops” plus “stems”) are mainly below 500 nucleotides in length. 相似文献
19.
Ethan DH Kim Ashish Sabharwal Adrian R Vetta Mathieu Blanchette 《Algorithms for molecular biology : AMB》2010,5(1):34