首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The identification of molecules involved in tumor initiation and progression is fundamental for understanding disease’s biology and, as a consequence, for the clinical management of patients. In the present work we will describe an optimized proteomic approach for the identification of molecules involved in the progression of Chronic Lymphocytic Leukemia (CLL). In detail, leukemic cell lysates are resolved by 2-dimensional Electrophoresis (2DE) and visualized as “spots” on the 2DE gels. Comparative analysis of proteomic maps allows the identification of differentially expressed proteins (in terms of abundance and post-translational modifications) that are picked, isolated and identified by Mass Spectrometry (MS). The biological function of the identified candidates can be tested by different assays (i.e. migration, adhesion and F-actin polymerization), that we have optimized for primary leukemic cells.  相似文献   

2.
Successful clustering algorithms are highly dependent on parameter settings. The clustering performance degrades significantly unless parameters are properly set, and yet, it is difficult to set these parameters a priori. To address this issue, in this paper, we propose a unique splitting-while-merging clustering framework, named “splitting merging awareness tactics” (SMART), which does not require any a priori knowledge of either the number of clusters or even the possible range of this number. Unlike existing self-splitting algorithms, which over-cluster the dataset to a large number of clusters and then merge some similar clusters, our framework has the ability to split and merge clusters automatically during the process and produces the the most reliable clustering results, by intrinsically integrating many clustering techniques and tasks. The SMART framework is implemented with two distinct clustering paradigms in two algorithms: competitive learning and finite mixture model. Nevertheless, within the proposed SMART framework, many other algorithms can be derived for different clustering paradigms. The minimum message length algorithm is integrated into the framework as the clustering selection criterion. The usefulness of the SMART framework and its algorithms is tested in demonstration datasets and simulated gene expression datasets. Moreover, two real microarray gene expression datasets are studied using this approach. Based on the performance of many metrics, all numerical results show that SMART is superior to compared existing self-splitting algorithms and traditional algorithms. Three main properties of the proposed SMART framework are summarized as: (1) needing no parameters dependent on the respective dataset or a priori knowledge about the datasets, (2) extendible to many different applications, (3) offering superior performance compared with counterpart algorithms.  相似文献   

3.
4.
Quantification of LC-MS peak intensities assigned during peptide identification in a typical comparative proteomics experiment will deviate from run-to-run of the instrument due to both technical and biological variation. Thus, normalization of peak intensities across an LC-MS proteomics dataset is a fundamental step in pre-processing. However, the downstream analysis of LC-MS proteomics data can be dramatically affected by the normalization method selected. Current normalization procedures for LC-MS proteomics data are presented in the context of normalization values derived from subsets of the full collection of identified peptides. The distribution of these normalization values is unknown a priori. If they are not independent from the biological factors associated with the experiment the normalization process can introduce bias into the data, possibly affecting downstream statistical biomarker discovery. We present a novel approach to evaluate normalization strategies, which includes the peptide selection component associated with the derivation of normalization values. Our approach evaluates the effect of normalization on the between-group variance structure in order to identify the most appropriate normalization methods that improve the structure of the data without introducing bias into the normalized peak intensities.  相似文献   

5.
Advanced statistical methods used to analyze high-throughput data such as gene-expression assays result in long lists of “significant genes.” One way to gain insight into the significance of altered expression levels is to determine whether Gene Ontology (GO) terms associated with a particular biological process, molecular function, or cellular component are over- or under-represented in the set of genes deemed significant. This process, referred to as enrichment analysis, profiles a gene-set, and is widely used to makes sense of the results of high-throughput experiments. The canonical example of enrichment analysis is when the output dataset is a list of genes differentially expressed in some condition. To determine the biological relevance of a lengthy gene list, the usual solution is to perform enrichment analysis with the GO. We can aggregate the annotating GO concepts for each gene in this list, and arrive at a profile of the biological processes or mechanisms affected by the condition under study. While GO has been the principal target for enrichment analysis, the methods of enrichment analysis are generalizable. We can conduct the same sort of profiling along other ontologies of interest. Just as scientists can ask “Which biological process is over-represented in my set of interesting genes or proteins?” we can also ask “Which disease (or class of diseases) is over-represented in my set of interesting genes or proteins?“. For example, by annotating known protein mutations with disease terms from the ontologies in BioPortal, Mort et al. recently identified a class of diseases—blood coagulation disorders—that were associated with a 14-fold depletion in substitutions at O-linked glycosylation sites. With the availability of tools for automatic annotation of datasets with terms from disease ontologies, there is no reason to restrict enrichment analyses to the GO. In this chapter, we will discuss methods to perform enrichment analysis using any ontology available in the biomedical domain. We will review the general methodology of enrichment analysis, the associated challenges, and discuss the novel translational analyses enabled by the existence of public, national computational infrastructure and by the use of disease ontologies in such analyses.

What to Learn in This Chapter

  • Review the commonly used approach of Gene Ontology based enrichment analysis
  • Understand the pitfalls associated with current approaches
  • Understand the national infrastructure available for using alternative ontologies for enrichment analysis
  • Learn about a generalized enrichment analysis workflow and its application using disease ontologies
This article is part of the “Translational Bioinformatics” collection for PLOS Computational Biology.
  相似文献   

6.
7.
Normalization of single cell RNA-seq data remains a challenging task. The performance of different methods can vary greatly between datasets when unwanted factors and biology are associated. Most normalization methods also only remove the effects of unwanted variation for the cell embedding but not from gene-level data typically used for differential expression (DE) analysis to identify marker genes. We propose RUV-III-NB, a method that can be used to remove unwanted variation from both the cell embedding and gene-level counts. Using pseudo-replicates, RUV-III-NB explicitly takes into account potential association with biology when removing unwanted variation. The method can be used for both UMI or read counts and returns adjusted counts that can be used for downstream analyses such as clustering, DE and pseudotime analyses. Using published datasets with different technological platforms, kinds of biology and levels of association between biology and unwanted variation, we show that RUV-III-NB manages to remove library size and batch effects, strengthen biological signals, improve DE analyses, and lead to results exhibiting greater concordance with independent datasets of the same kind. The performance of RUV-III-NB is consistent and is not sensitive to the number of factors assumed to contribute to the unwanted variation.  相似文献   

8.
An important step in the proteomic analysis of missing proteins is the use of a wide range of tissues, optimal extraction, and the processing of protein material in order to ensure the highest sensitivity in downstream protein detection. This work describes a purification protocol for identifying low-abundance proteins in human chorionic villi using the proposed “1DE-gel concentration” method. This involves the removal of SDS in a short electrophoresis run in a stacking gel without protein separation. Following the in-gel digestion of the obtained holistic single protein band, we used the peptide mixture for further LC–MS/MS analysis. Statistically significant results were derived from six datasets, containing three treatments, each from two tissue sources (elective or missed abortions). The 1DE-gel concentration increased the coverage of the chorionic villus proteome. Our approach allowed the identification of 15 low-abundance proteins, of which some had not been previously detected via the mass spectrometry of trophoblasts. In the post hoc data analysis, we found a dubious or uncertain protein (PSG7) encoded on human chromosome 19 according to neXtProt. A proteomic sample preparation workflow with the 1DE-gel concentration can be used as a prospective tool for uncovering the low-abundance part of the human proteome.  相似文献   

9.
We developed PathAct, a novel method for pathway analysis to investigate the biological and clinical implications of the gene expression profiles. The advantage of PathAct in comparison with the conventional pathway analysis methods is that it can estimate pathway activity levels for individual patient quantitatively in the form of a pathway-by-sample matrix. This matrix can be used for further analysis such as hierarchical clustering and other analysis methods. To evaluate the feasibility of PathAct, comparison with frequently used gene-enrichment analysis methods was conducted using two public microarray datasets. The dataset #1 was that of breast cancer patients, and we investigated pathways associated with triple-negative breast cancer by PathAct, compared with those obtained by gene set enrichment analysis (GSEA). The dataset #2 was another breast cancer dataset with disease-free survival (DFS) of each patient. Contribution by each pathway to prognosis was investigated by our method as well as the Database for Annotation, Visualization and Integrated Discovery (DAVID) analysis. In the dataset #1, four out of the six pathways that satisfied p < 0.05 and FDR < 0.30 by GSEA were also included in those obtained by the PathAct method. For the dataset #2, two pathways (“Cell Cycle” and “DNA replication”) out of four pathways by PathAct were commonly identified by DAVID analysis. Thus, we confirmed a good degree of agreement among PathAct and conventional methods. Moreover, several applications of further statistical analyses such as hierarchical cluster analysis by pathway activity, correlation analysis and survival analysis between pathways were conducted.  相似文献   

10.
We propose a novel strategy for incorporating hierarchical supervised label information into nonlinear dimensionality reduction techniques. Specifically, we extend t-SNE, UMAP, and PHATE to include known or predicted class labels and demonstrate the efficacy of our approach on multiple single-cell RNA sequencing datasets. Our approach, “Haisu,” is applicable across domains and methods of nonlinear dimensionality reduction. In general, the mathematical effect of Haisu can be summarized as a variable perturbation of the high dimensional space in which the original data is observed. We thereby preserve the core characteristics of the visualization method and only change the manifold to respect known or assumed class labels when provided. Our strategy is designed to aid in the discovery and understanding of underlying patterns in a dataset that is heavily influenced by parent-child relationships. We show that using our approach can also help in semi-supervised settings where labels are known for only some datapoints (for instance when only a fraction of the cells are labeled). In summary, Haisu extends existing popular visualization methods to enable a user to incorporate labels known a priori into a visualization, including their hierarchical relationships as defined by a user input graph.  相似文献   

11.

Background

Model selection is a vital part of most phylogenetic analyses, and accounting for the heterogeneity in evolutionary patterns across sites is particularly important. Mixture models and partitioning are commonly used to account for this variation, and partitioning is the most popular approach. Most current partitioning methods require some a priori partitioning scheme to be defined, typically guided by known structural features of the sequences, such as gene boundaries or codon positions. Recent evidence suggests that these a priori boundaries often fail to adequately account for variation in rates and patterns of evolution among sites. Furthermore, new phylogenomic datasets such as those assembled from ultra-conserved elements lack obvious structural features on which to define a priori partitioning schemes. The upshot is that, for many phylogenetic datasets, partitioned models of molecular evolution may be inadequate, thus limiting the accuracy of downstream phylogenetic analyses.

Results

We present a new algorithm that automatically selects a partitioning scheme via the iterative division of the alignment into subsets of similar sites based on their rates of evolution. We compare this method to existing approaches using a wide range of empirical datasets, and show that it consistently leads to large increases in the fit of partitioned models of molecular evolution when measured using AICc and BIC scores. In doing so, we demonstrate that some related approaches to solving this problem may have been associated with a small but important bias.

Conclusions

Our method provides an alternative to traditional approaches to partitioning, such as dividing alignments by gene and codon position. Because our method is data-driven, it can be used to estimate partitioned models for all types of alignments, including those that are not amenable to traditional approaches to partitioning.  相似文献   

12.
Thermoanaerobacter spp. have long been considered suitable Clostridium thermocellum coculture partners for improving lignocellulosic biofuel production through consolidated bioprocessing. However, studies using “omic”-based profiling to better understand carbon utilization and biofuel producing pathways have been limited to only a few strains thus far. To better characterize carbon and electron flux pathways in the recently isolated, xylanolytic strain, Thermoanaerobacter thermohydrosulfuricus WC1, label-free quantitative proteomic analyses were combined with metabolic profiling. SWATH-MS proteomic analysis quantified 832 proteins in each of six proteomes isolated from mid-exponential-phase cells grown on xylose, cellobiose, or a mixture of both. Despite encoding genes consistent with a carbon catabolite repression network observed in other Gram-positive organisms, simultaneous consumption of both substrates was observed. Lactate was the major end product of fermentation under all conditions despite the high expression of gene products involved with ethanol and/or acetate synthesis, suggesting that carbon flux in this strain may be controlled via metabolite-based (allosteric) regulation or is constrained by metabolic bottlenecks. Cross-species “omic” comparative analyses confirmed similar expression patterns for end-product-forming gene products across diverse Thermoanaerobacter spp. It also identified differences in cofactor metabolism, which potentially contribute to differences in end-product distribution patterns between the strains analyzed. The analyses presented here improve our understanding of T. thermohydrosulfuricus WC1 metabolism and identify important physiological limitations to be addressed in its development as a biotechnologically relevant strain in ethanologenic designer cocultures through consolidated bioprocessing.  相似文献   

13.
14.

Background

The analysis of complex proteomic and genomic profiles involves the identification of significant markers within a set of hundreds or even thousands of variables that represent a high-dimensional problem space. The occurrence of noise, redundancy or combinatorial interactions in the profile makes the selection of relevant variables harder.

Methodology/Principal Findings

Here we propose a method to select variables based on estimated relevance to hidden patterns. Our method combines a weighted-kernel discriminant with an iterative stochastic probability estimation algorithm to discover the relevance distribution over the set of variables. We verified the ability of our method to select predefined relevant variables in synthetic proteome-like data and then assessed its performance on biological high-dimensional problems. Experiments were run on serum proteomic datasets of infectious diseases. The resulting variable subsets achieved classification accuracies of 99% on Human African Trypanosomiasis, 91% on Tuberculosis, and 91% on Malaria serum proteomic profiles with fewer than 20% of variables selected. Our method scaled-up to dimensionalities of much higher orders of magnitude as shown with gene expression microarray datasets in which we obtained classification accuracies close to 90% with fewer than 1% of the total number of variables.

Conclusions

Our method consistently found relevant variables attaining high classification accuracies across synthetic and biological datasets. Notably, it yielded very compact subsets compared to the original number of variables, which should simplify downstream biological experimentation.  相似文献   

15.
Advancements in mass spectrometry‐based proteomics have enabled experiments encompassing hundreds of samples. While these large sample sets deliver much‐needed statistical power, handling them introduces technical variability known as batch effects. Here, we present a step‐by‐step protocol for the assessment, normalization, and batch correction of proteomic data. We review established methodologies from related fields and describe solutions specific to proteomic challenges, such as ion intensity drift and missing values in quantitative feature matrices. Finally, we compile a set of techniques that enable control of batch effect adjustment quality. We provide an R package, "proBatch", containing functions required for each step of the protocol. We demonstrate the utility of this methodology on five proteomic datasets each encompassing hundreds of samples and consisting of multiple experimental designs. In conclusion, we provide guidelines and tools to make the extraction of true biological signal from large proteomic studies more robust and transparent, ultimately facilitating reliable and reproducible research in clinical proteomics and systems biology.  相似文献   

16.
Many recent microarrays hold an enormous number of probe sets, thus raising many practical and theoretical problems in controlling the false discovery rate (FDR). Biologically, it is likely that most probe sets are associated with un-expressed genes, so the measured values are simply noise due to non-specific binding; also many probe sets are associated with non-differentially-expressed (non-DE) genes. In an analysis to find DE genes, these probe sets contribute to the false discoveries, so it is desirable to filter out these probe sets prior to analysis. In the methodology proposed here, we first fit a robust linear model for probe-level Affymetrix data that accounts for probe and array effects. We then develop a novel procedure called FLUSH (Filtering Likely Uninformative Sets of Hybridizations), which excludes probe sets that have statistically small array-effects or large residual variance. This filtering procedure was evaluated on a publicly available data set from a controlled spiked-in experiment, as well as on a real experimental data set of a mouse model for retinal degeneration. In both cases, FLUSH filtering improves the sensitivity in the detection of DE genes compared to analyses using unfiltered, presence-filtered, intensity-filtered and variance-filtered data. A freely-available package called FLUSH implements the procedures and graphical displays described in the article.  相似文献   

17.
Genome-wide RNA expression data provide a detailed view of an organism's biological state; hence, a dataset measuring expression variation between genetically diverse individuals (eQTL data) may provide important insights into the genetics of complex traits. However, with data from a relatively small number of individuals, it is difficult to distinguish true causal polymorphisms from the large number of possibilities. The problem is particularly challenging in populations with significant linkage disequilibrium, where traits are often linked to large chromosomal regions containing many genes. Here, we present a novel method, Lirnet, that automatically learns a regulatory potential for each sequence polymorphism, estimating how likely it is to have a significant effect on gene expression. This regulatory potential is defined in terms of “regulatory features”—including the function of the gene and the conservation, type, and position of genetic polymorphisms—that are available for any organism. The extent to which the different features influence the regulatory potential is learned automatically, making Lirnet readily applicable to different datasets, organisms, and feature sets. We apply Lirnet both to the human HapMap eQTL dataset and to a yeast eQTL dataset and provide statistical and biological results demonstrating that Lirnet produces significantly better regulatory programs than other recent approaches. We demonstrate in the yeast data that Lirnet can correctly suggest a specific causal sequence variation within a large, linked chromosomal region. In one example, Lirnet uncovered a novel, experimentally validated connection between Puf3—a sequence-specific RNA binding protein—and P-bodies—cytoplasmic structures that regulate translation and RNA stability—as well as the particular causative polymorphism, a SNP in Mkt1, that induces the variation in the pathway.  相似文献   

18.

Objectives

Epidermal growth factor receptor (EGFR) gene mutations in tumors predict tumor response to EGFR tyrosine kinase inhibitors (EGFR-TKIs) in non-small-cell lung cancer (NSCLC). However, obtaining tumor tissue for mutation analysis is challenging. Here, we aimed to detect serum peptides/proteins associated with EGFR gene mutation status, and test whether a classification algorithm based on serum proteomic profiling could be developed to analyze EGFR gene mutation status to aid therapeutic decision-making.

Patients and Methods

Serum collected from 223 stage IIIB or IV NSCLC patients with known EGFR gene mutation status in their tumors prior to therapy was analyzed by matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF-MS) and ClinProTools software. Differences in serum peptides/proteins between patients with EGFR gene TKI-sensitive mutations and wild-type EGFR genes were detected in a training group of 100 patients; based on this analysis, a serum proteomic classification algorithm was developed to classify EGFR gene mutation status and tested in an independent validation group of 123 patients. The correlation between EGFR gene mutation status, as identified with the serum proteomic classifier and response to EGFR-TKIs was analyzed.

Results

Nine peptide/protein peaks were significantly different between NSCLC patients with EGFR gene TKI-sensitive mutations and wild-type EGFR genes in the training group. A genetic algorithm model consisting of five peptides/proteins (m/z 4092.4, 4585.05, 1365.1, 4643.49 and 4438.43) was developed from the training group to separate patients with EGFR gene TKI-sensitive mutations and wild-type EGFR genes. The classifier exhibited a sensitivity of 84.6% and a specificity of 77.5% in the validation group. In the 81 patients from the validation group treated with EGFR-TKIs, 28 (59.6%) of 47 patients whose matched samples were labeled as “mutant” by the classifier and 3 (8.8%) of 34 patients whose matched samples were labeled as “wild” achieved an objective response (p<0.0001). Patients whose matched samples were labeled as “mutant” by the classifier had a significantly longer progression-free survival (PFS) than patients whose matched samples were labeled as “wild” (p=0.001).

Conclusion

Peptides/proteins related to EGFR gene mutation status were found in the serum. Classification of EGFR gene mutation status using the serum proteomic classifier established in the present study in patients with stage IIIB or IV NSCLC is feasible and may predict tumor response to EGFR-TKIs.  相似文献   

19.
Behçet’s disease (BD) is a chronic, relapsing, multisystemic inflammatory disorder with unanswered questions regarding its etiology/pathogenesis and classification. Distinct manifestation based subsets, pronounced geographical variations in expression, and discrepant immunological abnormalities raised the question whether Behçet’s is “a disease or a syndrome”. To answer the preceding question we aimed to display and compare the molecular mechanisms underlying distinct subsets of BD. For this purpose, the expression data of the gene expression profiling and association study on BD by Xavier et al (2013) was retrieved from GEO database and reanalysed by gene expression data analysis/visualization and bioinformatics enrichment tools. There were 15 BD patients (B) and 14 controls (C). Three subsets of BD patients were generated: MB (isolated mucocutaneous manifestations, n = 7), OB (ocular involvement, n = 4), and VB (large vein thrombosis, n = 4). Class comparison analyses yielded the following numbers of differentially expressed genes (DEGs); B vs C: 4, MB vs C: 5, OB vs C: 151, VB vs C: 274, MB vs OB: 215, MB vs VB: 760, OB vs VB: 984. Venn diagram analysis showed that there were no common DEGs in the intersection “MB vs C” ∩ “OB vs C” ∩ “VB vs C”. Cluster analyses successfully clustered distinct expressions of BD. During gene ontology term enrichment analyses, categories with relevance to IL-8 production (MB vs C) and immune response to microorganisms (OB vs C) were differentially enriched. Distinct subsets of BD display distinct expression profiles and different disease associated pathways. Based on these clear discrepancies, the designation as “Behçet’s syndrome” (BS) should be encouraged and future research should take into consideration the immunogenetic heterogeneity of BS subsets. Four gene groups, namely, negative regulators of inflammation (CD69, CLEC12A, CLEC12B, TNFAIP3), neutrophil granule proteins (LTF, OLFM4, AZU1, MMP8, DEFA4, CAMP), antigen processing and presentation proteins (CTSS, ERAP1), and regulators of immune response (LGALS2, BCL10, ITCH, CEACAM8, CD36, IL8, CCL4, EREG, NFKBIZ, CCR2, CD180, KLRC4, NFAT5) appear to be instrumental in BS immunopathogenesis.  相似文献   

20.

Introduction

Failure to properly account for normal systematic variations in OMICS datasets may result in misleading biological conclusions. Accordingly, normalization is a necessary step in the proper preprocessing of OMICS datasets. In this regards, an optimal normalization method will effectively reduce unwanted biases and increase the accuracy of downstream quantitative analyses. But, it is currently unclear which normalization method is best since each algorithm addresses systematic noise in different ways.

Objective

Determine an optimal choice of a normalization method for the preprocessing of metabolomics datasets.

Methods

Nine MVAPACK normalization algorithms were compared with simulated and experimental NMR spectra modified with added Gaussian noise and random dilution factors. Methods were evaluated based on an ability to recover the intensities of the true spectral peaks and the reproducibility of true classifying features from orthogonal projections to latent structures—discriminant analysis model (OPLS-DA).

Results

Most normalization methods (except histogram matching) performed equally well at modest levels of signal variance. Only probabilistic quotient (PQ) and constant sum (CS) maintained the highest level of peak recovery (>?67%) and correlation with true loadings (>?0.6) at maximal noise.

Conclusion

PQ and CS performed the best at recovering peak intensities and reproducing the true classifying features for an OPLS-DA model regardless of spectral noise level. Our findings suggest that performance is largely determined by the level of noise in the dataset, while the effect of dilution factors was negligible. A minimal allowable noise level of 20% was also identified for a valid NMR metabolomics dataset.
  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号