首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We study statistical methods to detect cancer genes that are over- or down-expressed in some but not all samples in a disease group. This has proven useful in cancer studies where oncogenes are activated only in a small subset of samples. We propose the outlier robust t-statistic (ORT), which is intuitively motivated from the t-statistic, the most commonly used differential gene expression detection method. Using real and simulation studies, we compare the ORT to the recently proposed cancer outlier profile analysis (Tomlins and others, 2005) and the outlier sum statistic of Tibshirani and Hastie (2006). The proposed method often has more detection power and smaller false discovery rates. Supplementary information can be found at http://www.biostat.umn.edu/~baolin/research/ort.html.  相似文献   

2.
We have devised a novel analysis approach, percentile analysis for differential gene expression (PADGE), for identifying genes differentially expressed between two groups of heterogeneous samples. PADGE was designed to compare expression profiles of sample subgroups at a series of percentile cutoffs and to examine the trend of relative expression between sample groups as expression level increases. Simulation studies showed that PADGE has more statistical power than t-statistics, cancer outlier profile analysis (COPA) (Tomlins SA, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R, Sun XW, Varambally S, Cao X, Tchinda J, Kuefer R, Lee C, Montie JE, Shah RB, Pienta KJ, Rubin MA, Chinnaiyan AM. Science 310: 644-648, 2005), and kurtosis (Teschendorff AE, Naderi A, Barbosa-Morais NL, Caldas C. Bioinformatics 22: 2269-2275, 2006). Application of PADGE to microarray data sets in tumor tissues demonstrated its utility in prioritizing cancer genes encoding potential therapeutic targets or diagnostic markers. A web application was developed for researchers to analyze a large gene expression data set from heterogeneous biological samples and identify differentially expressed genes between subsets of sample classes using PADGE and other available approaches. Availability: http://www.cgl.ucsf.edu/Research/genentech/padge/.  相似文献   

3.
Outlier sums for differential gene expression analysis   总被引:1,自引:0,他引:1  
We propose a method for detecting genes that, in a disease group, exhibit unusually high gene expression in some but not all samples. This can be particularly useful in cancer studies, where mutations that can amplify or turn off gene expression often occur in only a minority of samples. In real and simulated examples, the new method often exhibits lower false discovery rates than simple t-statistic thresholding. We also compare our approach to the recent cancer profile outlier analysis proposal of Tomlins and others (2005).  相似文献   

4.
MOTIVATION: Classification of biological samples by microarrays is a topic of much interest. A number of methods have been proposed and successfully applied to this problem. It has recently been shown that classification by nearest centroids provides an accurate predictor that may outperform much more complicated methods. The 'Prediction Analysis of Microarrays' (PAM) approach is one such example, which the authors strongly motivate by its simplicity and interpretability. In this spirit, I seek to assess the performance of classifiers simpler than even PAM. RESULTS: I surprisingly show that the modified t-statistics and shrunken centroids employed by PAM tend to increase misclassification error when compared with their simpler counterparts. Based on these observations, I propose a classification method called 'Classification to Nearest Centroids' (ClaNC). ClaNC ranks genes by standard t-statistics, does not shrink centroids and uses a class-specific gene-selection procedure. Because of these modifications, ClaNC is arguably simpler and easier to interpret than PAM, and it can be viewed as a traditional nearest centroid classifier that uses specially selected genes. I demonstrate that ClaNC error rates tend to be significantly less than those for PAM, for a given number of active genes. AVAILABILITY: Point-and-click software is freely available at http://students.washington.edu/adabney/clanc.  相似文献   

5.
6.
7.
Atlantic salmon of Eastern Canada were once of considerable importance to aboriginal, recreational, and commercial fisheries, yet many populations are now in decline, particularly those of the inner Bay of Fundy (iBoF), which were recently listed as endangered. We investigated whether nonneutral SNPs could be used to assign individual Atlantic salmon accurately to either the iBoF or the outer Bay of Fundy (oBoF) metapopulations because this has been difficult with existing neutral markers. We first searched for markers under diversifying selection by genotyping eight captively bred Bay of Fundy (BoF) populations for 320 SNP loci with the Sequenom MassARRAY? system and then analysed the data set with four different F(ST) outlier detection programs. Three outlier loci were identified by both BayesFST and BayeScan whereas seven outlier loci, including the three previously mentioned, were identified by both Fdist and Arlequin. A subset of 14 nonneutral SNPs was more accurate (85% accuracy) than a subset of 67 neutral SNPs (75% accuracy) at assigning individual salmon back to their metapopulation. We then chose a subset of nine outlier SNP markers and used them to inexpensively genotype archived DNA samples from seven wild BoF populations using Invader? chemistry. Hierarchical AMOVA of these independent wild samples corroborated our previous findings of significant genetic differentiation between iBoF and oBoF salmon metapopulations. Our research shows that identifying and using outlier loci is an important step towards achieving the goal of consistently and accurately distinguishing iBoF from oBoF Atlantic salmon, which will aid in their conservation.  相似文献   

8.
Microarrays have thousands to tens-of-thousands of gene features, but only a few hundred patient samples are available. The fundamental problem in microarray data analysis is identifying genes whose disruption causes congenital or acquired disease in humans. In this paper, we propose a new evolutionary method that can efficiently select a subset of potentially informative genes for support vector machine (SVM) classifiers. The proposed evolutionary method uses SVM with a given subset of gene features to evaluate the fitness function, and new subsets of features are selected based on the estimates of generalization error of SVMs and frequency of occurrence of the features in the evolutionary approach. Thus, in theory, selected genes reflect to some extent the generalization performance of SVM classifiers. We compare our proposed method with several existing methods and find that the proposed method can obtain better classification accuracy with a smaller number of selected genes than the existing methods.  相似文献   

9.
The extent of focal chromosomal copy number aberrations (CNAs) in cancer has been uncovered through technical innovations, and this discovery has been critical for the identification of new cancer driver genes in genomics projects such as TCGA and ICGC. Unlike constitutive copy number variations (CNVs), focal CNAs are the result of many selection events during the evolution of cancer genomes. Therefore, it is possible that a single gene in a focal CNA gives the tumor a selective growth advantage. This concept has been instrumental in the discovery of new cancer driver genes. However, focal CNAs lack a consensus definition; therefore, we propose one based on pragmatic considerations. We also describe different strategies to identify focal CNAs and procedures to distinguish them from large CNAs and CNVs.  相似文献   

10.
Head and Neck Squamous Cell Carcinoma (HNSCC) is the fifth most common cancer, annually affecting over half a million people worldwide. Presently, there are no accepted biomarkers for clinical detection and surveillance of HNSCC. In this work, a comprehensive genome-wide analysis of epigenetic alterations in primary HNSCC tumors was employed in conjunction with cancer-specific outlier statistics to define novel biomarker genes which are differentially methylated in HNSCC. The 37 identified biomarker candidates were top-scoring outlier genes with prominent differential methylation in tumors, but with no signal in normal tissues. These putative candidates were validated in independent HNSCC cohorts from our institution and TCGA (The Cancer Genome Atlas). Using the top candidates, ZNF14, ZNF160, and ZNF420, an assay was developed for detection of HNSCC cancer in primary tissue and saliva samples with 100% specificity when compared to normal control samples. Given the high detection specificity, the analysis of ZNF DNA methylation in combination with other DNA methylation biomarkers may be useful in the clinical setting for HNSCC detection and surveillance, particularly in high-risk patients. Several additional candidates identified through this work can be further investigated toward future development of a multi-gene panel of biomarkers for the surveillance and detection of HNSCC.  相似文献   

11.
Discovering statistically significant biclusters in gene expression data   总被引:1,自引:0,他引:1  
In gene expression data, a bicluster is a subset of the genes exhibiting consistent patterns over a subset of the conditions. We propose a new method to detect significant biclusters in large expression datasets. Our approach is graph theoretic coupled with statistical modelling of the data. Under plausible assumptions, our algorithm is polynomial and is guaranteed to find the most significant biclusters. We tested our method on a collection of yeast expression profiles and on a human cancer dataset. Cross validation results show high specificity in assigning function to genes based on their biclusters, and we are able to annotate in this way 196 uncharacterized yeast genes. We also demonstrate how the biclusters lead to detecting new concrete biological associations. In cancer data we are able to detect and relate finer tissue types than was previously possible. We also show that the method outperforms the biclustering algorithm of Cheng and Church (2000).  相似文献   

12.
Zhang F  Chen JY 《BMC genomics》2010,11(Z2):S12

Background

Breast cancer is worldwide the second most common type of cancer after lung cancer. Plasma proteome profiling may have a higher chance to identify protein changes between plasma samples such as normal and breast cancer tissues. Breast cancer cell lines have long been used by researches as model system for identifying protein biomarkers. A comparison of the set of proteins which change in plasma with previously published findings from proteomic analysis of human breast cancer cell lines may identify with a higher confidence a subset of candidate protein biomarker.

Results

In this study, we analyzed a liquid chromatography (LC) coupled tandem mass spectrometry (MS/MS) proteomics dataset from plasma samples of 40 healthy women and 40 women diagnosed with breast cancer. Using a two-sample t-statistics and permutation procedure, we identified 254 statistically significant, differentially expressed proteins, among which 208 are over-expressed and 46 are under-expressed in breast cancer plasma. We validated this result against previously published proteomic results of human breast cancer cell lines and signaling pathways to derive 25 candidate protein biomarkers in a panel. Using the pathway analysis, we observed that the 25 “activated” plasma proteins were present in several cancer pathways, including ‘Complement and coagulation cascades’, ‘Regulation of actin cytoskeleton’, and ‘Focal adhesion’, and match well with previously reported studies. Additional gene ontology analysis of the 25 proteins also showed that cellular metabolic process and response to external stimulus (especially proteolysis and acute inflammatory response) were enriched functional annotations of the proteins identified in the breast cancer plasma samples. By cross-validation using two additional proteomics studies, we obtained 86% and 83% similarities in pathway-protein matrix between the first study and the two testing studies, which is much better than the similarity we measured with proteins.

Conclusions

We presented a ‘systems biology’ method to identify, characterize, analyze and validate panel biomarkers in breast cancer proteomics data, which includes 1) t statistics and permutation process, 2) network, pathway and function annotation analysis, and 3) cross-validation of multiple studies. Our results showed that the systems biology approach is essential to the understanding molecular mechanisms of panel protein biomarkers.
  相似文献   

13.

Background

Using hybrid approach for gene selection and classification is common as results obtained are generally better than performing the two tasks independently. Yet, for some microarray datasets, both classification accuracy and stability of gene sets obtained still have rooms for improvement. This may be due to the presence of samples with wrong class labels (i.e. outliers). Outlier detection algorithms proposed so far are either not suitable for microarray data, or only solve the outlier detection problem on their own.

Results

We tackle the outlier detection problem based on a previously proposed Multiple-Filter-Multiple-Wrapper (MFMW) model, which was demonstrated to yield promising results when compared to other hybrid approaches (Leung and Hung, 2010). To incorporate outlier detection and overcome limitations of the existing MFMW model, three new features are introduced in our proposed MFMW-outlier approach: 1) an unbiased external Leave-One-Out Cross-Validation framework is developed to replace internal cross-validation in the previous MFMW model; 2) wrongly labeled samples are identified within the MFMW-outlier model; and 3) a stable set of genes is selected using an L1-norm SVM that removes any redundant genes present. Six binary-class microarray datasets were tested. Comparing with outlier detection studies on the same datasets, MFMW-outlier could detect all the outliers found in the original paper (for which the data was provided for analysis), and the genes selected after outlier removal were proven to have biological relevance. We also compared MFMW-outlier with PRAPIV (Zhang et al., 2006) based on same synthetic datasets. MFMW-outlier gave better average precision and recall values on three different settings. Lastly, artificially flipped microarray datasets were created by removing our detected outliers and flipping some of the remaining samples'' labels. Almost all the ‘wrong’ (artificially flipped) samples were detected, suggesting that MFMW-outlier was sufficiently powerful to detect outliers in high-dimensional microarray datasets.  相似文献   

14.
A number of circular regression models have been proposed in the literature. In recent years, there is a strong interest shown on the subject of outlier detection in circular regression. An outlier detection procedure can be developed by defining a new statistic in terms of the circular residuals. In this paper, we propose a new measure which transforms the circular residuals into linear measures using a trigonometric function. We then employ the row deletion approach to identify observations that affect the measure the most, a candidate of outlier. The corresponding cut-off points and the performance of the detection procedure when applied on Down and Mardia’s model are studied via simulations. For illustration, we apply the procedure on circadian data.  相似文献   

15.

Background

Meta-analysis of gene expression microarray datasets presents significant challenges for statistical analysis. We developed and validated a new bioinformatic method for the identification of genes upregulated in subsets of samples of a given tumour type (‘outlier genes’), a hallmark of potential oncogenes.

Methodology

A new statistical method (the gene tissue index, GTI) was developed by modifying and adapting algorithms originally developed for statistical problems in economics. We compared the potential of the GTI to detect outlier genes in meta-datasets with four previously defined statistical methods, COPA, the OS statistic, the t-test and ORT, using simulated data. We demonstrated that the GTI performed equally well to existing methods in a single study simulation. Next, we evaluated the performance of the GTI in the analysis of combined Affymetrix gene expression data from several published studies covering 392 normal samples of tissue from the central nervous system, 74 astrocytomas, and 353 glioblastomas. According to the results, the GTI was better able than most of the previous methods to identify known oncogenic outlier genes. In addition, the GTI identified 29 novel outlier genes in glioblastomas, including TYMS and CDKN2A. The over-expression of these genes was validated in vivo by immunohistochemical staining data from clinical glioblastoma samples. Immunohistochemical data were available for 65% (19 of 29) of these genes, and 17 of these 19 genes (90%) showed a typical outlier staining pattern. Furthermore, raltitrexed, a specific inhibitor of TYMS used in the therapy of tumour types other than glioblastoma, also effectively blocked cell proliferation in glioblastoma cell lines, thus highlighting this outlier gene candidate as a potential therapeutic target.

Conclusions/Significance

Taken together, these results support the GTI as a novel approach to identify potential oncogene outliers and drug targets. The algorithm is implemented in an R package (Text S1).  相似文献   

16.
X Yuan  J Zhang  L Yang  S Zhang  B Chen  Y Geng  Y Wang 《PloS one》2012,7(7):e41082
Somatic copy number alteration (CNA) is a common phenomenon in cancer genome. Distinguishing significant consensus events (SCEs) from random background CNAs in a set of subjects has been proven to be a valuable tool to study cancer. In order to identify SCEs with an acceptable type I error rate, better computational approaches should be developed based on reasonable statistics and null distributions. In this article, we propose a new approach named TAGCNA for identifying SCEs in somatic CNAs that may encompass cancer driver genes. TAGCNA employs a peel-off permutation scheme to generate a reasonable null distribution based on a prior step of selecting tag CNA markers from the genome being considered. We demonstrate the statistical power of TAGCNA on simulated ground truth data, and validate its applicability using two publicly available cancer datasets: lung and prostate adenocarcinoma. TAGCNA identifies SCEs that are known to be involved with proto-oncogenes (e.g. EGFR, CDK4) and tumor suppressor genes (e.g. CDKN2A, CDKN2B), and provides many additional SCEs with potential biological relevance in these data. TAGCNA can be used to analyze the significance of CNAs in various cancers. It is implemented in R and is freely available at http://tagcna.sourceforge.net/.  相似文献   

17.
MOTIVATION: DNA microarray experiments generating thousands of gene expression measurements, are being used to gather information from tissue and cell samples regarding gene expression differences that will be useful in diagnosing disease. We have developed a new method to analyse this kind of data using support vector machines (SVMs). This analysis consists of both classification of the tissue samples, and an exploration of the data for mis-labeled or questionable tissue results. RESULTS: We demonstrate the method in detail on samples consisting of ovarian cancer tissues, normal ovarian tissues, and other normal tissues. The dataset consists of expression experiment results for 97,802 cDNAs for each tissue. As a result of computational analysis, a tissue sample is discovered and confirmed to be wrongly labeled. Upon correction of this mistake and the removal of an outlier, perfect classification of tissues is achieved, but not with high confidence. We identify and analyse a subset of genes from the ovarian dataset whose expression is highly differentiated between the types of tissues. To show robustness of the SVM method, two previously published datasets from other types of tissues or cells are analysed. The results are comparable to those previously obtained. We show that other machine learning methods also perform comparably to the SVM on many of those datasets. AVAILABILITY: The SVM software is available at http://www.cs. columbia.edu/ approximately bgrundy/svm.  相似文献   

18.
Chromosomal translocations are common in cancer, and in some cases may be causal in the progression of the disease. Using microarrays, in which the expression of thousands of genes are simultaneously measured, could potentially allow one to detect recurrent translocations for a particular cancer type. Standard statistical tests, such as the t-test are not suited for detecting these translocations, but a simple test based on robust centering and scaling of the data to help detect outlier samples, followed by a search for pairs of samples with mutually exclusive outliers, may be used to find genes involved in recurrent translocations. We have implemented this method, termed Cancer Outlier Profile Analysis (COPA) in an R package (that we call the copa package), and show its applicability on a publicly available dataset. AVAILABILITY: http://www.bioconductor.org  相似文献   

19.
Wang Y  Sun G  Ji Z  Xing C  Liang Y 《PloS one》2012,7(1):e29860
In previous work, we proposed a method for detecting differential gene expression based on change-point of expression profile. This non-parametric change-point method gave promising result in both simulation study and public dataset experiment. However, the performance is still limited by the less sensitiveness to the right bound and the statistical significance of the statistics has not been fully explored. To overcome the insensitiveness to the right bound we modified the original method by adding a weight function to the D(n) statistic. Simulation study showed that the weighted change-point statistics method is significantly better than the original NPCPS in terms of ROC, false positive rate, as well as change-point estimate. The mean absolute error of the estimated change-point by weighted change-point method was 0.03, reduced by more than 50% comparing with the original 0.06, and the mean FPR was reduced by more than 55%. Experiment on microarray Dataset I resulted in 3974 differentially expressed genes out of total 5293 genes; experiment on microarray Dataset II resulted in 9983 differentially expressed genes among total 12576 genes. In summary, the method proposed here is an effective modification to the previous method especially when only a small subset of cancer samples has DGE.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号