首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 455 毫秒
1.
The most widely used statistical methods for finding differentially expressed genes (DEGs) are essentially univariate. In this study, we present a new T(2) statistic for analyzing microarray data. We implemented our method using a multiple forward search (MFS) algorithm that is designed for selecting a subset of feature vectors in high-dimensional microarray datasets. The proposed T2 statistic is a corollary to that originally developed for multivariate analyses and possesses two prominent statistical properties. First, our method takes into account multidimensional structure of microarray data. The utilization of the information hidden in gene interactions allows for finding genes whose differential expressions are not marginally detectable in univariate testing methods. Second, the statistic has a close relationship to discriminant analyses for classification of gene expression patterns. Our search algorithm sequentially maximizes gene expression difference/distance between two groups of genes. Including such a set of DEGs into initial feature variables may increase the power of classification rules. We validated our method by using a spike-in HGU95 dataset from Affymetrix. The utility of the new method was demonstrated by application to the analyses of gene expression patterns in human liver cancers and breast cancers. Extensive bioinformatics analyses and cross-validation of DEGs identified in the application datasets showed the significant advantages of our new algorithm.  相似文献   

2.
MOTIVATION: One problem with discriminant analysis of DNA microarray data is that each sample is represented by quite a large number of genes, and many of them are irrelevant, insignificant or redundant to the discriminant problem at hand. Methods for selecting important genes are, therefore, of much significance in microarray data analysis. In the present study, a new criterion, called LS Bound measure, is proposed to address the gene selection problem. The LS Bound measure is derived from leave-one-out procedure of LS-SVMs (least squares support vector machines), and as the upper bound for leave-one-out classification results it reflects to some extent the generalization performance of gene subsets. RESULTS: We applied this LS Bound measure for gene selection on two benchmark microarray datasets: colon cancer and leukemia. We also compared the LS Bound measure with other evaluation criteria, including the well-known Fisher's ratio and Mahalanobis class separability measure, and other published gene selection algorithms, including Weighting factor and SVM Recursive Feature Elimination. The strength of the LS Bound measure is that it provides gene subsets leading to more accurate classification results than the filter method while its computational complexity is at the level of the filter method. AVAILABILITY: A companion website can be accessed at http://www.ntu.edu.sg/home5/pg02776030/lsbound/. The website contains: (1) the source code of the gene selection algorithm; (2) the complete set of tables and figures regarding the experimental study; (3) proof of the inequality (9). CONTACT: ekzmao@ntu.edu.sg.  相似文献   

3.

Background  

In microarray gene expression profiling experiments, differentially expressed genes (DEGs) are detected from among tens of thousands of genes on an array using statistical tests. It is important to control the number of false positives or errors that are present in the resultant DEG list. To date, more than 20 different multiple test methods have been reported that compute overall Type I error rates in microarray experiments. However, these methods share the following dilemma: they have low power in cases where only a small number of DEGs exist among a large number of total genes on the array.  相似文献   

4.

Purpose

The classification between different gait patterns is a frequent task in gait assessment. The base vectors were usually found using principal component analysis (PCA) is replaced by an iterative application of the support vector machine (SVM). The aim was to use classifyability instead of variability to build a subspace (SVM space) that contains the information about classifiable aspects of a movement. The first discriminant of the SVM space will be compared to a discriminant found by an independent component analysis (ICA) in the SVM space.

Methods

Eleven runners ran using shoes with different midsoles. Kinematic data, representing the movements during stance phase when wearing the two shoes, was used as input to a PCA and SVM. The data space was decomposed by an iterative application of the SVM into orthogonal discriminants that were able to classify the two movements. The orthogonal discriminants spanned a subspace, the SVM space. It represents the part of the movement that allowed classifying the two conditions. The data in the SVM space was reconstructed for a visual assessment of the movement difference. An ICA was applied to the data in the SVM space to obtain a single discriminant. Cohen''s d effect size was used to rank the PCA vectors that could be used to classify the data, the first SVM discriminant or the ICA discriminant.

Results

The SVM base contains all the information that discriminates the movement of the two shod conditions. It was shown that the SVM base contains some redundancy and a single ICA discriminant was found by applying an ICA in the SVM space.

Conclusions

A combination of PCA, SVM and ICA is best suited to extract all parts of the gait pattern that discriminates between the two movements and to find a discriminant for the classification of dichotomous kinematic data.  相似文献   

5.
MOTIVATION: Differentially expressed gene (DEG) lists detected from different microarray studies for a same disease are often highly inconsistent. Even in technical replicate tests using identical samples, DEG detection still shows very low reproducibility. It is often believed that current small microarray studies will largely introduce false discoveries. RESULTS: Based on a statistical model, we show that even in technical replicate tests using identical samples, it is highly likely that the selected DEG lists will be very inconsistent in the presence of small measurement variations. Therefore, the apparently low reproducibility of DEG detection from current technical replicate tests does not indicate low quality of microarray technology. We also demonstrate that heterogeneous biological variations existing in real cancer data will further reduce the overall reproducibility of DEG detection. Nevertheless, in small subsamples from both simulated and real data, the actual false discovery rate (FDR) for each DEG list tends to be low, suggesting that each separately determined list may comprise mostly true DEGs. Rather than simply counting the overlaps of the discovery lists from different studies for a complex disease, novel metrics are needed for evaluating the reproducibility of discoveries characterized with correlated molecular changes. Supplementaty information: Supplementary data are available at Bioinformatics online.  相似文献   

6.
目的:探究将统计学习方法应用于心理测验所得的大量数据进行学习分析的可行性,并基于探究结果对飞行职业的人格特征进行进一步探索,为飞行人员的选拔及评估提供新的思路。方法:从某航空公司随机抽取1020名男性被试,其中飞行人员510名,非飞行人员510名,采用卡特尔16项人格测试对其进行测验,施测后对得到的16项因子分采用支持向量机就随机划分的训练组和测试组进行学习,分析学习结果。结果:挑选出4项因子作为分类的特征因子,基于线性支持向量机构建的分类器在交叉验证下的平均正确率为64%。结论:采用SVM构建的分类器具有一定的可靠性和有效性。  相似文献   

7.
8.
Over the last decade, many analytical methods and tools have been developed for microarray data. The detection of differentially expressed genes (DEGs) among different treatment groups is often a primary purpose of microarray data analysis. In addition, association studies investigating the relationship between genes and a phenotype of interest such as survival time are also popular in microarray data analysis. Phenotype association analysis provides a list of phenotype-associated genes (PAGs). However, it is sometimes necessary to identify genes that are both DEGs and PAGs. We consider the joint identification of DEGs and PAGs in microarray data analyses. The first approach we used was a naïve approach that detects DEGs and PAGs separately and then identifies the genes in an intersection of the list of PAGs and DEGs. The second approach we considered was a hierarchical approach that detects DEGs first and then chooses PAGs from among the DEGs or vice versa. In this study, we propose a new model-based approach for the joint identification of DEGs and PAGs. Unlike the previous two-step approaches, the proposed method identifies genes simultaneously that are DEGs and PAGs. This method uses standard regression models but adopts different null hypothesis from ordinary regression models, which allows us to perform joint identification in one-step. The proposed model-based methods were evaluated using experimental data and simulation studies. The proposed methods were used to analyze a microarray experiment in which the main interest lies in detecting genes that are both DEGs and PAGs, where DEGs are identified between two diet groups and PAGs are associated with four phenotypes reflecting the expression of leptin, adiponectin, insulin-like growth factor 1, and insulin. Model-based approaches provided a larger number of genes, which are both DEGs and PAGs, than other methods. Simulation studies showed that they have more power than other methods. Through analysis of data from experimental microarrays and simulation studies, the proposed model-based approach was shown to provide a more powerful result than the naïve approach and the hierarchical approach. Since our approach is model-based, it is very flexible and can easily handle different types of covariates.  相似文献   

9.
Alzheimer''s Disease (AD) is one of the most common causes of dementia, mostly affecting the elderly population. Currently, there is no proper diagnostic tool or method available for the detection of AD. The present study used two distinct data sets of AD genes, which could be potential biomarkers in the diagnosis. The differentially expressed genes (DEGs) curated from both datasets were used for machine learning classification, tissue expression annotation and co-expression analysis. Further, CNPY3, GPR84, HIST1H2AB, HIST1H2AE, IFNAR1, LMO3, MYO18A, N4BP2L1, PML, SLC4A4, ST8SIA4, TLE1 and N4BP2L1 were identified as highly significant DEGs and exhibited co-expression with other query genes. Moreover, a tissue expression study found that these genes are also expressed in the brain tissue. In addition to the earlier studies for marker gene identification, we have considered a different set of machine learning classifiers to improve the accuracy rate from the analysis. Amongst all the six classification algorithms, J48 emerged as the best classifier, which could be used for differentiating healthy and diseased samples. SMO/SVM and Logit Boost further followed J48 to achieve the classification accuracy.  相似文献   

10.
Lung cancer is a worldwide leading cause of cancer-related death. The aim of this study was to identify target genes and specific biomarkers for identification and treatment of different types of lung cancer with DNA microarray. Gene expression profile GSE6044 and miRNA microarray profile GSE17681 were downloaded from Gene Expression Omnibus database. The differentially expressed genes (DEGs) and miRNAs were screened with multtest package in R language. Then, functional enrichment analysis of identified DEGs was performed. Furthermore, the verified target genes based on screened miRNAs were selected from miRTarBase and miRecords databases. Then miRNA-target gene regulation network was constructed. APOE, CDC6 and ATP2B1were involved in most of the functions obtained for adenocarcinomas, small cell lung cancer and squamous cell carcinomas, respectively. The target DEGs of differentially expressed hsa-miR-29a included FGG in adenocarcinoma, RAN and COL4A1 in small cell lung cancer, GLUL in squamous cell carcinoma. The target DEGs of has-miR-7 were SNCA and SLC7A5 in adenocarcinoma and small cell lung cancer, respectively. ICAM1 and KIT were the target DEGs of hsa-miR-222 in adenocarcinoma and squamous cell carcinoma. The miRNAs and their differentially expressed target genes have the potential to be used in clinic for diagnosis and treatment of different kinds of lung cancer in the future.  相似文献   

11.

Background

Using hybrid approach for gene selection and classification is common as results obtained are generally better than performing the two tasks independently. Yet, for some microarray datasets, both classification accuracy and stability of gene sets obtained still have rooms for improvement. This may be due to the presence of samples with wrong class labels (i.e. outliers). Outlier detection algorithms proposed so far are either not suitable for microarray data, or only solve the outlier detection problem on their own.

Results

We tackle the outlier detection problem based on a previously proposed Multiple-Filter-Multiple-Wrapper (MFMW) model, which was demonstrated to yield promising results when compared to other hybrid approaches (Leung and Hung, 2010). To incorporate outlier detection and overcome limitations of the existing MFMW model, three new features are introduced in our proposed MFMW-outlier approach: 1) an unbiased external Leave-One-Out Cross-Validation framework is developed to replace internal cross-validation in the previous MFMW model; 2) wrongly labeled samples are identified within the MFMW-outlier model; and 3) a stable set of genes is selected using an L1-norm SVM that removes any redundant genes present. Six binary-class microarray datasets were tested. Comparing with outlier detection studies on the same datasets, MFMW-outlier could detect all the outliers found in the original paper (for which the data was provided for analysis), and the genes selected after outlier removal were proven to have biological relevance. We also compared MFMW-outlier with PRAPIV (Zhang et al., 2006) based on same synthetic datasets. MFMW-outlier gave better average precision and recall values on three different settings. Lastly, artificially flipped microarray datasets were created by removing our detected outliers and flipping some of the remaining samples'' labels. Almost all the ‘wrong’ (artificially flipped) samples were detected, suggesting that MFMW-outlier was sufficiently powerful to detect outliers in high-dimensional microarray datasets.  相似文献   

12.
Lyu  Yafei  Li  Qunhua 《BMC bioinformatics》2016,17(1):51-60
Determining differentially expressed genes (DEGs) between biological samples is the key to understand how genotype gives rise to phenotype. RNA-seq and microarray are two main technologies for profiling gene expression levels. However, considerable discrepancy has been found between DEGs detected using the two technologies. Integration data across these two platforms has the potential to improve the power and reliability of DEG detection. We propose a rank-based semi-parametric model to determine DEGs using information across different sources and apply it to the integration of RNA-seq and microarray data. By incorporating both the significance of differential expression and the consistency across platforms, our method effectively detects DEGs with moderate but consistent signals. We demonstrate the effectiveness of our method using simulation studies, MAQC/SEQC data and a synthetic microRNA dataset. Our integration method is not only robust to noise and heterogeneity in the data, but also adaptive to the structure of data. In our simulations and real data studies, our approach shows a higher discriminate power and identifies more biologically relevant DEGs than eBayes, DEseq and some commonly used meta-analysis methods.  相似文献   

13.
MOTIVATION: The problem of class prediction has received a tremendous amount of attention in the literature recently. In the context of DNA microarrays, where the task is to classify and predict the diagnostic category of a sample on the basis of its gene expression profile, a problem of particular importance is the diagnosis of cancer type based on microarray data. One method of classification which has been very successful in cancer diagnosis is the support vector machine (SVM). The latter has been shown (through simulations) to be superior in comparison with other methods, such as classical discriminant analysis, however, SVM suffers from the drawback that the solution is implicit and therefore is difficult to interpret. In order to remedy this difficulty, an analysis of variance decomposition using structured kernels is proposed and is referred to as the structured polychotomous machine. This technique utilizes Newton-Raphson to find estimates of coefficients followed by the Rao and Wald tests, respectively, for addition and deletion of import vectors. RESULTS: The proposed method is applied to microarray data and simulation data. The major breakthrough of our method is efficiency in that only a minimal number of genes that accurately predict the classes are selected. It has been verified that the selected genes serve as legitimate markers for cancer classification from a biological point of view. AVAILABILITY: All source codes used are available on request from the authors.  相似文献   

14.
One of the essential issues in microarray data analysis is to identify differentially expressed genes (DEGs) under different experimental treatments. In this article, a statistical procedure was proposed to identify the DEGs for gene expression data with or without missing observations from microarray experiment with one- or two-treatment factors. An F statistic based on Henderson method III was constructed to test the significance of differential expression for each gene under different treatment(s) levels. The cutoff P value was adjusted to control the experimental-wise false discovery rate. A human acute leukemia dataset corrected from 38 leukemia patients was reanalyzed by the proposed method. In comparison to the results from significant analysis of microarray (SAM) and microarray analysis of variance (MAANOVA), it was indicated that the proposed method has similar performance with MAANOVA for data with one-treatment factor, but MAANOVA cannot directly handle missing data. In addition, a mouse brain dataset collected from six brain regions of two inbred strains (two-treatment factors) was reanalyzed to identify genes with distinct regional-specific expression patterns. The results showed that the proposed method could identify more distinct regional-specific expression patterns than the previous analysis of the same dataset. Moreover, a computer program was developed and incorporated in the software QTModel, which is freely available at .  相似文献   

15.
With the advancement of microarray technology, it is now possible to study the expression profiles of thousands of genes across different experimental conditions or tissue samples simultaneously. Microarray cancer datasets, organized as samples versus genes fashion, are being used for classification of tissue samples into benign and malignant or their subtypes. They are also useful for identifying potential gene markers for each cancer subtype, which helps in successful diagnosis of particular cancer types. In this article, we have presented an unsupervised cancer classification technique based on multiobjective genetic clustering of the tissue samples. In this regard, a real-coded encoding of the cluster centers is used and cluster compactness and separation are simultaneously optimized. The resultant set of near-Pareto-optimal solutions contains a number of non-dominated solutions. A novel approach to combine the clustering information possessed by the non-dominated solutions through Support Vector Machine (SVM) classifier has been proposed. Final clustering is obtained by consensus among the clusterings yielded by different kernel functions. The performance of the proposed multiobjective clustering method has been compared with that of several other microarray clustering algorithms for three publicly available benchmark cancer datasets. Moreover, statistical significance tests have been conducted to establish the statistical superiority of the proposed clustering method. Furthermore, relevant gene markers have been identified using the clustering result produced by the proposed clustering method and demonstrated visually. Biological relationships among the gene markers are also studied based on gene ontology. The results obtained are found to be promising and can possibly have important impact in the area of unsupervised cancer classification as well as gene marker identification for multiple cancer subtypes.  相似文献   

16.
MOTIVATION: One particular application of microarray data, is to uncover the molecular variation among cancers. One feature of microarray studies is the fact that the number n of samples collected is relatively small compared to the number p of genes per sample which are usually in the thousands. In statistical terms this very large number of predictors compared to a small number of samples or observations makes the classification problem difficult. An efficient way to solve this problem is by using dimension reduction statistical techniques in conjunction with nonparametric discriminant procedures. RESULTS: We view the classification problem as a regression problem with few observations and many predictor variables. We use an adaptive dimension reduction method for generalized semi-parametric regression models that allows us to solve the 'curse of dimensionality problem' arising in the context of expression data. The predictive performance of the resulting classification rule is illustrated on two well know data sets in the microarray literature: the leukemia data that is known to contain classes that are easy 'separable' and the colon data set.  相似文献   

17.

Background

Breast cancer is one of the leading causes of deaths for women. It is of great necessity to develop effective methods for breast cancer detection and diagnosis. Recent studies have focused on gene-based signatures for outcome predictions. Kernel SVM for its discriminative power in dealing with small sample pattern recognition problems has attracted a lot attention. But how to select or construct an appropriate kernel for a specified problem still needs further investigation.

Results

Here we propose a novel kernel (Hadamard Kernel) in conjunction with Support Vector Machines (SVMs) to address the problem of breast cancer outcome prediction using gene expression data. Hadamard Kernel outperform the classical kernels and correlation kernel in terms of Area under the ROC Curve (AUC) values where a number of real-world data sets are adopted to test the performance of different methods.

Conclusions

Hadamard Kernel SVM is effective for breast cancer predictions, either in terms of prognosis or diagnosis. It may benefit patients by guiding therapeutic options. Apart from that, it would be a valuable addition to the current SVM kernel families. We hope it will contribute to the wider biology and related communities.
  相似文献   

18.
《Genomics》2020,112(2):1761-1767
We performed a multivariate meta-analysis of microarray data in Crohn's disease (CD) and Ulcerative colitis (UC), which are the main forms of inflammatory bowel disease (IBD). They share similar symptoms but differ in the location and extent of inflammation and in complications. We identified 249 differentially expressed genes (DEGs) in CD and 38 in UC at a false discovery rate of 1%. 20 of the DEGs were common to both diseases. A multivariate test identified 260 DEGs associated with IBD, 53 of which were not found in any of the disorders. We identified important molecular pathways implicated in the pathogenesis of IBD, such as the JAK/STAT and interferon-gamma signaling pathways, genes involved in cell adhesion, apoptosis and carcinogenesis. Among others, BCAT1 and GZMB are interesting novel DEGs that deserve further investigation in experimental models. The method could also be useful to other cases of meta-analysis of gene expression data.  相似文献   

19.
20.
Although many scholars have utilized high-throughput microarrays to delineate gene expression patterns after spinal cord injury (SCI), no study has evaluated gene changes in raphe magnus (RM) and somatomotor cortex (SMTC), two areas in brain primarily affected by SCI. In present study, we aimed to analyze the differentially expressed genes (DEGs) of RM and SMTC between SCI model and sham injured control at 4, 24 h, 7, 14, 28 days, and 3 months using microarray dataset GSE2270 downloaded from gene expression omnibus and unpaired significance analysis of microarray method. Protein–protein interaction (PPI) network was constructed for DEGs at crucial time points and significant biological functions were enriched using DAVID. The results indicated that more DEGs were identified at 14 days in RM and at 4 h/3 months in SMTC after SCI. In the PPI network for DEGs at 14 days in RM, interleukin 6, glyceraldehyde-3-phosphate dehydrogenase (GAPDH), FBJ murine osteosarcoma viral oncogene homolog (FOS), tumor necrosis factor, and nuclear receptor subfamily 3, group C, member 1 (glucocorticoid receptor) were the top 5 hub genes; In the PPI network for DEGs at 3 months in SMTC, the top 5 hub genes were ubiquitin B, Ras‐related C3 botulinum toxin substrate 1 (rho family, small GTP binding protein Rac1), FOS, Janus kinase 2 and vascular endothelial growth factor A. Hedgehog and Wnt signaling pathways were the top 2 significant pathways in RM. These hub DEGs and pathways may be underlying therapeutic targets for SCI.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号