首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
2.
Gene selection and classification of microarray data using random forest   总被引:9,自引:0,他引:9  

Background  

Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection.  相似文献   

3.
Summary Best Linear Prediction (BLP) was used to predict breeding values for 1,396 parents from progeny test data in an operational slash pine breeding program. BLP rankings of parents were compared to rankings of averaged standard scores, a common approach in forestry. Using BLP rankings, selection of higher ranking parents tends to choose parents in a larger number of more precise progeny tests. The trend is the opposite with standard scores; higher ranking parents tend to be those in fewer, less precise tests. BLP and a related methodology, Best Linear Unbiased Prediction (BLUP), were developed by dairy cattle breeders and have not been used widely outside of animal breeding for predicting breeding values from messy progeny test data. Application of either of these techniques usually requires simplifying assumptions to keep the problem computationally tractable. The more appropriate technique for a given application depends upon which set of assumptions are better for the given problem. An assumption of homogeneous genetic and error variances and covariances, generally made by animal breeders when applying BLUP, was inappropriate for our data. We employed an approach that treated fixed effects as known and treated the same trait measured in different environments as different traits with heterogeneous variance structures. As tree improvement programs become more complex, the ease with which BLP and BLUP handle messy data and incorporate diverse sources of information should make these techniques appealing to forest tree breeders.  相似文献   

4.
The Nature Reserve Selection Problem is a problem that arises in the context of studying biodiversity conservation. Subject to budgetary constraints, the problem is to select a set of regions to conserve so that the phylogenetic diversity of the set of species contained within those regions is maximized. Recently, it was shown in a paper by Moulton et al. that this problem is NP-hard. In this paper, we establish a tight polynomial-time approximation algorithm for the Nature Reserve Section Problem. Furthermore, we resolve a question on the computational complexity of a related problem left open in Moulton et al.  相似文献   

5.
The presence of heritable variation is a prerequisite for evolution, but natural selection typically reduces genetic variation. Variation can be maintained in traits under selection through spatial or temporal variation in fitness surfaces, frequency-dependent selection, or disruptive selection. We evaluated the maintenance of variation in the enantiomeric blend of pheromones employed by the bark beetle Ips pini (Say). In natural populations, we quantified fitness surfaces for mating success and progeny production. We investigated the effects of paternal pheromone blend on offspring survival by comparing the spatial scales at which pheromone blends and larval mortality agents vary. Males with extreme pheromone blends obtained up to 1.8 times as many mates who each laid equivalent numbers of eggs, producing strong disruptive selection on male pheromone blend. In combination with imperfect assortative mating that continually produces intermediate genotypes, this fitness surface is sufficient to maintain variation in a heritable trait that is strongly linked to fitness. The ultimate explanation for female preference is unknown but could be because of selection for reduced mortality from specialist predators that prefer common prey pheromone blends. Selection is most likely occurring at the scale of small resource patches within pine stands. Selection at coarser scales (pine stands) is unlikely because pheromone blends did not vary among pine stands. Selection at finer scales (within logs) is unlikely because males of similar enantiomeric blends were not aggregated on logs, and male pheromone blend did not affect the spacing to neighboring galleries. This study documents a rare case of diversifying selection in natural populations.  相似文献   

6.
We propose a new method to assess the merit of any set of scientific papers in a given field based on the citations they receive. Given a field and a citation impact indicator, such as the mean citation or the -index, the merit of a given set of articles is identified with the probability that a randomly drawn set of articles from a given pool of articles in that field has a lower citation impact according to the indicator in question. The method allows for comparisons between sets of articles of different sizes and fields. Using a dataset acquired from Thomson Scientific that contains the articles published in the periodical literature in the period 1998–2007, we show that the novel approach yields rankings of research units different from those obtained by a direct application of the mean citation or the -index.  相似文献   

7.
ABSTRACT: BACKGROUND: The ancestries of genes form gene trees which do not necessarily have the same topology as the species tree due to incomplete lineage sorting. Available algorithms determining the probability of a gene tree given a species tree require exponential computational runtime. RESULTS: In this paper, we provide a polynomial time algorithm to calculate the probability of a ranked gene tree topology for a given species tree, where a ranked tree topology is a tree topology with the internal vertices being ordered. The probability of a gene tree topology can thus be calculated in polynomial time if the number of orderings of the internal vertices is a polynomial number. However, the complexity of calculating the probability of a gene tree topology with an exponential number of rankings for a given species tree remains unknown. CONCLUSIONS: Polynomial algorithms for calculating ranked gene tree probabilities may become useful in developing methodology to infer species trees based on a collection of gene trees, leading to a more accurate reconstruction of ancestral species relationships.  相似文献   

8.
Amino acid propensities for secondary structures were used since the 1970s, when Chou and Fasman evaluated them within datasets of few tens of proteins and developed a method to predict secondary structure of proteins, still in use despite prediction methods having evolved to very different approaches and higher reliability. Propensity for secondary structures represents an intrinsic property of amino acid, and it is used for generating new algorithms and prediction methods, therefore our work has been aimed to investigate what is the best protein dataset to evaluate the amino acid propensities, either larger but not homogeneous or smaller but homogeneous sets, i.e., all-alpha, all-beta, alpha-beta proteins. As a first analysis, we evaluated amino acid propensities for helix, beta-strand, and coil in more than 2000 proteins from the PDBselect dataset. With these propensities, secondary structure predictions performed with a method very similar to that of Chou and Fasman gave us results better than the original one, based on propensities derived from the few tens of X-ray protein structures available in the 1970s. In a refined analysis, we subdivided the PDBselect dataset of proteins in three secondary structural classes, i.e., all-alpha, all-beta, and alpha-beta proteins. For each class, the amino acid propensities for helix, beta-strand, and coil have been calculated and used to predict secondary structure elements for proteins belonging to the same class by using resubstitution and jackknife tests. This second round of predictions further improved the results of the first round. Therefore, amino acid propensities for secondary structures became more reliable depending on the degree of homogeneity of the protein dataset used to evaluate them. Indeed, our results indicate also that all algorithms using propensities for secondary structure can be still improved to obtain better predictive results.  相似文献   

9.
Analysis of multivariate data sets from, for example, microarray studies frequently results in lists of genes which are associated with some response of interest. The biological interpretation is often complicated by the statistical instability of the obtained gene lists, which may partly be due to the functional redundancy among genes, implying that multiple genes can play exchangeable roles in the cell. In this paper, we use the concept of exchangeability of random variables to model this functional redundancy and thereby account for the instability. We present a flexible framework to incorporate the exchangeability into the representation of lists. The proposed framework supports straightforward comparison between any 2 lists. It can also be used to generate new more stable gene rankings incorporating more information from the experimental data. Using 2 microarray data sets, we show that the proposed method provides more robust gene rankings than existing methods with respect to sampling variations, without compromising the biological significance of the rankings.  相似文献   

10.
Due to the great variety of preprocessing tools in two-channel expression microarray data analysis it is difficult to choose the most appropriate one for a given experimental setup. In our study, two independent two-channel inhouse microarray experiments as well as a publicly available dataset were used to investigate the influence of the selection of preprocessing methods (background correction, normalization, and duplicate spots correlation calculation) on the discovery of differentially expressed genes. Here we are showing that both the list of differentially expressed genes and the expression values of selected genes depend significantly on the preprocessing approach applied. The choice of normalization method to be used had the highest impact on the results. We propose a simple but efficient approach to increase the reliability of obtained results, where two normalization methods which are theoretically distinct from one another are used on the same dataset. Then the intersection of results, that is, the lists of differentially expressed genes, is used in order to get a more accurate estimation of the genes that were de facto differentially expressed.  相似文献   

11.
利用分组重量编码预测细胞凋亡蛋白的亚细胞定位   总被引:2,自引:1,他引:1  
从氨基酸的物化特性出发,利用物理学中“粗粒化”和“分组”的思想,提出了一种新的蛋白质序列特征提取方法——分组重量编码方法。采用组分耦合算法作为分类器,从蛋白质一级序列出发对细胞凋亡蛋白的亚细胞定位进行研究。针对Zhou和Doctor使用的数据集,Re—substitution和Jackknife检验总体预测精度分别为98、O%和85.7%,比基于氨基酸组成和组分耦合算法的总体预测精度提高了7.2%和13.2%;针对陈颖丽和李前忠使用的数据集,Re—substitution和Jackknife检验总体预测精度分别为94.0%和80、1%,比基于二肽组成和离散增量算法的总体预测精度提高了5.9%和2、0%。针对我们自己整理的最新数据集,通过Re—substitution和Jackknife检验,总体预测精度分别为97.33%和75、11%。实验结果表明蛋白质序列的分组重量编码对于细胞凋亡蛋白的定位研究是一种有效的特征提取方法。  相似文献   

12.
Microarray data has a high dimension of variables but available datasets usually have only a small number of samples, thereby making the study of such datasets interesting and challenging. In the task of analyzing microarray data for the purpose of, e.g., predicting gene-disease association, feature selection is very important because it provides a way to handle the high dimensionality by exploiting information redundancy induced by associations among genetic markers. Judicious feature selection in microarray data analysis can result in significant reduction of cost while maintaining or improving the classification or prediction accuracy of learning machines that are employed to sort out the datasets. In this paper, we propose a gene selection method called Recursive Feature Addition (RFA), which combines supervised learning and statistical similarity measures. We compare our method with the following gene selection methods:
  • Support Vector Machine Recursive Feature Elimination (SVMRFE)
  • Leave-One-Out Calculation Sequential Forward Selection (LOOCSFS)
  • Gradient based Leave-one-out Gene Selection (GLGS)
To evaluate the performance of these gene selection methods, we employ several popular learning classifiers on the MicroArray Quality Control phase II on predictive modeling (MAQC-II) breast cancer dataset and the MAQC-II multiple myeloma dataset. Experimental results show that gene selection is strictly paired with learning classifier. Overall, our approach outperforms other compared methods. The biological functional analysis based on the MAQC-II breast cancer dataset convinced us to apply our method for phenotype prediction. Additionally, learning classifiers also play important roles in the classification of microarray data and our experimental results indicate that the Nearest Mean Scale Classifier (NMSC) is a good choice due to its prediction reliability and its stability across the three performance measurements: Testing accuracy, MCC values, and AUC errors.  相似文献   

13.
Manual selection of single particles in images acquired using cryo-electron microscopy (cryoEM) will become a significant bottleneck when datasets of a hundred thousand or even a million particles are required for structure determination at near atomic resolution. Algorithm development of fully automated particle selection is thus an important research objective in the cryoEM field. A number of research groups are making promising new advances in this area. Evaluation of algorithms using a standard set of cryoEM images is an essential aspect of this algorithm development. With this goal in mind, a particle selection "bakeoff" was included in the program of the Multidisciplinary Workshop on Automatic Particle Selection for cryoEM. Twelve groups participated by submitting the results of testing their own algorithms on a common dataset. The dataset consisted of 82 defocus pairs of high-magnification micrographs, containing keyhole limpet hemocyanin particles, acquired using cryoEM. The results of the bakeoff are presented in this paper along with a summary of the discussion from the workshop. It was agreed that establishing benchmark particles and using bakeoffs to evaluate algorithms are useful in promoting algorithm development for fully automated particle selection, and that the infrastructure set up to support the bakeoff should be maintained and extended to include larger and more varied datasets, and more criteria for future evaluations.  相似文献   

14.
Analysis of recursive gene selection approaches from microarray data   总被引:1,自引:0,他引:1  
MOTIVATION: Finding a small subset of most predictive genes from microarray for disease prediction is a challenging problem. Support vector machines (SVMs) have been found to be successful with a recursive procedure in selecting important genes for cancer prediction. However, it is not well understood how much of the success depends on the choice of the specific classifier and how much on the recursive procedure. We answer this question by examining multiple classifers [SVM, ridge regression (RR) and Rocchio] with feature selection in recursive and non-recursive settings on three DNA microarray datasets (ALL-AML Leukemia data, Breast Cancer data and GCM data). RESULTS: We found recursive RR most effective. On the AML-ALL dataset, it achieved zero error rate on the test set using only three genes (selected from over 7000), which is more encouraging than the best published result (zero error rate using 8 genes by recursive SVM). On the Breast Cancer dataset and the two largest categories of the GCM dataset, the results achieved by recursive RR are also very encouraging. A further analysis of the experimental results shows that different classifiers penalize redundant features to different extent and this property plays an important role in the recursive feature selection process. RR classifier tends to penalize redundant features to a much larger extent than the SVM does. This may be the reason why recursive RR has a better performance in selecting genes.  相似文献   

15.
We analyze the frequencies of synonymous codons in animal mitochondrial genomes, focusing particularly on mammals and fish. The frequencies of bases at 4-fold degenerate sites are found to be strongly influenced by context-dependent mutation, which causes correlations between pairs of neighboring bases. There is a pattern of excess of certain dinucleotides and deficit of others that is consistent across large numbers of species, despite the wide variation of single-nucleotide frequencies among species. In many bacteria, translational selection is an important influence on codon usage. In order to test whether translational selection also plays a role in mitochondria, we need to control for context-dependent mutation. Selection for translational accuracy can be detected by comparison of codon usage in conserved and variable sites in the same genes. We give a test of this type that works in the presence of context-dependent mutation. There is very little evidence for translational accuracy selection in the mitochondrial genes considered here. Selection for translational efficiency might lead to preference for codons that match the limited repertoire of anticodons on the mitochondrial tRNAs. This is difficult to detect because the effect would usually be in the same direction in comparable to codon families and so would not cause an observable difference in codon usage between families. Several lines of evidence suggest that this type of selection is weak in most cases. However, we found several cases where unusual bases occur at the wobble position of the tRNA, and in these cases, some evidence for selection on codon usage was found. We discuss the way that these unusual cases are associated with codon reassignments in the mitochondrial genetic code.  相似文献   

16.
We have used a 692 case dataset, collected retrospectively by a single observer, to develop decision support systems for the cytodiagnosis of fine needle aspirates of breast lesions. In this study, we use a 322 case dataset that was prospectively collected by multiple observers in a working clinical environment to test two predictive systems, using logistic regression and the multilayer perceptron (MLP) type of neural network. Ten observed features and the patient age were used as input features. The systems were developed using a training set and test set from the single observer dataset and then applied to the multiple observer dataset. For the independent test cases from the single observer dataset, with a threshold set for no false positives on the training set, logistic regression produced a sensitivity of 82% (95% confidence interval 73-91) and a predictive value of a positive result (PV +) of 98% (95-99), the values for the MLP were 79% (69-89) and 100%, respectively. However the performance on the prospective multiple observer dataset was much worse, with a sensitivity of 72% (65-80), and PV + of 97% (94-99) for logistic regression and 67% (60-75) and 91% (85-97) for the MLP. These results suggest that there is considerable interobserver variability for the defined features and that this system is unsuitable for further development in the clinical environment unless this problem can be overcome.  相似文献   

17.
The intense interest in the intrinsically disordered proteins in the life science community, together with the remarkable advancements in predictive technologies, have given rise to the development of a large number of computational predictors of intrinsic disorder from protein sequence. While the growing number of predictors is a positive trend, we have observed a considerable difference in predictive quality among predictors for individual proteins. Furthermore, variable predictor performance is often inconsistent between predictors for different proteins, and the predictor that shows the best predictive performance depends on the unique properties of each protein sequence. We propose a computational approach, DISOselect, to estimate the predictive performance of 12 selected predictors for individual proteins based on their unique sequence‐derived properties. This estimation informs the users about the expected predictive quality for a selected disorder predictor and can be used to recommend methods that are likely to provide the best quality predictions. Our solution does not depend on the results of any disorder predictor; the estimations are made based solely on the protein sequence. Our solution significantly improves predictive performance, as judged with a test set of 1,000 proteins, when compared to other alternatives. We have empirically shown that by using the recommended methods the overall predictive performance for a given set of proteins can be improved by a statistically significant margin. DISOselect is freely available for non‐commercial users through the webserver at http://biomine.cs.vcu.edu/servers/DISOselect/ .  相似文献   

18.
19.
Retrospective case–control studies are more susceptibleto selection bias than other epidemiologic studies as by designthey require that both cases and controls are representativeof the same population. However, as cases and control recruitmentprocesses are often different, it is not always obvious thatthe necessary exchangeability conditions hold. Selection biastypically arises when the selection criteria are associatedwith the risk factor under investigation. We develop a methodwhich produces bias-adjusted estimates for the odds ratio. Ourmethod hinges on 2 conditions. The first is that a variablethat separates the risk factor from the selection criteria canbe identified. This is termed the "bias breaking" variable.The second condition is that data can be found such that a bias-correctedestimate of the distribution of the bias breaking variable canbe obtained. We show by means of a set of examples that suchbias breaking variables are not uncommon in epidemiologic settings.We demonstrate using simulations that the estimates of the oddsratios produced by our method are consistently closer to thetrue odds ratio than standard odds ratio estimates using logisticregression. Further, by applying it to a case–controlstudy, we show that our method can help to determine whetherselection bias is present and thus confirm the validity of studyconclusions when no evidence of selection bias can be found.  相似文献   

20.
MOTIVATION: Despite the growing literature devoted to finding differentially expressed genes in assays probing different tissues types, little attention has been paid to the combinatorial nature of feature selection inherent to large, high-dimensional gene expression datasets. New flexible data analysis approaches capable of searching relevant subgroups of genes and experiments are needed to understand multivariate associations of gene expression patterns with observed phenotypes. RESULTS: We present in detail a deterministic algorithm to discover patterns of multivariate gene associations in gene expression data. The patterns discovered are differential with respect to a control dataset. The algorithm is exhaustive and efficient, reporting all existent patterns that fit a given input parameter set while avoiding enumeration of the entire pattern space. The value of the pattern discovery approach is demonstrated by finding a set of genes that differentiate between two types of lymphoma. Moreover, these genes are found to behave consistently in an independent dataset produced in a different laboratory using different arrays, thus validating the genes selected using our algorithm. We show that the genes deemed significant in terms of their multivariate statistics will be missed using other methods. AVAILABILITY: Our set of pattern discovery algorithms including a user interface is distributed as a package called Genes@Work. This package is freely available to non-commercial users and can be downloaded from our website (http://www.research.ibm.com/FunGen).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号