首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 296 毫秒
1.
Among numerous artificial intelligence approaches, k-Nearest Neighbor algorithms, genetic algorithms, and artificial neural networks are considered as the most common and effective methods in classification problems in numerous studies. In the present study, the results of the implementation of a novel hybrid feature selection-classification model using the above mentioned methods are presented. The purpose is benefitting from the synergies obtained from combining these technologies for the development of classification models. Such a combination creates an opportunity to invest in the strength of each algorithm, and is an approach to make up for their deficiencies. To develop proposed model, with the aim of obtaining the best array of features, first, feature ranking techniques such as the Fisher''s discriminant ratio and class separability criteria were used to prioritize features. Second, the obtained results that included arrays of the top-ranked features were used as the initial population of a genetic algorithm to produce optimum arrays of features. Third, using a modified k-Nearest Neighbor method as well as an improved method of backpropagation neural networks, the classification process was advanced based on optimum arrays of the features selected by genetic algorithms. The performance of the proposed model was compared with thirteen well-known classification models based on seven datasets. Furthermore, the statistical analysis was performed using the Friedman test followed by post-hoc tests. The experimental findings indicated that the novel proposed hybrid model resulted in significantly better classification performance compared with all 13 classification methods. Finally, the performance results of the proposed model was benchmarked against the best ones reported as the state-of-the-art classifiers in terms of classification accuracy for the same data sets. The substantial findings of the comprehensive comparative study revealed that performance of the proposed model in terms of classification accuracy is desirable, promising, and competitive to the existing state-of-the-art classification models.  相似文献   

2.
Fung ES  Ng MK 《Bioinformation》2007,2(5):230-234
One of the applications of the discriminant analysis on microarray data is to classify patient and normal samples based on gene expression values. The analysis is especially important in medical trials and diagnosis of cancer subtypes. The main contribution of this paper is to propose a simple Fisher-type discriminant method on gene selection in microarray data. In the new algorithm, we calculate a weight for each gene and use the weight values as an indicator to identify the subsets of relevant genes that categorize patient and normal samples. A l(2) - l(1) norm minimization method is implemented to the discriminant process to automatically compute the weights of all genes in the samples. The experiments on two microarray data sets have shown that the new algorithm can generate classification results as good as other classification methods, and effectively determine relevant genes for classification purpose. In this study, we demonstrate the gene selection's ability and the computational effectiveness of the proposed algorithm. Experimental results are given to illustrate the usefulness of the proposed model.  相似文献   

3.
A genotype calling algorithm for affymetrix SNP arrays   总被引:11,自引:0,他引:11  
MOTIVATION: A classification algorithm, based on a multi-chip, multi-SNP approach is proposed for Affymetrix SNP arrays. Current procedures for calling genotypes on SNP arrays process all the features associated with one chip and one SNP at a time. Using a large training sample where the genotype labels are known, we develop a supervised learning algorithm to obtain more accurate classification results on new data. The method we propose, RLMM, is based on a robustly fitted, linear model and uses the Mahalanobis distance for classification. The chip-to-chip non-biological variance is reduced through normalization. This model-based algorithm captures the similarities across genotype groups and probes, as well as across thousands of SNPs for accurate classification. In this paper, we apply RLMM to Affymetrix 100 K SNP array data, present classification results and compare them with genotype calls obtained from the Affymetrix procedure DM, as well as to the publicly available genotype calls from the HapMap project.  相似文献   

4.
We propose a model-based clustering method for high-dimensional longitudinal data via regularization in this paper. This study was motivated by the Trial of Activity in Adolescent Girls (TAAG), which aimed to examine multilevel factors related to the change of physical activity by following up a cohort of 783 girls over 10 years from adolescence to early adulthood. Our goal is to identify the intrinsic grouping of subjects with similar patterns of physical activity trajectories and the most relevant predictors within each group. The previous analyses conducted clustering and variable selection in two steps, while our new method can perform the tasks simultaneously. Within each cluster, a linear mixed-effects model (LMM) is fitted with a doubly penalized likelihood to induce sparsity for parameter estimation and effect selection. The large-sample joint properties are established, allowing the dimensions of both fixed and random effects to increase at an exponential rate of the sample size, with a general class of penalty functions. Assuming subjects are drawn from a Gaussian mixture distribution, model effects and cluster labels are estimated via a coordinate descent algorithm nested inside the Expectation-Maximization (EM) algorithm. Bayesian Information Criterion (BIC) is used to determine the optimal number of clusters and the values of tuning parameters. Our numerical studies show that the new method has satisfactory performance and is able to accommodate complex data with multilevel and/or longitudinal effects.  相似文献   

5.

Background

Using hybrid approach for gene selection and classification is common as results obtained are generally better than performing the two tasks independently. Yet, for some microarray datasets, both classification accuracy and stability of gene sets obtained still have rooms for improvement. This may be due to the presence of samples with wrong class labels (i.e. outliers). Outlier detection algorithms proposed so far are either not suitable for microarray data, or only solve the outlier detection problem on their own.

Results

We tackle the outlier detection problem based on a previously proposed Multiple-Filter-Multiple-Wrapper (MFMW) model, which was demonstrated to yield promising results when compared to other hybrid approaches (Leung and Hung, 2010). To incorporate outlier detection and overcome limitations of the existing MFMW model, three new features are introduced in our proposed MFMW-outlier approach: 1) an unbiased external Leave-One-Out Cross-Validation framework is developed to replace internal cross-validation in the previous MFMW model; 2) wrongly labeled samples are identified within the MFMW-outlier model; and 3) a stable set of genes is selected using an L1-norm SVM that removes any redundant genes present. Six binary-class microarray datasets were tested. Comparing with outlier detection studies on the same datasets, MFMW-outlier could detect all the outliers found in the original paper (for which the data was provided for analysis), and the genes selected after outlier removal were proven to have biological relevance. We also compared MFMW-outlier with PRAPIV (Zhang et al., 2006) based on same synthetic datasets. MFMW-outlier gave better average precision and recall values on three different settings. Lastly, artificially flipped microarray datasets were created by removing our detected outliers and flipping some of the remaining samples'' labels. Almost all the ‘wrong’ (artificially flipped) samples were detected, suggesting that MFMW-outlier was sufficiently powerful to detect outliers in high-dimensional microarray datasets.  相似文献   

6.

Background

Mixtures of beta distributions are a flexible tool for modeling data with values on the unit interval, such as methylation levels. However, maximum likelihood parameter estimation with beta distributions suffers from problems because of singularities in the log-likelihood function if some observations take the values 0 or 1.

Methods

While ad-hoc corrections have been proposed to mitigate this problem, we propose a different approach to parameter estimation for beta mixtures where such problems do not arise in the first place. Our algorithm combines latent variables with the method of moments instead of maximum likelihood, which has computational advantages over the popular EM algorithm.

Results

As an application, we demonstrate that methylation state classification is more accurate when using adaptive thresholds from beta mixtures than non-adaptive thresholds on observed methylation levels. We also demonstrate that we can accurately infer the number of mixture components.

Conclusions

The hybrid algorithm between likelihood-based component un-mixing and moment-based parameter estimation is a robust and efficient method for beta mixture estimation. We provide an implementation of the method (“betamix”) as open source software under the MIT license.
  相似文献   

7.
Weakly-supervised learning has recently emerged in the classification context where true labels are often scarce or unreliable. However, this learning setting has not yet been extensively analyzed for regression problems, which are typical in macroecology. We further define a novel computational setting of structurally noisy and incomplete target labels, which arises, for example, when the multi-output regression task defines a distribution such that outputs must sum up to unity. We propose an algorithmic approach to reduce noise in the target labels and improve predictions. We evaluate this setting with a case study in global vegetation modelling, which involves building a model to predict the distribution of vegetation cover from climatic conditions based on global remote sensing data. We compare the performance of the proposed approach to several incomplete target baselines. The results indicate that the error in the targets can be reduced by our proposed partial-imputation algorithm. We conclude that handling structural incompleteness in the target labels instead of using only complete observations for training helps to better capture global associations between vegetation and climate.  相似文献   

8.

Background

The sequence database searching has been the dominant method for peptide identification, in which a large number of peptide spectra generated from LC/MS/MS experiments are searched using a search engine against theoretical fragmentation spectra derived from a protein sequences database or a spectral library. Selecting trustworthy peptide spectrum matches (PSMs) remains a challenge.

Results

A novel scoring method named FC-Ranker is developed to assign a nonnegative weight to each target PSM based on the possibility of its being correct. Particularly, the scores of PSMs are updated by using a fuzzy SVM classification model and a fuzzy silhouette index iteratively. Trustworthy PSMs will be assigned high scores when the algorithm stops.

Conclusions

Our experimental studies show that FC-Ranker outperforms other post-database search algorithms over a variety of datasets, and it can be extended to solve a general classification problem with uncertain labels.
  相似文献   

9.
Clustering of multivariate data is a commonly used technique in ecology, and many approaches to clustering are available. The results from a clustering algorithm are uncertain, but few clustering approaches explicitly acknowledge this uncertainty. One exception is Bayesian mixture modelling, which treats all results probabilistically, and allows comparison of multiple plausible classifications of the same data set. We used this method, implemented in the AutoClass program, to classify catchments (watersheds) in the Murray Darling Basin (MDB), Australia, based on their physiographic characteristics (e.g. slope, rainfall, lithology). The most likely classification found nine classes of catchments. Members of each class were aggregated geographically within the MDB. Rainfall and slope were the two most important variables that defined classes. The second-most likely classification was very similar to the first, but had one fewer class. Increasing the nominal uncertainty of continuous data resulted in a most likely classification with five classes, which were again aggregated geographically. Membership probabilities suggested that a small number of cases could be members of either of two classes. Such cases were located on the edges of groups of catchments that belonged to one class, with a group belonging to the second-most likely class adjacent. A comparison of the Bayesian approach to a distance-based deterministic method showed that the Bayesian mixture model produced solutions that were more spatially cohesive and intuitively appealing. The probabilistic presentation of results from the Bayesian classification allows richer interpretation, including decisions on how to treat cases that are intermediate between two or more classes, and whether to consider more than one classification. The explicit consideration and presentation of uncertainty makes this approach useful for ecological investigations, where both data and expectations are often highly uncertain.  相似文献   

10.
As a rapidly developing research direction in computer vision (CV), related algorithms such as image classification and object detection have achieved inevitable research progress. Improving the accuracy and efficiency of algorithms for fine-grained identification of plant diseases and birds in agriculture is essential to the dynamic monitoring of agricultural environments. In this study, based on the computer vision detection and classification algorithm, combined with the architecture and ideas of the CNN model, the mainstream Transformer model was optimized, and then the CA-Transformer (Transformer Combined with Channel Attention) model was proposed to improve the ability to identify and classify critical areas. The main work is as follows: (1) The C-Attention mechanism is proposed to strengthen the feature information extraction within the patch and the communication between feature information so that the entire network can be fully attentive while reducing the computational overhead; (2) The weight-sharing method is proposed to transfer parameters between different layers, improve the reusability of model data, and at the same time increase the knowledge distillation link to reduce problems such as excessive parameters and overfitting; (3) Token Labeling is proposed to generate score labels according to the position of each Token, and the total loss function of this study is proposed according to the CA-Transformer model structure. The performance of the CA-Transformer model proposed in this study is compared with the current mainstream models on datasets of different scales, and ablation experiments are performed. The results show that the accuracy and mIoU of the CA-Transformer proposed in this study reach 82.89% and 53.17MS, respectively, and have good transfer learning ability, indicating that the model has good performance in fine-grained visual categorization tasks and can be used in ecological information. In the context of more diverse ecological information, this study can provide reference and inspiration for the practical application of information.  相似文献   

11.
Ovarian cancer recurs at the rate of 75% within a few months or several years later after therapy. Early recurrence, though responding better to treatment, is difficult to detect. Surface-enhanced laser desorption/ionization time-of-flight (SELDI-TOF) mass spectrometry has showed the potential to accurately identify disease biomarkers to help early diagnosis. A major challenge in the interpretation of SELDI-TOF data is the high dimensionality of the feature space. To tackle this problem, we have developed a multi-step data processing method composed of t-test, binning and backward feature selection. A new algorithm, support vector machine-Markov blanket/recursive feature elimination (SVM-MB/RFE) is presented for the backward feature selection. This method is an integration of minimum weight feature elimination by SVM-RFE and information theory based redundant/irrelevant feature removal by Markov Blanket. Subsequently, SVM was used for classification. We conducted the biomarker selection algorithm on 113 serum samples to identify early relapse from ovarian cancer patients after primary therapy. To validate the performance of the proposed algorithm, experiments were carried out in comparison with several other feature selection and classification algorithms.  相似文献   

12.
13.
A genetic algorithm (GA) for feature selection in conjunction with neural network was applied to predict protein structural classes based on single amino acid and all dipeptide composition frequencies. These sequence parameters were encoded as input features for a GA in feature selection procedure and classified with a three-layered neural network to predict protein structural classes. The system was established through optimization of the classification performance of neural network which was used as evaluation function. In this study, self-consistency and jackknife tests on a database containing 498 proteins were used to verify the performance of this hybrid method, and were compared with some of prior works. The adoption of a hybrid model, which encompasses genetic and neural technologies, demonstrated to be a promising approach in the task of protein structural class prediction.  相似文献   

14.
The high tumor heterogeneity makes it very challenging to identify key tumorigenic pathways as therapeutic targets. The integration of multiple omics data is a promising approach to identify driving regulatory networks in patient subgroups. Here, we propose a novel conceptual framework to discover patterns of miRNA-gene networks, observed frequently up- or down-regulated in a group of patients and to use such networks for patient stratification in hepatocellular carcinoma (HCC). We developed an integrative subgraph mining approach, called iSubgraph, and identified altered regulatory networks frequently observed in HCC patients. The miRNA and gene expression profiles were jointly analyzed in a graph structure. We defined a method to transform microarray data into graph representation that encodes miRNA and gene expression levels and the interactions between them as well. The iSubgraph algorithm was capable to detect cooperative regulation of miRNAs and genes even if it occurred only in some patients. Next, the miRNA-mRNA modules were used in an unsupervised class prediction model to discover HCC subgroups via patient clustering by mixture models. The robustness analysis of the mixture model showed that the class predictions are highly stable. Moreover, the Kaplan-Meier survival analysis revealed that the HCC subgroups identified by the algorithm have different survival characteristics. The pathway analyses of the miRNA-mRNA co-modules identified by the algorithm demonstrate key roles of Myc, E2F1, let-7, TGFB1, TNF and EGFR in HCC subgroups. Thus, our method can integrate various omics data derived from different platforms and with different dynamic scales to better define molecular tumor subtypes. iSubgraph is available as MATLAB code at http://www.cs.umd.edu/~ozdemir/isubgraph/.  相似文献   

15.
A novel gene selection algorithm based on the gene regulation probability is proposed. In this algorithm, a probabilistic model is established to estimate gene regulation probabilities using the maximum likelihood estimation method and then these probabilities are used to select key genes related by class distinction. The application on the leukemia data-set suggests that the defined gene regulation probability can identify the key genes to the acute lymphoblastic leukemia (ALL)/acute myeloid leukemia (AML) class distinction and the result of our proposed algorithm is competitive to those of the previous algorithms.  相似文献   

16.
An exciting biological advancement over the past few years is the use of microarray technologies to measure simultaneously the expression levels of thousands of genes. The bottleneck now is how to extract useful information from the resulting large amounts of data. An important and common task in analyzing microarray data is to identify genes with altered expression under two experimental conditions. We propose a nonparametric statistical approach, called the mixture model method (MMM), to handle the problem when there are a small number of replicates under each experimental condition. Specifically, we propose estimating the distributions of a t -type test statistic and its null statistic using finite normal mixture models. A comparison of these two distributions by means of a likelihood ratio test, or simply using the tail distribution of the null statistic, can identify genes with significantly changed expression. Several methods are proposed to effectively control the false positives. The methodology is applied to a data set containing expression levels of 1,176 genes of rats with and without pneumococcal middle ear infection.  相似文献   

17.
Risk assessment for quantitative responses using a mixture model   总被引:5,自引:0,他引:5  
Razzaghi M  Kodell RL 《Biometrics》2000,56(2):519-527
A problem that frequently occurs in biological experiments with laboratory animals is that some subjects are less susceptible to the treatment than others. A mixture model has traditionally been proposed to describe the distribution of responses in treatment groups for such experiments. Using a mixture dose-response model, we derive an upper confidence limit on additional risk, defined as the excess risk over the background risk due to an added dose. Our focus will be on experiments with continuous responses for which risk is the probability of an adverse effect defined as an event that is extremely rare in controls. The asymptotic distribution of the likelihood ratio statistic is used to obtain the upper confidence limit on additional risk. The method can also be used to derive a benchmark dose corresponding to a specified level of increased risk. The EM algorithm is utilized to find the maximum likelihood estimates of model parameters and an extension of the algorithm is proposed to derive the estimates when the model is subject to a specified level of added risk. An example is used to demonstrate the results, and it is shown that by using the mixture model a more accurate measure of added risk is obtained.  相似文献   

18.
Pharmacogenomics and clinical studies that measure the temporal expression levels of patients can identify important pathways and biomarkers that are activated during disease progression or in response to treatment. However, researchers face a number of challenges when trying to combine expression profiles from these patients. Unlike studies that rely on lab animals or cell lines, individuals vary in their baseline expression and in their response rate. In this paper we present a generative model for such data. Our model represents patient expression data using two levels, a gene level, which corresponds to a common response pattern, and a patient level, which accounts for the patient specific expression patterns and response rate. Using an EM algorithm, we infer the parameters of the model. We used our algorithm to analyze multiple sclerosis patient response to interferon-beta. As we show, our algorithm was able to improve upon prior methods for combining patients data. In addition, our algorithm was able to correctly identify patient specific response patterns.  相似文献   

19.
Naskar M  Das K  Ibrahim JG 《Biometrics》2005,61(3):729-737
A very general class of multivariate life distributions is considered for analyzing failure time clustered data that are subject to censoring and multiple modes of failure. Conditional on cluster-specific quantities, the joint distribution of the failure time and event indicator can be expressed as a mixture of the distribution of time to failure due to a certain type (or specific cause), and the failure type distribution. We assume here the marginal probabilities of various failure types are logistic functions of some covariates. The cluster-specific quantities are subject to some unknown distribution that causes frailty. The unknown frailty distribution is modeled nonparametrically using a Dirichlet process. In such a semiparametric setup, a hybrid method of estimation is proposed based on the i.i.d. Weighted Chinese Restaurant algorithm that helps us generate observations from the predictive distribution of the frailty. The Monte Carlo ECM algorithm plays a vital role for obtaining the estimates of the parameters that assess the extent of the effects of the causal factors for failures of a certain type. A simulation study is conducted to study the consistency of our methodology. The proposed methodology is used to analyze a real data set on HIV infection of a cohort of female prostitutes in Senegal.  相似文献   

20.
PurposeAccurate detection and treatment of Coronary Artery Disease is mainly based on invasive Coronary Angiography, which could be avoided provided that a robust, non-invasive detection methodology emerged. Despite the progress of computational systems, this remains a challenging issue. The present research investigates Machine Learning and Deep Learning methods in competing with the medical experts' diagnostic yield. Although the highly accurate detection of Coronary Artery Disease, even from the experts, is presently implausible, developing Artificial Intelligence models to compete with the human eye and expertise is the first step towards a state-of-the-art Computer-Aided Diagnostic system.MethodsA set of 566 patient samples is analysed. The dataset contains Polar Maps derived from scintigraphic Myocardial Perfusion Imaging studies, clinical data, and Coronary Angiography results. The latter is considered as reference standard. For the classification of the medical images, the InceptionV3 Convolutional Neural Network is employed, while, for the categorical and continuous features, Neural Networks and Random Forest classifier are proposed.ResultsThe research suggests that an optimal strategy competing with the medical expert's accuracy involves a hybrid multi-input network composed of InceptionV3 and a Random Forest. This method matches the expert's accuracy, which is 79.15% in the particular dataset.ConclusionImage classification using deep learning methods can cooperate with clinical data classification methods to enhance the robustness of the predicting model, aiming to compete with the medical expert's ability to identify Coronary Artery Disease subjects, from a large scale patient dataset.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号