共查询到20条相似文献,搜索用时 10 毫秒
1.
Feature selection is the problem of finding the best subset of features which have the most impact in predicting class labels. It is noteworthy that application of feature selection is more valuable in high dimensional datasets. In this paper, a filter feature selection method has been proposed on high dimensional binary medical datasets – Colon, Central Nervous System (CNS), GLI_85, SMK_CAN_187. The proposed method incorporates three sections. First, whale algorithm has been used to discard irrelevant features. Second, the rest of features are ranked based on a frequency based heuristic approach called Mutual Congestion. Third, majority voting has been applied on best feature subsets constructed using forward feature selection with threshold τ = 10. This work provides evidence that Mutual Congestion is solely powerful to predict class labels. Furthermore, applying whale algorithm increases the overall accuracy of Mutual Congestion in most of the cases. The findings also show that the proposed method improves the prediction with selecting the less possible features in comparison with state of the arts.https://github.com/hnematzadeh 相似文献
2.
Cluster Computing - Cloud computing is a preferred option for organizations around the globe, it offers scalable and internet-based computing resources as a flexible service. Security is a key... 相似文献
3.
Background Microarray experiments are becoming a powerful tool for clinical diagnosis, as they have the potential to discover gene expression
patterns that are characteristic for a particular disease. To date, this problem has received most attention in the context
of cancer research, especially in tumor classification. Various feature selection methods and classifier design strategies
also have been generally used and compared. However, most published articles on tumor classification have applied a certain
technique to a certain dataset, and recently several researchers compared these techniques based on several public datasets.
But, it has been verified that differently selected features reflect different aspects of the dataset and some selected features
can obtain better solutions on some certain problems. At the same time, faced with a large amount of microarray data with
little knowledge, it is difficult to find the intrinsic characteristics using traditional methods. In this paper, we attempt
to introduce a combinational feature selection method in conjunction with ensemble neural networks to generally improve the
accuracy and robustness of sample classification. 相似文献
5.
Background Due to the large number of genes in a typical microarray dataset, feature selection looks set to play an important role in
reducing noise and computational cost in gene expression-based tissue classification while improving accuracy at the same
time. Surprisingly, this does not appear to be the case for all multiclass microarray datasets. The reason is that many feature
selection techniques applied on microarray datasets are either rank-based and hence do not take into account correlations
between genes, or are wrapper-based, which require high computational cost, and often yield difficult-to-reproduce results.
In studies where correlations between genes are considered, attempts to establish the merit of the proposed techniques are
hampered by evaluation procedures which are less than meticulous, resulting in overly optimistic estimates of accuracy. 相似文献
6.
Variable selection is critical in competing risks regression with high-dimensional data. Although penalized variable selection methods and other machine learning-based approaches have been developed, many of these methods often suffer from instability in practice. This paper proposes a novel method named Random Approximate Elastic Net (RAEN). Under the proportional subdistribution hazards model, RAEN provides a stable and generalizable solution to the large-p-small-n variable selection problem for competing risks data. Our general framework allows the proposed algorithm to be applicable to other time-to-event regression models, including competing risks quantile regression and accelerated failure time models. We show that variable selection and parameter estimation improved markedly using the new computationally intensive algorithm through extensive simulations. A user-friendly R package RAEN is developed for public use. We also apply our method to a cancer study to identify influential genes associated with the death or progression from bladder cancer. 相似文献
7.
Apoptosis proteins have a central role in the development and the homeostasis of an organism. These proteins are very important
for understanding the mechanism of programmed cell death. The function of an apoptosis protein is closely related to its subcellular
location. It is crucial to develop powerful tools to predict apoptosis protein locations for rapidly increasing gap between
the number of known structural proteins and the number of known sequences in protein databank. In this study, amino acids
pair compositions with different spaces are used to construct feature sets for representing sample of protein feature selection
approach based on binary particle swarm optimization, which is applied to extract effective feature. Ensemble classifier is
used as prediction engine, of which the basic classifier is the fuzzy K-nearest neighbor. Each basic classifier is trained
with different feature sets. Two datasets often used in prior works are selected to validate the performance of proposed approach.
The results obtained by jackknife test are quite encouraging, indicating that the proposed method might become a potentially
useful tool for subcellular location of apoptosis protein, or at least can play a complimentary role to the existing methods
in the relevant areas. The supplement information and software written in Matlab are available by contacting the corresponding
author. 相似文献
8.
A major challenge for genomewide disease association studies is the high cost of genotyping large number of single nucleotide polymorphisms (SNPs). The correlations between SNPs, however, make it possible to select a parsimonious set of informative SNPs, known as "tagging" SNPs, able to capture most variation in a population. Considerable research interest has recently focused on the development of methods for finding such SNPs. In this paper, we present an efficient method for finding tagging SNPs. The method does not involve computation-intensive search for SNP subsets but discards redundant SNPs using a feature selection algorithm. In contrast to most existing methods, the method presented here does not limit itself to using only correlations between SNPs in local groups. By using correlations that occur across different chromosomal regions, the method can reduce the number of globally redundant SNPs. Experimental results show that the number of tagging SNPs selected by our method is smaller than by using block-based methods. Supplementary website: http://htsnp.stanford.edu/FSFS/. 相似文献
9.
A correlation-based approach is introduced for enhancing the ability of structure-scoring methods to identify and distinguish native-like conformations. The proposed method relies on a funnel-shaped scoring function that decreases steadily toward the native state. It takes advantage of the idea that the structure from a given ensemble that is closest to the native basin leads to the highest correlation coefficient between a given score and distance to that structure as an approximation of the native state for the entire ensemble. The method is applied successfully to a number of different test cases that demonstrate substantial improvements in the correlation of the score with the distance from the true native state but also result in the selection of more native-like structures compared to the original score. 相似文献
10.
ABSTRACT: BACKGROUND: Myocardial ischemia can be developed into more serious diseases. Early Detection of the ischemic syndrome inelectrocardiogram (ECG) more accurately and automatically can prevent it from developing into a catastrophicdisease. To this end, we propose a new method, which employs wavelets and simple feature selection. METHODS: For training and testing, the European ST-T database is used, which is comprised of 367 ischemic ST episodes in90 records. We first remove baseline wandering, and detect time positions of QRS complexes by a method basedon the discrete wavelet transform. Next, for each heart beat, we extract three features which can be used fordifferentiating ST episodes from normal: 1) the area between QRS offset and T-peak points, 2) the normalizedand signed sum from QRS offset to effective zero voltage point, and 3) the slope from QRS onset to offset point.We average the feature values for successive five beats to reduce effects of outliers. Finally we apply classifiersto those features. RESULTS: We evaluated the algorithm by kernel density estimation (KDE) and support vector machine (SVM) methods.Sensitivity and specificity for KDE were 0.939 and 0.912, respectively. The KDE classifier detects 349 ischemicST episodes out of total 367 ST episodes. Sensitivity and specificity of SVM were 0.941 and 0.923, respectively.The SVM classifier detects 355 ischemic ST episodes. CONCLUSIONS: We proposed a new method for detecting ischemia in ECG. It contains signal processing techniques of removingbaseline wandering and detecting time positions of QRS complexes by discrete wavelet transform, and featureextraction from morphology of ECG waveforms explicitly. It was shown that the number of selected featureswere sufficient to discriminate ischemic ST episodes from the normal ones. We also showed how the proposedKDE classifier can automatically select kernel bandwidths, meaning that the algorithm does not require anynumerical values of the parameters to be supplied in advance. In the case of the SVM classifier, one has to selecta single parameter. 相似文献
11.
The small ubiquitin-like modifier (SUMO) proteins are a kind of proteins that can be attached to a series of proteins. The sumoylation of protein is an important posttranslational modification. Thus, the prediction of the sumoylation site of a given protein is significant. Here we employed a combined method to perform this task. We predicted the sumoylation site of a protein by a two-staged procedure. At the first stage, whether a protein would be sumoylated was predicted; whereas at the second stage, the sumoylation sites of the protein were predicted if it was determined to be modified by SUMO at the first stage. At the first stage, we encoded a protein with protein families (PFAM) and trained the predictor with nearest network algorithm (NNA); at the second stage, we encoded nonapeptides (peptides that contain nine residues) of the protein containing the lysine residues, with Amino Acid Index, and trained the predictor with NNA. The predictor was tested by the k-fold cross-validation method. The highest accuracy of the second-staged predictor was 99.55% when 12 features were incorporated in the predictor. The corresponding Matthews Correlation Coefficient was 0.7952. These results indicate that the method is a promising tool to predict the sumoylation site of a protein. At last, the features used in the predictor are discussed. The software is available at request. 相似文献
12.
Most existing methods for sequence-based classification use exhaustive feature generation, employing, for example, all k-mer patterns. The motivation behind such (enumerative) approaches is to minimize the potential for overlooking important features. However, there are shortcomings to this strategy. First, practical constraints limit the scope of exhaustive feature generation to patterns of length ≤ k, such that potentially important, longer (> k) predictors are not considered. Second, features so generated exhibit strong dependencies, which can complicate understanding of derived classification rules. Third, and most importantly, numerous irrelevant features are created. These concerns can compromise prediction and interpretation. While remedies have been proposed, they tend to be problem-specific and not broadly applicable. Here, we develop a generally applicable methodology, and an attendant software pipeline, that is predicated on discriminatory motif finding. In addition to the traditional training and validation partitions, our framework entails a third level of data partitioning, a discovery partition. A discriminatory motif finder is used on sequences and associated class labels in the discovery partition to yield a (small) set of features. These features are then used as inputs to a classifier in the training partition. Finally, performance assessment occurs on the validation partition. Important attributes of our approach are its modularity (any discriminatory motif finder and any classifier can be deployed) and its universality (all data, including sequences that are unaligned and/or of unequal length, can be accommodated). We illustrate our approach on two nucleosome occupancy datasets and a protein solubility dataset, previously analyzed using enumerative feature generation. Our method achieves excellent performance results, with and without optimization of classifier tuning parameters. A Python pipeline implementing the approach is available at http://www.epibiostat.ucsf.edu/biostat/sen/dmfs/. 相似文献
13.
Background The identification of relevant biological features in large and complex datasets is an important step towards gaining insight
in the processes underlying the data. Other advantages of feature selection include the ability of the classification system
to attain good or even better solutions using a restricted subset of features, and a faster classification. Thus, robust methods
for fast feature selection are of key importance in extracting knowledge from complex biological data. 相似文献
14.
GalNAc-transferase can catalyze the biosynthesis of O-linked oligosaccharides. The specificity of GalNAc-transferase is composed of nine amino acid residues denoted by R4, R3, R2, R1, R0, R1', R2', R3', R4'. To predict whether the reducing monosaccharide will be covalently linked to the central residue R0(Ser or Thr), a new method based on feature selection has been proposed in our work. 277 nonapeptides from reference [Chou KC. A sequence-coupled vector-projection model for predicting the specificity of GalNAc-transferase. Protein Sci 1995;4:1365-83] are chosen for training set. Each nonapeptide is represented by hundreds of amino acid properties collected by Amino Acid Index database (http://www.genome.jp/aaindex) and transformed into a numeric vector with 4554 features. The Maximum Relevance Minimum Redundancy (mRMR) method combining with Incremental Feature Selection (IFS) and Feature Forward Selection (FFS) are then applied for feature selection. Nearest Neighbor Algorithm (NNA) is used to build prediction models. The optimal model contains 54 features and its correct rate tested by Jackknife cross-validation test reaches 91.34%. Final feature analysis indicates that amino acid residues at position R3' play the most important role in the recognition of GalNAc-transferase specificity, which were confirmed by the experiments [Elhammer AP, Poorman RA, Brown E, Maggiora LL, Hoogerheide JG, Kezdy FJ. The specificity of UDP-GalNAc:polypeptide N-acetylgalactosaminyltransferase as inferred from a database of in vivo substrates and from the in vitro glycosylation of proteins and peptides. J Biol Chem 1993;268:10029-38; O'Connell BC, Hagen FK, Tabak LA. The influence of flanking sequence on the O-glycosylation of threonine in vitro. J Biol Chem 1992;267:25010-8; Yoshida A, Suzuki M, Ikenaga H, Takeuchi M. Discovery of the shortest sequence motif for high level mucin-type O-glycosylation. J Biol Chem 1997;272:16884-8]. Our method can be used as a tool for predicting O-glycosylation sites and for investigating the GalNAc-transferase specificity, which is useful for designing competitive inhibitors of GalNAc-transferase. The predicting software is available upon the request. 相似文献
15.
Methionine aminopeptidase and N-terminal acetyltransferase are two enzymes that contribute most to the N-terminal acetylation, which has long been recognized as a frequent and important kind of co-translational modifications [R.A. Bradshaw, W.W. Brickey, K.W. Walker, N-terminal processing: the methionine aminopeptidase and N alpha-acetyl transferase families, Trends Biochem. Sci. 23 (1998) 263-267]. The combined action of these two enzymes leads to two types of N-terminal acetylated proteins that are with/without the initiator methionine after the N-terminal acetylation. To accurately predict these two types of N-terminal acetylation, a new method based on feature selection has been developed. 1047 N-terminal acetylated and non-acetylated decapeptides retrieved from Swiss-Prot database (http://cn.expasy.org) are encoded into feature vectors by amino acid properties collected in Amino Acid Index database (http://www.genome.jp/aaindex). The Maximum Relevance Minimum Redundancy method (mRMR) combining with Incremental Feature Selection (IFS) and Feature Forward Selection (FFS) is then applied to extract informative features. Nearest Neighbor Algorithm (NNA) is used to build prediction models. Tested by Jackknife Cross-Validation, the correct rate of predictors reach 91.34% and 75.49% for each type, which are both better than that of 84.41% and 62.99% acquired by using motif methods [S. Huang, R.C. Elliott, P.S. Liu, R.K. Koduri, J.L. Weickmann, J.H. Lee, L.C. Blair, P. Ghosh-Dastidar, R.A. Bradshaw, K.M. Bryan, et al., Specificity of cotranslational amino-terminal processing of proteins in yeast, Biochemistry 26 (1987) 8242-8246; R. Yamada, R.A. Bradshaw, Rat liver polysome N alpha-acetyltransferase: substrate specificity, Biochemistry 30 (1991) 1017-1021]. Furthermore, the analysis of the informative features indicates that at least six downstream residues might have effect on the rules that guide the N-terminal acetylation, besides the penultimate residue. The software is available upon request. 相似文献
16.
Wavelet transform has been widely applied in extracting characteristic information in spike sorting. As the wavelet coefficients used to distinguish various spike shapes are often disorganized, they still lack in effective unsupervised methods still lacks to select the most discriminative features. In this paper, we propose an unsupervised feature selection method, employing kernel density estimation to select those wavelet coefficients with bimodal or multimodal distributions. This method is tested on a simulated spike data set, and the average misclassification rate after fuzzy C-means clustering has been greatly reduced, which proves this kernel density estimation-based feature selection approach is effective. 相似文献
17.
The thermostability of proteins is particularly relevant for enzyme engineering. Developing a computational method to identify mesophilic proteins would be helpful for protein engineering and design. In this work, we developed support vector machine based method to predict thermophilic proteins using the information of amino acid distribution and selected amino acid pairs. A reliable benchmark dataset including 915 thermophilic proteins and 793 non-thermophilic proteins was constructed for training and testing the proposed models. Results showed that 93.8% thermophilic proteins and 92.7% non-thermophilic proteins could be correctly predicted by using jackknife cross-validation. High predictive successful rate exhibits that this model can be applied for designing stable proteins. 相似文献
18.
MOTIVATION: Feature subset selection is an important preprocessing step for classification. In biology, where structures or processes are described by a large number of features, the elimination of irrelevant and redundant information in a reasonable amount of time has a number of advantages. It enables the classification system to achieve good or even better solutions with a restricted subset of features, allows for a faster classification, and it helps the human expert focus on a relevant subset of features, hence providing useful biological knowledge. RESULTS: We present a heuristic method based on Estimation of Distribution Algorithms to select relevant subsets of features for splice site prediction in Arabidopsis thaliana. We show that this method performs a fast detection of relevant feature subsets using the technique of constrained feature subsets. Compared to the traditional greedy methods the gain in speed can be up to one order of magnitude, with results being comparable or even better than the greedy methods. This makes it a very practical solution for classification tasks that can be solved using a relatively small amount of discriminative features (or feature dependencies), but where the initial set of potential discriminative features is rather large. 相似文献
19.
Background Clustering is one of the most commonly used methods for discovering hidden structure in microarray gene expression data. Most
current methods for clustering samples are based on distance metrics utilizing all genes. This has the effect of obscuring
clustering in samples that may be evident only when looking at a subset of genes, because noise from irrelevant genes dominates
the signal from the relevant genes in the distance calculation. 相似文献
20.
Feature selection (FS) is a real-world problem that can be solved using optimization techniques. These techniques proposed solutions to make a predictive model, which minimizes the classifier's prediction errors by selecting informative or important features by discarding redundant, noisy, and irrelevant attributes in the original dataset. A new hybrid feature selection method is proposed using the Sine Cosine Algorithm (SCA) and Genetic Algorithm (GA), called SCAGA. Typically, optimization methods have two main search strategies; exploration of the search space and exploitation to determine the optimal solution. The proposed SCAGA resulted in better performance when balancing between exploitation and exploration strategies of the search space. The proposed SCAGA has also been evaluated using the following evaluation criteria: classification accuracy, worst fitness, mean fitness, best fitness, the average number of features, and standard deviation. Moreover, the maximum accuracy of a classification and the minimal features were obtained in the results. The results were also compared with a basic Sine Cosine Algorithm (SCA) and other related approaches published in literature such as Ant Lion Optimization and Particle Swarm Optimization. The comparison showed that the obtained results from the SCAGA method were the best overall the tested datasets from the UCI machine learning repository. 相似文献
|