首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
Qiao X  Liu Y 《Biometrics》2009,65(1):159-168
Summary .  In multicategory classification, standard techniques typically treat all classes equally. This treatment can be problematic when the dataset is unbalanced in the sense that certain classes have very small class proportions compared to others. The minority classes may be ignored or discounted during the classification process due to their small proportions. This can be a serious problem if those minority classes are important. In this article, we study the problem of unbalanced classification and propose new criteria to measure classification accuracy. Moreover, we propose three different weighted learning procedures, two one-step weighted procedures, as well as one adaptive weighted procedure. We demonstrate the advantages of the new procedures, using multicategory support vector machines, through simulated and real datasets. Our results indicate that the proposed methodology can handle unbalanced classification problems effectively.  相似文献   

3.
4.

Aim

To improve the accuracy of inferences on habitat associations and distribution patterns of rare species by combining machine‐learning, spatial filtering and resampling to address class imbalance and spatial bias of large volumes of citizen science data.

Innovation

Modelling rare species’ distributions is a pressing challenge for conservation and applied research. Often, a large number of surveys are required before enough detections occur to model distributions of rare species accurately, resulting in a data set with a high proportion of non‐detections (i.e. class imbalance). Citizen science data can provide a cost‐effective source of surveys but likely suffer from class imbalance. Citizen science data also suffer from spatial bias, likely from preferential sampling. To correct for class imbalance and spatial bias, we used spatial filtering to under‐sample the majority class (non‐detection) while maintaining all of the limited information from the minority class (detection). We investigated the use of spatial under‐sampling with randomForest models and compared it to common approaches used for imbalanced data, the synthetic minority oversampling technique (SMOTE), weighted random forest and balanced random forest models. Model accuracy was assessed using kappa, Brier score and AUC. We demonstrate the method by evaluating habitat associations and seasonal distribution patterns using citizen science data for a rare species, the tricoloured blackbird (Agelaius tricolor).

Main Conclusions

Spatial under‐sampling increased the accuracy of each model and outperformed the approach typically used to direct under‐sampling in the SMOTE algorithm. Our approach is the first to characterize winter distribution and movement of tricoloured blackbirds. Our results show that tricoloured blackbirds are positively associated with grassland, pasture and wetland habitats, and negatively associated with high elevations or evergreen forests during both winter and breeding seasons. The seasonal differences in distribution indicate that individuals move to the coast during the winter, as suggested by historical accounts.
  相似文献   

5.
Camera traps are a method for monitoring wildlife and they collect a large number of pictures. The number of images collected of each species usually follows a long-tail distribution, i.e., a few classes have a large number of instances, while a lot of species have just a small percentage. Although in most cases these rare species are the ones of interest to ecologists, they are often neglected when using deep-learning models because these models require a large number of images for the training. In this work, a simple and effective framework called Square-Root Sampling Branch (SSB) is proposed, which combines two classification branches that are trained using square-root sampling and instance sampling to improve long-tail visual recognition, and this is compared to state-of-the-art methods for handling this task: square-root sampling, class-balanced focal loss, and balanced group softmax. To achieve a more general conclusion, the methods for handling long-tail visual recognition were systematically evaluated in four families of computer vision models (ResNet, MobileNetV3, EfficientNetV2, and Swin Transformer) and four camera-trap datasets with different characteristics. Initially, a robust baseline with the most recent training tricks was prepared and, then, the methods for improving long-tail recognition were applied. Our experiments show that square-root sampling was the method that most improved the performance for minority classes by around 15%; however, this was at the cost of reducing the majority classes' accuracy by at least 3%. Our proposed framework (SSB) demonstrated itself to be competitive with the other methods and achieved the best or the second-best results for most of the cases for the tail classes; but, unlike the square-root sampling, the loss in the performance of the head classes was minimal, thus achieving the best trade-off among all the evaluated methods. Our experiments also show that Swin Transformer can achieve high performance for rare classes without applying any additional method for handling imbalance, and attains an overall accuracy of 88.76% for the WCS dataset and 94.97% for Snapshot Serengeti using a location-based training/test partition. Despite the improvement in the tail classes' performance, our experiments highlight the need for better methods for handling long-tail visual recognition in camera-trap images, since state-of-the-art approaches achieve poor performance, especially in classes with just a few training instances.  相似文献   

6.
Kurgan LA  Zhang T  Zhang H  Shen S  Ruan J 《Amino acids》2008,35(3):551-564
Structural class categorizes proteins based on the amount and arrangement of the constituent secondary structures. The knowledge of structural classes is applied in numerous important predictive tasks that address structural and functional features of proteins. We propose novel structural class assignment methods that use one-dimensional (1D) secondary structure as the input. The methods are designed based on a large set of low-identity sequences for which secondary structure is predicted from their sequence (PSSAsc model) or assigned based on their tertiary structure (SSAsc). The secondary structure is encoded using a comprehensive set of features describing count, content, and size of secondary structure segments, which are fed into a small decision tree that uses ten features to perform the assignment. The proposed models were compared against seven secondary structure-based and ten sequence-based structural class predictors. Using the 1D secondary structure, SSAsc and PSSAsc can assign proteins to the four main structural classes, while the existing secondary structure-based assignment methods can predict only three classes. Empirical evaluation shows that the proposed models are quite promising. Using the structure-based assignment performed in SCOP (structural classification of proteins) as the golden standard, the accuracy of SSAsc and PSSAsc equals 76 and 75%, respectively. We show that the use of the secondary structure predicted from the sequence as an input does not have a detrimental effect on the quality of structural class assignment when compared with using secondary structure derived from tertiary structure. Therefore, PSSAsc can be used to perform the automated assignment of structural classes based on the sequences.  相似文献   

7.
Large efforts have been made in classifying residues as binding sites in proteins using machine learning methods. The prediction task can be translated into the computational challenge of assigning each residue the label binding site or non‐binding site. Observational data comes from various possibly highly correlated sources. It includes the structure of the protein but not the structure of the complex. The model class of conditional random fields (CRFs) has previously successfully been used for protein binding site prediction. Here, a new CRF‐approach is presented that models the dependencies of residues using a general graphical structure defined as a neighborhood graph and thus our model makes fewer independence assumptions on the labels than sequential labeling approaches. A novel node feature “change in free energy” is introduced into the model, which is then denoted by ΔF‐CRF. Parameters are trained with an online large‐margin algorithm. Using the standard feature class relative accessible surface area alone, the general graph‐structure CRF already achieves higher prediction accuracy than the linear chain CRF of Li et al. ΔF‐CRF performs significantly better on a large range of false positive rates than the support‐vector‐machine‐based program PresCont of Zellner et al. on a homodimer set containing 128 chains. ΔF‐CRF has a broader scope than PresCont since it is not constrained to protein subgroups and requires no multiple sequence alignment. The improvement is attributed to the advantageous combination of the novel node feature with the standard feature and to the adopted parameter training method. Proteins 2015; 83:844–852. © 2015 Wiley Periodicals, Inc.  相似文献   

8.
The classical approaches for protein structure prediction rely either on homology of the protein sequence with a template structure or on ab initio calculations for energy minimization. These methods suffer from disadvantages such as the lack of availability of homologous template structures or intractably large conformational search space, respectively. The recently proposed fragment library based approaches first predict the local structures, which can be used in conjunction with the classical approaches of protein structure prediction. The accuracy of the predictions is dependent on the quality of the fragment library. In this work, we have constructed a library of local conformation classes purely based on geometric similarity. The local conformations are represented using Geometric Invariants, properties that remain unchanged under transformations such as translation and rotation, followed by dimension reduction via principal component analysis. The local conformations are then modeled as a mixture of Gaussian probability distribution functions (PDF). Each one of the Gaussian PDF’s corresponds to a conformational class with the centroid representing the average structure of that class. We find 46 classes when we use an octapeptide as a unit of local conformation. The protein 3-D structure can now be described as a sequence of local conformational classes. Further, it was of interest to see whether the local conformations can be predicted from the amino acid sequences. To that end, we have analyzed the correlation between sequence features and the conformational classes.  相似文献   

9.
The covarion hypothesis of molecular evolution proposes that selective pressures on an amino acid or nucleotide site change through time, thus causing changes of evolutionary rate along the edges of a phylogenetic tree. Several kinds of Markov models for the covarion process have been proposed. One model, proposed by Huelsenbeck (2002), has 2 substitution rate classes: the substitution process at a site can switch between a single variable rate, drawn from a discrete gamma distribution, and a zero invariable rate. A second model, suggested by Galtier (2001), assumes rate switches among an arbitrary number of rate classes but switching to and from the invariable rate class is not allowed. The latter model allows for some sites that do not participate in the rate-switching process. Here we propose a general covarion model that combines features of both models, allowing evolutionary rates not only to switch between variable and invariable classes but also to switch among different rates when they are in a variable state. We have implemented all 3 covarion models in a maximum likelihood framework for amino acid sequences and tested them on 23 protein data sets. We found significant likelihood increases for all data sets for the 3 models, compared with a model that does not allow site-specific rate switches along the tree. Furthermore, we found that the general model fit the data better than the simpler covarion models in the majority of the cases, highlighting the complexity in modeling the covarion process. The general covarion model can be used for comparing tree topologies, molecular dating studies, and the investigation of protein adaptation.  相似文献   

10.
利用复杂网络的方法来探索序列特征因素对蛋白质结构的影响。由于蛋白质的序列对结构具有重要且复杂的影响,因此将蛋白质的结构以及序列特征之间的关系模拟成一个复杂系统,通过利用互相关系数、标准化互信息和传递熵等方法来建立以序列特征为节点的加权网络,进而利用网络中心性的方法来分析不同蛋白质结构类型对应加权网络的中心性分布的差异,探索不同结构类型蛋白质的序列特征差异。发现不同的蛋白质结构类型对应的序列特征网络既有共性又有差异,文章将针对每一种结构类型的网络中心性分布,以及不同结构类型之间的共性与差异进行详细地讨论。研究结果对蛋白质序列与结构之间关系的研究,特别是结构分类研究具有重要的意义。  相似文献   

11.
基于树木起源、立地分级和龄组的单木生物量模型   总被引:4,自引:0,他引:4  
李海奎  宁金魁 《生态学报》2012,32(3):740-757
以马尾松(Pinus massoniana)和落叶松(Larix)的大样本实测资料为建模样本,以独立抽取的样本为验证样本,把样本按起源、立地和龄组进行分级,采用与材积相容的两种相对生长方程,分普通最小二乘和两种加权最小二乘,对地上部分总生物量、地上各部分生物量和地下生物量进行模型拟合和验证,使用决定系数、均方根误差、总相对误差和估计精度等8项统计量对结果进行分析。结果表明:两个树种地上部分总生物量,立地分类方法,模型的拟合结果和适用性都最优;马尾松VAR模型较优,而落叶松CAR模型较好;两种加权最小二乘方法,在建模样本和验证样本中表现得不一致。在建模样本中,加权回归2(权重函数1/f0.5)略优于加权回归1(权重函数1/y0.5),但在验证样本中,加权回归1却明显优于加权回归2。而同时满足建模样本拟合结果最优和验证样本检验结果最优的组合中,只有加权回归1。两个树种地上部分各分量生物量,模型拟合结果和适用性,均为干材最优,树叶最差、树枝和树皮居中,样本分类、模型类型和加权最小二乘方法对干材生物量的影响,规律和地上部分总生物量相同;样本分类、模型类型和加权最小二乘方法的最优组合,用验证样本检验的结果,总相对误差树枝不超过±10.0%,树皮不超过±5.0%,树叶马尾松不超过±30.0%,落叶松不超过±20.0%。两个树种地下部分(根)生物量,样本按龄组分类方法,模型拟合结果最优,与材积相容的模型总体上优于与地上部分总生物量相容模型。  相似文献   

12.
Chen Q  Ibrahim JG 《Biometrics》2006,62(1):177-184
We consider a class of semiparametric models for the covariate distribution and missing data mechanism for missing covariate and/or response data for general classes of regression models including generalized linear models and generalized linear mixed models. Ignorable and nonignorable missing covariate and/or response data are considered. The proposed semiparametric model can be viewed as a sensitivity analysis for model misspecification of the missing covariate distribution and/or missing data mechanism. The semiparametric model consists of a generalized additive model (GAM) for the covariate distribution and/or missing data mechanism. Penalized regression splines are used to express the GAMs as a generalized linear mixed effects model, in which the variance of the corresponding random effects provides an intuitive index for choosing between the semiparametric and parametric model. Maximum likelihood estimates are then obtained via the EM algorithm. Simulations are given to demonstrate the methodology, and a real data set from a melanoma cancer clinical trial is analyzed using the proposed methods.  相似文献   

13.
《Behavioural processes》1986,13(3):205-215
Children learned matching-to-sample tasks to establish two equivalence classes. Then, one member from each class appeared in a sequence procedure, thereby acquiring the ordinal properties “first” and “second”. When the remaining members in the two equivalence classes were placed in the sequence context, subjects responded in appropriate order without additional training. The data suggest a basic mechanism which can account for the production of new sequence behavior which has no explicit history of training.  相似文献   

14.
Predicting the cofactors of oxidoreductases plays an important role in inferring their catalytic mechanism. Feature extraction is a critical part in the prediction systems, requiring raw sequence data to be transformed into appropriate numerical feature vectors while minimizing information loss. In this paper, we present an amino acid composition distribution method for extracting useful features from primary sequence, and the k-nearest neighbor was used as the classifier. The overall prediction accuracy evaluated by the 10-fold cross-validation reached 90.74%. Comparing our method with other eight feature extraction methods, the improvement of the overall prediction accuracy ranged from 3.49% to 15.74%. Our experimental results confirm that the method we proposed is very useful and may be used for other bioinformatical predictions. Interestingly, when features extracted by our method and Chou's amphiphilic pseudo-amino acid composition were combined, the overall accuracy could reach 92.53%.  相似文献   

15.
Neural network schemes for detecting rare events in human genomic DNA   总被引:4,自引:0,他引:4  
MOTIVATION: Many problems in molecular biology as well as other areas involve detection of rare events in unbalanced data. We develop two sample stratification schemes in conjunction with neural networks for rare event detection in such databases. Sample stratification is a technique for making each class in a sample have equal influence on decision making. The first scheme proposed stratifies a sample by adding up the weighted sum of the derivatives during the backward pass of training. The second scheme proposed uses a technique of modified bootstrap aggregating. After training neural networks with multiple sets of bootstrapped examples of the rare event classes and subsampled examples of common event classes, multiple voting for classification is performed. RESULTS: These two schemes make rare event classes have a better chance of being included in the sample used for training neural networks and thus improve the classification accuracy for rare event detection. The experimental performance of the two schemes using two sets of human DNA sequences as well as another set of Gaussian data indicates that proposed schemes have the potential of significantly improving accuracy of neural networks to recognize rare events.  相似文献   

16.
Liu Z  Tan M 《Biometrics》2008,64(4):1155-1161
SUMMARY: In medical diagnosis, the diseased and nondiseased classes are usually unbalanced and one class may be more important than the other depending on the diagnosis purpose. Most standard classification methods, however, are designed to maximize the overall accuracy and cannot incorporate different costs to different classes explicitly. In this article, we propose a novel nonparametric method to directly maximize the weighted specificity and sensitivity of the receiver operating characteristic curve. Combining advances in machine learning, optimization theory, and statistics, the proposed method has excellent generalization property and assigns different error costs to different classes explicitly. We present experiments that compare the proposed algorithms with support vector machines and regularized logistic regression using data from a study on HIV-1 protease as well as six public available datasets. Our main conclusion is that the performance of proposed algorithm is significantly better in most cases than the other classifiers tested. Software package in MATLAB is available upon request.  相似文献   

17.
This work examined if currently available electromyography (EMG) driven models, that are calibrated to satisfy joint moments about one single degree of freedom (DOF), could provide the same musculotendon unit (MTU) force solution, when driven by the same input data, but calibrated about a different DOF. We then developed a novel and comprehensive EMG-driven model of the human lower extremity that used EMG signals from 16 muscle groups to drive 34 MTUs and satisfy the resulting joint moments simultaneously produced about four DOFs during different motor tasks. This also led to the development of a calibration procedure that allowed identifying a set of subject-specific parameters that ensured physiological behavior for the 34 MTUs. Results showed that currently available single-DOF models did not provide the same unique MTU force solution for the same input data. On the other hand, the MTU force solution predicted by our proposed multi-DOF model satisfied joint moments about multiple DOFs without loss of accuracy compared to single-DOF models corresponding to each of the four DOFs. The predicted MTU force solution was (1) a function of experimentally measured EMGs, (2) the result of physiological MTU excitation, (3) reflected different MTU contraction strategies associated to different motor tasks, (4) coordinated a greater number of MTUs with respect to currently available single-DOF models, and (5) was not specific to an individual DOF dynamics. Therefore, our proposed methodology has the potential of producing a more dynamically consistent and generalizable MTU force solution than was possible using single-DOF EMG-driven models. This will help better address the important scientific questions previously approached using single-DOF EMG-driven modeling. Furthermore, it might have applications in the development of human-machine interfaces for assistive devices.  相似文献   

18.
Training and testing of conventional machine learning models on binary classification problems depend on the proportions of the two outcomes in the relevant data sets. This may be especially important in practical terms when real-world applications of the classifier are either highly imbalanced or occur in unknown proportions. Intuitively, it may seem sensible to train machine learning models on data similar to the target data in terms of proportions of the two binary outcomes. However, we show that this is not the case using the example of prediction of deleterious and neutral phenotypes of human missense mutations in human genome data, for which the proportion of the binary outcome is unknown. Our results indicate that using balanced training data (50% neutral and 50% deleterious) results in the highest balanced accuracy (the average of True Positive Rate and True Negative Rate), Matthews correlation coefficient, and area under ROC curves, no matter what the proportions of the two phenotypes are in the testing data. Besides balancing the data by undersampling the majority class, other techniques in machine learning include oversampling the minority class, interpolating minority-class data points and various penalties for misclassifying the minority class. However, these techniques are not commonly used in either the missense phenotype prediction problem or in the prediction of disordered residues in proteins, where the imbalance problem is substantial. The appropriate approach depends on the amount of available data and the specific problem at hand.  相似文献   

19.
Machine learning or deep learning models have been widely used for taxonomic classification of metagenomic sequences and many studies reported high classification accuracy. Such models are usually trained based on sequences in several training classes in hope of accurately classifying unknown sequences into these classes. However, when deploying the classification models on real testing data sets, sequences that do not belong to any of the training classes may be present and are falsely assigned to one of the training classes with high confidence. Such sequences are referred to as out-of-distribution (OOD) sequences and are ubiquitous in metagenomic studies. To address this problem, we develop a deep generative model-based method, MLR-OOD, that measures the probability of a testing sequencing belonging to OOD by the likelihood ratio of the maximum of the in-distribution (ID) class conditional likelihoods and the Markov chain likelihood of the testing sequence measuring the sequence complexity. We compose three different microbial data sets consisting of bacterial, viral, and plasmid sequences for comprehensively benchmarking OOD detection methods. We show that MLR-OOD achieves the state-of-the-art performance demonstrating the generality of MLR-OOD to various types of microbial data sets. It is also shown that MLR-OOD is robust to the GC content, which is a major confounding effect for OOD detection of genomic sequences. In conclusion, MLR-OOD will greatly reduce false positives caused by OOD sequences in metagenomic sequence classification.  相似文献   

20.
The classical approaches for protein structure prediction rely either on homology of the protein sequence with a template structure or on ab initio calculations for energy minimization. These methods suffer from disadvantages such as the lack of availability of homologous template structures or intractably large conformational search space, respectively. The recently proposed fragment library based approaches first predict the local structures,which can be used in conjunction with the classical approaches of protein structure prediction. The accuracy of the predictions is dependent on the quality of the fragment library. In this work, we have constructed a library of local conformation classes purely based on geometric similarity. The local conformations are represented using Geometric Invariants, properties that remain unchanged under transformations such as translation and rotation, followed by dimension reduction via principal component analysis. The local conformations are then modeled as a mixture of Gaussian probability distribution functions (PDF). Each one of the Gaussian PDF's corresponds to a conformational class with the centroid representing the average structure of that class. We find 46 classes when we use an octapeptide as a unit of local conformation. The protein 3-D structure can now be described as a sequence of local conformational classes. Further, it was of interest to see whether the local conformations can be predicted from the amino acid sequences. To that end,we have analyzed the correlation between sequence features and the conformational classes.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号