首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
AIMS: Four bacterial source tracking (BST) methods, enterobacterial repetitive intergenic consensus sequence polymerase chain reaction (ERIC-PCR), automated ribotyping using HindIII, Kirby-Bauer antibiotic resistance analysis (KB-ARA) and pulsed-field gel electrophoresis (PFGE) were directly compared using the same collection of Escherichia coli isolates. The data sets from each BST method and from composite methods were compared for library accuracy and their ability to identify water isolates. METHODS AND RESULTS: Potential sources of faecal pollution were identified by watershed sanitary surveys. Domestic sewage and faecal samples from pets, cattle, avian livestock, other nonavian livestock, avian wildlife and nonavian wildlife sources were collected for isolation of E. coli. A total of 2275 E. coli isolates from 813 source samples were screened using ERIC-PCR to exclude clones and to maximize library diversity, resulting in 883 isolates from 745 samples selected for the library. The selected isolates were further analysed using automated ribotyping with HindIII, KB-ARA and PFGE. A total of 555 E. coli isolates obtained from 412 water samples were analysed by the four BST methods. A composite data set of the four BST methods gave the highest rates of correct classification (RCCs) with the fewest unidentified isolates than any single method alone. RCCs for the four-method composite data set and a seven-way split of source classes ranged from 22% for avian livestock to 83% for domestic sewage. Two-method composite data sets were also found to be better than individual methods, having RCCs similar to the four-method composite and identification of the same major sources of faecal pollution. CONCLUSIONS: The use of BST composite data sets may be more beneficial than the use of single methods. SIGNIFICANCE AND IMPACT OF THE STUDY: This is one of the first comprehensive comparisons using composite data from several BST methods. While the four-method approach provided the most desirable BST results, the use of two-method composite data sets may yield comparable BST results while providing for cost, labour and time savings.  相似文献   

2.
MOTIVATION: Novel methods, both molecular and statistical, are urgently needed to take advantage of recent advances in biotechnology and the human genome project for disease diagnosis and prognosis. Mass spectrometry (MS) holds great promise for biomarker identification and genome-wide protein profiling. It has been demonstrated in the literature that biomarkers can be identified to distinguish normal individuals from cancer patients using MS data. Such progress is especially exciting for the detection of early-stage ovarian cancer patients. Although various statistical methods have been utilized to identify biomarkers from MS data, there has been no systematic comparison among these approaches in their relative ability to analyze MS data. RESULTS: We compare the performance of several classes of statistical methods for the classification of cancer based on MS spectra. These methods include: linear discriminant analysis, quadratic discriminant analysis, k-nearest neighbor classifier, bagging and boosting classification trees, support vector machine, and random forest (RF). The methods are applied to ovarian cancer and control serum samples from the National Ovarian Cancer Early Detection Program clinic at Northwestern University Hospital. We found that RF outperforms other methods in the analysis of MS data.  相似文献   

3.
Classification is one of the most widely applied tasks in ecology. Ecologists have to deal with noisy, high-dimensional data that often are non-linear and do not meet the assumptions of conventional statistical procedures. To overcome this problem, machine-learning methods have been adopted as ecological classification methods. We compared five machine-learning based classification techniques (classification trees, random forests, artificial neural networks, support vector machines, and automatically induced rule-based fuzzy models) in a biological conservation context. The study case was that of the ocellated turkey (Meleagris ocellata), a bird endemic to the Yucatan peninsula that has suffered considerable decreases in local abundance and distributional area during the last few decades. On a grid of 10 × 10 km cells that was superimposed to the peninsula we analysed relationships between environmental and social explanatory variables and ocellated turkey abundance changes between 1980 and 2000. Abundance was expressed in three (decrease, no change, and increase) and 14 more detailed abundance change classes, respectively. Modelling performance varied considerably between methods with random forests and classification trees being the most efficient ones as measured by overall classification error and the normalised mutual information index. Artificial neural networks yielded the worst results along with linear discriminant analysis, which was included as a conventional statistical approach. We not only evaluated classification accuracy but also characteristics such as time effort, classifier comprehensibility and method intricacy—aspects that determine the success of a classification technique among ecologists and conservation biologists as well as for the communication with managers and decision makers. We recommend the combined use of classification trees and random forests due to the easy interpretability of classifiers and the high comprehensibility of the method.  相似文献   

4.
AIMS: The accuracy of ribotyping and antibiotic resistance analysis (ARA) for prediction of sources of faecal bacterial pollution in an urban southern California watershed was determined using blinded proficiency samples. METHODS AND RESULTS: Antibiotic resistance patterns and HindIII ribotypes of Escherichia coli (n = 997), and antibiotic resistance patterns of Enterococcus spp. (n = 3657) were used to construct libraries from sewage samples and from faeces of seagulls, dogs, cats, horses and humans within the watershed. The three libraries were analysed to determine the accuracy of host source prediction. The internal accuracy of the libraries (average rate of correct classification, ARCC) with six source categories was 44% for E. coli ARA, 69% for E. coli ribotyping and 48% for Enterococcus ARA. Each library's predictive ability towards isolates that were not part of the library was determined using a blinded proficiency panel of 97 E. coli and 99 Enterococcus isolates. Twenty-eight per cent (by ARA) and 27% (by ribotyping) of the E. coli proficiency isolates were assigned to the correct source category. Sixteen per cent were assigned to the same source category by both methods, and 6% were assigned to the correct category. Addition of 2480 E. coli isolates to the ARA library did not improve the ARCC or proficiency accuracy. In contrast, 45% of Enterococcus proficiency isolates were correctly identified by ARA. CONCLUSIONS: None of the methods performed well enough on the proficiency panel to be judged ready for application to environmental samples. SIGNIFICANCE AND IMPACT OF THE STUDY: Most microbial source tracking (MST) studies published have demonstrated library accuracy solely by the internal ARCC measurement. Low rates of correct classification for E. coli proficiency isolates compared with the ARCCs of the libraries indicate that testing of bacteria from samples that are not represented in the library, such as blinded proficiency samples, is necessary to accurately measure predictive ability. The library-based MST methods used in this study may not be suited for determination of the source(s) of faecal pollution in large, urban watersheds.  相似文献   

5.
Linear discriminant analysis (LDA) is frequently used for classification/prediction problems in physical anthropology, but it is unusual to find examples where researchers consider the statistical limitations and assumptions required for this technique. In these instances, it is difficult to know whether the predictions are reliable. This paper considers a nonparametric alternative to predictive LDA: binary, recursive (or classification) trees. This approach has the advantage that data transformation is unnecessary, cases with missing predictor variables do not require special treatment, prediction success is not dependent on data meeting normality conditions or covariance homogeneity, and variable selection is intrinsic to the methodology. Here I compare the efficacy of classification trees with LDA, using typical morphometric data. With data from modern hominoids, the results show that both techniques perform nearly equally. With complete data sets, LDA may be a better choice, as is shown in this example, but with missing observations, classification trees perform outstandingly well, whereas commercial discriminant analysis programs do not predict classifications for cases with incompletely measured predictor variables and generally are not designed to address the problem of missing data. Testing of data prior to analysis is necessary, and classification trees are recommended either as a replacement for LDA or as a supplement whenever data do not meet relevant assumptions. It is highly recommended as an alternative to LDA whenever the data set contains important cases with missing predictor variables.  相似文献   

6.
MOTIVATION: The desire to compare molecular phylogenies has stimulated the design of numerous tests. Most of these tests are formulated in a frequentist framework, and it is not known how they compare with Bayes procedures. I propose here two new Bayes tests that either compare pairs of trees (Bayes hypothesis test, BHT), or test each tree against an average of the trees included in the analysis (Bayes significance test, BST). RESULTS: The algorithm, based on a standard Metropolis-Hastings sampler, integrates nuisance parameters out and estimates the probability of the data under each topology. These quantities are used to estimate Bayes factors for composite vs. composite hypotheses. Based on two data sets, the BHT and BST are shown to construct similar confidence sets to the bootstrap and the Shimodaira Hasegawa test, respectively. This suggests that the known difference among previous tests is mainly due to the null hypothesis considered.  相似文献   

7.
OBJECTIVE: To study the discriminatory capacity of textural variables to classify the nuclei of breast tumor cells as benign or malignant, using a statistical approach. STUDY DESIGN: Image analysis techniques were used to automatically segment nuclei of cells obtained by fine needle aspiration and Papanicolaou stained. The sample comprised 95 cases of malignant lesions and 47 cases of benign lesions (approximately 25 nuclei per case), and 27 textural variables were measured. Two methods were used to analyze the data: classification and regression trees (CART) and discriminant analysis. RESULTS: The variance in gray levels was the most decisive variable in the CART analysis, correctly classifying 57% and 97% of benign and malignant cases, respectively. Discriminant analysis yielded the best results, correctly classifying 79% and 85% of benign and malignant cases, respectively. CONCLUSION: The classifier obtained by a statistical approach to the textural analysis of Papanicolaou-stained nuclei did not prove useful for diagnostic discrimination. Staining techniques that are not chromatin specific are highly variable, and other features have proven more effective with this type of staining.  相似文献   

8.
Ecologists collect their data manually by visiting multiple sampling sites. Since there can be multiple species in the multiple sampling sites, manually classifying them can be a daunting task. Much work in literature has focused mostly on statistical methods for classification of single species and very few studies on classification of multiple species. In addition to looking at multiple species, we noted that classification of multiple species result in multi-class imbalanced problem. This study proposes to use machine learning approach to classify multiple species in population ecology. In particular, bagging (random forests (RF) and bagging classification trees (bagCART)) and boosting (boosting classification trees (bootCART), gradient boosting machines (GBM) and adaptive boosting classification trees (AdaBoost)) classifiers were evaluated for their performances on imbalanced multiple fish species dataset. The recall and F1-score performance metrics were used to select the best classifier for the dataset. The bagging classifiers (RF and bagCART) achieved high performances on the imbalanced dataset while the boosting classifiers (bootCART, GBM and AdaBoost) achieved lower performances on the imbalanced dataset. We found that some machine learning classifiers were sensitive to imbalanced dataset hence they require data resampling to improve their performances. After resampling, the bagging classifiers (RF and bagCART) had high performances compared to boosting classifiers (bootCART, GBM and AdaBoost). The strong performances shown by bagging classifiers (RF and bagCART) suggest that they can be used for classifying multiple species in ecological studies.  相似文献   

9.
SELDI-TOF-MS is rapidly gaining popularity as a screening tool for clinical applications of proteomics. Application of adequate statistical techniques in all the stages from measurement to information is obligatory. One of the statistical methods often used in proteomics is classification: the assignment of subjects to discrete categories, for example healthy or diseased. Lately, many new classification methods have been developed, often specifically for the analysis of X-omics data. For proteomics studies a good strategy for evaluating classification results is of prime importance, because usually the number of objects will be small and it would be wasteful to set aside part of these as a 'mere' test set. The present paper offers such a strategy in the form of a protocol which can be used for choosing among different statistical classification methods and obtaining figures of merit of their performance. This paper also illustrates the usefulness of proteomics in a clinical setting, serum samples from Gaucher disease patients, when used in combination with an appropriate classification method.  相似文献   

10.
Vocal individuality varies between species and/or ontogenesis stages depending on needs in the vocal recognition, but also estimation of individual differences depends on the method of analysis. We studied pair-specific differences of duets elicited by mating pairs of Siberian crane Grus leucogeranus. We quantitatively described the duet structure and compared visual and statistical classification methods of pair identification by duet. Three methods were used: discriminant analysis, method of classification trees and visual classification of spectrogram. We found significant interpair differences. The pairs differ by duet structure that is by the ratio of male- and female-initiated duets and by the ratio of the number of male to female calls; temporal-frequency duet characteristics are pair-specific, too. All methods showed high interpair differences, which exceeded random values significantly. Discriminant analysis stepwise procedure based on 11 parameters resulted in 97.3% of correctly assigned duets. Human observers correctly assigned 80.7% of spectrograms. Our data provide a basis for remote monitoring of this endangered species with a wild population of only 3,000 birds.  相似文献   

11.
Efron-type measures of prediction error for survival analysis   总被引:3,自引:0,他引:3  
Gerds TA  Schumacher M 《Biometrics》2007,63(4):1283-1287
Estimates of the prediction error play an important role in the development of statistical methods and models, and in their applications. We adapt the resampling tools of Efron and Tibshirani (1997, Journal of the American Statistical Association92, 548-560) to survival analysis with right-censored event times. We find that flexible rules, like artificial neural nets, classification and regression trees, or regression splines can be assessed, and compared to less flexible rules in the same data where they are developed. The methods are illustrated with data from a breast cancer trial.  相似文献   

12.
A CART-based approach to discover emerging patterns in microarray data   总被引:1,自引:0,他引:1  
MOTIVATION: Cancer diagnosis using gene expression profiles requires supervised learning and gene selection methods. Of the many suggested approaches, the method of emerging patterns (EPs) has the particular advantage of explicitly modeling interactions among genes, which improves classification accuracy. However, finding useful (i.e. short and statistically significant) EP is typically very hard. METHODS: Here we introduce a CART-based approach to discover EPs in microarray data. The method is based on growing decision trees from which the EPs are extracted. This approach combines pattern search with a statistical procedure based on Fisher's exact test to assess the significance of each EP. Subsequently, sample classification based on the inferred EPs is performed using maximum-likelihood linear discriminant analysis. RESULTS: Using simulated data as well as gene expression data from colon and leukemia cancer experiments we assessed the performance of our pattern search algorithm and classification procedure. In the simulations, our method recovers a large proportion of known EPs while for real data it is comparable in classification accuracy with three top-performing alternative classification algorithms. In addition, it assigns statistical significance to the inferred EPs and allows to rank the patterns while simultaneously avoiding overfit of the data. The new approach therefore provides a versatile and computationally fast tool for elucidating local gene interactions as well as for classification. AVAILABILITY: A computer program written in the statistical language R implementing the new approach is freely available from the web page http://www.stat.uni-muenchen.de/~socher/  相似文献   

13.
Statistical methods for microarray assays   总被引:1,自引:0,他引:1  
The paper shortly reviews statistical methods used in the area of DNA microarray studies. All stages of the experiment are taken into account: planning, data collection, data preprocessing, analysis and validation. Among the methods of data analysis, the algorithms for estimating differential expression, multivariate approaches, clustering methods, as well as classification and discrimination are reviewed. The need is stressed for routine statistical data processing protocols and for the search of links of microarray data analysis with quantitative genetic models.  相似文献   

14.
Biochemical systems analysis of genome-wide expression data   总被引:6,自引:0,他引:6  
MOTIVATION: Modern methods of genomics have produced an unprecedented amount of raw data. The interpretation and explanation of these data constitute a major, well-recognized challenge. RESULTS: Biochemical Systems Theory (BST) is the mathematical basis of a well-established methodological framework for analyzing networks of biochemical reactions. An existing BST model of yeast glycolysis is used here to explain and interpret the glycolytic gene expression pattern of heat shocked yeast. Our analysis demonstrates that the observed gene expression profile satisfies the primary goals of increased ATP, trehalose, and NADPH production, while maintaining intermediate metabolites at reasonable levels. Based on a systematic exploration of alternative, hypothetical expression profiles, we show that the observed profile outperforms other profiles. Conclusion: BST is a useful framework for combining DNA microarray data with enzymatic process information to yield new insights into metabolic pathway regulation. AVAILABILITY: All analyses were executed with the software PLAS(Copyright), which is freely available at http://correio.cc.fc.ul.pt/~aenf/plas.html for academic use. CONTACT: VoitEO@MUSC.edu  相似文献   

15.
Human microbiome research characterizes the microbial content of samples from human habitats to learn how interactions between bacteria and their host might impact human health. In this work a novel parametric statistical inference method based on object-oriented data analysis (OODA) for analyzing HMP data is proposed. OODA is an emerging area of statistical inference where the goal is to apply statistical methods to objects such as functions, images, and graphs or trees. The data objects that pertain to this work are taxonomic trees of bacteria built from analysis of 16S rRNA gene sequences (e.g. using RDP); there is one such object for each biological sample analyzed. Our goal is to model and formally compare a set of trees. The contribution of our work is threefold: first, a weighted tree structure to analyze RDP data is introduced; second, using a probability measure to model a set of taxonomic trees, we introduce an approximate MLE procedure for estimating model parameters and we derive LRT statistics for comparing the distributions of two metagenomic populations; and third the Jumpstart HMP data is analyzed using the proposed model providing novel insights and future directions of analysis.  相似文献   

16.
A major task in the statistical analysis of genetic data such as gene expressions and single nucleotide polymorphisms (SNPs) is to predict whether a patient has a certain disease, or from which of several known subtypes of a disease a patient suffers. A large number of discrimination methods have been proposed in the literature and have been applied to genetic data to tackle this task. In this paper, we give an overview on the most popular of these procedures in the analysis of genetic data. Moreover, we describe how these methods for supervised classification can be combined with variable selection approaches to reduce the number of genetic features from several thousands to as few as possible to form a concise classification rule. Finally, we show how the resulting statistical models can be validated. (© 2008 WILEY‐VCH Verlag GmbH & Co. KGaA, Weinheim)  相似文献   

17.
Arachidonic acid (ARA, 5,8,1l,14-cis-eicosatetraenoic acid) is widely used in medicine, pharmaceutics, cosmetics, dietary nutrition, agriculture, and other fields. Microbiological production of ARA is of increased interest since the natural sources (pig liver, adrenal glands, and egg-yolk) cannot satisfy its growing requirements. Mechanisms for ARA biosynthesis as well as the regulation of enzymes involved in this process are considered. Review summarizes literature data concerning individual stages of microbiological ARA production, methods for screening of active strains-producers, physiological regulation of ARA synthesis in micromycetes (the effect of growth phase, medium composition, pH, temperature, and aeration), and effective technologies of fermentation and the product recovery. Information on the whole biotechnological process from strain selection to the ARA yield improvement and purification of the end product is presented.  相似文献   

18.
Question: (1) Which remote sensing classification most successfully identify aspen using multitemporal Landsat 5 TM images and airborne lidar data? (2) How has aspen distribution changed in southwestern Idaho? (3) Are topographic variables and conifer encroachment correlated with aspen changes? Location: Reynolds Creek Experimental Watershed in southwestern Idaho, USA. Methods: Multi‐temporal Landsat 5 TM and lidar data were used individually and fused together. The best classification model was compared with a 1965 aspen map and tree ring data. Conifer encroachment was examined via image‐based change detection and field mapping. Lidar‐derived topographic variables were correlated with aspen change patterns using quantile regression models. Results: The best Landsat 5 TM classification was a normalized difference vegetation index (NDVI)‐based approach with 92% overall accuracy. The lidar classification of tree presence/absence performed with 100% overall accuracy. Fusing the lidar classification with various Landsat 5 TM classifications improved overall accuracies 3 to 6%. Among the fusion models, the NDVI‐lidar fusion performed best with 96% overall accuracy. Change detection indicated 69% decline in aspen cover, but 179% increase in aspen cover in other areas of the watershed. Conifers have completely replaced 17% of the aspen, while 93% of the remaining aspen stands have young Douglas‐fir and western juniper trees underneath the aspen canopy. Aspen significantly decreased (P‐values <0.05) with increasing elevation (up to 2150 m) and decreasing slope. Conclusions: Landsat 5 TM data used with a NDVI‐based approach provide an accurate method to classify aspen distribution. Landsat 5 TM classifications can be further improved via fusion with lidar data. Aspen change patterns are spatially variable: while aspen is drastically declining in some parts of this watershed, aspen is increasing in other areas.  相似文献   

19.
Statistical inferences in phylogeography   总被引:2,自引:0,他引:2  
In conventional phylogeographic studies, historical demographic processes are elucidated from the geographical distribution of individuals represented on an inferred gene tree. However, the interpretation of gene trees in this context can be difficult as the same demographic/geographical process can randomly lead to multiple different genealogies. Likewise, the same gene trees can arise under different demographic models. This problem has led to the emergence of many statistical methods for making phylogeographic inferences. A popular phylogeographic approach based on nested clade analysis is challenged by the fact that a certain amount of the interpretation of the data is left to the subjective choices of the user, and it has been argued that the method performs poorly in simulation studies. More rigorous statistical methods based on coalescence theory have been developed. However, these methods may also be challenged by computational problems or poor model choice. In this review, we will describe the development of statistical methods in phylogeographic analysis, and discuss some of the challenges facing these methods.  相似文献   

20.
The evolution of five island populations of Green gecko, representing inter- and intra-specific divergence, was studied using biochemical data, scalation and shape. The data were numerically analysed using ordination analyses for the phenetic classification and Wagner trees to hypothesize the phylogeny. These studies revealed three phenetic groups corresponding to three mono-phyletic lineages. The numerical analysis of morphological data agreed with the numerical analysis of biochemical data. It is concluded that the classification based on biochemical affinities differed from the previous classification based on conventional analysis of morphology due to methodological and philosophical differences rather than differences between morphological and biochemical evolution.
The ordination analyses were very congruent between data sets (biochemical, shape, scalation, total) and the Wagner trees were generally congruent between data sets. Some Wagner trees based on scalation data were incongruent. The phenetic and cladistic classifications corresponded to each other but differed from the conventional classification. The phylogenetic analysis of the total data set indicated that the three specific lineages showed relatively equal anagenesis. However, anagenic divergence differed markedly between character types. It is suggested that a range of character types be used when studying anagenesis.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号