首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
MOTIVATION: Normalization of microarray data is essential for multiple-array analyses. Several normalization protocols have been proposed based on different biological or statistical assumptions. A fundamental problem arises whether they have effectively normalized arrays. In addition, for a given array, the question arises how to choose a method to most effectively normalize the microarray data. RESULTS: We propose several techniques to compare the effectiveness of different normalization methods. We approach the problem by constructing statistics to test whether there are any systematic biases in the expression profiles among duplicated spots within an array. The test statistics involve estimating the genewise variances. This is accomplished by using several novel methods, including empirical Bayes methods for moderating the genewise variances and the smoothing methods for aggregating variance information. P-values are estimated based on a normal or chi approximation. With estimated P-values, we can choose a most appropriate method to normalize a specific array and assess the extent to which the systematic biases due to the variations of experimental conditions have been removed. The effectiveness and validity of the proposed methods are convincingly illustrated by a carefully designed simulation study. The method is further illustrated by an application to human placenta cDNAs comprising a large number of clones with replications, a customized microarray experiment carrying just a few hundred genes on the study of the molecular roles of Interferons on tumor, and the Agilent microarrays carrying tens of thousands of total RNA samples in the MAQC project on the study of reproducibility, sensitivity and specificity of the data. AVAILABILITY: Code to implement the method in the statistical package R is available from the authors.  相似文献   

2.

Background

In diagnostic studies, a single and error-free test that can be used as the reference (gold) standard often does not exist. One solution is the use of panel diagnosis, i.e., a group of experts who assess the results from multiple tests to reach a final diagnosis in each patient. Although panel diagnosis, also known as consensus or expert diagnosis, is frequently used as the reference standard, guidance on preferred methodology is lacking. The aim of this study is to provide an overview of methods used in panel diagnoses and to provide initial guidance on the use and reporting of panel diagnosis as reference standard.

Methods and Findings

PubMed was systematically searched for diagnostic studies applying a panel diagnosis as reference standard published up to May 31, 2012. We included diagnostic studies in which the final diagnosis was made by two or more persons based on results from multiple tests. General study characteristics and details of panel methodology were extracted. Eighty-one studies were included, of which most reported on psychiatry (37%) and cardiovascular (21%) diseases. Data extraction was hampered by incomplete reporting; one or more pieces of critical information about panel reference standard methodology was missing in 83% of studies. In most studies (75%), the panel consisted of three or fewer members. Panel members were blinded to the results of the index test results in 31% of studies. Reproducibility of the decision process was assessed in 17 (21%) studies. Reported details on panel constitution, information for diagnosis and methods of decision making varied considerably between studies.

Conclusions

Methods of panel diagnosis varied substantially across studies and many aspects of the procedure were either unclear or not reported. On the basis of our review, we identified areas for improvement and developed a checklist and flow chart for initial guidance for researchers conducting and reporting of studies involving panel diagnosis. Please see later in the article for the Editors'' Summary  相似文献   

3.
MOTIVATION: Current methods for multiplicity adjustment do not make use of the graph structure of Gene Ontology (GO) when testing for association of expression profiles of GO terms with a response variable. RESULTS: We propose a multiple testing method, called the focus level procedure, that preserves the graph structure of Gene Ontology (GO). The procedure is constructed as a combination of a Closed Testing procedure with Holm's method. It requires a user to choose a 'focus level' in the GO graph, which reflects the level of specificity of terms in which the user is most interested. This choice also determines the level in the GO graph at which the procedure has most power. We prove that the procedure strongly controls the family-wise error rate without any additional assumptions on the joint distribution of the test statistics used. We also present an algorithm to calculate multiplicity-adjusted P-values. Because the focus level procedure preserves the structure of the GO graph, it does not generally preserve the ordering of the raw P-values in the adjusted P-values. AVAILABILITY: The focus level procedure has been implemented in the globaltest and GlobalAncova packages, both of which are available on www.bioconductor.org.  相似文献   

4.
The analysis of microarray data often involves performing a large number of statistical tests, usually at least one test per queried gene. Each test has a certain probability of reaching an incorrect inference; therefore, it is crucial to estimate or control error rates that measure the occurrence of erroneous conclusions in reporting and interpreting the results of a microarray study. In recent years, many innovative statistical methods have been developed to estimate or control various error rates for microarray studies. Researchers need guidance choosing the appropriate statistical methods for analysing these types of data sets. This review describes a family of methods that use a set of P-values to estimate or control the false discovery rate and similar error rates. Finally, these methods are classified in a manner that suggests the appropriate method for specific applications and diagnostic procedures that can identify problems in the analysis are described.  相似文献   

5.
We consider the problematic relationship between publication success and statistical significance in the light of analyses in which we examine the distribution of published probability (P) values across the statistical 'significance' range, below the 5% probability threshold. P-values are often judged according to whether they lie beneath traditionally accepted thresholds (< 0.05, < 0.01, < 0.001, < 0.0001); we examine how these thresholds influence the distribution of reported absolute P-values in published scientific papers, the majority in biological sciences. We collected published P-values from three leading journals, and summarized their distribution using the frequencies falling across and within these four threshold values between 0.05 and 0. These published frequencies were then fitted to three complementary null models which allowed us to predict the expected proportions of P-values in the top and bottom half of each inter-threshold interval (i.e. those lying below, as opposed to above, each P-value threshold). Statistical comparison of these predicted proportions, against those actually observed, provides the first empirical evidence for a remarkable excess of probability values being cited on, or just below, each threshold relative to the smoothed theoretical distributions. The pattern is consistent across thresholds and journals, and for whichever theoretical approach used to generate our expected proportions. We discuss this novel finding and its implications for solving the problems of publication bias and selective reporting in evolutionary biology.  相似文献   

6.
Dalmasso C  Génin E  Trégouet DA 《Genetics》2008,180(1):697-702
In the context of genomewide association studies where hundreds of thousand of polymorphisms are tested, stringent thresholds on the raw association test P-values are generally used to limit false-positive results. Instead of using thresholds based on raw P-values as in Bonferroni and sequential Sidak (SidakSD) corrections, we propose here to use a weighted-Holm procedure with weights depending on allele frequency of the polymorphisms. This method is shown to substantially improve the power to detect associations, in particular by favoring the detection of rare variants with high genetic effects over more frequent ones with lower effects.  相似文献   

7.
The impacts of sediment contaminants can be evaluated by different lines of evidence, including toxicity tests and ecological community studies. Responses from 10 different toxicity assays/tests were combined to arrive at a “site score.” We employed a relatively simple summary measure, pooled P-values where we quantify a potential decrement in response in a contaminated site relative to nominally clean reference sites. The response-specific P-values were defined relative to a “null” distribution of responses in reference sites, and were then pooled using standard meta-analytic methods. Ecological community data were also evaluated using an analogous strategy. A distribution of distances of the reference sites from thecentroid of the reference sites was obtained. The distance from each of the test sites from the centroid of the reference sites was then calculated, and the proportion of reference distances that exceed the test site difference was used to define an empirical P-value for that test site. A plot of the toxicity P-value versus the community P-value was used to identify sites based on both alteration in community structure and toxicity, that is, by weight-of-evidence. This approach provides a useful strategy for examining multiple lines of evidence that should be accessible to the broader scientific community. The use of a large collection of reference sites to empirically define P-values is appealing in that parametric distribution assumptions are avoided, although this does come at the cost of assuming the reference sites provide an appropriate comparison group for test sites.  相似文献   

8.
Analyzing gene expression data in terms of gene sets: methodological issues   总被引:3,自引:0,他引:3  
MOTIVATION: Many statistical tests have been proposed in recent years for analyzing gene expression data in terms of gene sets, usually from Gene Ontology. These methods are based on widely different methodological assumptions. Some approaches test differential expression of each gene set against differential expression of the rest of the genes, whereas others test each gene set on its own. Also, some methods are based on a model in which the genes are the sampling units, whereas others treat the subjects as the sampling units. This article aims to clarify the assumptions behind different approaches and to indicate a preferential methodology of gene set testing. RESULTS: We identify some crucial assumptions which are needed by the majority of methods. P-values derived from methods that use a model which takes the genes as the sampling unit are easily misinterpreted, as they are based on a statistical model that does not resemble the biological experiment actually performed. Furthermore, because these models are based on a crucial and unrealistic independence assumption between genes, the P-values derived from such methods can be wildly anti-conservative, as a simulation experiment shows. We also argue that methods that competitively test each gene set against the rest of the genes create an unnecessary rift between single gene testing and gene set testing.  相似文献   

9.
Surveillance is critical to mounting an appropriate and effective response to pandemics. However, aggregated case report data suffers from reporting delays and can lead to misleading inferences. Different from aggregated case report data, line list data is a table contains individual features such as dates of symptom onset and reporting for each reported case and a good source for modeling delays. Current methods for modeling reporting delays are not particularly appropriate for line list data, which typically has missing symptom onset dates that are non-ignorable for modeling reporting delays. In this paper, we develop a Bayesian approach that dynamically integrates imputation and estimation for line list data. Specifically, this Bayesian approach can accurately estimate the epidemic curve and instantaneous reproduction numbers, even with most symptom onset dates missing. The Bayesian approach is also robust to deviations from model assumptions, such as changes in the reporting delay distribution or incorrect specification of the maximum reporting delay. We apply the Bayesian approach to COVID-19 line list data in Massachusetts and find the reproduction number estimates correspond more closely to the control measures than the estimates based on the reported curve.  相似文献   

10.
Alves G  Yu YK 《PloS one》2011,6(8):e22647
Given the expanding availability of scientific data and tools to analyze them, combining different assessments of the same piece of information has become increasingly important for social, biological, and even physical sciences. This task demands, to begin with, a method-independent standard, such as the P-value, that can be used to assess the reliability of a piece of information. Good's formula and Fisher's method combine independent P-values with respectively unequal and equal weights. Both approaches may be regarded as limiting instances of a general case of combining P-values from m groups; P-values within each group are weighted equally, while weight varies by group. When some of the weights become nearly degenerate, as cautioned by Good, numeric instability occurs in computation of the combined P-values. We deal explicitly with this difficulty by deriving a controlled expansion, in powers of differences in inverse weights, that provides both accurate statistics and stable numerics. We illustrate the utility of this systematic approach with a few examples. In addition, we also provide here an alternative derivation for the probability distribution function of the general case and show how the analytic formula obtained reduces to both Good's and Fisher's methods as special cases. A C++ program, which computes the combined P-values with equal numerical stability regardless of whether weights are (nearly) degenerate or not, is available for download at our group website http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads/CoinedPValues.html.  相似文献   

11.
Modeling correlated or highly stratified multiple-response data is a common data analysis task in many applications, such as those in large epidemiological studies or multisite cohort studies. The generalized estimating equations method is a popular statistical method used to analyze these kinds of data, because it can manage many types of unmeasured dependence among outcomes. Collecting large amounts of highly stratified or correlated response data is time-consuming; thus, the use of a more aggressive sampling strategy that can accelerate this process—such as the active-learning methods found in the machine-learning literature—will always be beneficial. In this study, we integrate adaptive sampling and variable selection features into a sequential procedure for modeling correlated response data. Besides reporting the statistical properties of the proposed procedure, we also use both synthesized and real data sets to demonstrate the usefulness of our method.  相似文献   

12.

Background

Advanced intercross lines (AIL) are segregating populations created using a multi-generation breeding protocol for fine mapping complex trait loci (QTL) in mice and other organisms. Applying QTL mapping methods for intercross and backcross populations, often followed by naïve permutation of individuals and phenotypes, does not account for the effect of AIL family structure in which final generations have been expanded and leads to inappropriately low significance thresholds. The critical problem with naïve mapping approaches in AIL populations is that the individual is not an exchangeable unit.

Methodology/Principal Findings

The effect of family structure has immediate implications for the optimal AIL creation (many crosses, few animals per cross, and population expansion before the final generation) and we discuss these and the utility of AIL populations for QTL fine mapping. We also describe Genome Reshuffling for Advanced Intercross Permutation, (GRAIP) a method for analyzing AIL data that accounts for family structure. GRAIP permutes a more interchangeable unit in the final generation crosses – the parental genome – and simulating regeneration of a permuted AIL population based on exchanged parental identities. GRAIP determines appropriate genome-wide significance thresholds and locus-specific P-values for AILs and other populations with similar family structures. We contrast GRAIP with naïve permutation using a large densely genotyped mouse AIL population (1333 individuals from 32 crosses). A naïve permutation using coat color as a model phenotype demonstrates high false-positive locus identification and uncertain significance levels, which are corrected using GRAIP. GRAIP also detects an established hippocampus weight locus and a new locus, Hipp9a.

Conclusions and Significance

GRAIP determines appropriate genome-wide significance thresholds and locus-specific P-values for AILs and other populations with similar family structures. The effect of family structure has immediate implications for the optimal AIL creation and we discuss these and the utility of AIL populations.  相似文献   

13.

Aim

Species distribution data play a pivotal role in the study of ecology, evolution, biogeography and biodiversity conservation. Although large amounts of location data are available and accessible from public databases, data quality remains problematic. Of the potential sources of error, positional errors are critical for spatial applications, particularly where these errors place observations beyond the environmental or geographical range of species. These outliers need to be identified, checked and removed to improve data quality and minimize the impact on subsequent analyses. Manually checking all species records within large multispecies datasets is prohibitively costly. This work investigates algorithms that may assist in the efficient vetting of outliers in such large datasets.

Location

We used real, spatially explicit environmental data derived from the western part of Victoria, Australia, and simulated species distributions within this same region.

Methods

By adapting species distribution modelling (SDM), we developed a pseudo‐SDM approach for detecting outliers in species distribution data, which was implemented with random forest (RF) and support vector machine (SVM) resulting in two new methods: RF_pdSDM and SVM_pdSDM. Using virtual species, we compared eight existing multivariate outlier detection methods with these two new methods under various conditions.

Results

The two new methods based on the pseudo‐SDM approach had higher true skill statistic (TSS) values than other approaches, with TSS values always exceeding 0. More than 70% of the true outliers in datasets for species with a low and intermediate prevalence can be identified by checking 10% of the data points with the highest outlier scores.

Main conclusions

Pseudo‐SDM‐based methods were more effective than other outlier detection methods. However, this outlier detection procedure can only be considered as a screening tool, and putative outliers must be examined by experts to determine whether they are actual errors or important records within an inherently biased set of data.  相似文献   

14.
A CART-based approach to discover emerging patterns in microarray data   总被引:1,自引:0,他引:1  
MOTIVATION: Cancer diagnosis using gene expression profiles requires supervised learning and gene selection methods. Of the many suggested approaches, the method of emerging patterns (EPs) has the particular advantage of explicitly modeling interactions among genes, which improves classification accuracy. However, finding useful (i.e. short and statistically significant) EP is typically very hard. METHODS: Here we introduce a CART-based approach to discover EPs in microarray data. The method is based on growing decision trees from which the EPs are extracted. This approach combines pattern search with a statistical procedure based on Fisher's exact test to assess the significance of each EP. Subsequently, sample classification based on the inferred EPs is performed using maximum-likelihood linear discriminant analysis. RESULTS: Using simulated data as well as gene expression data from colon and leukemia cancer experiments we assessed the performance of our pattern search algorithm and classification procedure. In the simulations, our method recovers a large proportion of known EPs while for real data it is comparable in classification accuracy with three top-performing alternative classification algorithms. In addition, it assigns statistical significance to the inferred EPs and allows to rank the patterns while simultaneously avoiding overfit of the data. The new approach therefore provides a versatile and computationally fast tool for elucidating local gene interactions as well as for classification. AVAILABILITY: A computer program written in the statistical language R implementing the new approach is freely available from the web page http://www.stat.uni-muenchen.de/~socher/  相似文献   

15.
In this paper we describe a method for the statistical reconstruction of a large DNA sequence from a set of sequenced fragments. We assume that the fragments have been assembled and address the problem of determining the degree to which the reconstructed sequence is free from errors, i.e., its accuracy. A consensus distribution is derived from the assembled fragment configuration based upon the rates of sequencing errors in the individual fragments. The consensus distribution can be used to find a minimally redundant consensus sequence that meets a prespecified confidence level, either base by base or across any region of the sequence. A likelihood-based procedure for the estimation of the sequencing error rates, which utilizes an iterative EM algorithm, is described. Prior knowledge of the error rates is easily incorporated into the estimation procedure. The methods are applied to a set of assembled sequence fragments from the human G6PD locus. We close the paper with a brief discussion of the relevance and practical implications of this work.  相似文献   

16.
To facilitate decision support in freshwater ecosystem protection and restoration management, habitat suitability models can be very valuable. Data driven methods such as artificial neural networks (ANNs) are particularly useful in this context, seen their time-efficient development and relatively high reliability. However, specialized and technical literature on neural network modelling offers a variety of model development criteria to select model architecture, training procedure, etc. This may lead to confusion among ecosystem modellers and managers regarding the optimal training and validation methodology. This paper focuses on the analysis of ANN development and application for predicting macroinvertebrate communities, a species group commonly used in freshwater assessment worldwide. This review reflects on the different aspects regarding model development and application based on a selection of 26 papers reporting the use of ANN models for the prediction of macroinvertebrates. This analysis revealed that the applied model training and validation methodologies can often be improved and moreover crucial steps in the modelling process are often poorly documented. Therefore, suggestions to improve model development, assessment and application in ecological river management are presented. In particular, data pre-processing determines to a high extent the reliability of the induced models and their predictive relevance. This also counts for the validation criteria, that need to be better tuned to the practical simulation requirements. Moreover, the use of sensitivity methods can help to extract knowledge on the habitat preference of species and allow peer-review by ecological experts. The selection of relevant input variables remains a critical challenge as well. Model coupling is a missing crucial step to link human activities, hydrology, physical habitat conditions, water quality and ecosystem status. This last aspect is probably the most valuable aspect to enable decision support in water management based on ANN models.  相似文献   

17.
The intracoelomic surgical implantation of telemetry transmitters in fish is becoming the “standard” tagging approach for most field telemetry studies. Subsequently, efforts must be made to ensure the welfare of the fish are maintained and that fish do not experience significant mortality or sublethal impairments in health, behavior or physiology as a result of surgical procedures. Therefore, it is essential to adequately report information relating to all aspects of the surgical procedure to enable the reader to make an accurate interpretation of study results. We conducted a quantitative literature review aimed at characterizing trends in data reporting by examining a sample of fish telemetry studies published in peer-reviewed outlets during the last 20 years. We used a repeatability score, based on 16 predetermined criteria, to evaluate the reporting of surgical procedures in telemetry studies. The majority of studies failed to report basic information relating to the surgical procedures used. Repeatability scores were highly variable between studies and ranged from 0–93.8%. No single study provided complete information (mean repeatability score = 50.7%) and repeatability showed no trend over time. Some study information was consistently well reported (e.g. tag size and dimensions, the type of anaesthetic used and the location of incision). In contrast, the type of suture knots, duration or level of anaesthesia and precautions taken to minimize infection were consistently left out of the methods section of most telemetry studies. Our review was confounded by the large proportion of studies that cited other sources for their surgical methods, many of which themselves lacked complete information. We recommend that future electronic tagging studies that involve intracoelomic implantation include the minimum reporting standards presented in this paper. Increasing the detail of reporting will improve the quality of data presented, minimize welfare and ethical concerns and allow transparency for study repeatability.  相似文献   

18.
Connective tissues are biological composites comprising of collagen fibrils embedded in (and reinforcing) the hydrated proteoglycan-rich (PG) gel within the extracellular matrices (ECMs). Age-related changes to the mechanical properties of tissues are often associated with changes to the structure of the ECM, namely, fibril diameter. However, quantitative attempts to correlate fibril diameter to mechanical properties have yielded inconclusive evidence. Here, we described a novel approach that was based on the rule of mixtures for fiber composites to evaluate the dependence of age-related changes in tendon tensile strength (sigma) and stiffness (E) on the collagen fibril cross-sectional area fraction (rho), which is related to the fibril volume fraction. Tail tendons from C57BL6 mice from age groups 1.6-35.3 months old were stretched to failure to determine sigma and E. Parallel measurements of rho as a function of age were made using transmission electron microscopy. Mathematical models (rule of mixtures) of fibrils reinforcing a PG gel in tendons were used to investigate the influence of rho on ageing changes in sigma and E. The magnitudes of sigma, E, and rho increased rapidly from 1.6 months to 4.0 months (P-values <0.05) before reaching a constant (age independent) from 4.0 months to 29.0 months (P-values >0.05); this trend continued for E and rho (P-values >0.05) from 29.0 months to 35.3 months, but not for sigma, which decreased gradually (P-values <0.05). Linear regression analysis revealed that age-related changes in sigma and E correlated positively to rho (P-values <0.05). Collagen fibril cross-sectional area fraction rho is a significant predictor of ageing changes in sigma and E in the tail tendons of C57BL6 mice.  相似文献   

19.
Sequential soil coring is a commonly used approach to measure seasonal root biomass and necromass, from which root production can be estimated by maximum–minimum, sum of changes, compartment-flow model, and/or decision matrix methods. Among these methods, decision matrix is the most frequently used. However, the decision matrix, often underestimating fine root production, is frequently misused in research due to inadequate documentation of its underlying logic. In this paper, we reviewed the decision matrix method and provided mathematical logic for the development of the matrix, by which not only root production but also mortality and decomposition rates can be calculated. To ease its use for large datasets, we developed simplified equations to facilitate computation of root production, mortality and decomposition to be used in MS Excel or R. We also presented results from calculations by an example using empirical data from boreal forests to show proper calculations of root production, mortality and decomposition. The simplified decision matrix presented here shall promote its application in ecology, especially for large datasets.  相似文献   

20.
Background, aim, and scope  Cross-category weighting is one possible way to facilitate internal decision making when dealing with ambiguous impact assessment results, with simple additive weighting being a commonly used method. Yet, the question as to whether the methods applied today can, in fact, identify the most “environmentally friendly” alternative from a group perspective remains unanswered. The aim of this paper is to propose a new method for group decision making that ensures the effective identification of the most preferable alternative. Materials and methods  Common approaches to deduce a single set of weighting factors for application in a group decision situation (e.g., arithmetic mean, consensus) are discussed based on simple mathematics, empirical data, and thought experiments. After proposing an extended definition for “effectiveness” in group decision making, the paper recommends the use of social choice theory whose main focus is to identify the most preferable alternative based on individuals’ rankings of alternatives. The procedure is further supplemented by a Monte Carlo analysis to facilitate the assessment of the result’s robustness. Results  The general feasibility of the method is demonstrated. It generates a complete ranking of alternatives, which does not contain cardinal single scores. In terms of effectiveness, the mathematical structure of the procedure ensures the eligibility for compromise of the group decision proposal. The sensitivity analysis supports the decision makers in understanding the robustness of the proposed group ranking. Discussion  The method is based upon an extended definition of effectiveness which acknowledges the eligibility for compromise as the core requirement in group decision contexts. It is shown that multi-attribute decision-making (MADM) methods in use in life cycle assessment (LCA) today do not necessarily meet this requirement because of their mathematical structure. Further research should focus on empirical proof that the generated group results are indeed more eligible for compromise than results generated by current methods that utilize an averaged group weighting set. This is closely related to the question considering under which mathematical constraints it is even possible to generate an essentially different result. Conclusions  The paper describes a new multi-attribute group decision support system (MGDSS) for the identification of the most preferable alternative(s) for use in panel-based LCA studies. The main novelty is that it refrains from deducing a single set of weighting factors which is supposed to represent the panel as a whole. Instead, it applies voting rules that stem from social choice theory. Because of its mathematical structure, the procedure is deemed superior to common approaches in terms of its effectiveness. Recommendations and perspectives  The described method may be recommended for use in internal, panel-based LCA studies. In addition, the basic approach of the method—the combination of MADM methods with social choice theory—can be recommended for use in all those situations where multi-attribute decisions are to be made in a group context.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号