首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The most common tests for types and antitypes in configural frequency analysis are normal approximations of exact tests. In the paper such statistics under the complete independence model and under the fixed margins model are discussed. It turns out that these test statistics are not acceptable when the number of simultaneously performed tests is large or when the expected frequencies are small. In these cases, the use of exact tests is advocated and some existing computer programs for such tests are indicated. A normal approximation based on the strong version of the De Moivre-Laplace limit theorem is also discussed. Empirical examples are given from longitudinal data describing psychological development of boys.  相似文献   

2.
Longitudinal Configural Frequency Analysis (CFA) seeks to identify, at the manifest variable level, those temporal patterns that are observed more frequently (CFA types) or less frequently (CFA antitypes) than expected with reference to a base model. This article discusses, compares, and extends two base models of interest in longitudinal data analysis. The first of these, Prediction CFA (P-CFA), is a base model that can be used in the configural analysis of both cross-sectional and longitudinal data. This model takes the associations among predictors and among criteria into account. The second base model, Auto-Association CFA (A-CFA), was specifically designed for longitudinal data. This model takes the auto-associations among repeatedly observed variables into account. Both models are extended to accommodate covariates, for example, stratification variables. Application examples are given using data from a longitudinal study of domestic violence. It is illustrated that CFA is able to yield results that are not redundant with results from log-linear modeling or multinomial regression. It is concluded that CFA is particularly useful in the context of person-oriented research.  相似文献   

3.
Configural Frequency Analysis (CFA) is being increasingly used by psychologists and other researchers to test for the presence of combinations of categorical variables which occur more frequently or less frequently than expected under a particular model of chance. Configurations which occur more frequently than chance are known as “Types”-Configurations which are conspicuous by their absence or rarity are known as “Antitypes”. Most configural frequency test theory consists of binomial tests applied to the cells of a cross-tabulation table. The wide variety of statistical tests described in papers and books on CFA are approximations to the binomial test, due to the computational intensity associated with performing binomial tests directly (VON EYE, 1990b). This paper advocates direct computation of binomial probabilities instead of the usual approximations used in CFA. Mathematical relationships of the binomial distribution with the F and incomplete beta distributions are described which enable the researcher to efficiently compute binomial probabilities using functions available in common statistical software. The classical inference approach adopted by traditional CFA makes it difficult to make conclusions regarding the likely prevalence rates of types or antitypes in the reference population. It is also not possible to exploit additional information about the sample which, while not known precisely, is known with a degree of confidence and can aid in the identification of types and antitypes. A Bayesian conjugate distributions approach based on the incomplete beta distribution is proposed. Bayesian extensions of this model to both classical CFA and a sequential CFA analysis advanced by KIESER and VICTOR (1991) are described.  相似文献   

4.
For the analysis of cross-classifications having ordered categories, this paper proposes a model which is more parsimonious than the linear-by-linear association (LL) model (that is, which can be described in terms of fewer parameters than the LL model). In a special case, this model is more parsimonious than the uniform association (U) model. Under this model, the expected frequency on a log scale is a linear function of row and column variables for fixed column and row variables, respectively. For the well-known operation and dumping severity data, the parsimonious U model proposed here fits well, and new interpretations are added.  相似文献   

5.
Dominici F 《Biometrics》2000,56(2):546-553
We propose a methodology for estimating the cell probabilities in a multiway contingency table by combining partial information from a number of studies when not all of the variables are recorded in all studies. We jointly model the full set of categorical variables recorded in at least one of the studies, and we treat the variables that are not reported as missing dimensions of the study-specific contingency table. For example, we might be interested in combining several cohort studies in which the incidence in the exposed and nonexposed groups is not reported for all risk factors in all studies while the overall numbers of cases and cohort size is always available. To account for study-to-study variability, we adopt a Bayesian hierarchical model. At the first stage of the model, the observation stage, data are modeled by a multinomial distribution with fixed total number of observations. At the second stage, we use the logistic normal (LN) distribution to model variability in the study-specific cells' probabilities. Using this model and data augmentation techniques, we reconstruct the contingency table for each study regardless of which dimensions are missing, and we estimate population parameters of interest. Our hierarchical procedure borrows strength from all the studies and accounts for correlations among the cells' probabilities. The main difficulty in combining studies recording different variables is in maintaining a consistent interpretation of parameters across studies. The approach proposed here overcomes this difficulty and at the same time addresses the uncertainty arising from the missing dimensions. We apply our modeling strategy to analyze data on air pollution and mortality from 1987 to 1994 for six U.S. cities by combining six cross-classifications of low, medium, and high levels of mortality counts, particulate matter, ozone, and carbon monoxide with the complication that four of the six cities do not report all the air pollution variables. Our goals are to investigate the association between air pollution and mortality by reconstructing the tables with missing dimensions, to determine the most harmful pollutant combinations, and to make predictions about these key issues for a city other than the six sampled. We find that, for high levels of ozone and carbon monoxide, the number of cases with a high number of deaths increases as the levels of particulate matter, PM10, increases and that the most harmful combinations corresponds to high levels of PM10, confirming prior findings that levels of PM10 higher than the NAAQS standard are harmful.  相似文献   

6.
《Genomics》2019,111(6):1387-1394
To decipher the genetic architecture of human disease, various types of omics data are generated. Two common omics data are genotypes and gene expression. Often genotype data for a large number of individuals and gene expression data for a few individuals are generated due to biological and technical reasons, leading to unequal sample sizes for different omics data. Unavailability of standard statistical procedure for integrating such datasets motivates us to propose a two-step multi-locus association method using latent variables. Our method is powerful than single/separate omics data analysis and it unravels comprehensively deep-seated signals through a single statistical model. Extensive simulation confirms that it is robust to various genetic models as its power increases with sample size and number of associated loci. It provides p-values very fast. Application to real dataset on psoriasis identifies 17 novel SNPs, functionally related to psoriasis-associated genes, at much smaller sample size than standard GWAS.  相似文献   

7.
A generalizing, analytic model is developed for 2-way cross-classifications with fixed class sizes The model is expressed in closed form as a simple operator formula, and is readily extended to n-way cross-classifications, and to such cases where one or more cells are vacuous or fixed. The model permits easy derivation, by means of simple differential operators, of the exact power moments, product moments, and factorial moments of the cell frequencies. From this, the moments of the exact sampling distribution of the conventional x2-statistic can be computed, which, in turn, leads to a reappraisal of the Chi-square approximation for sparse and isotropic contingency tables. Here, the Gamma distribution is considered, and numerical results are presented that would suggest preference of the Gamma approximation over the Chi-square in such cases.  相似文献   

8.

Background

The identification of new diagnostic or prognostic biomarkers is one of the main aims of clinical cancer research. Technologies like mass spectrometry are commonly being used in proteomic research. Mass spectrometry signals show the proteomic profiles of the individuals under study at a given time. These profiles correspond to the recording of a large number of proteins, much larger than the number of individuals. These variables come in addition to or to complete classical clinical variables. The objective of this study is to evaluate and compare the predictive ability of new and existing models combining mass spectrometry data and classical clinical variables. This study was conducted in the context of binary prediction.

Results

To achieve this goal, simulated data as well as a real dataset dedicated to the selection of proteomic markers of steatosis were used to evaluate the methods. The proposed methods meet the challenge of high-dimensional data and the selection of predictive markers by using penalization methods (Ridge, Lasso) and dimension reduction techniques (PLS), as well as a combination of both strategies through sparse PLS in the context of a binary class prediction. The methods were compared in terms of mean classification rate and their ability to select the true predictive values. These comparisons were done on clinical-only models, mass-spectrometry-only models and combined models.

Conclusions

It was shown that models which combine both types of data can be more efficient than models that use only clinical or mass spectrometry data when the sample size of the dataset is large enough.  相似文献   

9.
The relationships among landscape characteristics and plant diversity in tropical forests may be used to predict biodiversity. To identify and characterize them, the number of species, as well as Shannon and Simpson diversity indices were calculated from 157 sampling quadrats (17,941 individuals sampled) while the vegetation classes were obtained from multi-spectral satellite image classification in four landscapes located in the southeast of Quintana Roo, Mexico. The mean number of species of trees, shrubs and vines as well as the mean value of the total number of species and the other two diversity indices were calculated for four vegetation classes in every one of the four landscapes. In addition, the relationships between landscape patterns metrics of patch types and diversity indices were explored. The multiple statistical analyses revealed significant predictor variables for the three diversity indices. Moreover, the shape, similarity and edge contrast metrics of patch types might serve as useful indicators for the number of species and the other two diversity variables at the landscape scale. Although the association between the three diversity indices and patch types metrics showed similar behavior, some differences were appreciated. The Shannon diversity index, with its greater sensitivity to rare species, should be considered as having a greater importance in interpretation analysis than Simpson index.  相似文献   

10.
The number and relationships of reproducing individuals create the observed genetic heterogeneity within a social insect colony. These are referred to as sociogenetic organization and were studied in the red ants M. ruginodis and M. lobicornis. Direct observations of the queen numbers were obtained by excavating colonies. The effective number of reproducing individuals was estimated from genetic relatedness based on genotype frequency data. Sociogenetic organization of colonies of both species is simple. The number of queens is low, single mating of queens is the rule and queen to queen variation in worker production is minor. The important variables of sociogenetic organization are the number and relatedness of coexisting queens in polygynous colonies. Queen nestmates are related on average by 0.405 in polygynous colonies of M. ruginodis, showing that colonies recruit their own daughters as new reproductives. The distribution of queen number in M. ruginodis indicates that the study population contains both microgyna and macrogyna types of the species. The large proportion of colonies where the resident queen(s) is not the mother of the workers shows that the average life span of a queen is short and colonies are serially polygynous.  相似文献   

11.
Habitat destruction and degradation are important drivers of biodiversity loss within agro-ecosystems. However, little is known about the effect of farming practices and the value of woody hedgerows on Lepidoptera in North America. The purpose of this work was to study moth diversity in woody hedgerows and croplands of organic and conventional farms. In addition, the influence of vegetation composition and abiotic variables on species richness, abundance, and composition was examined. Moths were sampled with light traps during six weeks in the summer of 2001. Vegetation data and abiotic variables were obtained for all sites. In total, 26,020 individuals from 12 families and 408 species were captured. Most species were uncommon. Only 35 species included >100 individuals while for 71% of species <10 individuals were found. The Noctuidae represented 221 species and 85% of all individuals captured. Woody hedgerows harbored more species and in greater number than croplands. There was no significant difference in moth diversity between organic and conventional farms, except that the Notodontidae were significantly more species rich in organic than in conventional sites. Results show that species richness, abundance, and composition were greatly influenced by habitat types (hedgerow versus crop field) and abiotic variables (minimum temperature which was correlated to moon illumination, rainfall, and cloud cover). Moth species composition was significantly correlated to vegetation composition. This study broadens our understanding of the factors driving moth diversity and expands our knowledge of their geographic range. The maintenance of noncrop habitats such as woody hedgerows within agro-ecosystems seems paramount to preserving the biodiversity and abundance of many organisms, including moths.  相似文献   

12.
This paper reports on aerial surveys conducted to estimate the relative abundance and trend in growth of the southern right whale (Eubalaena australis) population from Península Valdés. The number of whales counted tripled from 1999 to 2016. We modeled the number of whales, the number of calves, the number of solitary individuals and the number of individuals in breeding groups using as predictive variables the year, Julian day, and Julian day2 by means of generalized linear models. The rate of increase decreased from near 7% in 2007 to 0.06% and 2.30% for total number of whales and number of calves, respectively for 2016. Trends in the rates of increase for total number of whales and number of calves were negative (?0.732% and ?0.376%, respectively). The habitat use of the whales changed along the years, with mothers and calves using more heavily the near‐shore strip, resulting in a decreasing trend for solitary individuals and breeding groups in near‐shore waters. We conclude that whales are still increasing their abundance, while the rate of increase is decreasing. Differences in the rates of increase of the group types and changes in habitat use are thought to be the consequence of a density‐dependence process.  相似文献   

13.

Background  

With the advent of high throughput biotechnology data acquisition platforms such as micro arrays, SNP chips and mass spectrometers, data sets with many more variables than observations are now routinely being collected. Finding relationships between response variables of interest and variables in such data sets is an important problem akin to finding needles in a haystack. Whilst methods for a number of response types have been developed a general approach has been lacking.  相似文献   

14.
Complex, high-dimensional data sets pose significant analytical challenges in the post-genomic era. Such data sets are not exclusive to genetic analyses and are also pertinent to epidemiology. There has been considerable effort to develop hypothesis-free data mining and machine learning methodologies. However, current methodologies lack exhaustivity and general applicability. Here we use a novel non-parametric, non-euclidean data mining tool, HyperCube®, to explore exhaustively a complex epidemiological malaria data set by searching for over density of events in m-dimensional space. Hotspots of over density correspond to strings of variables, rules, that determine, in this case, the occurrence of Plasmodium falciparum clinical malaria episodes. The data set contained 46,837 outcome events from 1,653 individuals and 34 explanatory variables. The best predictive rule contained 1,689 events from 148 individuals and was defined as: individuals present during 1992–2003, aged 1–5 years old, having hemoglobin AA, and having had previous Plasmodium malariae malaria parasite infection ≤10 times. These individuals had 3.71 times more P. falciparum clinical malaria episodes than the general population. We validated the rule in two different cohorts. We compared and contrasted the HyperCube® rule with the rules using variables identified by both traditional statistical methods and non-parametric regression tree methods. In addition, we tried all possible sub-stratified quantitative variables. No other model with equal or greater representativity gave a higher Relative Risk. Although three of the four variables in the rule were intuitive, the effect of number of P. malariae episodes was not. HyperCube® efficiently sub-stratified quantitative variables to optimize the rule and was able to identify interactions among the variables, tasks not easy to perform using standard data mining methods. Search of local over density in m-dimensional space, explained by easily interpretable rules, is thus seemingly ideal for generating hypotheses for large datasets to unravel the complexity inherent in biological systems.  相似文献   

15.
Bjørnstad A  Westad F  Martens H 《Hereditas》2004,141(2):149-165
The utility of a relatively new multivariate method, bi-linear modelling by cross-validated partial least squares regression (PLSR), was investigated in the analysis of QTL. The distinguishing feature of PLSR is to reveal reliable covariance structures in data of different types with regard to the same set objects. Two matrices X (here: genetic markers) and Y (here: phenotypes) are interactively decomposed into latent variables (PLS components, or PCs) in a way which facilitates statistically reliable and graphically interpretable model building. Natural collinearities between input variables are utilized actively to stabilise the modelling, instead of being treated as a statistical problem. The importance of cross-validation/jack-knifing as an intuitively appealing way to avoid overfitting, is emphasized. Two datasets from chromosomal mapping studies of different complexity were chosen for illustration (QTL for tomato yield and for oat heading date). Results from PLSR analysis were compared to published results and to results using the package PLABQTL in these data sets. In all cases PLSR gave at least similar explained validation variances as the reported studies. An attractive feature is that PLSR allows the analysis of several traits/replicates in one analysis, and the direct visual identification of individuals with desirable marker genotypes. It is suggested that PLSR may be useful in structural and functional genomics and in marker assisted selection, particularly in cases with limited number of objects.  相似文献   

16.
Aim To evaluate the strength of evidence for hypotheses explaining the relationship between climate and species richness in forest plots. We focused on the effect of energy availability which has been hypothesized to influence species richness: (1) via the effect of productivity on the total number of individuals (the more individuals hypothesis, MIH); (2) through the effect of temperature on metabolic rate (metabolic theory of biodiversity, MTB); or (3) by imposing climatic limits on species distributions. Location Global. Methods We utilized a unique ‘Gentry‐style’ 370 forest plots data set comprising tree counts and individual stem measurements, covering tropical and temperate forests across all six forested continents. We analysed variation in plot species richness and species richness controlled for the number of individuals by using rarefaction. Ordinary least squares (OLS) regression and spatial regressions were used to explore the relative performance of different sets of environmental variables. Results Species richness patterns do not differ whether we use raw number of species or number of species controlled for number of individuals, indicating that number of individuals is not the proximate driver of species richness. Productivity‐related variables (actual evapotranspiration, net primary productivity, normalized difference vegetation index) perform relatively poorly as correlates of tree species richness. The best predictors of species richness consistently include the minimum temperature and precipitation values together with the annual means of these variables. Main conclusion Across the world's forests there is no evidence to support the MIH, and a very limited evidence for a prominent role of productivity as a driver of species richness patterns. The role of temperature is much more important, although this effect is more complex than originally assumed by the MTB. Variation in forest plot diversity appears to be mostly affected by variation in the minimum climatic values. This is consistent with the ‘climatic tolerance hypothesis’ that climatic extremes have acted as a strong constraint on species distribution and diversity.  相似文献   

17.
Assessment of PLSDA cross validation   总被引:3,自引:0,他引:3  
Classifying groups of individuals based on their metabolic profile is one of the main topics in metabolomics research. Due to the low number of individuals compared to the large number of variables, this is not an easy task. PLSDA is one of the data analysis methods used for the classification. Unfortunately this method eagerly overfits the data and rigorous validation is necessary. The validation however is far from straightforward. Is this paper we will discuss a strategy based on cross model validation and permutation testing to validate the classification models. It is also shown that too optimistic results are obtained when the validation is not done properly. Furthermore, we advocate against the use of PLSDA score plots for inference of class differences.  相似文献   

18.
19.
A practical methodology is presented for the exact combinatory evaluation of 2-way cross-classifications. The combinatory distribution is based on a test-statistic relating to the information content of an observed table. Several examples of exact evaluation accompany this presentation.  相似文献   

20.
With the loss of natural wetlands, artificial wetlands are becoming increasingly important as habitat for waterbirds. We investigated the relationships between waterbirds and various biophysical parameters on artificial wetlands in an Australian urban valley. The densities (birds per hectare) of several species were correlated (mostly positively) with wetland area, and correlations were observed between certain species and other physical and water chemistry variables. Waterbird community structure, based on both abundance (birds per wetland) and density data, was most consistently positively correlated with the relative amount of wetland perimeter that was vegetated, surface area, distance to nearest wetland, public accessibility and shoreline irregularity. We also compared the relative use of the two types of urban wetlands, namely urban lakes and stormwater treatment wetlands, and found for both abundance and density that the number of individuals and species did not vary significantly between wetland types but that significant differences were observed for particular species and feeding guilds, with no species or guild being more abundant or found in greater density on an urban lake than a stormwater treatment wetland. Designing wetlands to provide a diversity of habitat will benefit most species.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号