首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Machine learning and statistical model based classifiers have increasingly been used with more complex and high dimensional biological data obtained from high-throughput technologies. Understanding the impact of various factors associated with large and complex microarray datasets on the predictive performance of classifiers is computationally intensive, under investigated, yet vital in determining the optimal number of biomarkers for various classification purposes aimed towards improved detection, diagnosis, and therapeutic monitoring of diseases. We investigate the impact of microarray based data characteristics on the predictive performance for various classification rules using simulation studies. Our investigation using Random Forest, Support Vector Machines, Linear Discriminant Analysis and k-Nearest Neighbour shows that the predictive performance of classifiers is strongly influenced by training set size, biological and technical variability, replication, fold change and correlation between biomarkers. Optimal number of biomarkers for a classification problem should therefore be estimated taking account of the impact of all these factors. A database of average generalization errors is built for various combinations of these factors. The database of generalization errors can be used for estimating the optimal number of biomarkers for given levels of predictive accuracy as a function of these factors. Examples show that curves from actual biological data resemble that of simulated data with corresponding levels of data characteristics. An R package optBiomarker implementing the method is freely available for academic use from the Comprehensive R Archive Network (http://www.cran.r-project.org/web/packages/optBiomarker/).  相似文献   

2.
The analysis of polychoric correlations via principal component analysis and exploratory factor analysis are well-known approaches to determine the dimensionality of ordered categorical items. However, the application of these approaches has been considered as critical due to the possible indefiniteness of the polychoric correlation matrix. A possible solution to this problem is the application of smoothing algorithms. This study compared the effects of three smoothing algorithms, based on the Frobenius norm, the adaption of the eigenvalues and eigenvectors, and on minimum-trace factor analysis, on the accuracy of various variations of parallel analysis by the means of a simulation study. We simulated different datasets which varied with respect to the size of the respondent sample, the size of the item set, the underlying factor model, the skewness of the response distributions and the number of response categories in each item. We found that a parallel analysis and principal component analysis of smoothed polychoric and Pearson correlations led to the most accurate results in detecting the number of major factors in simulated datasets when compared to the other methods we investigated. Of the methods used for smoothing polychoric correlation matrices, we recommend the algorithm based on minimum trace factor analysis.  相似文献   

3.
V. A. Mashin 《Biophysics》2011,56(2):286-297
The methodical problems of factor analysis of the heart rate spectrum have been considered: the multivariate normal assumption, factorability of the intercorrelation matrix, criteria for determining the number of factors, and the validity of the factor analysis model. Unnormalized and normalized variables, the matrices of Pearson’s and Spearman’s correlation coefficients were used. In the process of factor analysis (principal components method, Varimax Rotation), the restrictions and the possibility of different statistical criteria and procedures were explored. The meaning of the selected factors and the possibility of using the intercorrelation matrix of Spearman coefficients for unnormalized variables were considered.  相似文献   

4.
The paper deals with structural organization of psycho-physiological and psychological characteristics of personality. The following characteristics were estimated: psychomotor, intellectual, creative abilities, features of imagination as well as basic properties of personality: extraversion--introversion, and neuroticism. Using the Q-technique of factor analysis of the 1-st, 2-nd and 3rd orders, the authors singled out three types of psychophysiological and psychological human organization. Each type was given a generalized interpretation. Conclusion is made that these types reflect more generalized processes than factors derived with R-technique of factor analysis.  相似文献   

5.
Areas of endemism are central to cladistic biogeography. The concept has been much debated in the past, and from this has emerged the generally accepted definition as an area to which at least two species are endemic. Protocols for locating areas of endemism have been neglected, and to date no attempt has been made to develop optimality criteria against which to evaluate competing hypotheses of areas of endemism. Here various protocols for finding areas of endemism are evaluated--protocols based on both phonetic and parsimony analyses, on both unweighted data and data weighted by various criteria. The optimality criteria used to compare the performance of the methods include the number of species included in the areas of endemism, the number of areas delimited, and the degree of distributional congruency of the species restricted to each area of endemism. These methods are applied to the African Restionaceae in the Cape Floristic Region. Parsimony methods using weighted data are shown to perform best on the combination of all three optimality criteria. By varying the weighting parameters, the size of the areas of endemism can be varied. This provides a very useful tool for locating areas of endemism that satisfy prespecified scale criteria.  相似文献   

6.
Population stratification may confound the results of genetic association studies among unrelated individuals from admixed populations. Several methods have been proposed to estimate the ancestral information in admixed populations and used to adjust the population stratification in genetic association tests. We evaluate the performances of three different methods: maximum likelihood estimation, ADMIXMAP and Structure through various simulated data sets and real data from Latino subjects participating in a genetic study of asthma. All three methods provide similar information on the accuracy of ancestral estimates and control type I error rate at an approximately similar rate. The most important factor in determining accuracy of the ancestry estimate and in minimizing type I error rate is the number of markers used to estimate ancestry. We demonstrate that approximately 100 ancestry informative markers (AIMs) are required to obtain estimates of ancestry that correlate with correlation coefficients more than 0.9 with the true individual ancestral proportions. In addition, after accounting for the ancestry information in association tests, the excess of type I error rate is controlled at the 5% level when 100 markers are used to estimate ancestry. However, since the effect of admixture on the type I error rate worsens with sample size, the accuracy of ancestry estimates also needs to increase to make the appropriate correction. Using data from the Latino subjects, we also apply these methods to an association study between body mass index and 44 AIMs. These simulations are meant to provide some practical guidelines for investigators conducting association studies in admixed populations.  相似文献   

7.
The precise, quantitative portrayal of intraindividual change is a key goal of developmental researchers. Research designs particularly appropriate to selected aspects of this goal are multivariate, replicated, single-subject, repeated measures (MRSRM) designs. Data collected with these designs may be factor analyzed with P-technique factor analysis to elucidate, for each individual studied, patterns of systematic change. The nature of the factor patterns obtained for each individual can then be compared across individuals to determine the relative idiosyncrasy or generality of patterns of change. This paper extends an earlier review (Luborsky & Mintz, 1972) of studies using multivariate, single-subject, repeated measures designs and P-technique factor analysis. The emphasis here is on both more recent studies and the value of subject replication in creating a confluence of idiographic and nomothetic approaches to the study of behavior and behavioral development across the lifespan.  相似文献   

8.
9.
McKinney SA  Joo C  Ha T 《Biophysical journal》2006,91(5):1941-1951
The analysis of single-molecule fluorescence resonance energy transfer (FRET) trajectories has become one of significant biophysical interest. In deducing the transition rates between various states of a system for time-binned data, researchers have relied on simple, but often arbitrary methods of extracting rates from FRET trajectories. Although these methods have proven satisfactory in cases of well-separated, low-noise, two- or three-state systems, they become less reliable when applied to a system of greater complexity. We have developed an analysis scheme that casts single-molecule time-binned FRET trajectories as hidden Markov processes, allowing one to determine, based on probability alone, the most likely FRET-value distributions of states and their interconversion rates while simultaneously determining the most likely time sequence of underlying states for each trajectory. Together with a transition density plot and Bayesian information criterion we can also determine the number of different states present in a system in addition to the state-to-state transition probabilities. Here we present the algorithm and test its limitations with various simulated data and previously reported Holliday junction data. The algorithm is then applied to the analysis of the binding and dissociation of three RecA monomers on a DNA construct.  相似文献   

10.
Gene set methods aim to assess the overall evidence of association of a set of genes with a phenotype, such as disease or a quantitative trait. Multiple approaches for gene set analysis of expression data have been proposed. They can be divided into two types: competitive and self-contained. Benefits of self-contained methods include that they can be used for genome-wide, candidate gene, or pathway studies, and have been reported to be more powerful than competitive methods. We therefore investigated ten self-contained methods that can be used for continuous, discrete and time-to-event phenotypes. To assess the power and type I error rate for the various previously proposed and novel approaches, an extensive simulation study was completed in which the scenarios varied according to: number of genes in a gene set, number of genes associated with the phenotype, effect sizes, correlation between expression of genes within a gene set, and the sample size. In addition to the simulated data, the various methods were applied to a pharmacogenomic study of the drug gemcitabine. Simulation results demonstrated that overall Fisher''s method and the global model with random effects have the highest power for a wide range of scenarios, while the analysis based on the first principal component and Kolmogorov-Smirnov test tended to have lowest power. The methods investigated here are likely to play an important role in identifying pathways that contribute to complex traits.  相似文献   

11.
SUMMARY. 1. The reliability of the simple frequency, Janetschek, Cassie and Dyar's law methods for determining or corroborating instars of mayflies and stoneflies was evaluated using data from published studies, a population of Baetisca rogersi and populations simulated through use of random numbers and generated normal distributions. 2. The Janetschek and Cassie methods are variations of the simple frequency method that offer no significant advantage. Modes of the Cassie method, thought to represent instars, are much more difficult or impossible to detect than are the corresponding peaks of the other two methods. 3. Overlap in size between adjacent instars can lead to false instar peaks or modes in frequency plots. The potential for overlap in mayflies and stoneflies is greatly increased, compared to other insects, because of their large number of instars and known developmental variability. The normal distribution simulations demonstrated that instar size variability as low as 5–7.5% COV (coefficient of variability) may lead to false instar peaks when the number of instars is in the typical range. These simulations also indicated that even simple frequency plots with distinct peaks may result in inaccurate instar determinations. 4. The number of size classes used in an analysis was correlated with the number of peaks or modes revealed. The number of peaks greater than zero in the Janetschek plots for the Baetisca rogersi population varied from 5 to 53 as the number of size classes was varied from 20 to 188. Similarly for the random number simulations. the number of peaks varied from 6 to 41 as the number of size classes varied from 22 to 127. 5. Dyar's law semi-logarithmic plots do not corroborate instars determined through frequency methods, because the uniform spacing of‘instar’data points is the direct result of the uniform spacing of peaks in frequency plots of most data sources (including random numbers), whether or not peaks actually indicate instars. Also Dyar's law plots will‘corroborate’different numbers of instars depending on the peak selection criteria used. The potential for corroborating instars through supplemental rearing and best-fit analysis is discussed. 6. The future of mayfly—stonefly instar determination lies in the increased and more rigorous application of the rearing and Palmen body (mayflies only) methods.  相似文献   

12.
13.
Conservation genetics: beyond the maintenance of marker diversity   总被引:8,自引:0,他引:8  
One of the major problems faced by conservation biologists is the allocation of scarce resources to an overwhelmingly large number of species in need of preservation efforts. Both demographic and genetic information have been brought to bear on this problem; however, the role of information obtained from genetic markers has largely been limited to the characterization of gene frequencies and patterns of diversity. While the genetic consequences of rarity may be a contributing factor to endangerment, it is widely recognized that demographic factors often may be more important. Because patterns of genetic marker variation are influenced by the same demographic factors of interest to the conservation biologist, it is possible to extract useful demographic information from genetic marker data. Such an approach may be productive for determining plant mating systems, inbreeding depression, effective population size, and metapopulation structure. In many cases, however, data consisting only of marker frequencies are inadequate for these purposes. Development of genealogical based analytical methods coupled with studies of DNA sequence variation within and among populations is likely to yield the most information on demographic processes from genetic marker data. Indeed, in some cases it may be the only means of obtaining information on the long-term demographic properties that may be most useful for determining the future prospects of a species of interest.  相似文献   

14.
Objectives: This study focuses on experimental analysis and corresponding mathematical simulation of in vitro HUVECs (human umbilical vein endothelial cells) proliferation in the presence of various types of drugs. Materials and methods: HUVECs, once seeded in Petri dishes, were expanded to confluence. Temporal profiles of total count obtained by classic haemocytometry and cell size distribution measured using an electronic Coulter counter, are quantitatively simulated by a suitable model based on the population balance approach. Influence of drugs on cell proliferation is also properly simulated by accounting for suitable kinetic equations. Results and discussion: The models’ parameters have been determined by comparison with experimental data related to cell population expansion and cell size distribution in the absence of drugs. Inhibition constant for each type of drug has been estimated by comparing the experimental data with model results concerning temporal profiles of total cell count. The reliability of the model and its predictive capability have been tested by simulating cell size distribution for experiments performed in the presence of drugs. The proposed model will be useful in interpreting effects of selected drugs on expansion of readily available human cells.  相似文献   

15.
18 quantitative finger and palmar dermatoglyphic traits were analyzed with the aim of determining genetic effects and common familial environmental influences on a large (358 nuclear pedigrees) number of twins (MZ and DZ). Genetic analysis based on principal factors includes variance and bivariate variance decomposition analysis. Especially, Factor 1 (digital pattern size) is remarkable, due to its degree of universality. The results of genetic analysis revealed all three extracted factors have significant proportion of additive genetic variance (93.5% to 72.9%). The main results of bivariate variance decomposition analysis appears significant correlation in residual variance between digital pattern size factor (Factor 1) versus finger pattern intensity factor (Factor 4), and palmar main lines factor (Factor 2) verses a-b ridge count (Factor 3), but there was no significant correlation in the genetic variance of factors.  相似文献   

16.
The choice of an appropriate sample size for a study is a notoriously neglected topic in behavioural research, even though it is of utmost importance and the rules of action are more than clear – or are they? They may be clear if a formal power analysis is concerned. However, with the educated guesswork usually applied in behavioural studies there are various trade‐offs, and the degrees of freedom are extensive. An analysis of 119 original studies haphazardly chosen from five leading behavioural journals suggests that the selected sample size reflects an influence of constraints more often than a rational optimization process. As predicted, field work involves greater samples than studies conducted in captivity, and invertebrates are used in greater numbers than vertebrates when the approach is similar. However, it seems to be less important for determining the number of subjects if the study employs observational or experimental means. This is surprising because in contrast to mere observations, experiments allow to reduce random variation in the data, which is an essential precondition for economizing on sample size. By pointing to inconsistent patterns the intention of this article is to induce thought and discussion among behavioural researchers on this crucial issue, where apparently neither standard procedures are applied nor conventions have yet been established. This is an issue of concern for authors, referees and editors alike.  相似文献   

17.
A common problem in molecular phylogenetics is choosing a model of DNA substitution that does a good job of explaining the DNA sequence alignment without introducing superfluous parameters. A number of methods have been used to choose among a small set of candidate substitution models, such as the likelihood ratio test, the Akaike Information Criterion (AIC), the Bayesian Information Criterion (BIC), and Bayes factors. Current implementations of any of these criteria suffer from the limitation that only a small set of models are examined, or that the test does not allow easy comparison of non-nested models. In this article, we expand the pool of candidate substitution models to include all possible time-reversible models. This set includes seven models that have already been described. We show how Bayes factors can be calculated for these models using reversible jump Markov chain Monte Carlo, and apply the method to 16 DNA sequence alignments. For each data set, we compare the model with the best Bayes factor to the best models chosen using AIC and BIC. We find that the best model under any of these criteria is not necessarily the most complicated one; models with an intermediate number of substitution types typically do best. Moreover, almost all of the models that are chosen as best do not constrain a transition rate to be the same as a transversion rate, suggesting that it is the transition/transversion rate bias that plays the largest role in determining which models are selected. Importantly, the reversible jump Markov chain Monte Carlo algorithm described here allows estimation of phylogeny (and other phylogenetic model parameters) to be performed while accounting for uncertainty in the model of DNA substitution.  相似文献   

18.
The columnar arrangement of dividing cells in the epiphyseal cartilage plates of growing bones provides a model of a linear proliferation system. One factor which determines the rate of cell production, and hence the rate of growth, is the size of the proliferating population. In this one dimensional system this size is equal to the length of the proliferation zone. Two possible mechanisms for a differentiation control that sets a limit to the length of this zone have been tested in computer simulations. While a diffusion gradient control is consistent with cell kinetic measurements a division limit based on an inheritable growth substance is shown to require further development before the model fits experimental data.Cell division in the columns produces linear clones of cells. If the final length of a bone is set by a limit on the number of divisions that the cartilage stem cells can make, then the number of cells per clone is crucial in determining overall bone growth. The parameters that affect linear clone size have been investigated in computer simulations. Clone size depends largely on the relative division rate of stem cells to proliferation zone cells — but the data on stem cell division rates are generally unreliable.The analysis could be applied to other linear proliferating systems.  相似文献   

19.
DNA microarray gene expression and microarray-based comparative genomic hybridization (aCGH) have been widely used for biomedical discovery. Because of the large number of genes and the complex nature of biological networks, various analysis methods have been proposed. One such method is "gene shaving," a procedure which identifies subsets of the genes with coherent expression patterns and large variation across samples. Since combining genomic information from multiple sources can improve classification and prediction of diseases, in this paper we proposed a new method, "ICA gene shaving" (ICA, independent component analysis), for jointly analyzing gene expression and copy number data. First we used ICA to analyze joint measurements, gene expression and copy number, of a biological system and project the data onto statistically independent biological processes. Next, we used these results to identify patterns of variation in the data and then applied an iterative shaving method. We investigated the properties of our proposed method by analyzing both simulated and real data. We demonstrated that the robustness of our method to noise using simulated data. Using breast cancer data, we showed that our method is superior to the Generalized Singular Value Decomposition (GSVD) gene shaving method for identifying genes associated with breast cancer.  相似文献   

20.
Hui M  Li J  Wen X  Yao L  Long Z 《PloS one》2011,6(12):e29274

Background

Independent Component Analysis (ICA) has been widely applied to the analysis of fMRI data. Accurate estimation of the number of independent components of fMRI data is critical to reduce over/under fitting. Although various methods based on Information Theoretic Criteria (ITC) have been used to estimate the intrinsic dimension of fMRI data, the relative performance of different ITC in the context of the ICA model hasn''t been fully investigated, especially considering the properties of fMRI data. The present study explores and evaluates the performance of various ITC for the fMRI data with varied white noise levels, colored noise levels, temporal data sizes and spatial smoothness degrees.

Methodology

Both simulated data and real fMRI data with varied Gaussian white noise levels, first-order auto-regressive (AR(1)) noise levels, temporal data sizes and spatial smoothness degrees were carried out to deeply explore and evaluate the performance of different traditional ITC.

Principal Findings

Results indicate that the performance of ITCs depends on the noise level, temporal data size and spatial smoothness of fMRI data. 1) High white noise levels may lead to underestimation of all criteria and MDL/BIC has the severest underestimation at the higher Gaussian white noise level. 2) Colored noise may result in overestimation that can be intensified by the increase of AR(1) coefficient rather than the SD of AR(1) noise and MDL/BIC shows the least overestimation. 3) Larger temporal data size will be better for estimation for the model of white noise but tends to cause severer overestimation for the model of AR(1) noise. 4) Spatial smoothing will result in overestimation in both noise models.

Conclusions

1) None of ITC is perfect for all fMRI data due to its complicated noise structure. 2) If there is only white noise in data, AIC is preferred when the noise level is high and otherwise, Laplace approximation is a better choice. 3) When colored noise exists in data, MDL/BIC outperforms the other criteria.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号