首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Differential gene expression detection and sample classification using microarray data have received much research interest recently. Owing to the large number of genes p and small number of samples n (p > n), microarray data analysis poses big challenges for statistical analysis. An obvious problem owing to the 'large p small n' is over-fitting. Just by chance, we are likely to find some non-differentially expressed genes that can classify the samples very well. The idea of shrinkage is to regularize the model parameters to reduce the effects of noise and produce reliable inferences. Shrinkage has been successfully applied in the microarray data analysis. The SAM statistics proposed by Tusher et al. and the 'nearest shrunken centroid' proposed by Tibshirani et al. are ad hoc shrinkage methods. Both methods are simple, intuitive and prove to be useful in empirical studies. Recently Wu proposed the penalized t/F-statistics with shrinkage by formally using the (1) penalized linear regression models for two-class microarray data, showing good performance. In this paper we systematically discussed the use of penalized regression models for analyzing microarray data. We generalize the two-class penalized t/F-statistics proposed by Wu to multi-class microarray data. We formally derive the ad hoc shrunken centroid used by Tibshirani et al. using the (1) penalized regression models. And we show that the penalized linear regression models provide a rigorous and unified statistical framework for sample classification and differential gene expression detection.  相似文献   

2.
Computations for genome scans need to adapt to the increasing use of dense diallelic markers as well as of full-chromosome multipoint linkage analysis with either diallelic or multiallelic markers. Whereas suitable exact-computation tools are available for use with small pedigrees, equivalent exact computation for larger pedigrees remains infeasible. Markov chain-Monte Carlo (MCMC)-based methods currently provide the only computationally practical option. To date, no systematic comparison of the performance of MCMC-based programs is available, nor have these programs been systematically evaluated for use with dense diallelic markers. Using simulated data, we evaluate the performance of two MCMC-based linkage-analysis programs--lm_markers from the MORGAN package and SimWalk2--under a variety of analysis conditions. Pedigrees consisted of 14, 52, or 98 individuals in 3, 5, or 6 generations, respectively, with increasing amounts of missing data in larger pedigrees. One hundred replicates of markers and trait data were simulated on a 100-cM chromosome, with up to 10 multiallelic and up to 200 diallelic markers used simultaneously for computation of multipoint LOD scores. Exact computation was available for comparison in most situations, and comparison with a perfectly informative marker or interprogram comparison was available in the remaining situations. Our results confirm the accuracy of both programs in multipoint analysis with multiallelic markers on pedigrees of varied sizes and missing-data patterns, but there are some computational differences. In contrast, for large numbers of dense diallelic markers, only the lm_markers program was able to provide accurate results within a computationally practical time. Thus, programs in the MORGAN package are the first available to provide a computationally practical option for accurate linkage analyses in genome scans with both large numbers of diallelic markers and large pedigrees.  相似文献   

3.
The genetic management of captive populations to conserve genetic variation is currently based on analyses of individual pedigrees to infer inbreeding and kinship coefficients and values of individuals as breeders. Such analyses require that individual pedigrees are known and individual pairing (mating) can be controlled. Many species in captivity, however, breed in groups due to various reasons, such as space constraints and fertility considerations for species living naturally in social groups, and thus have no pedigrees available for the traditional genetic analyses and management. In the absence of individual pedigree, such group breeding populations can still be genetically monitored, evaluated and managed by suitable population genetics models using population level information (such as census data). This article presents a simple genetic model of group breeding populations to demonstrate how to estimate the genetic variation maintained within and among populations and to optimise management based on these estimates. A numerical example is provided to illustrate the use of the proposed model. Some issues relevant to group breeding, such as the development and robustness evaluation of the population genetics model appropriate for a particular species under specific management and recording systems and the genetic monitoring with markers, are also briefly discussed.  相似文献   

4.

Background

Populational linkage disequilibrium and within-family linkage are commonly used for QTL mapping and marker assisted selection. The combination of both results in more robust and accurate locations of the QTL, but models proposed so far have been either single marker, complex in practice or well fit to a particular family structure.

Results

We herein present linear model theory to come up with additive effects of the QTL alleles in any member of a general pedigree, conditional to observed markers and pedigree, accounting for possible linkage disequilibrium among QTLs and markers. The model is based on association analysis in the founders; further, the additive effect of the QTLs transmitted to the descendants is a weighted (by the probabilities of transmission) average of the substitution effects of founders'' haplotypes. The model allows for non-complete linkage disequilibrium QTL-markers in the founders. Two submodels are presented: a simple and easy to implement Haley-Knott type regression for half-sib families, and a general mixed (variance component) model for general pedigrees. The model can use information from all markers. The performance of the regression method is compared by simulation with a more complex IBD method by Meuwissen and Goddard. Numerical examples are provided.

Conclusion

The linear model theory provides a useful framework for QTL mapping with dense marker maps. Results show similar accuracies but a bias of the IBD method towards the center of the region. Computations for the linear regression model are extremely simple, in contrast with IBD methods. Extensions of the model to genomic selection and multi-QTL mapping are straightforward.  相似文献   

5.
Evidence for a major gene influence on persistent developmental stuttering   总被引:1,自引:0,他引:1  
Stuttering is a complex developmental speech disorder of unknown etiology. There is a substantial aggregation of stuttering in families, suggesting a genetic component to the disorder. However, the exact mode of transmission is still unknown. An earlier study of 56 multigenerational pedigrees ascertained through single adult probands (38 males and 18 females) found that biological relatives of persistent developmental stutterers have an approximately 10-fold higher risk than in the general population; risk is higher for male relatives, and proband's sex does not affect recurrence and relative risks. In the present paper we conduct a complex segregation analysis of the same data, using the logistic regression model of the SAGE software. Based on the comparisons of model likelihoods, the Mendelian model was selected over all other nongenetic models and the general transmission model. This model was further refined into the most parsimonious model, which shows an autosomal dominant major gene effect influenced by two covariates: sex and affection status of parents. With this model applied to 47 informative multiplex pedigrees, a power calculation based on linkage simulation produced an average lod score of 6.8 for 10-cM density genome scan markers. These results give impetus for a genomewide linkage analysis of susceptibility to persistent developmental stuttering.  相似文献   

6.
The availability of high density panels of molecular markers has prompted the adoption of genomic selection (GS) methods in animal and plant breeding. In GS, parametric, semi-parametric and non-parametric regressions models are used for predicting quantitative traits. This article shows how to use neural networks with radial basis functions (RBFs) for prediction with dense molecular markers. We illustrate the use of the linear Bayesian LASSO regression model and of two non-linear regression models, reproducing kernel Hilbert spaces (RKHS) regression and radial basis function neural networks (RBFNN) on simulated data and real maize lines genotyped with 55,000 markers and evaluated for several trait-environment combinations. The empirical results of this study indicated that the three models showed similar overall prediction accuracy, with a slight and consistent superiority of RKHS and RBFNN over the additive Bayesian LASSO model. Results from the simulated data indicate that RKHS and RBFNN models captured epistatic effects; however, adding non-signal (redundant) predictors (interaction between markers) can adversely affect the predictive accuracy of the non-linear regression models.  相似文献   

7.
An extraordinarily large number of single nucleotide polymorphisms (SNPs) are now available in humans as well as in other model organisms. Technological advancements may soon make it feasible to assay hundreds of SNPs in virtually any organism of interest. One potential application of SNPs is the determination of pairwise genetic relationships in populations without known pedigrees. Although microsatellites are currently the marker of choice for this purpose, the number of independently segregating microsatellite markers that can be feasibly assayed is limited. Thus, it can be difficult to distinguish reliably some classes of relationship (e.g. full-sibs from half-sibs) with microsatellite data alone. We assess, via Monte Carlo computer simulation, the potential for using a large panel of independently segregating SNPs to infer genetic relationships, following the analytical approach of Blouin et al. (1996). We have explored a 'best case scenario' in which 100 independently segregating SNPs are available. For discrimination among single-generation relationships or for the identification of parent-offspring pairs, it appears that such a panel of moderately polymorphic SNPs (minor allele frequency of 0.20) will provide discrimination power equivalent to only 16-20 independently segregating microsatellites. Although newly available analytical methods that can account for tight genetic linkage between markers will, in theory, allow improved estimation of relationships using thousands of SNPs in highly dense genomic scans, in practice such studies will only be feasible in a handful of model organisms. Given the comparable amount of effort required for the development of both types of markers, it seems that microsatellites will remain the marker of choice for relationship estimation in nonmodel organisms, at least for the foreseeable future.  相似文献   

8.
Yi N  George V  Allison DB 《Genetics》2003,164(3):1129-1138
In this article, we utilize stochastic search variable selection methodology to develop a Bayesian method for identifying multiple quantitative trait loci (QTL) for complex traits in experimental designs. The proposed procedure entails embedding multiple regression in a hierarchical normal mixture model, where latent indicators for all markers are used to identify the multiple markers. The markers with significant effects can be identified as those with higher posterior probability included in the model. A simple and easy-to-use Gibbs sampler is employed to generate samples from the joint posterior distribution of all unknowns including the latent indicators, genetic effects for all markers, and other model parameters. The proposed method was evaluated using simulated data and illustrated using a real data set. The results demonstrate that the proposed method works well under typical situations of most QTL studies in terms of number of markers and marker density.  相似文献   

9.
Dense sets of hundreds of thousands of markers have been developed for genome-wide association studies. These marker sets are also beneficial for linkage analysis of large, deep pedigrees containing distantly related cases. It is impossible to analyse jointly all genotypes in large pedigrees using the Lander–Green Algorithm, however, as marker density increases it becomes less crucial to analyse all individuals’ genotypes simultaneously. In this report, an approximate multipoint non-parametric technique is described, where large pedigrees are split into many small pedigrees, each containing just two cases. This technique is demonstrated, using phased data from the International Hapmap Project to simulate sets of 10,000, 50,000 and 250,000 markers, showing that it becomes increasingly accurate as more markers are genotyped. This method allows routine linkage analysis of large families with dense marker sets and represents a more easily applied alternative to Monte Carlo Markov Chain methods.  相似文献   

10.
SUMMARY: Differential gene expression detection using microarrays has received lots of research interests recently. Many methods have been proposed, including variants of F-statistics, non-parametric approaches and empirical Bayesian methods etc. The SAM statistics has been shown to have good performance in empirical studies. SAM is more like an ad hoc shrinkage method. The idea is that for small sample microarray data, it is often useful to pool information across genes to improve efficiency. Under Bayesian framework Smyth formally derived the test statistics with shrinkage using the hierarchical models. In this paper we cast differential gene expression detection in the familiar framework of linear regression model. Commonly used test statistics correspond to using least squares to estimate the regression parameters. Based on the vast literature of research on linear models, we can naturally consider other alternatives. Here we explore the penalized linear regression. We propose the penalized t-/F-statistics for two-class microarray data based on [Formula: see text] penalty. We will show that the penalized test statistics intuitively makes sense and through applications we illustrate its good performance. AVAILABILITY: Supplementary information including program codes, more detailed analysis results and R functions for the proposed methods can be found at http://www.biostat.umn.edu/~baolin/research CONTACT: baolin@biostat.umn.edu SUPPLEMENTARY INFORMATION: http://www.biostat.umn.edu/~baolin/research.  相似文献   

11.
Aulchenko YS  de Koning DJ  Haley C 《Genetics》2007,177(1):577-585
For pedigree-based quantitative trait loci (QTL) association analysis, a range of methods utilizing within-family variation such as transmission-disequilibrium test (TDT)-based methods have been developed. In scenarios where stratification is not a concern, methods exploiting between-family variation in addition to within-family variation, such as the measured genotype (MG) approach, have greater power. Application of MG methods can be computationally demanding (especially for large pedigrees), making genomewide scans practically infeasible. Here we suggest a novel approach for genomewide pedigree-based quantitative trait loci (QTL) association analysis: genomewide rapid association using mixed model and regression (GRAMMAR). The method first obtains residuals adjusted for family effects and subsequently analyzes the association between these residuals and genetic polymorphisms using rapid least-squares methods. At the final step, the selected polymorphisms may be followed up with the full measured genotype (MG) analysis. In a simulation study, we compared type 1 error, power, and operational characteristics of the proposed method with those of MG and TDT-based approaches. For moderately heritable (30%) traits in human pedigrees the power of the GRAMMAR and the MG approaches is similar and is much higher than that of TDT-based approaches. When using tabulated thresholds, the proposed method is less powerful than MG for very high heritabilities and pedigrees including large sibships like those observed in livestock pedigrees. However, there is little or no difference in empirical power of MG and the proposed method. In any scenario, GRAMMAR is much faster than MG and enables rapid analysis of hundreds of thousands of markers.  相似文献   

12.
In health services and outcome research, count outcomes are frequently encountered and often have a large proportion of zeros. The zero‐inflated negative binomial (ZINB) regression model has important applications for this type of data. With many possible candidate risk factors, this paper proposes new variable selection methods for the ZINB model. We consider maximum likelihood function plus a penalty including the least absolute shrinkage and selection operator (LASSO), smoothly clipped absolute deviation (SCAD), and minimax concave penalty (MCP). An EM (expectation‐maximization) algorithm is proposed for estimating the model parameters and conducting variable selection simultaneously. This algorithm consists of estimating penalized weighted negative binomial models and penalized logistic models via the coordinated descent algorithm. Furthermore, statistical properties including the standard error formulae are provided. A simulation study shows that the new algorithm not only has more accurate or at least comparable estimation, but also is more robust than the traditional stepwise variable selection. The proposed methods are applied to analyze the health care demand in Germany using the open‐source R package mpath .  相似文献   

13.
Lin S  Ding J  Dong C  Liu Z  Ma ZJ  Wan S  Xu Y 《BMC genetics》2005,6(Z1):S76
We compare and contrast the performance of SIMPLE, a Monte Carlo based software, with that of several other methods for linkage and haplotype analyses, focusing on the simulated data from the New York City population. First, a whole-genome scan study based on the microsatellite markers was performed using GENEHUNTER. Because GENEHUNTER had to drop individuals for many of the pedigrees, we performed a follow-up study focusing on several regions of interest using SIMPLE, which can handle all pedigrees in their entirety. Second, 3 haplotyping programs, including that in SIMPLE, were used to reconstruct haplotypic configurations in pedigrees. SIMPLE emerges clearly as a preferred tool, as it can handle large pedigrees and produces haplotypic configurations without double recombinant haplotypes. For this study, we had knowledge of the simulating models at the time we performed the analysis.  相似文献   

14.
Estimating the genetic variance available for traits informs us about a population’s ability to evolve in response to novel selective challenges. In selfing species, theory predicts a loss of genetic diversity that could lead to an evolutionary dead-end, but empirical support remains scarce. Genetic variability in a trait is estimated by correlating the phenotypic resemblance with the proportion of the genome that two relatives share identical by descent (‘realized relatedness’). The latter is traditionally predicted from pedigrees (ΦA: expected value) but can also be estimated using molecular markers (average number of alleles shared). Nevertheless, evolutionary biologists, unlike animal breeders, remain cautious about using marker-based relatedness coefficients to study complex phenotypic traits in populations. In this paper, we review published results comparing five different pedigree-free methods and use simulations to test individual-based models (hereafter called animal models) using marker-based relatedness coefficients, with a special focus on the influence of mating systems. Our literature review confirms that Ritland’s regression method is unreliable, but suggests that animal models with marker-based estimates of relatedness and genomic selection are promising and that more testing is required. Our simulations show that using molecular markers instead of pedigrees in animal models seriously worsens the estimation of heritability in outcrossing populations, unless a very large number of loci is available. In selfing populations the results are less biased. More generally, populations with high identity disequilibrium (consanguineous or bottlenecked populations) could be propitious for using marker-based animal models, but are also more likely to deviate from the standard assumptions of quantitative genetics models (non-additive variance).  相似文献   

15.
Jannink JL 《Genetics》2007,176(1):553-561
Association studies are designed to identify main effects of alleles across a potentially wide range of genetic backgrounds. To control for spurious associations, effects of the genetic background itself are often incorporated into the linear model, either in the form of subpopulation effects in the case of structure or in the form of genetic relationship matrices in the case of complex pedigrees. In this context epistatic interactions between loci can be captured as an interaction effect between the associated locus and the genetic background. In this study I developed genetic and statistical models to tie the locus by genetic background interaction idea back to more standard concepts of epistasis when genetic background is modeled using an additive relationship matrix. I also simulated epistatic interactions in four-generation randomly mating pedigrees and evaluated the ability of the statistical models to identify when a biallelic associated locus was epistatic to other loci. Under additive-by-additive epistasis, when interaction effects of the associated locus were quite large (explaining 20% of the phenotypic variance), epistasis was detected in 79% of pedigrees containing 320 individuals. The epistatic model also predicted the genotypic value of progeny better than a standard additive model in 78% of simulations. When interaction effects were smaller (although still fairly large, explaining 5% of the phenotypic variance), epistasis was detected in only 9% of pedigrees containing 320 individuals and the epistatic and additive models were equally effective at predicting the genotypic values of progeny. Epistasis was detected with the same power whether the overall epistatic effect was the result of a single pairwise interaction or the sum of nine pairwise interactions, each generating one ninth of the epistatic variance. The power to detect epistasis was highest (94%) at low QTL minor allele frequency, fell to a minimum (60%) at minor allele frequency of about 0.2, and then plateaued at about 80% as alleles reached intermediate frequencies. The power to detect epistasis declined when the linkage disequilibrium between the DNA marker and the functional polymorphism was not complete.  相似文献   

16.
Cai T  Huang J  Tian L 《Biometrics》2009,65(2):394-404
Summary .  In the presence of high-dimensional predictors, it is challenging to develop reliable regression models that can be used to accurately predict future outcomes. Further complications arise when the outcome of interest is an event time, which is often not fully observed due to censoring. In this article, we develop robust prediction models for event time outcomes by regularizing the Gehan's estimator for the accelerated failure time (AFT) model ( Tsiatis, 1996 , Annals of Statistics 18, 305–328) with least absolute shrinkage and selection operator (LASSO) penalty. Unlike existing methods based on the inverse probability weighting and the Buckley and James estimator ( Buckley and James, 1979 , Biometrika 66, 429–436), the proposed approach does not require additional assumptions about the censoring and always yields a solution that is convergent. Furthermore, the proposed estimator leads to a stable regression model for prediction even if the AFT model fails to hold. To facilitate the adaptive selection of the tuning parameter, we detail an efficient numerical algorithm for obtaining the entire regularization path. The proposed procedures are applied to a breast cancer dataset to derive a reliable regression model for predicting patient survival based on a set of clinical prognostic factors and gene signatures. Finite sample performances of the procedures are evaluated through a simulation study.  相似文献   

17.
Genetically modified mouse strains derived from embryonic stem (ES) cells are powerful tools for gene function analysis. ES cells from the C57BL/6 mouse strain are not widely used to generate mouse models despite the advantage of a defined genetic background. We assessed genetic variation in six such ES cell lines with 275 SSLP markers. Compared to C57BL/6, Bruce4 differed at 34 SSLP markers and had significant heterozygosity on three chromosomes. BL/6#3 and Dale1 ES cell lines differed at only 3 SSLP makers. The C2 and WB6d ES cell lines differed at 6 SSLP markers. It is important to compare the efficiency of producing mouse models with available C57BL/6 ES cells relative to standard 129 mouse strain ES cells. We assessed genetic stability (the tendency of cells to become aneuploid) in 110 gene-targeted ES cell clones from the most widely used C57BL/6 ES cell line, Bruce4, and 710 targeted 129 ES cell clones. Bruce4 clones were more likely to be aneuploid and unsuitable for ES cell-mouse chimera production. Despite their tendency to aneuploidy and consequent inefficiency, use of Bruce4 ES cells can be valuable for models requiring behavioral studies and other mouse models that benefit from a defined C57BL/6 background. Electronic supplementary material The online version of this article (doi:) contains supplementary material, which is available to authorized users.  相似文献   

18.
Bayesian LASSO for quantitative trait loci mapping   总被引:7,自引:1,他引:6       下载免费PDF全文
Yi N  Xu S 《Genetics》2008,179(2):1045-1055
The mapping of quantitative trait loci (QTL) is to identify molecular markers or genomic loci that influence the variation of complex traits. The problem is complicated by the facts that QTL data usually contain a large number of markers across the entire genome and most of them have little or no effect on the phenotype. In this article, we propose several Bayesian hierarchical models for mapping multiple QTL that simultaneously fit and estimate all possible genetic effects associated with all markers. The proposed models use prior distributions for the genetic effects that are scale mixtures of normal distributions with mean zero and variances distributed to give each effect a high probability of being near zero. We consider two types of priors for the variances, exponential and scaled inverse-chi(2) distributions, which result in a Bayesian version of the popular least absolute shrinkage and selection operator (LASSO) model and the well-known Student's t model, respectively. Unlike most applications where fixed values are preset for hyperparameters in the priors, we treat all hyperparameters as unknowns and estimate them along with other parameters. Markov chain Monte Carlo (MCMC) algorithms are developed to simulate the parameters from the posteriors. The methods are illustrated using well-known barley data.  相似文献   

19.
Fan R  Jung J  Jin L 《Genetics》2006,172(1):663-686
In this article, population-based regression models are proposed for high-resolution linkage disequilibrium mapping of quantitative trait loci (QTL). Two regression models, the "genotype effect model" and the "additive effect model," are proposed to model the association between the markers and the trait locus. The marker can be either diallelic or multiallelic. If only one marker is used, the method is similar to a classical setting by Nielsen and Weir, and the additive effect model is equivalent to the haplotype trend regression (HTR) method by Zaykin et al. If two/multiple marker data with phase ambiguity are used in the analysis, the proposed models can be used to analyze the data directly. By analytical formulas, we show that the genotype effect model can be used to model the additive and dominance effects simultaneously; the additive effect model takes care of the additive effect only. On the basis of the two models, F-test statistics are proposed to test association between the QTL and markers. By a simulation study, we show that the two models have reasonable type I error rates for a data set of moderate sample size. The noncentrality parameter approximations of F-test statistics are derived to make power calculation and comparison. By a simulation study, it is found that the noncentrality parameter approximations of F-test statistics work very well. Using the noncentrality parameter approximations, we compare the power of the two models with that of the HTR. In addition, a simulation study is performed to make a comparison on the basis of the haplotype frequencies of 10 SNPs of angiotensin-1 converting enzyme (ACE) genes.  相似文献   

20.
James E. Hixson 《Genetica》1987,73(1-2):85-90
Nonhuman primates are particularly useful as animal models for common human diseases in which both genetic and environmental factors play important roles. The recent development of DNA markers (restriction fragment length polymorphisms, RFLPs) greatly increases the power of linkage analysis to detect major genes that affect quantitative phenotypes, including those related to diseases. This paper summarizes a strategy for using RFLPs in linkage analysis of baboon pedigrees to identify genes that control lipoprotein phenotype, which in turn is predictive of susceptibility to atherosclerosis. This strategy also can be applied to other common human diseases for which nonhuman primate models exist.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号