首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
The prediction of phenotypic traits using high-density genomic data has many applications such as the selection of plants and animals of commercial interest; and it is expected to play an increasing role in medical diagnostics. Statistical models used for this task are usually tested using cross-validation, which implicitly assumes that new individuals (whose phenotypes we would like to predict) originate from the same population the genomic prediction model is trained on. In this paper we propose an approach based on clustering and resampling to investigate the effect of increasing genetic distance between training and target populations when predicting quantitative traits. This is important for plant and animal genetics, where genomic selection programs rely on the precision of predictions in future rounds of breeding. Therefore, estimating how quickly predictive accuracy decays is important in deciding which training population to use and how often the model has to be recalibrated. We find that the correlation between true and predicted values decays approximately linearly with respect to either FST or mean kinship between the training and the target populations. We illustrate this relationship using simulations and a collection of data sets from mice, wheat and human genetics.  相似文献   

2.
Mounting evidence suggests that natural populations can harbor extensive fitness diversity with numerous genomic loci under selection. It is also known that genealogical trees for populations under selection are quantifiably different from those expected under neutral evolution and described statistically by Kingman’s coalescent. While differences in the statistical structure of genealogies have long been used as a test for the presence of selection, the full extent of the information that they contain has not been exploited. Here we demonstrate that the shape of the reconstructed genealogical tree for a moderately large number of random genomic samples taken from a fitness diverse, but otherwise unstructured, asexual population can be used to predict the relative fitness of individuals within the sample. To achieve this we define a heuristic algorithm, which we test in silico, using simulations of a Wright–Fisher model for a realistic range of mutation rates and selection strength. Our inferred fitness ranking is based on a linear discriminator that identifies rapidly coalescing lineages in the reconstructed tree. Inferred fitness ranking correlates strongly with actual fitness, with a genome in the top 10% ranked being in the top 20% fittest with false discovery rate of 0.1–0.3, depending on the mutation/selection parameters. The ranking also enables us to predict the genotypes that future populations inherit from the present one. While the inference accuracy increases monotonically with sample size, samples of 200 nearly saturate the performance. We propose that our approach can be used for inferring relative fitness of genomes obtained in single-cell sequencing of tumors and in monitoring viral outbreaks.  相似文献   

3.
In the context of genetics and breeding research on multiple phenotypic traits, reconstructing the directional or causal structure between phenotypic traits is a prerequisite for quantifying the effects of genetic interventions on the traits. Current approaches mainly exploit the genetic effects at quantitative trait loci (QTLs) to learn about causal relationships among phenotypic traits. A requirement for using these approaches is that at least one unique QTL has been identified for each trait studied. However, in practice, especially for molecular phenotypes such as metabolites, this prerequisite is often not met due to limited sample sizes, high noise levels and small QTL effects. Here, we present a novel heuristic search algorithm called the QTL+phenotype supervised orientation (QPSO) algorithm to infer causal directions for edges in undirected phenotype networks. The two main advantages of this algorithm are: first, it does not require QTLs for each and every trait; second, it takes into account associated phenotypic interactions in addition to detected QTLs when orienting undirected edges between traits. We evaluate and compare the performance of QPSO with another state-of-the-art approach, the QTL-directed dependency graph (QDG) algorithm. Simulation results show that our method has broader applicability and leads to more accurate overall orientations. We also illustrate our method with a real-life example involving 24 metabolites and a few major QTLs measured on an association panel of 93 tomato cultivars. Matlab source code implementing the proposed algorithm is freely available upon request.  相似文献   

4.
We describe a new way to develop evidence of causes of biological effects using field-based species sensitivity distributions (SSDs) and show how evidence can be compared when genera or effect endpoints are different among potentially causal agents. To evaluate if a cause is sufficient to elicit an effect, we developed a general SSD. A cause was judged sufficient if the intensity of the stressor at the site predicted the observed proportion of extirpation. To evaluate if an effect is specific to a cause, we developed site-specific SSDs using field-based effect levels of genera occurring in the locality of the study. An effect was judged specific to a cause if susceptible genera were absent and tolerant genera were present. Field-based SSDs were used to assess nutrients and conductivity. Other associations were used to assess metals, sediment, dissolved oxygen, and temperature. A case study at Pigeon Roost Creek, Tennessee, USA, illustrates how the SSDs are used to infer multiple causes. A weight-of-evidence analysis identified nutrients and sediment as probable causes but another unidentified agent appears to be acting as well. This inferential approach has broad application and the causal models for conductivity, nutrients, and deposited sediment can be used at other locations.  相似文献   

5.
Associating phenotypic traits and quantitative trait loci (QTL) to causative regions of the underlying genome is a key goal in agricultural research.InterStoreDB is a suite of integrated databases designed to assist in this process.The individual databases are species independent and generic in design,providing access to curated datasets relating to plant populations,phenotypic traits,genetic maps,marker loci and QTL,with links to functional gene annotation and genomic sequence data.Each component database provides access to associated metadata,including data provenance and parameters used in analyses,thus providing users with information to evaluate the relative worth of any associations identified.The databases include CropStoreDB,for management of population,genetic map,QTL and trait measurement data,SeqStoreDB for sequence-related data and AlignStoreDB,which stores sequence alignment information,and allows navigation between genetic and genomic datasets.Genetic maps are visualized and compared using the CMAP tool,and functional annotation from sequenced genomes is provided via an EnsEMBL-based genome browser.This framework facilitates navigation of the multiple biological domains involved in genetics and genomics research in a transparent manner within a single portal.We demonstrate the value of InterStoreDB as a tool for Brassica research.InterStoreDB is available from:http://www.interstoredb.org  相似文献   

6.
7.
Computational inference of novel therapeutic values for existing drugs, i.e., drug repositioning, offers the great prospect for faster and low-risk drug development. Previous researches have indicated that chemical structures, target proteins, and side-effects could provide rich information in drug similarity assessment and further disease similarity. However, each single data source is important in its own way and data integration holds the great promise to reposition drug more accurately. Here, we propose a new method for drug repositioning, PreDR (Predict Drug Repositioning), to integrate molecular structure, molecular activity, and phenotype data. Specifically, we characterize drug by profiling in chemical structure, target protein, and side-effects space, and define a kernel function to correlate drugs with diseases. Then we train a support vector machine (SVM) to computationally predict novel drug-disease interactions. PreDR is validated on a well-established drug-disease network with 1,933 interactions among 593 drugs and 313 diseases. By cross-validation, we find that chemical structure, drug target, and side-effects information are all predictive for drug-disease relationships. More experimentally observed drug-disease interactions can be revealed by integrating these three data sources. Comparison with existing methods demonstrates that PreDR is competitive both in accuracy and coverage. Follow-up database search and pathway analysis indicate that our new predictions are worthy of further experimental validation. Particularly several novel predictions are supported by clinical trials databases and this shows the significant prospects of PreDR in future drug treatment. In conclusion, our new method, PreDR, can serve as a useful tool in drug discovery to efficiently identify novel drug-disease interactions. In addition, our heterogeneous data integration framework can be applied to other problems.  相似文献   

8.
Exome sequencing has been widely used in detecting pathogenic nonsynonymous single nucleotide variants (SNVs) for human inherited diseases. However, traditional statistical genetics methods are ineffective in analyzing exome sequencing data, due to such facts as the large number of sequenced variants, the presence of non-negligible fraction of pathogenic rare variants or de novo mutations, and the limited size of affected and normal populations. Indeed, prevalent applications of exome sequencing have been appealing for an effective computational method for identifying causative nonsynonymous SNVs from a large number of sequenced variants. Here, we propose a bioinformatics approach called SPRING (Snv PRioritization via the INtegration of Genomic data) for identifying pathogenic nonsynonymous SNVs for a given query disease. Based on six functional effect scores calculated by existing methods (SIFT, PolyPhen2, LRT, MutationTaster, GERP and PhyloP) and five association scores derived from a variety of genomic data sources (gene ontology, protein-protein interactions, protein sequences, protein domain annotations and gene pathway annotations), SPRING calculates the statistical significance that an SNV is causative for a query disease and hence provides a means of prioritizing candidate SNVs. With a series of comprehensive validation experiments, we demonstrate that SPRING is valid for diseases whose genetic bases are either partly known or completely unknown and effective for diseases with a variety of inheritance styles. In applications of our method to real exome sequencing data sets, we show the capability of SPRING in detecting causative de novo mutations for autism, epileptic encephalopathies and intellectual disability. We further provide an online service, the standalone software and genome-wide predictions of causative SNVs for 5,080 diseases at http://bioinfo.au.tsinghua.edu.cn/spring.  相似文献   

9.
Comparative genome-scale analyses of protein-coding gene sequences are employed to examine evidence for whole-genome duplication and horizontal gene transfer. For this purpose, an orthogroup should be delineated to infer evolutionary history regarding each gene, and results of all orthogroup analyses need to be integrated to infer a genome-scale history. An orthogroup is a set of genes descended from a single gene in the last common ancestor of all species under consideration. However, such analyses confront several problems: 1) Analytical pipelines to infer all gene histories with methods comparing species and gene trees are not fully developed, and 2) without detailed analyses within orthogroups, evolutionary events of paralogous genes in the same orthogroup cannot be distinguished for genome-wide integration of results derived from multiple orthogroup analyses. Here I present an analytical pipeline, ORTHOSCOPE* (star), to infer evolutionary histories of animal/plant genes from genome-scale data. ORTHOSCOPE* estimates a tree for a specified gene, detects speciation/gene duplication events that occurred at nodes belonging to only one lineage leading to a species of interest, and then integrates results derived from gene trees estimated for all query genes in genome-wide data. Thus, ORTHOSCOPE* can be used to detect species nodes just after whole-genome duplications as a first step of comparative genomic analyses. Moreover, by examining the presence or absence of genes belonging to species lineages with dense taxon sampling available from the ORTHOSCOPE web version, ORTHOSCOPE* can detect genes lost in specific lineages and horizontal gene transfers. This pipeline is available at https://github.com/jun-inoue/ORTHOSCOPE_STAR.  相似文献   

10.
We present a tool for repetitive, marker-free, site-specific integration in Lactococcus lactis, in which a nonreplicating plasmid vector (pKV6) carrying a phage attachment site (attP) can be integrated into a bacterial attachment site (attB). The novelty of the tool described here is the inclusion of a minimal bacterial attachment site (attBmin), two mutated loxP sequences (lox66 and lox71) allowing for removal of undesirable vector elements (antibiotic resistance marker), and a counterselection marker (oroP) for selection of loxP recombination on the pKV6 vector. When transformed into L. lactis expressing the phage TP901-1 integrase, pKV6 integrates with high frequency into the chromosome, where it is flanked by attL and attR hybrid attachment sites. After expression of Cre recombinase from a plasmid that is not able to replicate in L. lactis, loxP recombinants can be selected for by using 5-fluoroorotic acid. The introduced attBmin site can subsequently be used for a second round of integration. To examine if attP recombination was specific to the attB site, integration was performed in strains containing the attB, attL, and attR sites or the attL and attR sites only. Only attP-attB recombination was observed when all three sites were present. In the absence of the attB site, a low frequency of attP-attL recombination was observed. To demonstrate the functionality of the system, the xylose utilization genes (xylABR and xylT) from L. lactis strain KF147 were integrated into the chromosome of L. lactis strain MG1363 in two steps.  相似文献   

11.
12.
Despite significant advances in invertebrate phylogenomics over the past decade, the higher-level phylogeny of Pycnogonida (sea spiders) remains elusive. Due to the inaccessibility of some small-bodied lineages, few phylogenetic studies have sampled all sea spider families. Previous efforts based on a handful of genes have yielded unstable tree topologies. Here, we inferred the relationships of 89 sea spider species using targeted capture of the mitochondrial genome, 56 conserved exons, 101 ultraconserved elements, and 3 nuclear ribosomal genes. We inferred molecular divergence times by integrating morphological data for fossil species to calibrate 15 nodes in the arthropod tree of life. This integration of data classes resolved the basal topology of sea spiders with high support. The enigmatic family Austrodecidae was resolved as the sister group to the remaining Pycnogonida and the small-bodied family Rhynchothoracidae as the sister group of the robust-bodied family Pycnogonidae. Molecular divergence time estimation recovered a basal divergence of crown group sea spiders in the Ordovician. Comparison of diversification dynamics with other marine invertebrate taxa that originated in the Paleozoic suggests that sea spiders and some crustacean groups exhibit resilience to mass extinction episodes, relative to mollusk and echinoderm lineages.  相似文献   

13.
14.
15.
Relating Amino Acid Sequence to Phenotype: Analysis of Peptide-Binding Data   总被引:1,自引:0,他引:1  
We illustrate data analytic concerns that arise in the context of relating genotype, as represented by amino acid sequence, to phenotypes (outcomes). The present application examines whether peptides that bind to a particular major histocompatibility complex (MHC) class I molecule have characteristic amino acid sequences. However, the concerns identified and addressed are considerably more general. It is recognized that simple rules for predicting binding based solely on preferences for specific amino acids in certain (anchor) positions of the peptide's amino acid sequence are generally inadequate and that binding is potentially influenced by all sequence positions as well as between-position interactions. The desire to elucidate these more complex prediction rules has spawned various modeling attempts, the shortcomings of which provide motivation for the methods adopted here. Because of (i) this need to model between-position interactions, (ii) amino acids constituting a highly (20) multilevel unordered categorical covariate, and (iii) there frequently being numerous such covariates (i.e., positions) comprising the sequence, standard regression/classification techniques are problematic due to the proliferation of indicator variables required for encoding the sequence position covariates and attendant interactions. These difficulties have led to analyses based on (continuous) properties (e.g., molecular weights) of the amino acids. However, there is potential information loss in such an approach if the properties used are incomplete and/or do not capture the mechanism underlying association with the phenotype. Here we demonstrate that handling unordered categorical covariates with numerous levels and accompanying interactions can be done effectively using classification trees and recently devised bump-hunting methods. We further tackle the question of whether observed associations are attributable to amino acid properties as well as addressing the assessment and implications of between-position covariation.  相似文献   

16.
Since metabolome data are derived from the underlying metabolic network, reverse engineering of such data to recover the network topology is of wide interest. Lyapunov equation puts a constraint to the link between data and network by coupling the covariance of data with the strength of interactions (Jacobian matrix). This equation, when expressed as a linear set of equations at steady state, constitutes a basis to infer the network structure given the covariance matrix of data. The sparse structure of metabolic networks points to reactions which are active based on minimal enzyme production, hinting at sparsity as a cellular objective. Therefore, for a given covariance matrix, we solved Lyapunov equation to calculate Jacobian matrix by a simultaneous use of minimization of Euclidean norm of residuals and maximization of sparsity (the number of zeros in Jacobian matrix) as objective functions to infer directed small-scale networks from three kingdoms of life (bacteria, fungi, mammalian). The inference performance of the approach was found to be promising, with zero False Positive Rate, and almost one True positive Rate. The effect of missing data on results was additionally analyzed, revealing superiority over similarity-based approaches which infer undirected networks. Our findings suggest that the covariance of metabolome data implies an underlying network with sparsest pattern. The theoretical analysis forms a framework for further investigation of sparsity-based inference of metabolic networks from real metabolome data.  相似文献   

17.
18.
Advances in sequencing technologies are allowing genome-wide association studies at an ever-growing scale. The interpretation of these studies requires dealing with statistical and combinatorial challenges, owing to the multi-factorial nature of human diseases and the huge space of genomic markers that are being monitored. Recently, it was proposed that using protein–protein interaction network information could help in tackling these challenges by restricting attention to markers or combinations of markers that map to close proteins in the network. In this review, we survey techniques for integrating genomic variation data with network information to improve our understanding of complex diseases and reveal meaningful associations.  相似文献   

19.
The debate on the causal association between vitamin D status, measured as serum concentration of 25-hydroxyvitamin D (25[OH]D), and various health outcomes warrants investigation in large-scale health surveys. Measuring the 25(OH)D concentration for each participant is not always feasible, because of the logistics of blood collection and the costs of vitamin D testing. To address this problem, past research has used predicted 25(OH)D concentration, based on multivariable linear regression, as a proxy for unmeasured vitamin D status. We restate this approach in a mathematical framework, to deduce its possible pitfalls. Monte Carlo simulation and real data from the National Health and Nutrition Examination Survey 2005–06 are used to confirm the deductions. The results indicate that variables that are used in the prediction model (for 25[OH]D concentration) but not in the model for the health outcome (called instrumental variables), play an essential role in the identification of an effect. Such variables should be unrelated to the health outcome other than through vitamin D; otherwise the estimate of interest will be biased. The approach of predicted 25(OH)D concentration derived from multivariable linear regression may be valid. However, careful verification that the instrumental variables are unrelated to the health outcome is required.  相似文献   

20.
Laboratory inbred mouse models are a valuable resource to identify quantitative trait loci (QTL) for complex reproductive performance traits. Advances in mouse genomics and high density single nucleotide polymorphism mapping has enabled genome-wide association studies to identify genes linked with specific phenotypes. Gene expression profiles of reproductive tissues also provide potentially useful information for identifying genes that play an important role. We have developed a highly fecund inbred strain, QSi5, with accompanying genotyping for comparative analysis of reproductive performance. Here we analyzed the QSi5 phenotype using a comparative analysis with fecundity data derived from 22 inbred strains of mice from the Mouse Phenome Project, and integration with published expression data from mouse ovary development. Using a haplotype association approach, 400 fecundity-associated regions (FDR < 0.05) with 499 underlying genes were identified. The most significant associations were located on Chromosomes 14, 8, and 6, and the genes underlying these regions were extracted. When these genes were analyzed for expression in an ovarian development profile (GSE6916) several distinctive co-expression patterns across each developmental stage were identified. The genetic analysis also refined 21 fecundity associated intervals on Chromosomes 1, 6, 9, 13, and 17 that overlapped with previously reported reproductive performance QTL. The combined use of phenotypic and in silico data with an integrative genomic analysis provides a powerful tool for elucidating the molecular mechanisms underlying fecundity.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号