首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 437 毫秒
1.
Siu H  Jin L  Xiong M 《PloS one》2012,7(1):e29901
The dimension of the population genetics data produced by next-generation sequencing platforms is extremely high. However, the "intrinsic dimensionality" of sequence data, which determines the structure of populations, is much lower. This motivates us to use locally linear embedding (LLE) which projects high dimensional genomic data into low dimensional, neighborhood preserving embedding, as a general framework for population structure and historical inference. To facilitate application of the LLE to population genetic analysis, we systematically investigate several important properties of the LLE and reveal the connection between the LLE and principal component analysis (PCA). Identifying a set of markers and genomic regions which could be used for population structure analysis will provide invaluable information for population genetics and association studies. In addition to identifying the LLE-correlated or PCA-correlated structure informative marker, we have developed a new statistic that integrates genomic information content in a genomic region for collectively studying its association with the population structure and LASSO algorithm to search such regions across the genomes. We applied the developed methodologies to a low coverage pilot dataset in the 1000 Genomes Project and a PHASE III Mexico dataset of the HapMap. We observed that 25.1%, 44.9% and 21.4% of the common variants and 89.2%, 92.4% and 75.1% of the rare variants were the LLE-correlated markers in CEU, YRI and ASI, respectively. This showed that rare variants, which are often private to specific populations, have much higher power to identify population substructure than common variants. The preliminary results demonstrated that next generation sequencing offers a rich resources and LLE provide a powerful tool for population structure analysis.  相似文献   

2.
3.
The open sharing of genomic data provides an incredibly rich resource for the study of bacterial evolution and function and even anthropogenic activities such as the widespread use of antimicrobials. However, these data consist of genomes assembled with different tools and levels of quality checking, and of large volumes of completely unprocessed raw sequence data. In both cases, considerable computational effort is required before biological questions can be addressed. Here, we assembled and characterised 661,405 bacterial genomes retrieved from the European Nucleotide Archive (ENA) in November of 2018 using a uniform standardised approach. Of these, 311,006 did not previously have an assembly. We produced a searchable COmpact Bit-sliced Signature (COBS) index, facilitating the easy interrogation of the entire dataset for a specific sequence (e.g., gene, mutation, or plasmid). Additional MinHash and pp-sketch indices support genome-wide comparisons and estimations of genomic distance. Combined, this resource will allow data to be easily subset and searched, phylogenetic relationships between genomes to be quickly elucidated, and hypotheses rapidly generated and tested. We believe that this combination of uniform processing and variety of search/filter functionalities will make this a resource of very wide utility. In terms of diversity within the data, a breakdown of the 639,981 high-quality genomes emphasised the uneven species composition of the ENA/public databases, with just 20 of the total 2,336 species making up 90% of the genomes. The overrepresented species tend to be acute/common human pathogens, aligning with research priorities at different levels from individual interests to funding bodies and national and global public health agencies.

This study presents the first uniformly assembled, comprehensively described and searchable dataset of 661,405 bacterial genomes; this resource will empower more scientists to harness the multitude of data in public sequencing archives, but also reveals the biased composition of these archives, with 90% of the data originating from just 20 species.  相似文献   

4.
On the evolution of multigene families   总被引:1,自引:0,他引:1  
Multigene families are classified into three groups: small families as exemplified by hemoglobin genes of mammals; middlesize multigene families, by genes of mammalian histocompatibility antigens; and large multigene families, by variable region genes of immunoglobulins. Facts and theories on these evolving multigene families are reviewed, with special reference to the population genetics of their concerted evolution. It is shown that multigene families are evolving under continued occurrence of unequal (but homologous) crossing-over and gene conversion, and that mechanisms for maintaining genetic variability are totally different from the conventional models of population genetics. Thus, in view of widespread occurrence of multigene families in genomes of higher organisms, the evolutionary theory based mainly on change of gene frequency at each locus would appear to need considerable revision.  相似文献   

5.
There is an abundance of malaria genetic data being collected from the field, yet using these data to understand the drivers of regional epidemiology remains a challenge. A key issue is the lack of models that relate parasite genetic diversity to epidemiological parameters. Classical models in population genetics characterize changes in genetic diversity in relation to demographic parameters, but fail to account for the unique features of the malaria life cycle. In contrast, epidemiological models, such as the Ross-Macdonald model, capture malaria transmission dynamics but do not consider genetics. Here, we have developed an integrated model encompassing both parasite evolution and regional epidemiology. We achieve this by combining the Ross-Macdonald model with an intra-host continuous-time Moran model, thus explicitly representing the evolution of individual parasite genomes in a traditional epidemiological framework. Implemented as a stochastic simulation, we use the model to explore relationships between measures of parasite genetic diversity and parasite prevalence, a widely-used metric of transmission intensity. First, we explore how varying parasite prevalence influences genetic diversity at equilibrium. We find that multiple genetic diversity statistics are correlated with prevalence, but the strength of the relationships depends on whether variation in prevalence is driven by host- or vector-related factors. Next, we assess the responsiveness of a variety of statistics to malaria control interventions, finding that those related to mixed infections respond quickly (∼months) whereas other statistics, such as nucleotide diversity, may take decades to respond. These findings provide insights into the opportunities and challenges associated with using genetic data to monitor malaria epidemiology.  相似文献   

6.
7.
Associating phenotypic traits and quantitative trait loci (QTL) to causative regions of the underlying genome is a key goal in agricultural research.InterStoreDB is a suite of integrated databases designed to assist in this process.The individual databases are species independent and generic in design,providing access to curated datasets relating to plant populations,phenotypic traits,genetic maps,marker loci and QTL,with links to functional gene annotation and genomic sequence data.Each component database provides access to associated metadata,including data provenance and parameters used in analyses,thus providing users with information to evaluate the relative worth of any associations identified.The databases include CropStoreDB,for management of population,genetic map,QTL and trait measurement data,SeqStoreDB for sequence-related data and AlignStoreDB,which stores sequence alignment information,and allows navigation between genetic and genomic datasets.Genetic maps are visualized and compared using the CMAP tool,and functional annotation from sequenced genomes is provided via an EnsEMBL-based genome browser.This framework facilitates navigation of the multiple biological domains involved in genetics and genomics research in a transparent manner within a single portal.We demonstrate the value of InterStoreDB as a tool for Brassica research.InterStoreDB is available from:http://www.interstoredb.org  相似文献   

8.
In this paper, we present a method that enable both homology-based approach and composition-based approach to further study the functional core (i.e., microbial core and gene core, correspondingly). In the proposed method, the identification of major functionality groups is achieved by generative topic modeling, which is able to extract useful information from unlabeled data. We first show that generative topic model can be used to model the taxon abundance information obtained by homology-based approach and study the microbial core. The model considers each sample as a “document,” which has a mixture of functional groups, while each functional group (also known as a “latent topic”) is a weight mixture of species. Therefore, estimating the generative topic model for taxon abundance data will uncover the distribution over latent functions (latent topic) in each sample. Second, we show that, generative topic model can also be used to study the genome-level composition of “N-mer” features (DNA subreads obtained by composition-based approaches). The model consider each genome as a mixture of latten genetic patterns (latent topics), while each functional pattern is a weighted mixture of the “N-mer” features, thus the existence of core genomes can be indicated by a set of common N-mer features. After studying the mutual information between latent topics and gene regions, we provide an explanation of the functional roles of uncovered latten genetic patterns. The experimental results demonstrate the effectiveness of proposed method.  相似文献   

9.
The major online single nucleotide polymorphism (SNP) databases freely available as research tools for genetic analysis are explained, reviewed, and compared. An outline is given of the search strategies that can be used with the most extensive current SNP databases: National Centre for Biotechnology Information (NCBI) dbSNP and HapMap to help the user secure the most appropriate data for the research needs of clinical genetics and population genetics research. A range of online tools that can be useful in designing SNP genotyping assays are also detailed.  相似文献   

10.
Although there has been great success in identifying disease genes for simple, monogenic Mendelian traits, deciphering the genetic mechanisms involved in complex diseases remains challenging. One major approach is to identify configurations of interacting factors such as single nucleotide polymorphisms (SNPs) that confer susceptibility to disease. Traditional methods, such as the multiple dimensional reduction method and the combinatorial partitioning method, provide good tools to decipher such interactions amid a disease population with a single genetic cause. However, these traditional methods have not managed to resolve the issue of genetic heterogeneity, which is believed to be a very common phenomenon in complex diseases. There is rarely prior knowledge of the genetic heterogeneity of a disease, and traditional methods based on estimation over the entire population are unlikely to succeed in the presence of heterogeneity. We present a novel Boosted Generative Modeling (BGM) approach for structure-model the interactions leading to diseases in the context of genetic heterogeneity. Our BGM method bridges the ensemble and generative modeling approaches to genetic association studies under a case-control design. Generative modeling is employed to model the interaction network configuration and the causal relationships, while boosting is used to address the genetic heterogeneity problem. We perform our method on simulation data of complex diseases. The results indicate that our method is capable of modeling the structure of interaction networks among disease-susceptible loci and of addressing genetic heterogeneity issues where the traditional methods, such as multiple dimensional reduction method, fail to apply. Our BGM method provides an exploratory tool that identifies the variables (e.g., disease-susceptible loci) that are likely to correlate and contribute to the disease.  相似文献   

11.
Mitochondria are subcellular organelles in which oxidative phosphorylation and other important biochemical functions take place within the cell. Within these organelles is a mitochondrial (mt) genome, which is distinct from, but cooperates with, the nuclear genome of the cell. Studying mt genomes has implications for various fundamental areas, including mt biochemistry, physiology and molecular biology. Importantly, the mt genome is a rich source of markers for population genetic and systematic studies. To date, more than 696 mt genomes have been sequenced for a range of metazoan organisms. However, few of these are from parasitic nematodes, despite their socioeconomic importance and the need for fundamental investigations into areas such as nematode genetics, systematics and ecology. In this article, we review knowledge and recent progress in mt genomics of parasitic nematodes, summarize applications of mt gene markers to the study of population genetics, systematics, epidemiology and evolution of key nematodes, and highlight some prospects and opportunities for future research.  相似文献   

12.
Gene flow and recombination in admixed populations produce genomes that are mosaic combinations of chromosome segments inherited from different source populations, that is, chromosome segments with different genetic ancestries. The statistical problem of estimating genetic ancestry from DNA sequence data has been widely studied, and analyses of genetic ancestry have facilitated research in molecular ecology and ecological genetics. In this review, we describe and compare different model‐based statistical methods used to infer genetic ancestry. We describe the conceptual and mathematical structure of these models and highlight some of their key differences and shared features. We then discuss recent empirical studies that use estimates of genetic ancestry to analyse population histories, the nature and genetic basis of species boundaries, and the genetic architecture of traits. These diverse studies demonstrate the breadth of applications that rely on genetic ancestry estimates and typify the genomics‐enabled research that is becoming increasingly common in molecular ecology. We conclude by identifying key research areas where future studies might further advance this field.  相似文献   

13.

Background  

Castor bean (Ricinus communis) is an agricultural crop and garden ornamental that is widely cultivated and has been introduced worldwide. Understanding population structure and the distribution of castor bean cultivars has been challenging because of limited genetic variability. We analyzed the population genetics of R. communis in a worldwide collection of plants from germplasm and from naturalized populations in Florida, U.S. To assess genetic diversity we conducted survey sequencing of the genomes of seven diverse cultivars and compared the data to a reference genome assembly of a widespread cultivar (Hale). We determined the population genetic structure of 676 samples using single nucleotide polymorphisms (SNPs) at 48 loci.  相似文献   

14.
Decoding models, such as those underlying multivariate classification algorithms, have been increasingly used to infer cognitive or clinical brain states from measures of brain activity obtained by functional magnetic resonance imaging (fMRI). The practicality of current classifiers, however, is restricted by two major challenges. First, due to the high data dimensionality and low sample size, algorithms struggle to separate informative from uninformative features, resulting in poor generalization performance. Second, popular discriminative methods such as support vector machines (SVMs) rarely afford mechanistic interpretability. In this paper, we address these issues by proposing a novel generative-embedding approach that incorporates neurobiologically interpretable generative models into discriminative classifiers. Our approach extends previous work on trial-by-trial classification for electrophysiological recordings to subject-by-subject classification for fMRI and offers two key advantages over conventional methods: it may provide more accurate predictions by exploiting discriminative information encoded in 'hidden' physiological quantities such as synaptic connection strengths; and it affords mechanistic interpretability of clinical classifications. Here, we introduce generative embedding for fMRI using a combination of dynamic causal models (DCMs) and SVMs. We propose a general procedure of DCM-based generative embedding for subject-wise classification, provide a concrete implementation, and suggest good-practice guidelines for unbiased application of generative embedding in the context of fMRI. We illustrate the utility of our approach by a clinical example in which we classify moderately aphasic patients and healthy controls using a DCM of thalamo-temporal regions during speech processing. Generative embedding achieves a near-perfect balanced classification accuracy of 98% and significantly outperforms conventional activation-based and correlation-based methods. This example demonstrates how disease states can be detected with very high accuracy and, at the same time, be interpreted mechanistically in terms of abnormalities in connectivity. We envisage that future applications of generative embedding may provide crucial advances in dissecting spectrum disorders into physiologically more well-defined subgroups.  相似文献   

15.
This brief review aims to illustrate how theory can aid in our understanding of the factors that determine the regulation and stability of parasite abundance, and influence the impact of control measures. The current generation of models are obviously crude, and ignore much biological detail, but they are often able to capture qualitative trends observed in real communities. As such, their analysis and investigation can provide important conceptional insights or, in some circumstances, they can be of value in a predictive role (e.g. the impact of chemotherapy in human communities).This field of research, however, is still in its infancy and much remains to be done to improve biological realism in model formulation and to extent the methods of analysis and interpretation. In the latter context, for example, the current analytical methods for the study of the dynamical properties of non-linear systems of differential and partial differential equations are inadequate for many areas of biological application. Future advances in applied mathematics will, therefore, be of great importance. As far as biological realism is concerned, three areas require urgent attention. The first concerns the treatment of heterogeneity in worm loads within host communities. The generative factors of parasite aggregation are many and varied and little is understood at present of how these processes influence a parasite's population response to perturbation induced, for example, by control measures. Stochastic models are required to examine this problem but current work in this area is very limited.The second area concerns immunity to parasitic infection. Few models take account of the substantive body of experimental work which attests to the significance of host responses (both specific and non-specific) to parasite invasion as determinants of parasite abundance within both an individual host and in the community at large. A start has been made in the investigation of models which mimic acquired immunity and immunological “memory” but much refinement and elaboration is needed (Anderson &; May, 1985a). In particular, the next generation of models should address the details of antibody-antigen and cell-antigen interactions in individual hosts as well as the broader questions concerning herd immunity. Heterogeneity in immunological responsiveness as a consequence of host nutritional status or genetic background must also be condsidered.The final topic is that of population genetics. Geneticists invariably consider changes in gene frequencies without reference to changes in parasite or host abundance, ecologists and epidemiologists have tended to study changes in abundance without reference to changes in genetic structure while immunologists have focused on the mechanisms of resistance to parasitic infection without reference to population or genetic changes. It is becoming increasingly apparent that host genetic background and genetic heterogeneity within parasite populations (e.g. the malarial parasites of man) are important determinants of observed population events (Medley &; Anderson, 1985). Future research must attempt to meld the areas of genetics, population dynamics and immunology. Such an integration presents a fascinating challenge.  相似文献   

16.
There have been many recent developments in malaria genetics, with much information coming from the field of drug resistance. The findings of classical genetic crossing experiments, together with data from sequencing Plasmodium genomes and the powerful new tools of population genetics, are beginning to explain the ingenuity of Plasmodium at evading the control measures used against it so far.  相似文献   

17.
Alterations to the genetic code – codon reassignments – have occurred many times in life’s history, despite the fact that genomes are coadapted to their genetic codes and therefore alterations are likely to be maladaptive. A potential mechanism for adaptive codon reassignment, which could trigger either a temporary period of codon ambiguity or a permanent genetic code change, is the reactivation of a pseudogene by a nonsense suppressor mutant transfer RNA. I examine the population genetics of each stage of this process and find that pseudogene rescue is plausible and also readily explains some features of extant variability in genetic codes.  相似文献   

18.
Demographic processes directly affect patterns of genetic variation within contemporary populations as well as future generations, allowing for demographic inference from patterns of both present-day and past genetic variation. Advances in laboratory procedures, sequencing and genotyping technologies in the past decades have resulted in massive increases in high-quality genome-wide genetic data from present-day populations and allowed retrieval of genetic data from archaeological material, also known as ancient DNA. This has resulted in an explosion of work exploring past changes in population size, structure, continuity and movement. However, as genetic processes are highly stochastic, patterns of genetic variation only indirectly reflect demographic histories. As a result, past demographic processes need to be reconstructed using an inferential approach. This usually involves comparing observed patterns of variation with model expectations from theoretical population genetics. A large number of approaches have been developed based on different population genetic models that each come with assumptions about the data and underlying demography. In this article I review some of the key models and assumptions underlying the most commonly used approaches for past demographic inference and their consequences for our ability to link the inferred demographic processes to the archaeological and climate records.This article is part of the theme issue ‘Cross-disciplinary approaches to prehistoric demography’.  相似文献   

19.
Limitations of next-generation genome sequence assembly   总被引:1,自引:0,他引:1  
High-throughput sequencing technologies promise to transform the fields of genetics and comparative biology by delivering tens of thousands of genomes in the near future. Although it is feasible to construct de novo genome assemblies in a few months, there has been relatively little attention to what is lost by sole application of short sequence reads. We compared the recent de novo assemblies using the short oligonucleotide analysis package (SOAP), generated from the genomes of a Han Chinese individual and a Yoruban individual, to experimentally validated genomic features. We found that de novo assemblies were 16.2% shorter than the reference genome and that 420.2 megabase pairs of common repeats and 99.1% of validated duplicated sequences were missing from the genome. Consequently, over 2,377 coding exons were completely missing. We conclude that high-quality sequencing approaches must be considered in conjunction with high-throughput sequencing for comparative genomics analyses and studies of genome evolution.  相似文献   

20.

Background  

Mutation rate (μ) per generation per locus is an important parameter in the models of population genetics. Studies on mutation rate and its variation are of significance to elucidate the extent and distribution of genetic variation, further infer evolutionary relationships among closely related species, and deeply understand genetic variation of genomes. However, patterns of rate variation of microsatellite loci are still poorly understood in plant species. Furthermore, how their mutation rates vary in di-, tri-, and tetra-nucleotide repeats within the species is largely uninvestigated across related plant genomes.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号