首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.

Background

A minor but significant fraction of samples subjected to next-generation sequencing methods are either mixed-up or cross-contaminated. These events can lead to false or inconclusive results. We have therefore developed SASI-Seq; a process whereby a set of uniquely barcoded DNA fragments are added to samples destined for sequencing. From the final sequencing data, one can verify that all the reads derive from the original sample(s) and not from contaminants or other samples.

Results

By adding a mixture of three uniquely barcoded amplicons, of different sizes spanning the range of insert sizes one would normally use for Illumina sequencing, at a spike-in level of approximately 0.1%, we demonstrate that these fragments remain intimately associated with the sample. They can be detected following even the tightest size selection regimes or exome enrichment and can report the occurrence of sample mix-ups and cross-contamination.As a consequence of this work, we have designed a set of 384 eleven-base Illumina barcode sequences that are at least 5 changes apart from each other, allowing for single-error correction and very low levels of barcode misallocation due to sequencing error.

Conclusion

SASI-Seq is a simple, inexpensive and flexible tool that enables sample assurance, allows deconvolution of sample mix-ups and reports levels of cross-contamination between samples throughout NGS workflows.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2164-15-110) contains supplementary material, which is available to authorized users.  相似文献   

2.
MOTIVATION: Gene expression data have become an instrumental resource in describing the molecular state associated with various cellular phenotypes and responses to environmental perturbations. The utility of expression profiling has been demonstrated in partitioning clinical states, predicting the class of unknown samples and in assigning putative functional roles to previously uncharacterized genes based on profile similarity. However, gene expression profiling has had only limited success in identifying therapeutic targets. This is partly due to the fact that current methods based on fold-change focus only on single genes in isolation, and thus cannot convey causal information. In this paper, we present a technique for analysis of expression data in a graph-theoretic framework that relies on associations between genes. We describe the global organization of these networks and biological correlates of their structure. We go on to present a novel technique for the molecular characterization of disparate cellular states that adds a new dimension to the fold-based methods and conclude with an example application to a human medulloblastoma dataset. RESULTS: We have shown that expression networks generated from large model-organism expression datasets are scale-free and that the average clustering coefficient of these networks is several orders of magnitude higher than would be expected for similarly sized scale-free networks, suggesting an inherent hierarchical modularity similar to that previously identified in other biological networks. Furthermore, we have shown that these properties are robust with respect to the parameters of network construction. We have demonstrated an enrichment of genes having lethal knockout phenotypes in the high-degree (i.e. hub) nodes in networks generated from aggregate condition datasets; using process-focused Saccharomyces cerivisiae datasets we have demonstrated additional high-degree enrichments of condition-specific genes encoding proteins known to be involved in or important for the processes interrogated by the microarrays. These results demonstrate the utility of network analysis applied to expression data in identifying genes that are regulated in a state-specific manner. We concluded by showing that a sample application to a human clinical dataset prominently identified a known therapeutic target. AVAILABILITY: Software implementing the methods for network generation presented in this paper is available for academic use by request from the authors in the form of compiled linux binary executables.  相似文献   

3.
Cancer genomes frequently contain somatic copy number alterations (SCNA) that can significantly perturb the expression level of affected genes and thus disrupt pathways controlling normal growth. In melanoma, many studies have focussed on the copy number and gene expression levels of the BRAF, PTEN and MITF genes, but little has been done to identify new genes using these parameters at the genome-wide scale. Using karyotyping, SNP and CGH arrays, and RNA-seq, we have identified SCNA affecting gene expression ('SCNA-genes') in seven human metastatic melanoma cell lines. We showed that the combination of these techniques is useful to identify candidate genes potentially involved in tumorigenesis. Since few of these alterations were recurrent across our samples, we used a protein network-guided approach to determine whether any pathways were enriched in SCNA-genes in one or more samples. From this unbiased genome-wide analysis, we identified 28 significantly enriched pathway modules. Comparison with two large, independent melanoma SCNA datasets showed less than 10% overlap at the individual gene level, but network-guided analysis revealed 66% shared pathways, including all but three of the pathways identified in our data. Frequently altered pathways included WNT, cadherin signalling, angiogenesis and melanogenesis. Additionally, our results emphasize the potential of the EPHA3 and FRS2 gene products, involved in angiogenesis and migration, as possible therapeutic targets in melanoma. Our study demonstrates the utility of network-guided approaches, for both large and small datasets, to identify pathways recurrently perturbed in cancer.  相似文献   

4.
Class and biomarker discovery continue to be among the preeminent goals in gene microarray studies of cancer. We have developed a new data mining technique, which we call Binary State Pattern Clustering (BSPC) that is specifically adapted for these purposes, with cancer and other categorical datasets. BSPC is capable of uncovering statistically significant sample subclasses and associated marker genes in a completely unsupervised manner. This is accomplished through the application of a digital paradigm, where the expression level of each potential marker gene is treated as being representative of its discrete functional state. Multiple genes that divide samples into states along the same boundaries form a kind of gene-cluster that has an associated sample-cluster. BSPC is an extremely fast deterministic algorithm that scales well to large datasets. Here we describe results of its application to three publicly available oligonucleotide microarray datasets. Using an alpha-level of 0.05, clusters reproducing many of the known sample classifications were identified along with associated biomarkers. In addition, a number of simulations were conducted using shuffled versions of each of the original datasets, noise-added datasets, as well as completely artificial datasets. The robustness of BSPC was compared to that of three other publicly available clustering methods: ISIS, CTWC and SAMBA. The simulations demonstrate BSPC's substantially greater noise tolerance and confirm the accuracy of our calculations of statistical significance.  相似文献   

5.
6.
The major mental disorders, schizophrenia and bipolar disorder are substantially heritable. Recent genomic studies have identified a small number of common and rare risk genes contributing to both disorders and support epidemiological evidence that genetic susceptibility overlaps between them. Prompted by the question of whether risk genes cluster in specific molecular pathways or implicate discrete mechanisms we and others have developed hypothesis-free methods of investigating genome-wide association datasets at a pathway-level. The application of our method to the 212 experimentally-derived pathways in the Kyoto Encycolpaedia of Genes and Genomes (KEGG) database identified significant association between the cell adhesion molecule (CAM) pathway and both schizophrenia and bipolar disorder susceptibility across three GWAS datasets. Interestingly, a similar approach applied to an autistic spectrum disorders (ASDs) sample identified a similar pathway and involved many of the same genes. Disruption of a number of these genes (including NRXN1, CNTNAP2 and CASK) are known to cause diverse neurodevelopmental brain disorder phenotypes including schizophenia, autism, learning disability and specific language disorder. Taken together these studies bring the CAM pathway sharply into focus for more comprehensive DNA sequencing to identify the critical genes, and investigate their relationships and interaction with environmental risk factors in the expression of many seemingly different neurodevelopmental disorders.  相似文献   

7.
The identification of genome-wide cis-regulatory modules (CRMs) and characterization of their associated epigenetic features are fundamental steps toward the understanding of gene regulatory networks. Although integrative analysis of available genome-wide information can provide new biological insights, the lack of novel methodologies has become a major bottleneck. Here, we present a comprehensive analysis tool called combinatorial CRM decoder (CCD), which utilizes the publicly available information to identify and characterize genome-wide CRMs in a species of interest. CCD first defines a set of the epigenetic features which is significantly associated with a set of known CRMs as a code called ‘trace code’, and subsequently uses the trace code to pinpoint putative CRMs throughout the genome. Using 61 genome-wide data sets obtained from 17 independent mouse studies, CCD successfully catalogued ∼12 600 CRMs (five distinct classes) including polycomb repressive complex 2 target sites as well as imprinting control regions. Interestingly, we discovered that ∼4% of the identified CRMs belong to at least two different classes named ‘multi-functional CRM’, suggesting their functional importance for regulating spatiotemporal gene expression. From these examples, we show that CCD can be applied to any potential genome-wide datasets and therefore will shed light on unveiling genome-wide CRMs in various species.  相似文献   

8.
To date, genome-wide association studies have identified thousands of statistically-significant associations between genetic variants, and phenotypes related to a myriad of traits and diseases. A key goal for human-genetics research is to translate these associations into functional mechanisms. Popular gene-set analysis tools, like MAGMA, map variants to genes they might affect, and then integrate genome-wide association study data (that is, variant-level associations for a phenotype) to score genes for association with a phenotype. Gene scores are subsequently used in competitive gene-set analyses to identify biological processes that are enriched for phenotype association. By default, variants are mapped to genes in their proximity. However, many variants that affect phenotypes are thought to act at regulatory elements, which can be hundreds of kilobases away from their target genes. Thus, we explored the idea of augmenting a proximity-based mapping scheme with publicly-available datasets of regulatory interactions. We used MAGMA to analyze genome-wide association study data for ten different phenotypes, and evaluated the effects of augmentation by comparing numbers, and identities, of genes and gene sets detected as statistically significant between mappings. We detected several pitfalls and confounders of such “augmented analyses”, and introduced ways to control for them. Using these controls, we demonstrated that augmentation with datasets of regulatory interactions only occasionally strengthened the enrichment for phenotype association amongst (biologically-relevant) gene sets for different phenotypes. Still, in such cases, genes and regulatory elements responsible for the improvement could be pinpointed. For instance, using brain regulatory-interactions for augmentation, we were able to implicate two acetylcholine receptor subunits involved in post-synaptic chemical transmission, namely CHRNB2 and CHRNE, in schizophrenia. Collectively, our study presents a critical approach for integrating regulatory interactions into gene-set analyses for genome-wide association study data, by introducing various controls to distinguish genuine results from spurious discoveries.  相似文献   

9.
10.
Multi-locus profiles of genetic risk, so-called “genetic risk scores,” can be used to translate discoveries from genome-wide association studies into tools for population health research. We developed a genetic risk score for obesity from results of 16 published genome-wide association studies of obesity phenotypes in European-descent samples. We then evaluated this genetic risk score using data from the Atherosclerosis Risk in Communities (ARIC) cohort GWAS sample (N = 10,745, 55% female, 77% white, 23% African American). Our 32-locus GRS was a statistically significant predictor of body mass index (BMI) and obesity among ARIC whites [for BMI, r = 0.13, p<1?×?10?30; for obesity, area under the receiver operating characteristic curve (AUC) = 0.57 (95% CI 0.55–0.58)]. The GRS predicted differences in obesity risk net of demographic, geographic, and socioeconomic information. The GRS performed less well among African Americans. The genetic risk score we derived from GWAS provides a molecular measurement of genetic predisposition to elevated BMI and obesity.[Supplemental materials are available for this article. Go to the publisher's online edition of Biodemography and Social Biology for the following resource: Supplement to Development & Evaluation of a Genetic Risk Score for Obesity.]  相似文献   

11.
Psoriasis, an immune-mediated, inflammatory disease of the skin and joints, provides an ideal system for expression quantitative trait locus (eQTL) analysis, because it has a strong genetic basis and disease-relevant tissue (skin) is readily accessible. To better understand the role of genetic variants regulating cutaneous gene expression, we identified 841 cis-acting eQTLs using RNA extracted from skin biopsies of 53 psoriatic individuals and 57 healthy controls. We found substantial overlap between cis-eQTLs of normal control, uninvolved psoriatic, and lesional psoriatic skin. Consistent with recent studies and with the idea that control of gene expression can mediate relationships between genetic variants and disease risk, we found that eQTL SNPs are more likely to be associated with psoriasis than are randomly selected SNPs. To explore the tissue specificity of these eQTLs and hence to quantify the benefits of studying eQTLs in different tissues, we developed a refined statistical method for estimating eQTL overlap and used it to compare skin eQTLs to a published panel of lymphoblastoid cell line (LCL) eQTLs. Our method accounts for the fact that most eQTL studies are likely to miss some true eQTLs as a result of power limitations and shows that ~70% of cis-eQTLs in LCLs are shared with skin, as compared with the naive estimate of < 50% sharing. Our results provide a useful method for estimating the overlap between various eQTL studies and provide a catalog of cis-eQTLs in skin that can facilitate efforts to understand the functional impact of identified susceptibility variants on psoriasis and other skin traits.  相似文献   

12.
Xu W  Wang M  Zhang X  Wang L  Feng H 《Bioinformation》2008,2(7):301-303
Gene selection is to detect the most significantly expressed genes under different conditions expression data. The current challenge in gene selection is the comparison of a large number of genes with limited patient samples. Thus it is trivial task in simple statistical analysis. Various statistical measurements are adopted by filter methods applied in gene selection studies. Their ability to discriminate phenotypes is crucial in classification and selection. Here we describe the standard deviation error distribution (SDED) method for gene selection. It utilizes variations within-class and among-class in gene expression data. We tested the method using 4 leukemia datasets available in the public domain. The method was compared with the GS2 and CHO methods. The Prediction accuracies by SDED are better than both GS2 and CHO for different datasets. These are 0.8-4.2% and 1.6-8.4% more that in GS2 and CHO. The related OMIM annotations and KEGG pathways analyses verified that SDED can pick out more 4.0% and 6.1% genes with biological significance than GS2 and CHO, respectively.  相似文献   

13.
Cluster-Rasch models for microarray gene expression data   总被引:1,自引:0,他引:1  
Li H  Hong F 《Genome biology》2001,2(8):research0031.1-research003113

Background

We propose two different formulations of the Rasch statistical models to the problem of relating gene expression profiles to the phenotypes. One formulation allows us to investigate whether a cluster of genes with similar expression profiles is related to the observed phenotypes; this model can also be used for future prediction. The other formulation provides an alternative way of identifying genes that are over- or underexpressed from their expression levels in tissue or cell samples of a given tissue or cell type.

Results

We illustrate the methods on available datasets of a classification of acute leukemias and of 60 cancer cell lines. For tumor classification, the results are comparable to those previously obtained. For the cancer cell lines dataset, we found four clusters of genes that are related to drug response for many of the 90 drugs that we considered. In addition, for each type of cell line, we identified genes that are over- or underexpressed relative to other genes.

Conclusions

The cluster-Rasch model provides a probabilistic model for describing gene expression patterns across samples and can be used to relate gene expression profiles to phenotypes.  相似文献   

14.
15.
Late-onset Alzheimer’s disease (LOAD) is the most common type of dementia causing irreversible brain damage to the elderly and presents a major public health challenge. Clinical research and genome-wide association studies have suggested a potential contribution of the endocytic pathway to AD, with an emphasis on common loci. However, the contribution of rare variants in this pathway to AD has not been thoroughly investigated. In this study, we focused on the effect of rare variants on AD by first applying a rare-variant gene-set burden analysis using genes in the endocytic pathway on over 3,000 individuals with European ancestry from three large whole-genome sequencing (WGS) studies. We identified significant associations of rare-variant burden within the endocytic pathway with AD, which were successfully replicated in independent datasets. We further demonstrated that this endocytic rare-variant enrichment is associated with neurofibrillary tangles (NFTs) and age-related phenotypes, increasing the risk of obtaining severer brain damage, earlier age-at-onset, and earlier age-of-death. Next, by aggregating rare variants within each gene, we sought to identify single endocytic genes associated with AD and NFTs. Careful examination using NFTs revealed one significantly associated gene, ANKRD13D. To identify functional associations, we integrated bulk RNA-Seq data from over 600 brain tissues and found two endocytic expression genes (eGenes), HLA-A and SLC26A7, that displayed significant influences on their gene expressions. Differential expressions between AD patients and controls of these three identified genes were further examined by incorporating scRNA-Seq data from 48 post-mortem brain samples and demonstrated distinct expression patterns across cell types. Taken together, our results demonstrated strong rare-variant effect in the endocytic pathway on AD risk and progression and functional effect of gene expression alteration in both bulk and single-cell resolution, which may bring more insight and serve as valuable resources for future AD genetic studies, clinical research, and therapeutic targeting.  相似文献   

16.
17.
It is an assumption of large, population-based datasets that samples are annotated accurately whether they correspond to known relationships or unrelated individuals. These annotations are key for a broad range of genetics applications. While many methods are available to assess relatedness that involve estimates of identity-by-descent (IBD) and/or identity-by-state (IBS) allele-sharing proportions, we developed a novel approach that estimates IBD0, 1, and 2 based on observed IBS within windows. When combined with genome-wide IBS information, it provides an intuitive and practical graphical approach with the capacity to analyze datasets with thousands of samples without prior information about relatedness between individuals or haplotypes. We applied the method to a commonly used Human Variation Panel consisting of 400 nominally unrelated individuals. Surprisingly, we identified identical, parent-child, and full-sibling relationships and reconstructed pedigrees. In two instances non-sibling pairs of individuals in these pedigrees had unexpected IBD2 levels, as well as multiple regions of homozygosity, implying inbreeding. This combined method allowed us to distinguish related individuals from those having atypical heterozygosity rates and determine which individuals were outliers with respect to their designated population. Additionally, it becomes increasingly difficult to identify distant relatedness using genome-wide IBS methods alone. However, our IBD method further identified distant relatedness between individuals within populations, supported by the presence of megabase-scale regions lacking IBS0 across individual chromosomes. We benchmarked our approach against the hidden Markov model of a leading software package (PLINK), showing improved calling of distantly related individuals, and we validated it using a known pedigree from a clinical study. The application of this approach could improve genome-wide association, linkage, heterozygosity, and other population genomics studies that rely on SNP genotype data.  相似文献   

18.
19.
20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号