期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Epitopemap: a web application for integrated whole proteome epitope prediction

Damien Farrell Stephen V Gordon 《BMC bioinformatics》2015,16(1)

Background

Predictions of MHC binding affinity are commonly used in immunoinformatics for T cell epitope prediction. There are multiple available methods, some of which provide web access. However there is currently no convenient way to access the results from multiple methods at the same time or to execute predictions for an entire proteome at once.

Results

We designed a web application that allows integration of multiple epitope prediction methods for any number of proteins in a genome. The tool is a front-end for various freely available methods. Features include visualisation of results from multiple predictors within proteins in one plot, genome-wide analysis and estimates of epitope conservation.

Conclusions

We present a self contained web application, Epitopemap, for calculating and viewing epitope predictions with multiple methods. The tool is easy to use and will assist in computational screening of viral or bacterial genomes.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-015-0659-0) contains supplementary material, which is available to authorized users. 相似文献

2.

The importance of phenotypic data analysis for genomic prediction - a case study comparing different spatial models in rye

Angela-Maria Bernal-Vasquez Jens M?hring Malthe Schmidt Manfred Sch?nleben Chris-Carolin Sch?n Hans-Peter Piepho 《BMC genomics》2014,15(1)

Background

Genomic prediction is becoming a daily tool for plant breeders. It makes use of genotypic information to make predictions used for selection decisions. The accuracy of the predictions depends on the number of genotypes used in the calibration; hence, there is a need of combining data across years. A proper phenotypic analysis is a crucial prerequisite for accurate calibration of genomic prediction procedures. We compared stage-wise approaches to analyse a real dataset of a multi-environment trial (MET) in rye, which was connected between years only through one check, and used different spatial models to obtain better estimates, and thus, improved predictive abilities for genomic prediction. The aims of this study were to assess the advantage of using spatial models for the predictive abilities of genomic prediction, to identify suitable procedures to analyse a MET weakly connected across years using different stage-wise approaches, and to explore genomic prediction as a tool for selection of models for phenotypic data analysis.

Results

Using complex spatial models did not significantly improve the predictive ability of genomic prediction, but using row and column effects yielded the highest predictive abilities of all models. In the case of MET poorly connected between years, analysing each year separately and fitting year as a fixed effect in the genomic prediction stage yielded the most realistic predictive abilities. Predictive abilities can also be used to select models for phenotypic data analysis. The trend of the predictive abilities was not the same as the traditionally used Akaike information criterion, but favoured in the end the same models.

Conclusions

Making predictions using weakly linked datasets is of utmost interest for plant breeders. We provide an example with suggestions on how to handle such cases. Rather than relying on checks we show how to use year means across all entries for integrating data across years. It is further shown that fitting of row and column effects captures most of the heterogeneity in the field trials analysed.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2164-15-646) contains supplementary material, which is available to authorized users. 相似文献

3.

Identification of copy number variants from exome sequence data

Pubudu Saneth Samarakoon Hanne S?rmo Sorte Bj?rn Evert Kristiansen Tove Skodje Ying Sheng Geir E Tj?nnfjord Barbro Stadheim Asbj?rg Stray-Pedersen Olaug Kristin R?dningen Robert Lyle 《BMC genomics》2014,15(1)

Background

With advances in next generation sequencing technologies and genomic capture techniques, exome sequencing has become a cost-effective approach for mutation detection in genetic diseases. However, computational prediction of copy number variants (CNVs) from exome sequence data is a challenging task. Whilst numerous programs are available, they have different sensitivities, and have low sensitivity to detect smaller CNVs (1–4 exons). Additionally, exonic CNV discovery using standard aCGH has limitations due to the low probe density over exonic regions. The goal of our study was to develop a protocol to detect exonic CNVs (including shorter CNVs that cover 1–4 exons), combining computational prediction algorithms and a high-resolution custom CGH array.

Results

We used six published CNV prediction programs (ExomeCNV, CONTRA, ExomeCopy, ExomeDepth, CoNIFER, XHMM) and an in-house modification to ExomeCopy and ExomeDepth (ExCopyDepth) for computational CNV prediction on 30 exomes from the 1000 genomes project and 9 exomes from primary immunodeficiency patients. CNV predictions were tested using a custom CGH array designed to capture all exons (exaCGH). After this validation, we next evaluated the computational prediction of shorter CNVs. ExomeCopy and the in-house modified algorithm, ExCopyDepth, showed the highest capability in detecting shorter CNVs. Finally, the performance of each computational program was assessed by calculating the sensitivity and false positive rate.

Conclusions

In this paper, we assessed the ability of 6 computational programs to predict CNVs, focussing on short (1–4 exon) CNVs. We also tested these predictions using a custom array targeting exons. Based on these results, we propose a protocol to identify and confirm shorter exonic CNVs combining computational prediction algorithms and custom aCGH experiments.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2164-15-661) contains supplementary material, which is available to authorized users. 相似文献

4.

MAC: identifying and correcting annotation for multi-nucleotide variations

Lei Wei Lu T. Liu Jacob R. Conroy Qiang Hu Jeffrey M. Conroy Carl D. Morrison Candace S. Johnson Jianmin Wang Song Liu 《BMC genomics》2015,16(1)

Background

Next-Generation Sequencing (NGS) technologies have rapidly advanced our understanding of human variation in cancer. To accurately translate the raw sequencing data into practical knowledge, annotation tools, algorithms and pipelines must be developed that keep pace with the rapidly evolving technology. Currently, a challenge exists in accurately annotating multi-nucleotide variants (MNVs). These tandem substitutions, when affecting multiple nucleotides within a single protein codon of a gene, result in a translated amino acid involving all nucleotides in that codon. Most existing variant callers report a MNV as individual single-nucleotide variants (SNVs), often resulting in multiple triplet codon sequences and incorrect amino acid predictions. To correct potentially misannotated MNVs among reported SNVs, a primary challenge resides in haplotype phasing which is to determine whether the neighboring SNVs are co-located on the same chromosome.

Results

Here we describe MAC (Multi-Nucleotide Variant Annotation Corrector), an integrative pipeline developed to correct potentially mis-annotated MNVs. MAC was designed as an application that only requires a SNV file and the matching BAM file as data inputs. Using an example data set containing 3024 SNVs and the corresponding whole-genome sequencing BAM files, we show that MAC identified eight potentially mis-annotated SNVs, and accurately updated the amino acid predictions for seven of the variant calls.

Conclusions

MAC can identify and correct amino acid predictions that result from MNVs affecting multiple nucleotides within a single protein codon, which cannot be handled by most existing SNV-based variant pipelines. The MAC software is freely available and represents a useful tool for the accurate translation of genomic sequence to protein function. 相似文献

5.

Genomic prediction using imputed whole-genome sequence data in Holstein Friesian cattle

Rianne van Binsbergen Mario P. L. Calus Marco C. A. M. Bink Fred A. van Eeuwijk Chris Schrooten Roel F. Veerkamp 《遗传、选种与进化》2015,47(1)

Background

In contrast to currently used single nucleotide polymorphism (SNP) panels, the use of whole-genome sequence data is expected to enable the direct estimation of the effects of causal mutations on a given trait. This could lead to higher reliabilities of genomic predictions compared to those based on SNP genotypes. Also, at each generation of selection, recombination events between a SNP and a mutation can cause decay in reliability of genomic predictions based on markers rather than on the causal variants. Our objective was to investigate the use of imputed whole-genome sequence genotypes versus high-density SNP genotypes on (the persistency of) the reliability of genomic predictions using real cattle data.

Methods

Highly accurate phenotypes based on daughter performance and Illumina BovineHD Beadchip genotypes were available for 5503 Holstein Friesian bulls. The BovineHD genotypes (631,428 SNPs) of each bull were used to impute whole-genome sequence genotypes (12,590,056 SNPs) using the Beagle software. Imputation was done using a multi-breed reference panel of 429 sequenced individuals. Genomic estimated breeding values for three traits were predicted using a Bayesian stochastic search variable selection (BSSVS) model and a genome-enabled best linear unbiased prediction model (GBLUP). Reliabilities of predictions were based on 2087 validation bulls, while the other 3416 bulls were used for training.

Results

Prediction reliabilities ranged from 0.37 to 0.52. BSSVS performed better than GBLUP in all cases. Reliabilities of genomic predictions were slightly lower with imputed sequence data than with BovineHD chip data. Also, the reliabilities tended to be lower for both sequence data and BovineHD chip data when relationships between training animals were low. No increase in persistency of prediction reliability using imputed sequence data was observed.

Conclusions

Compared to BovineHD genotype data, using imputed sequence data for genomic prediction produced no advantage. To investigate the putative advantage of genomic prediction using (imputed) sequence data, a training set with a larger number of individuals that are distantly related to each other and genomic prediction models that incorporate biological information on the SNPs or that apply stricter SNP pre-selection should be considered.

Electronic supplementary material

The online version of this article (doi:10.1186/s12711-015-0149-x) contains supplementary material, which is available to authorized users. 相似文献

6.

HotSpotter: efficient visualization of driver mutations

Jason Roszik Scott E Woodman 《BMC genomics》2014,15(1)

Background

Driver mutations are positively selected during the evolution of cancers. The relative frequency of a particular mutation within a gene is typically used as a criterion for identifying a driver mutation. However, driver mutations may occur with relative infrequency at a particular site, but cluster within a region of the gene. When analyzing across different cancers, particular mutation sites or mutations within a particular region of the gene may be of relatively low frequency in some cancers, but still provide selective growth advantage.

Results

This paper presents a method that allows rapid and easy visualization of mutation data sets and identification of potential gene mutation hotspot sites and/or regions. As an example, we identified hotspot regions in the NFE2L2 gene that are potentially functionally relevant in endometrial cancer, but would be missed using other analyses.

Conclusions

HotSpotter is a quick, easy-to-use visualization tool that delivers gene identities with associated mutation locations and frequencies overlaid upon a large cancer mutation reference set. This allows the user to identify potential driver mutations that are less frequent in a cancer or are localized in a hotspot region of relatively infrequent mutations.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2164-15-1044) contains supplementary material, which is available to authorized users. 相似文献

7.

Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions

Yohan Kim John Sidney S?ren Buus Alessandro Sette Morten Nielsen Bjoern Peters 《BMC bioinformatics》2014,15(1)

Background

It is important to accurately determine the performance of peptide:MHC binding predictions, as this enables users to compare and choose between different prediction methods and provides estimates of the expected error rate. Two common approaches to determine prediction performance are cross-validation, in which all available data are iteratively split into training and testing data, and the use of blind sets generated separately from the data used to construct the predictive method. In the present study, we have compared cross-validated prediction performances generated on our last benchmark dataset from 2009 with prediction performances generated on data subsequently added to the Immune Epitope Database (IEDB) which served as a blind set.

Results

We found that cross-validated performances systematically overestimated performance on the blind set. This was found not to be due to the presence of similar peptides in the cross-validation dataset. Rather, we found that small size and low sequence/affinity diversity of either training or blind datasets were associated with large differences in cross-validated vs. blind prediction performances. We use these findings to derive quantitative rules of how large and diverse datasets need to be to provide generalizable performance estimates.

Conclusion

It has long been known that cross-validated prediction performance estimates often overestimate performance on independently generated blind set data. We here identify and quantify the specific factors contributing to this effect for MHC-I binding predictions. An increasing number of peptides for which MHC binding affinities are measured experimentally have been selected based on binding predictions and thus are less diverse than historic datasets sampling the entire sequence and affinity space, making them more difficult benchmark data sets. This has to be taken into account when comparing performance metrics between different benchmarks, and when deriving error estimates for predictions based on benchmark performance.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2105-15-241) contains supplementary material, which is available to authorized users. 相似文献

8.

IPred - integrating ab initio and evidence based gene predictions to improve prediction accuracy

Franziska Zickmann Bernhard Y Renard 《BMC genomics》2015,16(1)

Background

Gene prediction is a challenging but crucial part in most genome analysis pipelines. Various methods have evolved that predict genes ab initio on reference sequences or evidence based with the help of additional information, such as RNA-Seq reads or EST libraries. However, none of these strategies is bias-free and one method alone does not necessarily provide a complete set of accurate predictions.

Results

We present IPred (Integrative gene Prediction), a method to integrate ab initio and evidence based gene identifications to complement the advantages of different prediction strategies. IPred builds on the output of gene finders and generates a new combined set of gene identifications, representing the integrated evidence of the single method predictions.

Conclusion

We evaluate IPred in simulations and real data experiments on Escherichia Coli and human data. We show that IPred improves the prediction accuracy in comparison to single method predictions and to existing methods for prediction combination.

Electronic supplementary material

The online version of this article (doi:10.1186/s12864-015-1315-9) contains supplementary material, which is available to authorized users. 相似文献

9.

Are special read alignment strategies necessary and cost-effective when handling sequencing reads from patient-derived tumor xenografts?

Kai-Yuen Tso Sau Dan Lee Kwok-Wai Lo Kevin Y Yip 《BMC genomics》2014,15(1)

Background

Patient-derived tumor xenografts in mice are widely used in cancer research and have become important in developing personalized therapies. When these xenografts are subject to DNA sequencing, the samples could contain various amounts of mouse DNA. It has been unclear how the mouse reads would affect data analyses. We conducted comprehensive simulations to compare three alignment strategies at different mutation rates, read lengths, sequencing error rates, human-mouse mixing ratios and sequenced regions. We also sequenced a nasopharyngeal carcinoma xenograft and a cell line to test how the strategies work on real data.

Results

We found the "filtering" and "combined reference" strategies performed better than aligning reads directly to human reference in terms of alignment and variant calling accuracies. The combined reference strategy was particularly good at reducing false negative variants calls without significantly increasing the false positive rate. In some scenarios the performance gain of these two special handling strategies was too small for special handling to be cost-effective, but it was found crucial when false non-synonymous SNVs should be minimized, especially in exome sequencing.

Conclusions

Our study systematically analyzes the effects of mouse contamination in the sequencing data of human-in-mouse xenografts. Our findings provide information for designing data analysis pipelines for these data.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2164-15-1172) contains supplementary material, which is available to authorized users. 相似文献

10.

A meta-learning approach for B-cell conformational epitope prediction

Yuh-Jyh Hu Shun-Chien Lin Yu-Lung Lin Kuan-Hui Lin Shun-Ning You 《BMC bioinformatics》2014,15(1)

Background

One of the major challenges in the field of vaccine design is identifying B-cell epitopes in continuously evolving viruses. Various tools have been developed to predict linear or conformational epitopes, each relying on different physicochemical properties and adopting distinct search strategies. We propose a meta-learning approach for epitope prediction based on stacked and cascade generalizations. Through meta learning, we expect a meta learner to be able integrate multiple prediction models, and outperform the single best-performing model. The objective of this study is twofold: (1) to analyze the complementary predictive strengths in different prediction tools, and (2) to introduce a generic computational model to exploit the synergy among various prediction tools. Our primary goal is not to develop any particular classifier for B-cell epitope prediction, but to advocate the feasibility of meta learning to epitope prediction. With the flexibility of meta learning, the researcher can construct various meta classification hierarchies that are applicable to epitope prediction in different protein domains.

Results

We developed the hierarchical meta-learning architectures based on stacked and cascade generalizations. The bottom level of the hierarchy consisted of four conformational and four linear epitope prediction tools that served as the base learners. To perform consistent and unbiased comparisons, we tested the meta-learning method on an independent set of antigen proteins that were not used previously to train the base epitope prediction tools. In addition, we conducted correlation and ablation studies of the base learners in the meta-learning model. Low correlation among the predictions of the base learners suggested that the eight base learners had complementary predictive capabilities. The ablation analysis indicated that the eight base learners differentially interacted and contributed to the final meta model. The results of the independent test demonstrated that the meta-learning approach markedly outperformed the single best-performing epitope predictor.

Conclusions

Computational B-cell epitope prediction tools exhibit several differences that affect their performances when predicting epitopic regions in protein antigens. The proposed meta-learning approach for epitope prediction combines multiple prediction tools by integrating their complementary predictive strengths. Our experimental results demonstrate the superior performance of the combined approach in comparison with single epitope predictors.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-014-0378-y) contains supplementary material, which is available to authorized users. 相似文献

11.

Genomic landscape of rat strain and substrain variation

Roel Hermsen Joep de Ligt Wim Spee Francis Blokzijl Sebastian Sch?fer Eleonora Adami Sander Boymans Stephen Flink Ruben van Boxtel Robin H van der Weide Tim Aitman Norbert Hübner Marieke Simonis Boris Tabakoff Victor Guryev Edwin Cuppen 《BMC genomics》2015,16(1)

Background

Since the completion of the rat reference genome in 2003, whole-genome sequencing data from more than 40 rat strains have become available. These data represent the broad range of strains that are used in rat research including commonly used substrains. Currently, this wealth of information cannot be used to its full extent, because the variety of different variant calling algorithms employed by different groups impairs comparison between strains. In addition, all rat whole genome sequencing studies to date used an outdated reference genome for analysis (RGSC3.4 released in 2004).

Results

Here we present a comprehensive, multi-sample and uniformly called set of genetic variants in 40 rat strains, including 19 substrains. We reanalyzed all primary data using a recent version of the rat reference assembly (RGSC5.0 released in 2012) and identified over 12 million genomic variants (SNVs, indels and structural variants) among the 40 strains. 28,318 SNVs are specific to individual substrains, which may be explained by introgression from other unsequenced strains and ongoing evolution by genetic drift. Substrain SNVs may have a larger predicted functional impact compared to older shared SNVs.

Conclusions

In summary we present a comprehensive catalog of uniformly analyzed genetic variants among 40 widely used rat inbred strains based on the RGSC5.0 assembly. This represents a valuable resource, which will facilitate rat functional genomic research. In line with previous observations, our genome-wide analyses do not show evidence for contribution of multiple ancestral founder rat subspecies to the currently used rat inbred strains, as is the case for mouse. In addition, we find that the degree of substrain variation is highly variable between strains, which is of importance for the correct interpretation of experimental data from different labs.

Electronic supplementary material

The online version of this article (doi:10.1186/s12864-015-1594-1) contains supplementary material, which is available to authorized users. 相似文献

12.

Improved precision of QTL mapping using a nonlinear Bayesian method in a multi-breed population leads to greater accuracy of across-breed genomic predictions

Kathryn E Kemper Coralie M Reich Philip J Bowman Christy J vander Jagt Amanda J Chamberlain Brett A Mason Benjamin J Hayes Michael E Goddard 《遗传、选种与进化》2015,47(1)

Background

Genomic selection is increasingly widely practised, particularly in dairy cattle. However, the accuracy of current predictions using GBLUP (genomic best linear unbiased prediction) decays rapidly across generations, and also as selection candidates become less related to the reference population. This is likely caused by the effects of causative mutations being dispersed across many SNPs (single nucleotide polymorphisms) that span large genomic intervals. In this paper, we hypothesise that the use of a nonlinear method (BayesR), combined with a multi-breed (Holstein/Jersey) reference population will map causative mutations with more precision than GBLUP and this, in turn, will increase the accuracy of genomic predictions for selection candidates that are less related to the reference animals.

Results

BayesR improved the across-breed prediction accuracy for Australian Red dairy cattle for five milk yield and composition traits by an average of 7% over the GBLUP approach (Australian Red animals were not included in the reference population). Using the multi-breed reference population with BayesR improved accuracy of prediction in Australian Red cattle by 2 – 5% compared to using BayesR with a single breed reference population. Inclusion of 8478 Holstein and 3917 Jersey cows in the reference population improved accuracy of predictions for these breeds by 4 and 5%. However, predictions for Holstein and Jersey cattle were similar using within-breed and multi-breed reference populations. We propose that the improvement in across-breed prediction achieved by BayesR with the multi-breed reference population is due to more precise mapping of quantitative trait loci (QTL), which was demonstrated for several regions. New candidate genes with functional links to milk synthesis were identified using differential gene expression in the mammary gland.

Conclusions

QTL detection and genomic prediction are usually considered independently but persistence of genomic prediction accuracies across breeds requires accurate estimation of QTL effects. We show that accuracy of across-breed genomic predictions was higher with BayesR than with GBLUP and that BayesR mapped QTL more precisely. Further improvements of across-breed accuracy of genomic predictions and QTL mapping could be achieved by increasing the size of the reference population, including more breeds, and possibly by exploiting pleiotropic effects to improve mapping efficiency for QTL with small effects.

Electronic supplementary material

The online version of this article (doi:10.1186/s12711-014-0074-4) contains supplementary material, which is available to authorized users. 相似文献

13.

Evaluation of variant identification methods for whole genome sequencing data in dairy cattle

Christine F Baes Marlies A Dolezal James E Koltes Beat Bapst Eric Fritz-Waters Sandra Jansen Christine Flury Heidi Signer-Hasler Christian Stricker Rohan Fernando Ruedi Fries Juerg Moll Dorian J Garrick James M Reecy Birgit Gredler 《BMC genomics》2014,15(1)

Background

Advances in human genomics have allowed unprecedented productivity in terms of algorithms, software, and literature available for translating raw next-generation sequence data into high-quality information. The challenges of variant identification in organisms with lower quality reference genomes are less well documented. We explored the consequences of commonly recommended preparatory steps and the effects of single and multi sample variant identification methods using four publicly available software applications (Platypus, HaplotypeCaller, Samtools and UnifiedGenotyper) on whole genome sequence data of 65 key ancestors of Swiss dairy cattle populations. Accuracy of calling next-generation sequence variants was assessed by comparison to the same loci from medium and high-density single nucleotide variant (SNV) arrays.

Results

The total number of SNVs identified varied by software and method, with single (multi) sample results ranging from 17.7 to 22.0 (16.9 to 22.0) million variants. Computing time varied considerably between software. Preparatory realignment of insertions and deletions and subsequent base quality score recalibration had only minor effects on the number and quality of SNVs identified by different software, but increased computing time considerably. Average concordance for single (multi) sample results with high-density chip data was 58.3% (87.0%) and average genotype concordance in correctly identified SNVs was 99.2% (99.2%) across software. The average quality of SNVs identified, measured as the ratio of transitions to transversions, was higher using single sample methods than multi sample methods. A consensus approach using results of different software generally provided the highest variant quality in terms of transition/transversion ratio.

Conclusions

Our findings serve as a reference for variant identification pipeline development in non-human organisms and help assess the implication of preparatory steps in next-generation sequencing pipelines for organisms with incomplete reference genomes (pipeline code is included). Benchmarking this information should prove particularly useful in processing next-generation sequencing data for use in genome-wide association studies and genomic selection.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2164-15-948) contains supplementary material, which is available to authorized users. 相似文献

14.

Deep sequencing of multiple regions of glial tumors reveals spatial heterogeneity for mutations in clinically relevant genes

Akash Kumar Evan A Boyle Mari Tokita Andrei M Mikheev Michelle C Sanger Emily Girard John R Silber Luis F Gonzalez-Cuyar Joseph B Hiatt Andrew Adey Choli Lee Jacob O Kitzman Donald E Born Daniel L Silbergeld James M Olson Robert C Rostomily Jay Shendure 《Genome biology》2014,15(12)

Background

The extent of intratumoral mutational heterogeneity remains unclear in gliomas, the most common primary brain tumors, especially with respect to point mutation. To address this, we applied single molecule molecular inversion probes targeting 33 cancer genes to assay both point mutations and gene amplifications within spatially distinct regions of 14 glial tumors.

Results

We find evidence of regional mutational heterogeneity in multiple tumors, including mutations in TP53 and RB1 in an anaplastic oligodendroglioma and amplifications in PDGFRA and KIT in two glioblastomas (GBMs). Immunohistochemistry confirms heterogeneity of TP53 mutation and PDGFRA amplification. In all, 3 out of 14 glial tumors surveyed have evidence for heterogeneity for clinically relevant mutations.

Conclusions

Our results underscore the need to sample multiple regions in GBM and other glial tumors when devising personalized treatments based on genomic information, and furthermore demonstrate the importance of measuring both point mutation and copy number alteration while investigating genetic heterogeneity within cancer samples.

Electronic supplementary material

The online version of this article (doi:10.1186/s13059-014-0530-z) contains supplementary material, which is available to authorized users. 相似文献

15.

Whole genome sequencing of an ethnic Pathan (Pakhtun) from the north-west of Pakistan

Muhammad Ilyas Jong-Soo Kim Jesse Cooper Young-Ah Shin Hak-Min Kim Yun Sung Cho Seungwoo Hwang Hyunho Kim Jaewoo Moon Oksung Chung JeHoon Jun Achal Rastogi Sanghoon Song Junsu Ko Andrea Manica Ziaur Rahman Tayyab Husnain Jong Bhak 《BMC genomics》2015,16(1)

Background

Pakistan covers a key geographic area in human history, being both part of the Indus River region that acted as one of the cradles of civilization and as a link between Western Eurasia and Eastern Asia. This region is inhabited by a number of distinct ethnic groups, the largest being the Punjabi, Pathan (Pakhtuns), Sindhi, and Baloch.

Results

We analyzed the first ethnic male Pathan genome by sequencing it to 29.7-fold coverage using the Illumina HiSeq2000 platform. A total of 3.8 million single nucleotide variations (SNVs) and 0.5 million small indels were identified by comparing with the human reference genome. Among the SNVs, 129,441 were novel, and 10,315 nonsynonymous SNVs were found in 5,344 genes. SNVs were annotated for health consequences and high risk diseases, as well as possible influences on drug efficacy. We confirmed that the Pathan genome presented here is representative of this ethnic group by comparing it to a panel of Central Asians from the HGDP-CEPH panels typed for ~650 k SNPs. The mtDNA (H2) and Y haplogroup (L1) of this individual were also typical of his geographic region of origin. Finally, we reconstruct the demographic history by PSMC, which highlights a recent increase in effective population size compatible with admixture between European and Asian lineages expected in this geographic region.

Conclusions

We present a whole-genome sequence and analyses of an ethnic Pathan from the north-west province of Pakistan. It is a useful resource to understand genetic variation and human migration across the whole Asian continent.

Electronic supplementary material

The online version of this article (doi:10.1186/s12864-015-1290-1) contains supplementary material, which is available to authorized users. 相似文献

16.

Novel Point and Combo-Mutations in the Genome of Hepatitis B Virus-Genotype D: Characterization and Impact on Liver Disease Progression to Hepatocellular Carcinoma

Somenath Datta Alip Ghosh Debanjali Dasgupta Amit Ghosh Shrabasti Roychoudhury Gaurav Roy Soumyojit Das Kausik Das Subash Gupta Keya Basu Analabha Basu Simanti Datta Abhijit Chowdhury Soma Banerjee 《PloS one》2014,9(10)

Background

The contribution of chronic hepatitis B virus (HBV) infection in the pathogenesis of hepatocellular carcinoma (HCC) through progressive stages of liver fibrosis is exacerbated by the acquisition of naturally occurring mutations in its genome. This study has investigated the prevalence of single and combo mutations in the genome of HBV-genotype D from treatment naïve Indian patients of progressive liver disease stages and assessed their impact on the disease progression to HCC.

Methods

The mutation profile was determined from the sequence analysis of the full-length HBV genome and compared with the reference HBV sequences. SPSS 16.0 and R software were used to delineate their statistical significance in predicting HCC occurrence.

Results

Age was identified as associated risk factor for HCC development in chronic hepatitis B (CHB) patients (p≤0.01). Beyond the classical mutations in basal core promoter (BCP) (A1762T/G1764A) and precore (G1862T), persistence of progressively accumulated mutations in enhancer-I, surface, HBx and core were showed significant association to liver disease progression. BCP_T1753C, core_T147C, surface_L213I had contributed significantly in the disease progression to HCC (p<0.05) in HBeAg positive patients whereas precore_T1858C, core_I116L, core_P130Q and preS1_S98T in HBeAg negative patients. Furthermore, the effect of individual mutation was magnified by the combination with A1762T/G1764A in HCC pathogenesis. Multivariate risk analysis had confirmed that core_P130Q [OR 20.71, 95% CI (1.64–261.77), p = 0.019] in B cell epitope and core_T147C [OR 14.58, 95% CI (1.17–181.76), p = 0.037] in CTL epitope were two independent predictors of HCC in HBeAg positive and negative patients respectively.

Conclusions

Thus distinct pattern of mutations distributed across the entire HBV genome may be useful in predicting HCC in high-risk CHB patients and pattern of mutational combinations may exert greater impact on HCC risk prediction more accurately than point mutations and hence these predictors may support the existing surveillance strategies in proper management of the patients. 相似文献

17.

Using whole-genome sequences of the LG/J and SM/J inbred mouse strains to prioritize quantitative trait genes and nucleotides

Igor Nikolskiy Donald F Conrad Sung Chun Justin C Fay James M Cheverud Heather A Lawson 《BMC genomics》2015,16(1)

相似文献

18.

Towards a Rigorous Assessment of Systems Biology Models: The DREAM3 Challenges

Robert J. Prill Daniel Marbach Julio Saez-Rodriguez Peter K. Sorger Leonidas G. Alexopoulos Xiaowei Xue Neil D. Clarke Gregoire Altan-Bonnet Gustavo Stolovitzky 《PloS one》2010,5(2)

Background

Systems biology has embraced computational modeling in response to the quantitative nature and increasing scale of contemporary data sets. The onslaught of data is accelerating as molecular profiling technology evolves. The Dialogue for Reverse Engineering Assessments and Methods (DREAM) is a community effort to catalyze discussion about the design, application, and assessment of systems biology models through annual reverse-engineering challenges.

Methodology and Principal Findings

We describe our assessments of the four challenges associated with the third DREAM conference which came to be known as the DREAM3 challenges: signaling cascade identification, signaling response prediction, gene expression prediction, and the DREAM3 in silico network challenge. The challenges, based on anonymized data sets, tested participants in network inference and prediction of measurements. Forty teams submitted 413 predicted networks and measurement test sets. Overall, a handful of best-performer teams were identified, while a majority of teams made predictions that were equivalent to random. Counterintuitively, combining the predictions of multiple teams (including the weaker teams) can in some cases improve predictive power beyond that of any single method.

Conclusions

DREAM provides valuable feedback to practitioners of systems biology modeling. Lessons learned from the predictions of the community provide much-needed context for interpreting claims of efficacy of algorithms described in the scientific literature. 相似文献

19.

Selection of haplotype variables from a high-density marker map for genomic prediction

Beatriz CD Cuyabano Guosheng Su Mogens S. Lund 《遗传、选种与进化》2015,47(1)

Background

Using haplotype blocks as predictors rather than individual single nucleotide polymorphisms (SNPs) may improve genomic predictions, since haplotypes are in stronger linkage disequilibrium with the quantitative trait loci than are individual SNPs. It has also been hypothesized that an appropriate selection of a subset of haplotype blocks can result in similar or better predictive ability than when using the whole set of haplotype blocks. This study investigated genomic prediction using a set of haplotype blocks that contained the SNPs with large effects estimated from an individual SNP prediction model. We analyzed protein yield, fertility and mastitis of Nordic Holstein cattle, and used high-density markers (about 770k SNPs). To reach an optimum number of haplotype variables for genomic prediction, predictions were performed using subsets of haplotype blocks that contained a range of 1000 to 50 000 main SNPs.

Results

The use of haplotype blocks improved the prediction reliabilities, even when selection focused on only a group of haplotype blocks. In this case, the use of haplotype blocks that contained the 20 000 to 50 000 SNPs with the highest effect was sufficient to outperform the model that used all individual SNPs as predictors (up to 1.3 % improvement in prediction reliability for mastitis, compared to individual SNP approach), and the achieved reliabilities were similar to those using all haplotype blocks available in the genome data (from 0.6 % lower to 0.8 % higher reliability).

Conclusions

Haplotype blocks used as predictors can improve the reliability of genomic prediction compared to the individual SNP model. Furthermore, the use of a subset of haplotype blocks that contains the main SNP effects from genomic data could be a feasible approach to genomic prediction in dairy cattle, given an increase in density of genotype data available. The predictive ability of the models that use a subset of haplotype blocks was similar to that obtained using either all haplotype blocks or all individual SNPs, with the benefit of having a much lower computational demand. 相似文献

20.

Whole genome profiling of spontaneous and chemically induced mutations in Toxoplasma gondii

Andrew Farrell Bradley I Coleman Brian Benenati Kevin M Brown Ira J Blader Gabor T Marth Marc-Jan Gubbels 《BMC genomics》2014,15(1)

Background

Next generation sequencing is helping to overcome limitations in organisms less accessible to classical or reverse genetic methods by facilitating whole genome mutational analysis studies. One traditionally intractable group, the Apicomplexa, contains several important pathogenic protozoan parasites, including the Plasmodium species that cause malaria.Here we apply whole genome analysis methods to the relatively accessible model apicomplexan, Toxoplasma gondii, to optimize forward genetic methods for chemical mutagenesis using N-ethyl-N-nitrosourea (ENU) and ethylmethane sulfonate (EMS) at varying dosages.

Results

By comparing three different lab-strains we show that spontaneously generated mutations reflect genome composition, without nucleotide bias. However, the single nucleotide variations (SNVs) are not distributed randomly over the genome; most of these mutations reside either in non-coding sequence or are silent with respect to protein coding. This is in contrast to the random genomic distribution of mutations induced by chemical mutagenesis. Additionally, we report a genome wide transition vs transversion ratio (ti/tv) of 0.91 for spontaneous mutations in Toxoplasma, with a slightly higher rate of 1.20 and 1.06 for variants induced by ENU and EMS respectively. We also show that in the Toxoplasma system, surprisingly, both ENU and EMS have a proclivity for inducing mutations at A/T base pairs (78.6% and 69.6%, respectively).

Conclusions

The number of SNVs between related laboratory strains is relatively low and managed by purifying selection away from changes to amino acid sequence. From an experimental mutagenesis point of view, both ENU (24.7%) and EMS (29.1%) are more likely to generate variation within exons than would naturally accumulate over time in culture (19.1%), demonstrating the utility of these approaches for yielding proportionally greater changes to the amino acid sequence. These results will not only direct the methods of future chemical mutagenesis in Toxoplasma, but also aid in designing forward genetic approaches in less accessible pathogenic protozoa as well.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2164-15-354) contains supplementary material, which is available to authorized users. 相似文献