首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 156 毫秒
1.
目的 研究构建基于共祖(identity-by-descent,IBD)片段算法预测远亲缘关系分析流程并评估预测准确性。方法 采用高密度单核苷酸多态性(single nucleotide polymorphism,SNP)芯片对253份家系样本进行检测,研究基于IBD片段算法的分析流程进行两两个体间亲缘关系预测,评估预测准确性。随机减少SNP位点,评估位点数对算法预测准确性的影响。结果 IBD片段算法预测1~7级亲缘关系平均置信区间准确率为94.72%,预测可信度为99.77%,6级及以上亲缘关系预测时出现假阴性。随着SNP数量减少,预测准确性会出现一定程度的下降。结论 IBD片段算法可用于7级以内亲缘关系的预测,该算法在群体遗传学、法医遗传学等领域有重要应用价值。  相似文献   

2.
目的 QuickTargSeq全集成法医DNA现场快速检测系统是国内首台自主研制的现场快检仪,可应用于InDel族群推断检测,2 h左右完成“样本进-结果出”的快速自动化InDel分型。本文对InDel族群推断微流控芯片检测体系的性能进行评估,以期为实践应用提供参考。方法 使用InDel族群推断微流控芯片检测体系,对体系的灵敏度、干扰物耐受性、成功率、分型准确率、精确性、准确性、峰平衡性及检材适应性进行验证评估,同时对测试样本的族群来源进行推断。结果 138份样本的全集成检测成功率为95.65%,分型准确率为98.85%;DNA模板量≥5 ng时,可获得完整InDel分型,口腔拭子样本最佳采集次数为口腔内壁左右两侧各刮擦8次,血卡样本最佳检测方式为6片(Φ=2 mm);所有基因座的平均杂合子峰高比值为0.86;10次运行的等位基因分型标准物(allelic ladder)片段大小标准差均在0.3 bp以内,测试样本等位基因和相应的等位基因分型标准物之间的片段准确性均在0.5 bp以内。结论 该体系可实现对口腔拭子、血卡、唾液卡及烟蒂样本的准确分型,能够准确推断样本的族群来源。  相似文献   

3.
目的 评估基于状态一致性(identity-by-state,IBS)算法预测个体间亲缘关系的准确性.方法 采用Illumina GSA芯片对253份样本进行全基因组检测,基于高密度单核苷酸多态性(single nucleotide polymorphism,SNP)数据计算两两个体间IBS共享统计量预测亲缘关系.通过...  相似文献   

4.
《遗传》2021,(9)
样本的族群来源推断在法医调查中可发挥重要作用,一个理想的推断体系是用一组较少的遗传标记实现较高的族群推断准确性。本研究调研搜集了区分东亚北方三个族群北方汉族、日本人和韩国人的428个祖先信息SNP (ancestry informative SNP, AISNP),获取了其在三个族群307份样本中的分型,通过位点Fst值及等位基因频率聚类等信息进一步精简位点,最终得到了一组49AISNP组合。基于307份样本利用留一法对49AISNP进行推断准确性验证,结果表明其在北方汉族、日本和韩国族群中的推断准确性均高于99%。49AISNP组合将有助于东亚地区亚族群的进一步区分。  相似文献   

5.
目的 男性型脱发(male pattern baldness,MPB),又称为雄激素性脱发(AGA),是一种常见的男性脱发类型,大约80%的表型差异可以用遗传因素解释。目前的MPB遗传推断研究主要基于欧洲人群,东亚人群相关研究较少。本研究在中国人群中对欧洲人群MPB关联位点进行验证分析,并建立遗传推断模型。方法 本研究调查了486个与欧洲人群MPB相关单核苷酸多态性(SNP)位点在312名中国汉族男性中的关联性,分别使用逐步回归和Lasso回归方法对关联出的位点进行筛选。使用逻辑回归算法构建预测模型,通过十折交叉验证的方法评估。之后进一步比较了逻辑回归、k近邻分类器、随机森林、支持向量机4种常用分类器模型对MPB的预测准确性。结果 有174个SNP位点与中国汉族男性的MPB显著相关(P<0.05)。通过不同的筛选方法,分别得到了22个SNP和25个SNP的位点集合。基于上述位点集合建立了22-SNP和 25-SNP两种逻辑回归预测模型。以AUC(ROC曲线下方的面积大小,area under curve)来衡量,两种模型对MPB预测的准确性分别为0.85和0.84;经十折交叉验证后预测准确性分别下降至0.81和0.77。当加入年龄作为预测因子后,两种模型的AUC均达到最大值0.89。从运行结果来看,逻辑回归预测模型较本研究中的其他分类器模型具有明显优势。结论 总体而言,虽然预测模型的准确性尚未达到临床期望水平,但SNP在MPB的遗传预测方面仍具备很大的潜力,可以为MPB的早期诊断、临床干预和法庭科学应用提供参考。  相似文献   

6.
目的 毛干是案件现场常见的生物物证,目前缺少有效的个体识别方法而未能在案件调查和法庭诉讼中发挥作用。毛干蛋白质组中的单氨基酸多态性(SAP)蕴含着个体遗传差异信息,可应用于个体识别。方法 为研究毛干物证SAP个体差异,本文使用离子液体对12份2 cm长的毛干样本(6人,每人2根)经过前处理后,进行LC-MS/MS质谱检测,分析毛干中的蛋白质组成。然后利用自建的东亚人群SAP蛋白质序列数据库,对质谱数据进行搜库分析,依据自建的SAP与SNP对应注释表信息,推导出SAP对应的nsSNP分型,并且与外显子测序nsSNP结果比较,进而验证SAP检测的准确性。最后,利用验证准确的SAP分型进行随机匹配概率的计算。结果 12份样品共计获得321个SAP,每个样本平均为(131±17)个。6人的随机匹配概率数值范围为1.4×10-4~1.0×10-9结论 本文建立了东亚人群毛干蛋白中SAP检测方法,并验证了个体识别应用的能力,为法庭科学中毛干个体识别提供了有力的工具和新的思路。  相似文献   

7.
目的 长链非编码RNA在遗传、代谢和基因表达调控等方面发挥着重要作用。然而,传统的实验方法解析RNA的三级结构耗时长、费用高且操作要求高。此外,通过计算方法来预测RNA的三级结构在近十年来无突破性进展。因此,需要提出新的预测算法来准确的预测RNA的三级结构。所以,本文发展可以用于提高RNA三级结构预测准确性的碱基关联图预测方法。方法 为了利用RNA理化特征信息,本文应用多层全卷积神经网络和循环神经网络的深度学习算法来预测RNA碱基间的接触概率,并通过注意力机制处理RNA序列中碱基间相互依赖的特征。结果 通过多层神经网络与注意力机制结合,本文方法能够有效得到RNA特征值中局部和全局的信息,提高了模型的鲁棒性和泛化能力。检验计算表明,所提出模型对序列长度L的4种标准(L/10、L/5、L/2、L)碱基关联图的预测准确率分别达到0.84、0.82、0.82和0.75。结论 基于注意力机制的深度学习预测算法能够提高RNA碱基关联图预测的准确率,从而帮助RNA三级结构的预测。  相似文献   

8.
摘要 目的:探寻幽门螺杆菌(Helicobacter pylori,Hp)铁摄取调节蛋白(ferric uptake regulator,Fur)基因单核苷酸多态性(single nucleotide polymorphism,SNP)和进化分型与胃癌的相关性。方法:选取2011-2018年青岛市市立医院保存的150株Hp(胃癌来源59株和胃炎来源91株),运用聚合酶链式反应(polymerase chain reaction,PCR)方法扩增fur基因,并进行一代测序及SNP分析。通过NCBI数据库下载226株东亚亚群Hp菌株fur基因序列,应用MEGA 5.0软件分析SNP并构建fur基因Neighbour-Joining系统进化树,建立进化分型。结果:98.7 %(148/150)Hp菌株fur基因PCR扩增阳性。序列分析发现fur基因351位点存在碱基A→G的同义SNP(SNP A351G),胃癌来源的菌株中G等位基因的变异频率明显高于胃炎来源的菌株,差异有统计学意义(χ2=5.161,P=0.023);携带该等位基因的菌株发生胃癌风险明显升高(OR=2.4)。在东亚Hp fur基因的Neighbour-Joining系统进化树中,依据进化距离将东亚Hp菌株分为Ⅰ型和Ⅱ型两个亚型,fur基因进化Ⅰ型中的胃癌来源Hp菌株比例明显高于Ⅱ型,差异有统计学意义(χ2=41.8,P=9.9×10-11);感染furⅠ型Hp的患者发生胃癌的风险显著升高(OR=4.7)。结论:携带fur SNP A351G的Hp菌株导致胃癌发生风险显著升高,fur基因进化Ⅰ型与胃癌发生风险具有一定相关性。  相似文献   

9.
目的 基于位点特异性打分矩阵(position-specific scoring matrices,PSSM)的预测模型已经取得了良好的效果,基于PSSM的各种优化方法也在不断发展,但准确率相对较低,为了进一步提高预测准确率,本文基于卷积神经网络(convolutional neural networks,CNN)算法做了进一步研究。方法 采用PSSM将启动子序列处理成数值矩阵,通过CNN算法进行分类。大肠杆菌K-12(Escherichia coli K-12,E.coli K-12,下文简称大肠杆菌)的Sigma38、Sigma54和Sigma70 3种启动子序列被作为正集,编码(Coding)区和非编码(Non-coding)区的序列为负集。结果 在预测大肠杆菌启动子的二分类中,准确率达到99%,启动子预测的成功率接近100%;在对Sigma38、Sigma54、Sigma70 3种启动子的三分类中,预测准确率为98%,并且针对每一种序列的预测准确率均可以达到98%以上。最后,本文以Sigma38、Sigma54、Sigma70 3种启动子分别和Coding区或者Non-coding区序列做四分类,预测得到的准确性为0.98,对3种Sigma启动子均衡样本的十交叉检验预测精度均可以达到0.95以上,海明距离为0.016,Kappa系数为0.97。结论 相较于支持向量机(support vector machine,SVM)等其他分类算法,CNN分类算法更具优势,并且基于CNN的分类优势,编码方式亦可以得到简化。  相似文献   

10.
目的 人体组织的稳定同位素组成与其生长期间的个体饮食情况、所处环境及代谢状况有关。人头发一经长出便不再与身体进行物质交换,化学性质稳定,易于采集,是研究人体组织稳定同位素组成的良好对象。构成人体的氧、氢元素主要来自于所摄入的水和食物,其中氧、氢稳定同位素组成会通过角蛋白的形式被记录于头发当中。不同地区居民头发中氧、氢稳定同位素组成差异可被用于推断人的饮食情况、生活地域和活动轨迹信息,在法庭科学等研究领域具有重要意义。方法 本研究利用元素分析仪-稳定同位素比质谱仪(EA-IRMS)对国内不同地区常住居民头发样本进行氧、氢稳定同位素比值检测和分析。结果 部分城市间居民头发δ18O和δ2H存在显著性差异,整体δ18O和δ2H存在显著正相关性。对所得稳定同位素数据进行判别分析推断头发的地域来源,其交互验证整体判别准确率为63.9%,结合碳、氮稳定同位素数据后,其判别准确率大幅提升,交互验证的整体判别准确率达到76.0%。随着判别分析中使用的稳定同位素种类的增加,判别函数模型的判别能力明显增强。结论 利用4种元素稳定同位素数据建立的多层感知器神经网络模型的整体判别准确率为82.8%,径向基函数神经网络模型整体判别准确率为78.8%,3种溯源推断数学模型中,多层感知器神经网络模型的判别准确率最高。  相似文献   

11.

Background

Identity by descent (IBD) has played a fundamental role in the discovery of genetic loci underlying human diseases. Both pedigree-based and population-based linkage analyses rely on estimating recent IBD, and evidence of ancient IBD can be used to detect population structure in genetic association studies. Various methods for detecting IBD, including those implemented in the soft- ware programs fastIBD and GERMLINE, have been developed in the past several years using population genotype data from microarray platforms. Now, next-generation DNA sequencing data is becoming increasingly available, enabling the comprehensive analysis of genomes, in- cluding identifying rare variants. These sequencing data may provide an opportunity to detect IBD with higher resolution than previously possible, potentially enabling the detection of disease causing loci that were previously undetectable with sparser genetic data.

Results

Here, we investigate how different levels of variant coverage in sequencing and microarray genotype data influences the resolution at which IBD can be detected. This includes microarray genotype data from the WTCCC study, denser genotype data from the HapMap Project, low coverage sequencing data from the 1000 Genomes Project, and deep coverage complete genome data from our own projects. With high power (78%), we can detect segments of length 0.4 cM or larger using fastIBD and GERMLINE in sequencing data. This compares to similar power to detect segments of length 1.0 cM or higher with microarray genotype data. We find that GERMLINE has slightly higher power than fastIBD for detecting IBD segments using sequencing data, but also has a much higher false positive rate.

Conclusion

We further quantify the effect of variant density, conditional on genetic map length, on the power to resolve IBD segments. These investigations into IBD resolution may help guide the design of future next generation sequencing studies that utilize IBD, including family-based association studies, association studies in admixed populations, and homozygosity mapping studies.  相似文献   

12.
Sequencing family DNA samples provides an attractive alternative to population based designs to identify rare variants associated with human disease due to the enrichment of causal variants in pedigrees. Previous studies showed that genotype calling accuracy can be improved by modeling family relatedness compared to standard calling algorithms. Current family-based variant calling methods use sequencing data on single variants and ignore the identity-by-descent (IBD) sharing along the genome. In this study we describe a new computational framework to accurately estimate the IBD sharing from the sequencing data, and to utilize the inferred IBD among family members to jointly call genotypes in pedigrees. Through simulations and application to real data, we showed that IBD can be reliably estimated across the genome, even at very low coverage (e.g. 2X), and genotype accuracy can be dramatically improved. Moreover, the improvement is more pronounced for variants with low frequencies, especially at low to intermediate coverage (e.g. 10X to 20X), making our approach effective in studying rare variants in cost-effective whole genome sequencing in pedigrees. We hope that our tool is useful to the research community for identifying rare variants for human disease through family-based sequencing.  相似文献   

13.
Restriction‐site associated DNA sequencing (RADSeq) facilitates rapid generation of thousands of genetic markers at relatively low cost; however, several sources of error specific to RADSeq methods often lead to biased estimates of allele frequencies and thereby to erroneous population genetic inference. Estimating the distribution of sample allele frequencies without calling genotypes was shown to improve population inference from whole genome sequencing data, but the ability of this approach to account for RADSeq‐specific biases remains unexplored. Here we assess in how far genotype‐free methods of allele frequency estimation affect demographic inference from empirical RADSeq data. Using the well‐studied pied flycatcher (Ficedula hypoleuca) as a study system, we compare allele frequency estimation and demographic inference from whole genome sequencing data with that from RADSeq data matched for samples using both genotype‐based and genotype free methods. The demographic history of pied flycatchers as inferred from RADSeq data was highly congruent with that inferred from whole genome resequencing (WGS) data when allele frequencies were estimated directly from the read data. In contrast, when allele frequencies were derived from called genotypes, RADSeq‐based estimates of most model parameters fell outside the 95% confidence interval of estimates derived from WGS data. Notably, more stringent filtering of the genotype calls tended to increase the discrepancy between parameter estimates from WGS and RADSeq data, respectively. The results from this study demonstrate the ability of genotype‐free methods to improve allele frequency spectrum‐ (AFS‐) based demographic inference from empirical RADSeq data and highlight the need to account for uncertainty in NGS data regardless of sequencing method.  相似文献   

14.
The detection of genetic segments of Identical by Descent (IBD) in Genome-Wide Association Studies has proven successful in pinpointing genetic relatedness between reportedly unrelated individuals and leveraging such regions to shortlist candidate genes. These techniques depend on high-density genotyping arrays and their effectiveness in diverse sequence data is largely unknown. Due to decreasing costs and increasing effectiveness of high throughput techniques for whole-exome sequencing, an influx of exome sequencing data has become available. Studies using exomes and IBD-detection methods within known pedigrees have shown that IBD can be useful in finding hidden genetic candidates where known relatives are available. We set out to examine the viability of using IBD-detection in whole exome sequencing data in population-wide studies. In doing so, we extend GERMLINE, a method to detect IBD from exome sequencing data by finding small slices of matching alleles between pairs of individuals and extending them into full IBD segments. This algorithm allows for efficient population-wide detection in dense data. We apply this algorithm to a cohort of Crohn''s Disease cases where whole-exome and GWAS array data is available. We confirm that GWAS-based detected segments are highly accurate and predictive of underlying shared variation. Where segments inferred from GWAS are expected to be of high accuracy, we compare exome-based detection accuracy of multiple detection strategies. We find detection accuracy to be prohibitively low in all assessments, both in terms of segment sensitivity and specificity. Even after isolating relatively long segments beyond 10cM, exome-based detection continued to offer poor specificity/sensitivity tradeoffs. We hypothesize that the variable coverage and platform biases of exome capture account for this decreased accuracy and look toward whole genome sequencing data as a higher quality source for detecting population-wide IBD.  相似文献   

15.
It has become clear that hybridization between species is much more common than previously recognized. As a result, we now know that the genomes of many modern species, including our own, are a patchwork of regions derived from past hybridization events. Increasingly researchers are interested in disentangling which regions of the genome originated from each parental species using local ancestry inference methods. Due to the diverse effects of admixture, this interest is shared across disparate fields, from human genetics to research in ecology and evolutionary biology. However, local ancestry inference methods are sensitive to a range of biological and technical parameters which can impact accuracy. Here we present paired simulation and ancestry inference pipelines, mixnmatch and ancestryinfer, to help researchers plan and execute local ancestry inference studies. mixnmatch can simulate arbitrarily complex demographic histories in the parental and hybrid populations, selection on hybrids, and technical variables such as coverage and contamination. ancestryinfer takes as input sequencing reads from simulated or real individuals, and implements an efficient local ancestry inference pipeline. We perform a series of simulations with mixnmatch to pinpoint factors that influence accuracy in local ancestry inference and highlight useful features of the two pipelines. mixnmatch is a powerful tool for simulations of hybridization while ancestryinfer facilitates local ancestry inference on real or simulated data.  相似文献   

16.
Inferring the ancestry at each locus in the genome of recently admixed individuals (e.g., Latino Americans) plays a major role in medical and population genetic inferences, ranging from finding disease-risk loci, to inferring recombination rates, to mapping missing contigs in the human genome. Although many methods for local ancestry inference have been proposed, most are designed for use with genotyping arrays and fail to make use of the full spectrum of data available from sequencing. In addition, current haplotype-based approaches are very computationally demanding, requiring large computational time for moderately large sample sizes. Here we present new methods for local ancestry inference that leverage continent-specific variants (CSVs) to attain increased performance over existing approaches in sequenced admixed genomes. A key feature of our approach is that it incorporates the admixed genomes themselves jointly with public datasets, such as 1000 Genomes, to improve the accuracy of CSV calling. We use simulations to show that our approach attains accuracy similar to widely used computationally intensive haplotype-based approaches with large decreases in runtime. Most importantly, we show that our method recovers comparable local ancestries, as the 1000 Genomes consensus local ancestry calls in the real admixed individuals from the 1000 Genomes Project. We extend our approach to account for low-coverage sequencing and show that accurate local ancestry inference can be attained at low sequencing coverage. Finally, we generalize CSVs to sub-continental population-specific variants (sCSVs) and show that in some cases it is possible to determine the sub-continental ancestry for short chromosomal segments on the basis of sCSVs.  相似文献   

17.
Dou J  Zhao X  Fu X  Jiao W  Wang N  Zhang L  Hu X  Wang S  Bao Z 《Biology direct》2012,7(1):17-9
ABSTRACT: BACKGROUND: Single nucleotide polymorphisms (SNPs) are the most abundant type of genetic variation in eukaryotic genomes and have recently become the marker of choice in a wide variety of ecological and evolutionary studies. The advent of next-generation sequencing (NGS) technologies has made it possible to efficiently genotype a large number of SNPs in the non-model organisms with no or limited genomic resources. Most NGS-based genotyping methods require a reference genome to perform accurate SNP calling. Little effort, however, has yet been devoted to developing or improving algorithms for accurate SNP calling in the absence of a reference genome. RESULTS: Here we describe an improved maximum likelihood (ML) algorithm called iML, which can achieve high genotyping accuracy for SNP calling in the non-model organisms without a reference genome. The iML algorithm incorporates the mixed Poisson/normal model to detect composite read clusters and can efficiently prevent incorrect SNP calls resulting from repetitive genomic regions. Through analysis of simulation and real sequencing datasets, we demonstrate that in comparison with ML or a threshold approach, iML can remarkably improve the accuracy of de novo SNP genotyping and is especially powerful for the reference-free genotyping in diploid genomes with high repeat contents. CONCLUSIONS: The iML algorithm can efficiently prevent incorrect SNP calls resulting from repetitive genomic regions, and thus outperforms the original ML algorithm by achieving much higher genotyping accuracy. Our algorithm is therefore very useful for accurate de novo SNP genotyping in the non-model organisms without a reference genome.  相似文献   

18.
In non‐model organisms, evolutionary questions are frequently addressed using reduced representation sequencing techniques due to their low cost, ease of use, and because they do not require genomic resources such as a reference genome. However, evidence is accumulating that such techniques may be affected by specific biases, questioning the accuracy of obtained genotypes, and as a consequence, their usefulness in evolutionary studies. Here, we introduce three strategies to estimate genotyping error rates from such data: through the comparison to high quality genotypes obtained with a different technique, from individual replicates, or from a population sample when assuming Hardy‐Weinberg equilibrium. Applying these strategies to data obtained with Restriction site Associated DNA sequencing (RAD‐seq), arguably the most popular reduced representation sequencing technique, revealed per‐allele genotyping error rates that were much higher than sequencing error rates, particularly at heterozygous sites that were wrongly inferred as homozygous. As we exemplify through the inference of genome‐wide and local ancestry of well characterized hybrids of two Eurasian poplar (Populus) species, such high error rates may lead to wrong biological conclusions. By properly accounting for these error rates in downstream analyses, either by incorporating genotyping errors directly or by recalibrating genotype likelihoods, we were nevertheless able to use the RAD‐seq data to support biologically meaningful and robust inferences of ancestry among Populus hybrids. Based on these findings, we strongly recommend carefully assessing genotyping error rates in reduced representation sequencing experiments, and to properly account for these in downstream analyses, for instance using the tools presented here.  相似文献   

19.

Background

The advent of low cost next generation sequencing has made it possible to sequence a large number of dairy and beef bulls which can be used as a reference for imputation of whole genome sequence data. The aim of this study was to investigate the accuracy and speed of imputation from a high density SNP marker panel to whole genome sequence level. Data contained 132 Holstein, 42 Jersey, 52 Nordic Red and 16 Brown Swiss bulls with whole genome sequence data; 16 Holstein, 27 Jersey and 29 Nordic Reds had previously been typed with the bovine high density SNP panel and were used for validation. We investigated the effect of enlarging the reference population by combining data across breeds on the accuracy of imputation, and the accuracy and speed of both IMPUTE2 and BEAGLE using either genotype probability reference data or pre-phased reference data. All analyses were done on Bovine autosome 29 using 387,436 bi-allelic variants and 13,612 SNP markers from the bovine HD panel.

Results

A combined breed reference population led to higher imputation accuracies than did a single breed reference. The highest accuracy of imputation for all three test breeds was achieved when using BEAGLE with un-phased reference data (mean genotype correlations of 0.90, 0.89 and 0.87 for Holstein, Jersey and Nordic Red respectively) but IMPUTE2 with un-phased reference data gave similar accuracies for Holsteins and Nordic Red. Pre-phasing the reference data only lead to a minor decrease in the imputation accuracy, but gave a large improvement in computation time. Pre-phasing with BEAGLE was substantially faster than pre-phasing with SHAPEIT2 (2.5 hours vs. 52 hours for 242 individuals), and imputation with pre-phased data was faster in IMPUTE2 than in BEAGLE (5 minutes vs. 50 minutes per individual).

Conclusion

Combining reference populations across breeds is a good option to increase the size of the reference data and in turn the accuracy of imputation when only few animals are available. Pre-phasing the reference data only slightly decreases the accuracy but gives substantial improvements in speed. Using BEAGLE for pre-phasing and IMPUTE2 for imputation is a fast and accurate strategy.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号