首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
MOTIVATION: Human single nucleotide polymorphisms (SNPs) are the most frequent type of genetic variation in human population. One of the most important goals of SNP projects is to understand which human genotype variations are related to Mendelian and complex diseases. Great interest is focused on non-synonymous coding SNPs (nsSNPs) that are responsible of protein single point mutation. nsSNPs can be neutral or disease associated. It is known that the mutation of only one residue in a protein sequence can be related to a number of pathological conditions of dramatic social impact such as Alzheimer's, Parkinson's and Creutzfeldt-Jakob's diseases. The quality and completeness of presently available SNPs databases allows the application of machine learning techniques to predict the insurgence of human diseases due to single point protein mutation starting from the protein sequence. RESULTS: In this paper, we develop a method based on support vector machines (SVMs) that starting from the protein sequence information can predict whether a new phenotype derived from a nsSNP can be related to a genetic disease in humans. Using a dataset of 21 185 single point mutations, 61% of which are disease-related, out of 3587 proteins, we show that our predictor can reach more than 74% accuracy in the specific task of predicting whether a single point mutation can be disease related or not. Our method, although based on less information, outperforms other web-available predictors implementing different approaches. AVAILABILITY: A beta version of the web tool is available at http://gpcr.biocomp.unibo.it/cgi/predictors/PhD-SNP/PhD-SNP.cgi  相似文献   

2.
The use of dense SNPs to predict the genetic value of an individual for a complex trait is often referred to as “genomic selection” in livestock and crops, but is also relevant to human genetics to predict, for example, complex genetic disease risk. The accuracy of prediction depends on the strength of linkage disequilibrium (LD) between SNPs and causal mutations. If sequence data were used instead of dense SNPs, accuracy should increase because causal mutations are present, but demographic history and long-term negative selection also influence accuracy. We therefore evaluated genomic prediction, using simulated sequence in two contrasting populations: one reducing from an ancestrally large effective population size (Ne) to a small one, with high LD common in domestic livestock, while the second had a large constant-sized Ne with low LD similar to that in some human or outbred plant populations. There were two scenarios in each population; causal variants were either neutral or under long-term negative selection. For large Ne, sequence data led to a 22% increase in accuracy relative to ∼600K SNP chip data with a Bayesian analysis and a more modest advantage with a BLUP analysis. This advantage increased when causal variants were influenced by negative selection, and accuracy persisted when 10 generations separated reference and validation populations. However, in the reducing Ne population, there was little advantage for sequence even with negative selection. This study demonstrates the joint influence of demography and selection on accuracy of prediction and improves our understanding of how best to exploit sequence for genomic prediction.  相似文献   

3.
Understanding the mechanism of the protein stability change is one of the most challenging tasks. Recently, the prediction of protein stability change affected by single point mutations has become an interesting topic in molecular biology. However, it is desirable to further acquire knowledge from large databases to provide new insights into the nature of them. This paper presents an interpretable prediction tree method (named iPTREE-2) that can accurately predict changes of protein stability upon mutations from sequence based information and analyze sequence characteristics from the viewpoint of composition and order. Therefore, iPTREE-2 based on a regression tree algorithm exhibits the ability of finding important factors and developing rules for the purpose of data mining. On a dataset of 1859 different single point mutations from thermodynamic database, ProTherm, iPTREE-2 yields a correlation coefficient of 0.70 between predicted and experimental values. In the task of data mining, detailed analysis of sequences reveals the possibility of the compositional specificity of residues in different ranges of stability change and implies the existence of certain patterns. As building rules, we found that the mutation residues in wild type and in mutant protein play an important role. The present study demonstrates that iPTREE-2 can serve the purpose of predicting protein stability change, especially when one requires more understandable knowledge.  相似文献   

4.
The development of methods to assess the impact of amino acid mutations on human health has become an important goal in biomedical research, due to the growing number of nonsynonymous SNPs identified. Within this context, computational methods constitute a valuable tool, because they can easily process large amounts of mutations and give useful, almost cost-free, information on their pathological character. In this paper we present a computational approach to the prediction of disease-associated amino acid mutations, using only sequence-based information (amino acid properties, evolutionary information, secondary structure and accessibility predictions, and database annotations) and neural networks, as a model building tool. Mutations are predicted to be either pathological or neutral. Our results show that the method has a good overall success rate, 83%, that can reach 95% when trained for specific proteins. The methodology is fast and flexible enough to provide good estimates of the pathological character of large sets of nonsynonymous SNPs, but can also be easily adapted to give more precise predictions for proteins of special biomedical interest.  相似文献   

5.
单核苷酸多态性(single nucleotide polymorphism,SNPs),即在基因组水平上由单个核苷酸的变异而引起的DNA序列多态性变化,具体是指在DNA序列中的单个碱基的变异,其是人类基因组变异种最常见的一种。SNP研究最主要的目的就是对人类表型变异遗传学的理解,尤其是关于人类遗传疾病的研究。而非同义单核苷酸多态性(nsSNPs)是SNPs中的一种,主要是指处于编码区会引起翻译后对应氨基酸序列变化的单核苷酸突变。因为nsSNPs可能会对蛋白质的功能造成影响,被认为是造成人类遗传病的主要原因。因此将与疾病相关的nsSNPs从中性的nsSNPs中区分出来是很重要的。本文根据国内外与疾病相关nsSNPs预测的研究,分析了预测中所涉及到的特征属性,总结了对这些特征进行优化的特征选择方法,并概述了在预测过程中使用的各种分类器。  相似文献   

6.
7.
Whole-genome resequencing technology has improved rapidly during recent years and is expected to improve further such that the sequencing of an entire human genome sequence for $1000 is within reach. Our main aim here is to use whole-genome sequence data for the prediction of genetic values of individuals for complex traits and to explore the accuracy of such predictions. This is relevant for the fields of plant and animal breeding and, in human genetics, for the prediction of an individual''s risk for complex diseases. Here, population history and genomic architectures were simulated under the Wright–Fisher population and infinite-sites mutation model, and prediction of genetic value was by the genomic selection approach, where a Bayesian nonlinear model was used to predict the effects of individual SNPs. The Bayesian model assumed a priori that only few SNPs are causative, i.e., have an effect different from zero. When using whole-genome sequence data, accuracies of prediction of genetic value were >40% increased relative to the use of dense ∼30K SNP chips. At equal high density, the inclusion of the causative mutations yielded an extra increase of accuracy of 2.5–3.7%. Predictions of genetic value remained accurate even when the training and evaluation data were 10 generations apart. Best linear unbiased prediction (BLUP) of SNP effects does not take full advantage of the genome sequence data, and nonlinear predictions, such as the Bayesian method used here, are needed to achieve maximum accuracy. On the basis of theoretical work, the results could be extended to more realistic genome and population sizes.GENOME resequencing technologies are currently developing at a very rapid rate, which we for simplicity call genome sequencing even though it is used on a species with a reference sequence. The current generation sequencing technology is two orders of magnitude faster and more cost effective than the technologies used for the sequencing of the human genome (Shendure and Ji 2008; TenBosch and Grody 2008). Future technologies are expected to reduce cost by another 100-fold so that sequencing an entire human genome for $1000 is considered achievable in the near future (Mardis 2008). The question arises: How can we make best use of entire genome sequence data on many individuals? One use will be the ability to predict the genetic value of an individual for complex traits. In the fields of animal and plant breeding, this would be of great practical benefit because most important traits are complex, quantitative traits, i.e., traits that are affected by many genes and by the environment. In humans the promise of personalized medicine relies on the ability to predict an individual''s genetic risk for complex, multifactorial diseases, such as Crohn''s disease (Barrett et al. 2008), and the ability to predict response to alternative treatments. The first aim of this article is to explore the accuracy of this prediction using the full genome sequence of the individual.The use of high-density SNP genotype data to predict genetic value, called genomic selection, was first proposed by Meuwissen et al. (2001). In its most sophisticated form, a Bayesian model was used to predict the effects of thousands of SNPs on the total genetic value simultaneously, where a priori it was assumed that only few SNPs were useful for predicting the trait [because they were in linkage disequilibrium (LD) with mutations causing variation in the trait], while many SNPs were not useful. Even among the SNPs that were useful for prediction, it was assumed that the distribution of effects was not normal because there were occasionally SNPs in LD with quantitative trait loci (QTL) that may occasionally have very large effect. To model this, the distribution of SNP effects was assumed to follow a distribution with thicker tails than the normal distribution (e.g., the t-distribution is often used). In the case of whole-genome sequence data, the polymorphisms that are causing the genetic differences between the individuals are among those being analyzed. For the sake of simplicity we call all polymorphisms in the sequence data SNPs while recognizing that other types of polymorphisms such as indels will be included. Assuming that the causal SNPs are included in the analysis simplifies the prior distribution of the SNP effects, because the effects of all the other SNPs, even if they are in LD with the causal SNPs, are expected to disappear. Thus, the prior distribution simplifies to the fact that some SNPs are expected to be causative and have an effect drawn from the distribution of the gene effects. The distribution of gene effects is investigated extensively in the evolutionary and other literature and is reported to be gamma (Hayes and Goddard 2001) or exponentially distributed (Erickson et al. 2004; Rocha et al. 2004), where the latter is a special form of the gamma distribution. On the downside, whole-genome sequence data will contain millions of SNPs and it may be difficult for genomic selection to separate the relatively few causative SNPs from all the others.Meuwissen et al. (2001) also investigated a model in which all SNPs were assumed to have an effect drawn from the same normal distribution [the so-called genome-wide best linear unbiased prediction (GWBLUP) model]. Although this model seems biologically implausible, it has been found to perform well in data from dairy cattle (VanRaden et al. 2009). However, we hypothesize that with sequence level data the BLUP model will not perform as well as models that assume that only some causal SNPs need to be included in the model.The aims here are to investigate the following: how accurately genetic values for complex traits can be predicted by genomic selection when whole-genome sequence data are available on a large number of individuals; whether it makes a difference to have the whole-genome sequence available, including the causative mutations, vs. very dense SNP marker genotypes; whether the estimates of the SNP effects can be used on individuals that are many generations separated from the data set in which they were estimated; the effect of the statistical model used on accuracy of prediction; and how accurately causative mutations can be detected and mapped. Because whole-genome sequence data on many individuals are not yet available, and because we needed to know the true genetic values of the individuals, the aforementioned questions were investigated by computer simulations of whole-genome sequence data.  相似文献   

8.
Large collections of single nucleotide polymorphisms (SNPs) have recently been identified from a number of livestock genomes. This raises the possibility that SNP arrays might be useful for analysis in related species for which few genetic markers are currently available. To address the likely success of such an approach, the aim of this study was to examine the threshold number and position of flanking mutations which act to prevent genotype calls being produced. Sequence diversity was measured across 16 loci containing SNPs known either to work successfully between species or fail between species. In pairwise comparisons between domestic and wild sheep, sequence divergence surrounding working SNP assays was significantly lower than that surrounding non‐functional assays. In addition, the location of flanking mismatches tended to be closer to the target SNP in loci that failed to generate genotype calls across species. The magnitude of sequence divergence observed for both working and non‐functional assays was compared with the divergence separating domestic sheep from European Mouflon, African Barbary, goat and cattle. The results suggest that the utility of SNP arrays for analysis of shared polymorphism will be restricted to closely related pairs of species. Analysis across more divergent species will, however, be successful for other objectives, such as the identification of the ancestral state of SNPs.  相似文献   

9.
SNAP: predict effect of non-synonymous polymorphisms on function   总被引:1,自引:0,他引:1  
Many genetic variations are single nucleotide polymorphisms (SNPs). Non-synonymous SNPs are 'neutral' if the resulting point-mutated protein is not functionally discernible from the wild type and 'non-neutral' otherwise. The ability to identify non-neutral substitutions could significantly aid targeting disease causing detrimental mutations, as well as SNPs that increase the fitness of particular phenotypes. Here, we introduced comprehensive data sets to assess the performance of methods that predict SNP effects. Along we introduced SNAP (screening for non-acceptable polymorphisms), a neural network-based method for the prediction of the functional effects of non-synonymous SNPs. SNAP needs only sequence information as input, but benefits from functional and structural annotations, if available. In a cross-validation test on over 80,000 mutants, SNAP identified 80% of the non-neutral substitutions at 77% accuracy and 76% of the neutral substitutions at 80% accuracy. This constituted an important improvement over other methods; the improvement rose to over ten percentage points for mutants for which existing methods disagreed. Possibly even more importantly SNAP introduced a well-calibrated measure for the reliability of each prediction. This measure will allow users to focus on the most accurate predictions and/or the most severe effects. Available at http://www.rostlab.org/services/SNAP.  相似文献   

10.
Single nucleotide polymorphisms (SNPs) are the most frequent variation in the human genome. Nonsynonymous SNPs that lead to missense mutations can be neutral or deleterious, and several computational methods have been presented that predict the phenotype of human missense mutations. These methods use sequence‐based and structure‐based features in various combinations, relying on different statistical distributions of these features for deleterious and neutral mutations. One structure‐based feature that has not been studied significantly is the accessible surface area within biologically relevant oligomeric assemblies. These assemblies are different from the crystallographic asymmetric unit for more than half of X‐ray crystal structures. We find that mutations in the core of proteins or in the interfaces in biological assemblies are significantly more likely to be disease‐associated than those on the surface of the biological assemblies. For structures with more than one protein in the biological assembly (whether the same sequence or different), we find the accessible surface area from biological assemblies provides a statistically significant improvement in prediction over the accessible surface area of monomers from protein crystal structures (P = 6e‐5). When adding this information to sequence‐based features such as the difference between wildtype and mutant position‐specific profile scores, the improvement from biological assemblies is statistically significant but much smaller (P = 0.018). Combining this information with sequence‐based features in a support vector machine leads to 82% accuracy on a balanced dataset of 50% disease‐associated mutations from SwissVar and 50% neutral mutations from human/primate sequence differences in orthologous proteins. Proteins 2013. © 2012 Wiley Periodicals, Inc.  相似文献   

11.

Background  

Human genetic variations primarily result from single nucleotide polymorphisms (SNPs) that occur approximately every 1000 bases in the overall human population. The non-synonymous SNPs (nsSNPs) that lead to amino acid changes in the protein product may account for nearly half of the known genetic variations linked to inherited human diseases. One of the key problems of medical genetics today is to identify nsSNPs that underlie disease-related phenotypes in humans. As such, the development of computational tools that can identify such nsSNPs would enhance our understanding of genetic diseases and help predict the disease.  相似文献   

12.
Using genetic variation to study human disease.   总被引:14,自引:0,他引:14  
The generation of a draft sequence of the human genome has spawned a unique opportunity to investigate the role of genetic variation in human diseases. The difference between any two human genomes has been estimated to be less than 0.1% overall, but still, this means that there are at least several million nucleotide differences per individual. The study of single nucleotide polymorphisms (SNPs), the most common type of variant, is likely to contribute substantially to deciphering genetic determinants of common and rare diseases. The effort to identify SNPs has been accelerated by three developments: the availability of sequence data from the genome project, improved informatic tools for searching the former and high-throughput genotype platforms. With these new tools in hand, dissecting the genetics of disease will rapidly move forward, although a number of formidable challenges will have to be met to see its promise realized in clinical medicine.  相似文献   

13.
Viral evolution remains to be a main obstacle in the effectiveness of antiviral treatments. The ability to predict this evolution will help in the early detection of drug-resistant strains and will potentially facilitate the design of more efficient antiviral treatments. Various tools has been utilized in genome studies to achieve this goal. One of these tools is machine learning, which facilitates the study of structure-activity relationships, secondary and tertiary structure evolution prediction, and sequence error correction. This work proposes a novel machine learning technique for the prediction of the possible point mutations that appear on alignments of primary RNA sequence structure. It predicts the genotype of each nucleotide in the RNA sequence, and proves that a nucleotide in an RNA sequence changes based on the other nucleotides in the sequence. Neural networks technique is utilized in order to predict new strains, then a rough set theory based algorithm is introduced to extract these point mutation patterns. This algorithm is applied on a number of aligned RNA isolates time-series species of the Newcastle virus. Two different data sets from two sources are used in the validation of these techniques. The results show that the accuracy of this technique in predicting the nucleotides in the new generation is as high as 75 %. The mutation rules are visualized for the analysis of the correlation between different nucleotides in the same RNA sequence.  相似文献   

14.
Expressed sequence tag (EST) libraries from members of the Penaeidae family and brine shrimp (Artemia franciscana) are currently the primary source of sequence data for shrimp species. Penaeid shrimp are the most commonly farmed worldwide, but selection methods for improving shrimp are limited. A better understanding of shrimp genomics is needed for farmers to use genetic markers to select the best breeding animals. The ESTs from Litopenaeus vannamei have been previously mined for single nucleotide polymorphisms (SNPs). This present study took publicly available ESTs from nine shrimp species, excluding L. vannamei, clustered them with CAP3, predicted SNPs within them using SNPidentifier, and then analyzed whether the SNPs were intra- or interspecies. Major goals of the project were to predict SNPs that may distinguish shrimp species, locate SNPs that may segregate in multiple species, and determine the genetic similarities between L. vannamei and the other shrimp species based on their EST sequences. Overall, 4,597 SNPs were predicted from 4,600 contigs with 703 of them being interspecies SNPs, 735 of them possibly predicting species' differences, and 18 of them appearing to segregate in multiple species. While sequences appear relatively well conserved, SNPs do not appear to be well conserved across shrimp species.  相似文献   

15.
As the largest set of sequence variants, single-nucleotide polymorphisms (SNPs) constitute powerful assets for mapping genes and mutations related to common diseases and for pharmacogenetic studies. A major goal in human genetics is to establish a high-density map of the genome containing several hundred thousand SNPs. Here we assayed 3.7 Mb (154,397 bp in 24 alleles) of chromosome 14 expressed sequence tags (ESTs) and sequence-tagged sites, for sequence variation in DNA samples from 12 African individuals. We identified and mapped 480 biallelic markers (459 SNPs and 21 small insertions and deletions), equally distributed between EST and non-EST classes. Extensive research in public databases also yielded 604 chromosome 14 SNPs (dbSNPs), 520 of which could be mapped and 19 of which are common between CNG (i.e., identified at the Centre National de Génotypage) and dbSNP polymorphisms. We present a dense map of SNP variation of human chromosome 14 based on 981 nonredundant biallelic markers present among 1345 radiation hybrid mapped sequence objects. Next, bioinformatic tools allowed 945 significant sequence alignments to chromosome 14 contigs, giving the precise chromosome sequence position for 70% of the mapped sequences and SNPs. In addition, these tools also permitted the identification and mapping of 273 SNPs in 159 known genes. The availability of this SNP map will permit a wide range of genetic studies on a complete chromosome. The recognition of 45 genes with multiple SNPs, by allowing the construction of haplotypes, should facilitate pharmacogenetic studies in the corresponding regions.  相似文献   

16.
17.
? Premise of the study: Next-generation sequencing (NGS) technologies are frequently used for resequencing and mining of single nucleotide polymorphisms (SNPs) by comparison to a reference genome. In crop species such as chickpea (Cicer arietinum) that lack a reference genome sequence, NGS-based SNP discovery is a challenge. Therefore, unlike probability-based statistical approaches for consensus calling and by comparison with a reference sequence, a coverage-based consensus calling (CbCC) approach was applied and two genotypes were compared for SNP identification. ? Methods: A CbCC approach is used in this study with four commonly used short read alignment tools (Maq, Bowtie, Novoalign, and SOAP2) and 15.7 and 22.1 million Illumina reads for chickpea genotypes ICC4958 and ICC1882, together with the chickpea trancriptome assembly (CaTA). ? Key results: A nonredundant set of 4543 SNPs was identified between two chickpea genotypes. Experimental validation of 224 randomly selected SNPs showed superiority of Maq among individual tools, as 50.0% of SNPs predicted by Maq were true SNPs. For combinations of two tools, greatest accuracy (55.7%) was reported for Maq and Bowtie, with a combination of Bowtie, Maq, and Novoalign identifying 61.5% true SNPs. SNP prediction accuracy generally increased with increasing reads depth. ? Conclusions: This study provides a benchmark comparison of tools as well as read depths for four commonly used tools for NGS SNP discovery in a crop species without a reference genome sequence. In addition, a large number of SNPs have been identified in chickpea that would be useful for molecular breeding.  相似文献   

18.
Proteins destined for secretion or membrane compartments possess signal peptides for insertion into the membrane. The signal peptide is therefore critical for localization and function of cell surface receptors and ligands that mediate cell-cell communication. About 4% of all human proteins listed in UniProt database have signal peptide domains in their N terminals. A comprehensive literature survey was performed to retrieve functional and disease associated genetic variants in the signal peptide domains of human proteins. In 21 human proteins we have identified 26 disease associated mutations within their signal peptide domains, 14 mutations of which have been experimentally shown to impair the signal peptide function and thus influence protein transportation. We took advantage of SignalP 3.0 predictions to characterize the signal peptide prediction score differences between the mutant and the wild-type alleles of each mutation, as well as 189 previously uncharacterized single nucleotide polymorphisms (SNPs) found to be located in the signal peptide domains of 165 human proteins. Comparisons of signal peptide prediction outcomes of mutations and SNPs, have implicated SNPs potentially impacting the signal peptide function, and thus the cellular localization of the human proteins. The majority of the top candidate proteins represented membrane and secreted proteins that are associated with molecular transport, cell signaling and cell to cell interaction processes of the cell. This is the first study that systematically characterizes genetic variation occurring in the signal peptides of all human proteins. This study represents a useful strategy for prioritization of SNPs occurring within the signal peptide domains of human proteins. Functional evaluation of candidates identified herein may reveal effects on major cellular processes including immune cell function, cell recognition and adhesion, and signal transduction.  相似文献   

19.
董辉  钱海涛  柳晓利  丛斌 《昆虫知识》2011,48(1):167-173
单核苷酸多态性(single nucleotide polymorphisms,SNPs)主要是指在染色体基因组水平上由于单个核苷酸的变异而引起的DNA序列多态性,包括单碱基的转换或颠换引起的点突变,其中最少出现1种等位基因频率不小于1%,常以双等位基因的形式出现,稳定而可靠。在目前的昆虫基因组研究中,SNPs标记的研究主要集中在果蝇、蚊媒、家蚕等一些模式生物。本文对SNPs标记在昆虫的种类鉴定、遗传图谱构建、种群遗传学、抗药性分子机理等方面进行了综述,最后展望了SNPs在种群遗传、标记辅助选择和生物进化等研究领域中的应用前景。  相似文献   

20.
Single nucleotide polymorphisms (SNPs) have rarely been exploited in nonhuman and nonmodel organism genetic studies. This is due partly to difficulties in finding SNPs in species where little DNA sequence data exist, as well as to a lack of robust and inexpensive genotyping methods. We have explored one SNP discovery method for molecular ecology, evolution, and conservation studies to evaluate the method and its limitations for population genetics in mammals. We made use of 'CATS' (or 'EPIC') primers to screen for novel SNPs in mammals. Most of these primer sets were designed from primates and/or rodents, for amplifying intron regions from conserved genes. We have screened 202 loci in 16 representatives of the major mammalian clades. Polymerase chain reaction (PCR) success correlated with phylogenetic distance from the human and mouse sequences used to design most primers; for example, specific PCR products from primates and the mouse amplified the most consistently and the marsupial and armadillo amplifications were least successful. Approximately 24% (opossum) to 65% (chimpanzee) of primers produced usable PCR product(s) in the mammals tested. Products produced generally high but variable levels of readable sequence and similarity to the expected genes. In a preliminary screen of chimpanzee DNA, 12 SNPs were identified from six (of 11) sequenced regions, yielding a SNP on average every 400 base pairs (bp). Given the progress in genome sequencing, and the large numbers of CATS-like primers published to date, this approach may yield sufficient SNPs per species for population and conservation genetic studies in nonmodel mammals and other organisms.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号