首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
3.

Background  

Single nucleotide polymorphisms (SNPs) are important tools in studying complex genetic traits and genome evolution. Computational strategies for SNP discovery make use of the large number of sequences present in public databases (in most cases as expressed sequence tags (ESTs)) and are considered to be faster and more cost-effective than experimental procedures. A major challenge in computational SNP discovery is distinguishing allelic variation from sequence variation between paralogous sequences, in addition to recognizing sequencing errors. For the majority of the public EST sequences, trace or quality files are lacking which makes detection of reliable SNPs even more difficult because it has to rely on sequence comparisons only.  相似文献   

4.
5.
6.
Using the Phred/Phrap/Polyphred/Consed pipeline established in the National Livestock Research Institute of Korea, we predicted candidate coding single nucleotide polymorphisms (cSNPs) from 7,600 expressed sequence tags (ESTs) derived from three cDNA libraries (liver, M. longissimus dorsi, and intermuscular fat) of Hanwoo (Korean native cattle) steers. From the 7,600 ESTs, 829 contigs comprising more than two EST reads were assembled using the Phrap assembler. Based on the contig analysis, 201 candidate cSNPs were identified in 129 contigs, in which transitions (69%) outnumbered transversions (31%). To verify whether the predicted cSNPs are real, 17 SNPs involved in lipid and energy metabolism were selected from the ESTs. Twelve of these were confirmed to be real while five were identified as artifacts, possibly due to expressed sequence tag sequence error. Further analysis of the 12 verified cSNPs was performed using the program BLASTX. Five were identified as nonsynonymous cSNPs, five were synonymous cSNPs, and two SNPs were located in 3'-UTRs. Our data indicated that a relatively high SNP prediction rate (71%) from a large EST database could produce abundant cSNPs rapidly, which can be used as valuable genetic markers in cattle.  相似文献   

7.
8.
AutoSNP is a program to detect single nucleotide polymorphisms (SNPs) and insertion/deletion polymorphisms (indels) in expressed sequence tag (EST) data. The program uses d2cluster and cap3 to cluster and align EST sequences, and uses redundancy to differentiate between candidate SNPs and sequence errors. Candidate polymorphisms are identified as occurring in multiple reads within an alignment. For each candidate SNP, two measures of confidence are calculated, the redundancy of the polymorphism at a SNP locus and the co segregation of the candidate SNP with other SNPs in the alignment. AVAILABILITY: The program was written in PERL and is freely available to non-commercial users by request from the authors.  相似文献   

9.
Tan EC  Li H 《Gene》2006,376(2):268-280
Most of the studies on single nucleotide variations are on substitutions rather than insertions/deletions. In this study, we examined the distribution and characteristics of single nucleotide insertions/deletions (SNindels), using data available from dbSNP for all the human chromosomes. There are almost 300,000 SNindels in the database, of which only 0.8% are validated. They occur at the frequency of 0.887 per 10 kb on average for the whole genome, or approximately 1 for every 11,274 bp. More than half occur in regions with mononucleotide repeats the longest of which is 47 bases. Overall the mononucleotide repeats involving C and G are much shorter than those for A and T. About 12% are surrounded by palindromes. There is general correlation between chromosome size and total number for each chromosome. Inter-chromosomal variation in density ranges from 0.6 to 21.7 per kilobase. The overall spectrum shows very high proportion of SNindel of types -/A and -/T at over 81%. The proportion of -/A and -/T SNindels for each chromosome is correlated to its AT content. Less than half of the SNindels are within or near known genes and even fewer (<0.183%) in coding regions, and more than 1.4% of -/C and -/G are in coding compared to 0.2% for -/A and -/T types. SNindels of -/A and -/T types make up 80% of those found within untranslated regions but less than 40% of those within coding regions. A separate analysis using the subset of 2324 validated SNindels showed slightly less AT bias of 74%, SNindels not within mononucleotide repeats showed even less AT bias at 58%. Density of validated SNindels is 0.007/10 kb overall and 90% are found within or near genes. Among all chromosomes, Y has the lowest numbers and densities for all SNindels, validated SNindels, and SNindels not within repeats.  相似文献   

10.
Amino acid similarity often needs to be considered in DNA sequence comparison to elucidate gene functions. We propose a Smith-Waterman-like algorithm which considers amino acid similarity and insertions/deletions in sequences at the DNA level and at the protein level in a hybrid manner. The algorithm is applied to cDNA sequences of Oryza sativa and those of Arabidopsis thaliana. The results are compared with the results of application of NCBI's tblastx program (which compares the sequences in the BLAST manner after translation). It is shown that the present algorithm is very helpful in discovering nucleotide insertions/deletions originating from experimental errors as well as amino acid insertions/deletions due to evolutionary reasons.  相似文献   

11.
Single nucleotide polymorphisms (SNPs) are useful for characterizing allelic variation, for genome-wide mapping, and as a tool for marker-assisted selection. Discovery of SNPs through de novo sequencing is inefficient within cultivated tomato (Lycopersicon esculentum Mill.) because the polymorphism rate is more than ten-fold lower than the sequencing error rate. The availability of expressed sequence tag (EST) data has made it feasible to discover putative SNPs in silico prior to experimental verification. By exploiting redundancy among EST data available for different varieties among 148,373 tomato ESTs, we have identified candidate SNPs for use within cultivated germplasm pools. 1,245 contigs having three EST sequences of Rio Grande and three EST sequences of TA496 were used for SNP discovery. We detected 1 SNP for every 8,500 bases analyzed, with 101 candidate SNPs in 44 genes identified. Sixty-six SNPs could be recognized by restriction enzymes, and subsequent experimental verification using restriction digestion or CEL I digestion confirmed 83% of the putative polymorphisms tested. SNPs between TA496 and Rio Grande have a high probability (53%) of detecting polymorphisms between other L. esculentum varieties. Twenty-six SNPs in 18 unigenes were mapped to specific chromosomes. Two SNPs, LEOH23 and LEOH37, were shown to be linked to quantitative trait loci contributing to fruit color within elite breeding populations. These results suggest that the growing databases of DNA sequence will yield information that facilitates improvement within the germplasm pools that have contributed to productive modern varieties.  相似文献   

12.
13.
14.
Expressed sequence tags (ESTs) provide researchers with a quick and inexpensive route for discovering new genes, data on gene expression and regulation, and also provide genic markers that help in constructing genome maps. Cacao is an important perennial crop of humid tropics. Cacao EST sequences, as available in the public domain, were downloaded and made into contigs. Microsatellites were located in these ESTs and contigs using five softwares (MISA, TRA, TROLL, SSRIT and SSR primer). MISA gave maximum coverage of SSRs in cacao ESTs and contigs, although TRA was able to detect higher order (>5-mer) repeats. The frequency of SSRs was one per 26.9 kb in the known set of ESTs. One-third of the repeats in EST-contigs were found to be trimeric. A few rare repeats like 21-mer repeat were also located. A/T repeats were most abundant among the mononucleotide repeats and the AG/GA/TC/CT type was the most frequent among dimerics. Flanking primers were designed using Primer3 program and verified experimentally for PCR amplification. The results of the study are made available freely online database (). Seven primer pairs amplified genomic DNA isolated from leaves were used to screen a representative set of 12 accessions of cacao.  相似文献   

15.
16.
The availability of large expressed sequence tag (EST) databases has led to a revolution in the way new genes are identified. Mining of these databases using known protein sequences as queries is a powerful technique for discovering orthologous and paralogous genes. The scientist is often confronted, however, by an enormous amount of search output owing to the inherent redundancy of EST data. In addition, high search sensitivity often cannot be achieved using only a single member of a protein superfamily as a query. In this paper a technique for addressing both of these issues is described. Assembled EST databases are queried with every member of a protein superfamily, the results are integrated and false positives are pruned from the set. The result is a set of assemblies enriched in members of the protein superfamily under consideration. The technique is applied to the G protein-coupled receptor (GPCR) superfamily in the construction of a GPCR Resource. A novel full-length human GPCR identified from the GPCR Resource is presented, illustrating the utility of the method.  相似文献   

17.
The single nucleotide polymorphism (SNP) is the difference of the DNA sequence between individuals and provides abundant information about genetic variation. Large scale discovery of high frequency SNPs is being undertaken using various methods. However, the publicly available SNP data sometimes need to be verified. If only a particular gene locus is concerned, locus-specific polymerase chain reaction amplification may be useful. Problem of this method is that the secondary peak has to be measured. We have analyzed trace data from conventional sequencing equipment and found an applicable rule to discern SNPs from noise. The rule is applied to multiply aligned sequences with a trace and the peak height of the traces are compared between samples. We have developed software that integrates this function to automatically identify SNPs. The software works accurately for high quality sequences and also can detect SNPs in low quality sequences. Further, it can determine allele frequency, display this information as a bar graph and assign corresponding nucleotide combinations. It is also designed for a person to verify and edit sequences easily on the screen. It is very useful for identifying de novo SNPs in a DNA fragment of interest.  相似文献   

18.
We made use of 81,635 expressed sequence tags (ESTs) derived from 12 different cDNA libraries of the silkworm, Bombyx mori, inbred strain Dazao (P50), to identify high-quality candidate single nucleotide polymorphisms (SNPs). By PHRAP assembling, 12,980 contigs containing 11,537 contigs assembled by more than one read were obtained, and 101 candidate SNPs and 27 single base insertions/deletions were identified from 117 contigs assembled from 1576 high-quality reads base-called with PHRED and screened on the basis of the neighborhood quality standard (NQS). Simultaneously, we also predicted 40 SNPs in coding regions (cSNPs), of which 26 were predicted to lead to amino acid non-synonymous variations and 14 synonymous substitutions. Also, the 1.66:1 ratio of transition/transversion is different from that of other insects. As the first SNP analysis of a Lepidoptera, B. mori, the single nucleotide polymorphic density is estimated to be 1.3 x 10(-3) by sequence diversity. This analysis shows that expressed sequences from multiple libraries may provide an abundant source of comparative reads to mine for cSNPs from the silkworm genome.  相似文献   

19.
The alpine plant Arabis alpina is an emerging model in the ecological genomic field which is well suited to identifying the genes involved in local adaptation in contrasted environmental conditions, a subject which remains poorly understood at molecular level. This study presents the assembly of a pool of A. alpina genomic fragments using next‐generation sequencing technologies. These contigs cover 172 Mb of the A. alpina genome (i.e. 50% of the genome) and were shown to contain sequences giving positive hits against 96% of the 458 CEGMA core genes (Core Eukaryotic Genes Mapping Approach), a set of highly conserved eukaryotic genes. Regions presenting high nucleic sequence identity with 77% of the close relative Arabidopsis thaliana's genes were found with an unbiased distribution across the different functional categories of A. thaliana genes. This new resource was tested using a resequencing assay to identify polymorphic sites. Sixteen samples were successfully analysed and 127 041 single‐nucleotide polymorphisms identified. This contig data set will contribute to improving our understanding of the ecology of Arabis alpina, thus constituting an important resource for future ecological genomic studies.  相似文献   

20.
Three closely related variants of rat (Rattus norvegicus) mtDNA have been shown to differ in the number of T residues found in a run of Ts (light strand) which spans the junction between the tRNACys and tRNATyr genes. The number of Ts in the repeat varies from 6 to 8 in these DNAs. Another, less closely related, R. norvegicus variant has a run of 5 Ts at this site and in the related species, Rattus rattus, a run of 4 Ts is found. In R. norvegicus mtDNA runs of 5 As and 5 Gs are found just to the 3' side of the variable T repeat, and it is suggested that the three runs of repeated nucleotides may stabilize heteroduplexes which result from strand slippage and which give rise to the insertions and/or deletions. Among 17 mtDNA clones derived from an individual with the 8T repeat, one clone was found which possessed a 9T repeat. This variant may represent an additional DNA type originally present within the individual.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号