首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 20 毫秒
1.
MOTIVATION: Expressed sequence tag (EST) surveys are an efficient way to characterize large numbers of genes from an organism. The rate of gene discovery in an EST survey depends on the degree of redundancy of the cDNA libraries from which sequences are obtained. However, few statistical methods have been developed to assess and compare redundancies of various libraries from preliminary EST surveys. RESULTS: We consider statistics for the comparison of EST libraries based upon the frequencies with which genes occur in subsamples of reads. These measures are useful in determining which one of several libraries is more likely to yield new genes in future reads and what proportion of additional reads one might want to take from the libraries in order to be likely to obtain new genes. One approach is to compare single sample measures that have been successfully used in species estimation problems, such as coverage of a library, defined as the proportion of the library that is represented in the given sample of reads. Another single library measure is an estimate of the expected number of additional genes that will be found in a new sample of reads. We also propose statistics that jointly use data from all the libraries. Analogous formulas for coverage and the expected numbers of new genes are presented. These measures consider coverage in a single library based upon reads from all libraries and similarly, the expected numbers of new genes that will be discovered by taking reads from all libraries with fixed proportions. Together, the statistics presented provide useful comparative measures for the libraries that can be used to guide sampling from each of the libraries to maximize the rate of gene discovery. Finally, we present tests for whether genes are equally represented or expressed in a set of libraries. Binomial and chi2 tests are presented for gene-by-gene comparisons of expression. Overall tests of the equality of proportional representation are presented and multiple comparisons issues are addressed. These methods can be used to evaluate changes in gene expression reflected in the composition of EST libraries prepared from different tissue types or cells exposed to different environmental conditions. AVAILABILITY: Software will be made available at http://www.mathstat.dal.ca/~tsusko  相似文献   

2.
3.
Gene expression analysis by signature pyrosequencing   总被引:3,自引:0,他引:3  
  相似文献   

4.
EST expression profiling provides an attractive tool for studying differential gene expression, but cDNA libraries' origins and EST data quality are not always known or reported. Libraries may originate from pooled or mixed tissues; EST clustering, EST counts, library annotations and analysis algorithms may contain errors. Traditional data analysis methods, including research into tissue-specific gene expression, assume EST counts to be correct and libraries to be correctly annotated, which is not always the case. Therefore, a method capable of assessing the quality of expression data based on that data alone would be invaluable for assessing the quality of EST data and determining their suitability for mRNA expression analysis. Here we report an approach to the selection of a small generic subset of 244 UniGene clusters suitable for identification of the tissue of origin for EST libraries and quality control of the expression data using EST expression information alone. We created a small expression matrix of UniGene IDs using two rounds of selection followed by two rounds of optimisation. Our selection procedures differ from traditional approaches to finding "tissue-specific" genes and our matrix yields consistency high positive correlation values for libraries with confirmed tissues of origin and can be applied for tissue typing and quality control of libraries as small as just a few hundred total ESTs. Furthermore, we can pick up tissue correlations between related tissues e.g. brain and peripheral nervous tissue, heart and muscle tissues and identify tissue origins for a few libraries of uncharacterised tissue identity. It was possible to confirm tissue identity for some libraries which have been derived from cancer tissues or have been normalised. Tissue matching is affected strongly by cancer progression or library normalisation and our approach may potentially be applied for elucidating the stage of normalisation in normalised libraries or for cancer staging.  相似文献   

5.
6.
MOTIVATION: In gene discovery projects based on EST sequencing, effective post-sequencing identification methods are important in determining tissue sources of ESTs within pooled cDNA libraries. In the past, such identification efforts have been characterized by higher than necessary failure rates due to the presence of errors within the subsequence containing the oligo tag intended to define the tissue source for each EST. RESULTS: A large-scale EST-based gene discovery program at The University of Iowa has led to the creation of a unique software method named UITagCreator usable in the creation of large sets of synthetic tissue identification tags. The identification tags provide error detection and correction capability and, in conjunction with automated annotation software, result in a substantial improvement in the accurate identification of the tissue source in the presence of sequencing and base-calling errors. These identification rates are favorable, relative to past paradigms. AVAILABILITY: The UITagCreator source code and installation instructions, along with detection software usable in concert with created tag sets, is freely available at http://genome.uiowa.edu/pubsoft/software.html CONTACT: tomc@eng.uiowa.edu  相似文献   

7.
The expressed sequence tag (EST) data provide a powerful tool for identification of transcribed DNA sequences. However, as EST are relatively short, many exons are poorly covered by EST, thus reducing the utility of EST data. Recently, signature sequence tag (SST) fingerprints were proposed as an alternative to EST fingerprints. Given a fingerprint set of probes, SST of a clone is a subset of probes from the fingerprint set that hybridize with the clone. We demonstrate that besides being a powerful technique for screening cDNA libraries, SST technology provides for very accurate gene predictions. Even with a small fingerprint set (600-800 probes), SST-based gene recognition outperforms many conventional and EST-based methods. The increase in the size of the fingerprint set to 1500 probes provides almost perfect gene recognition. Even more importantly, SST-based gene predictions miss very few exons and, therefore, provide an opportunity to bypass the cDNA sequencing step on the way from finished genomic sequence to mutation detection in gene-hunting projects. Because SST data can be obtained in a highly parallel and inexpensive way, SST technology has a potential of complementing EST technology for gene hunting.  相似文献   

8.
Although yield trials for switchgrass (Panicum virgatum L.), a potentially high value biofuel feedstock crop, are currently underway throughout North America, the genetic tools for crop improvement in this species are still in the early stages of development. Identification of high-density molecular markers, such as single nucleotide polymorphisms (SNPs), that are amenable to high-throughput genotyping approaches, is the first step in a quantitative genetics study of this model biofuel crop species. We generated and sequenced expressed sequence tag (EST) libraries from thirteen diverse switchgrass cultivars representing both upland and lowland ecotypes, as well as tetraploid and octoploid genomes. We followed this with reduced genomic library preparation and massively parallel sequencing of the same samples using the Illumina Genome Analyzer technology platform. EST libraries were used to generate unigene clusters and establish a gene-space reference sequence, thus providing a framework for assembly of the short sequence reads. SNPs were identified utilizing these scaffolds. We used a custom software program for alignment and SNP detection and identified over 149,000 SNPs across the 13 short-read sequencing libraries (SRSLs). Approximately 25,000 additional SNPs were identified from the entire EST collection available for the species. This sequencing effort generated data that are suitable for marker development and for estimation of population genetic parameters, such as nucleotide diversity and linkage disequilibrium. Based on these data, we assessed the feasibility of genome wide association mapping and genomic selection applications in switchgrass. Overall, the SNP markers discovered in this study will help facilitate quantitative genetics experiments and greatly enhance breeding efforts that target improvement of key biofuel traits and development of new switchgrass cultivars.  相似文献   

9.
We determined 36 310 bovine expressed sequence tag (EST) sequences using 10 different cDNA libraries. For massive EST sequencing, we devised a new system with two major features. First, we constructed cDNA libraries in which the poly(A) tails were removed using nested deletion at the 3′-ends. This permitted high quality reading of sequences from the 3′-end of the cDNA, which is otherwise difficult to do. Second, we increased throughput by sequencing directly on templates generated by colony PCR. Using this system, we determined 600 cDNA sequences per day. The read-out length was >450 bases in >90% of the sequences. Furthermore, we established a data management system for analyses, storage and manipulation of the sequence data. Finally, 16 358 non-redundant ESTs were derived from ~6900 independent genes. These data will facilitate construction of a precise comparative map across mammalian species and isolate the functional genes that govern economic traits. This system is applicable to other organisms, including livestock, for which EST data are limited.  相似文献   

10.
11.
12.
13.
The statistical analysis of array comparative genomic hybridization (CGH) data has now shifted to the joint assessment of copy number variations at the cohort level. Considering multiple profiles gives the opportunity to correct for systematic biases observed on single profiles, such as probe GC content or the so-called "wave effect." In this article, we extend the segmentation model developed in the univariate case to the joint analysis of multiple CGH profiles. Our contribution is multiple: we propose an integrated model to perform joint segmentation, normalization, and calling for multiple array CGH profiles. This model shows great flexibility, especially in the modeling of the wave effect that gives a likelihood framework to approaches proposed by others. We propose a new dynamic programming algorithm for break point positioning, as well as a model selection criterion based on a modified bayesian information criterion proposed in the univariate case. The performance of our method is assessed using simulated and real data sets. Our method is implemented in the R package cghseg.  相似文献   

14.
15.
DNA methylation is an important epigenetic modification that has essential roles in cellular processes including gene regulation, development and disease and is widely dysregulated in most types of cancer. Recent advances in sequencing technology have enabled the measurement of DNA methylation at single nucleotide resolution through methods such as whole-genome bisulfite sequencing and reduced representation bisulfite sequencing. In DNA methylation studies, a key task is to identify differences under distinct biological contexts, for example, between tumor and normal tissue. A challenge in sequencing studies is that the number of biological replicates is often limited by the costs of sequencing. The small number of replicates leads to unstable variance estimation, which can reduce accuracy to detect differentially methylated loci (DML). Here we propose a novel statistical method to detect DML when comparing two treatment groups. The sequencing counts are described by a lognormal-beta-binomial hierarchical model, which provides a basis for information sharing across different CpG sites. A Wald test is developed for hypothesis testing at each CpG site. Simulation results show that the proposed method yields improved DML detection compared to existing methods, particularly when the number of replicates is low. The proposed method is implemented in the Bioconductor package DSS.  相似文献   

16.
New perspectives on glutamine synthetase in grasses   总被引:2,自引:0,他引:2  
Members of the glutamine synthetase (GS) gene family have now been characterized in many crop species such as wheat, rice, and maize. Studies have shown that cytosolic GS isoforms are involved in nitrogen remobilization during leaf senescence and emphasized a role in seed production particularly in small grain crop species. Data from the sequencing of genomes for model crops and expressed sequence tag (EST) libraries from non-model species have strengthened the idea that the cytosolic GS genes are organized in three functionally and phylogenetically conserved subfamilies. Using a bioinformatic approach, the considerable publicly available information on high throughput gene expression was mined to search for genes having patterns of expression similar to GS. Interesting new hypotheses have emerged from searching for co-expressed genes across multiple unfiltered experimental data sets in rice. This approach should inform new experimental designs and studies to explore the regulation of the GS gene family further. It is expected that understanding the regulation of GS under varied climatic conditions will emerge as an important new area considering the results from recent studies that have shown nitrogen assimilation to be critical to plant acclimation to high CO(2) concentrations.  相似文献   

17.
18.
MOTIVATION: High accuracy of data always governs the large-scale gene discovery projects. The data should not only be trustworthy but should be correctly annotated for various features it contains. Sequence errors are inherent in single-pass sequences such as ESTs obtained from automated sequencing. These errors further complicate the automated identification of EST-related sequencing. A tool is required to prepare the data prior to advanced annotation processing and submission to public databases. RESULTS: This paper describes ESTprep, a program designed to preprocess expressed sequence tag (EST) sequences. It identifies the location of features present in ESTs and allows the sequence to pass only if it meets various quality criteria. Use of ESTprep has resulted in substantial improvement in accurate EST feature identification and fidelity of results submitted to GenBank. AVAILABILITY: The program is freely available for download from http://genome.uiowa.edu/pubsoft/software.html  相似文献   

19.
Perspectives on the molecular epidemiology of aerodigestive tract cancers   总被引:8,自引:0,他引:8  
Improving laboratory techniques and the greater availability of genetic data have led to a flurry of publications from molecular epidemiologic studies on aerodigestive tract cancers. Inconsistent results have been observed in studies of sequence variants, due to limitations such as small sample size, possible detection of false positives, moderate prior probabilities that each SNP confers a substantial increase in cancer risk, and publication bias. Meta- and pooled-analyses were shown to be effective in elucidating modest increases in aerodigestive tract cancer risk attributable to sequence variants. Phenotypic assays developed to quantify an individual's DNA repair capacity have been applied to epidemiological studies on aerodigestive tract cancers. Epigenetic events have also been studied in tumor progression and as susceptibility factors for aerodigestive tract cancers, in smaller scale studies. It is imperative that limitations of previous studies are addressed for future research in the molecular epidemiology of aerodigestive tract cancers. Some recommendations for future research are to: (i) incorporate multiple markers of different types (ex. genotype and phenotype data), (ii) enhance statistical power by conducting studies with larger sample size, and developing consortia to coordinate research efforts, (iii) improve marker selection via a hybrid strategy of incorporating data on evolutionary biology and physico-chemical properties of amino acids, with haplotype/tag SNP data, (iv) employ novel statistical methods such as hierarchical modeling with Bayesian adjustments, false positive reporting probability and modeling of complex pathways. Consortia have been initiated for head and neck cancer (International Head and Neck Cancer Epidemiology Consortium (INHANCE)) and lung cancer (International Lung Cancer Consortium (ILCCO)) with the aim to share comparable data, to focus on rare subgroups such as nonsmokers and to coordinate laboratory analyses. Such collaborative efforts and integration across disciplines will be essential in contributing to the elucidation of genetic susceptibility to aerodigestive tract cancers.  相似文献   

20.
Metabarcoding of environmental samples on second‐generation sequencing platforms has rapidly become a valuable tool for ecological studies. A fundamental assumption of this approach is the reliance on being able to track tagged amplicons back to the samples from which they originated. In this study, we address the problem of sequences in metabarcoding sequencing outputs with false combinations of used tags (tag jumps). Unless these sequences can be identified and excluded from downstream analyses, tag jumps creating sequences with false, but already used tag combinations, can cause incorrect assignment of sequences to samples and artificially inflate diversity. In this study, we document and investigate tag jumping in metabarcoding studies on Illumina sequencing platforms by amplifying mixed‐template extracts obtained from bat droppings and leech gut contents with tagged generic arthropod and mammal primers, respectively. We found that an average of 2.6% and 2.1% of sequences had tag combinations, which could be explained by tag jumping in the leech and bat diet study, respectively. We suggest that tag jumping can happen during blunt‐ending of pools of tagged amplicons during library build and as a consequence of chimera formation during bulk amplification of tagged amplicons during library index PCR. We argue that tag jumping and contamination between libraries represents a considerable challenge for Illumina‐based metabarcoding studies, and suggest measures to avoid false assignment of tag jumping‐derived sequences to samples.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号