首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Sequence analysis and editing for bisulphite genomic sequencing projects   总被引:6,自引:1,他引:5  
Bisulphite genomic sequencing is a widely used technique for detailed analysis of the methylation status of a region of DNA. It relies upon the selective deamination of unmethylated cytosine to uracil after treatment with sodium bisulphite, usually followed by PCR amplification of the chosen target region. Since this two-step procedure replaces all unmethylated cytosine bases with thymine, PCR products derived from unmethylated templates contain only three types of nucleotide, in unequal proportions. This can create a number of technical difficulties (e.g. for some base-calling methods) and impedes manual analysis of sequencing results (since the long runs of T or A residues are difficult to align visually with the parent sequence). To facilitate the detailed analysis of bisulphite PCR products (particularly using multiple cloned templates), we have developed a visually intuitive program that identifies the methylation status of CpG dinucleotides by analysis of raw sequence data files produced by MegaBace or ABI sequencers as well as Staden SCF trace files and plain text files. The program then also collates and presents data derived from independent templates (e.g. separate clones). This results in a considerable reduction in the time required for completion of a detailed genomic methylation project.  相似文献   

2.

Background  

Trace or chromatogram files (raw data) are produced by automatic nucleic acid sequencing equipment or sequencers. Each file contains information which can be interpreted by specialised software to reveal the sequence (base calling). This is done by the sequencer proprietary software or publicly available programs. Depending on the size of a sequencing project the number of trace files can vary from just a few to thousands of files. Sequencing quality assessment on various criteria is important at the stage preceding clustering and contig assembly. Two major publicly available packages – Phred and Staden are used by preAssemble to perform sequence quality processing.  相似文献   

3.
The estimation of prediction quality is important because without quality measures, it is difficult to determine the usefulness of a prediction. Currently, methods for ligand binding site residue predictions are assessed in the function prediction category of the biennial Critical Assessment of Techniques for Protein Structure Prediction (CASP) experiment, utilizing the Matthews Correlation Coefficient (MCC) and Binding-site Distance Test (BDT) metrics. However, the assessment of ligand binding site predictions using such metrics requires the availability of solved structures with bound ligands. Thus, we have developed a ligand binding site quality assessment tool, FunFOLDQA, which utilizes protein feature analysis to predict ligand binding site quality prior to the experimental solution of the protein structures and their ligand interactions. The FunFOLDQA feature scores were combined using: simple linear combinations, multiple linear regression and a neural network. The neural network produced significantly better results for correlations to both the MCC and BDT scores, according to Kendall's τ, Spearman's ρ and Pearson's r correlation coefficients, when tested on both the CASP8 and CASP9 datasets. The neural network also produced the largest Area Under the Curve score (AUC) when Receiver Operator Characteristic (ROC) analysis was undertaken for the CASP8 dataset. Furthermore, the FunFOLDQA algorithm incorporating the neural network, is shown to add value to FunFOLD, when both methods are employed in combination. This results in a statistically significant improvement over all of the best server methods, the FunFOLD method (6.43%), and one of the top manual groups (FN293) tested on the CASP8 dataset. The FunFOLDQA method was also found to be competitive with the top server methods when tested on the CASP9 dataset. To the best of our knowledge, FunFOLDQA is the first attempt to develop a method that can be used to assess ligand binding site prediction quality, in the absence of experimental data.  相似文献   

4.
Modern applications of Sanger DNA sequencing often require converting a large number of chromatogram trace files into high-quality DNA sequences for downstream analyses. Relatively few nonproprietary software tools are available to assist with this process. SeqTrace is a new, free, and open-source software application that is designed to automate the entire workflow by facilitating easy batch processing of large numbers of trace files. SeqTrace can identify, align, and compute consensus sequences from matching forward and reverse traces, filter low-quality base calls, and end-trim finished sequences. The software features a graphical interface that includes a full-featured chromatogram viewer and sequence editor. SeqTrace runs on most popular operating systems and is freely available, along with supporting documentation, at http://seqtrace.googlecode.com/.  相似文献   

5.
6.
AimThe aim of this work was to design and evaluate a software tool for analysis of a patient’s respiration, with the goal of optimizing the effectiveness of motion management techniques during radiotherapy imaging and treatment.Materials and methodsA software tool which analyses patient respiratory data files (.vxp files) created by the Varian Real-Time Position Management System (RPM) was developed to analyse patient respiratory data. The software, called RespAnalysis, was created in MATLAB and provides four modules, one each for determining respiration characteristics, providing breathing coaching (biofeedback training), comparing pre and post-training characteristics and performing a fraction-by-fraction assessment. The modules analyse respiratory traces to determine signal characteristics and specifically use a Sample Entropy algorithm as the key means to quantify breathing irregularity. Simulated respiratory signals, as well as 91 patient RPM traces were analysed with RespAnalysis to test the viability of using the Sample Entropy for predicting breathing regularity.ResultsRetrospective assessment of patient data demonstrated that the Sample Entropy metric was a predictor of periodic irregularity in respiration data, however, it was found to be insensitive to amplitude variation. Additional waveform statistics assessing the distribution of signal amplitudes over time coupled with Sample Entropy method were found to be useful in assessing breathing regularity.ConclusionsThe RespAnalysis software tool presented in this work uses the Sample Entropy method to analyse patient respiratory data recorded for motion management purposes in radiation therapy. This is applicable during treatment simulation and during subsequent treatment fractions, providing a way to quantify breathing irregularity, as well as assess the need for breathing coaching. It was demonstrated that the Sample Entropy metric was correlated to the irregularity of the patient’s respiratory motion in terms of periodicity, whilst other metrics, such as percentage deviation of inhale/exhale peak positions provided insight into respiratory amplitude regularity.  相似文献   

7.
8.
The increasing accessibility and reduced costs of sequencing has made genome analysis accessible to more and more researchers. Yet there remains a steep learning curve in the subsequent computational steps required to process raw reads into a database-deposited genome sequence. Here we describe “Genomer,” a tool to simplify the manual tasks of finishing and uploading a genome sequence to a database. Genomer can format a genome scaffold into the common files required for submission to GenBank. This software also simplifies updating a genome scaffold by allowing a human-readable YAML format file to be edited instead of large sequence files. Genomer is written as a command line tool and is an effort to make the manual process of genome scaffolding more robust and reproducible. Extensive documentation and video tutorials are available at http://next.gs.  相似文献   

9.
The program phase is widely used for Bayesian inference of haplotypes from diploid genotypes; however, manually creating phase input files from sequence alignments is an error-prone and time-consuming process, especially when dealing with numerous variable sites and/or individuals. Here, a web tool called seqphase is presented that generates phase input files from fasta sequence alignments and converts phase output files back into fasta. During the production of the phase input file, several consistency checks are performed on the dataset and suitable command line options to be used for the actual phase data analysis are suggested. seqphase was written in perl and is freely accessible over the Internet at the address http://www.mnhn.fr/jfflot/seqphase.  相似文献   

10.
In the developing mammalian retina, horizontal neurons undergo a dramatic reorganization of their processes shortly after they migrate to their appropriate laminar position. This is an important process because it is now understood that the apical processes are important for establishing the regular mosaic of horizontal cells in the retina and proper reorganization during lamination is required for synaptogenesis with photoreceptors and bipolar neurons. However, this process is difficult to study because the analysis of horizontal neuron anatomy is labor intensive and time-consuming. In this paper, we present a computational method for automatically tracing the three-dimensional (3-D) dendritic structure of horizontal retinal neurons in two-photon laser scanning microscope (TPLSM) imagery. Our method is based on 3-D skeletonization and is thus able to preserve the complex structure of the dendritic arbor of these cells. We demonstrate the effectiveness of our approach by comparing our tracing results against two sets of semi-automated traces over a set of 10 horizontal neurons ranging in age from P1 to P5. We observe an average agreement level of 81% between our automated trace and the manual traces. This automated method will serve as an important starting point for further refinement and optimization.  相似文献   

11.
'Ted' (Trace editor) is a graphical editor for sequence and trace data from automated fluorescence sequencing machines. It provides facilities for viewing sequence and trace data (in top or bottom strand orientation), for editing the base sequence, for automated or manual trimming of the head (vector) and tail (uncertain data) from the sequence, for vertical and horizontal trace scaling, for keeping a history of sequence editing, and for output of the edited sequence. Ted has been used extensively in the C.elegans genome sequencing project, both as a stand-alone program and integrated into the Staden sequence assembly package, and has greatly aided in the efficiency and accuracy of sequence editing. It runs in the X windows environment on Sun workstations and is available from the authors. Ted currently supports sequence and trace data from the ABI 373A and Pharmacia A.L.F. sequencers.  相似文献   

12.
Visualization tools that allow both optimization of instrument''s parameters for data acquisition and specific quality control (QC) for a given sample prior to time-consuming database searches have been scarce until recently and are currently still not freely available. To address this need, we have developed the visualization tool LogViewer, which uses diagnostic data from the RAW files of the Thermo Orbitrap and linear trap quadrupole-Fourier transform (LTQ-FT) mass spectrometers to monitor relevant metrics. To summarize and visualize the performance on our test samples, log files from RawXtract are imported and displayed. LogViewer is a visualization tool that allows a specific and fast QC for a given sample without time-consuming database searches. QC metrics displayed include: mass spectrometry (MS) ion-injection time histograms, MS ion-injection time versus retention time, MS2 ion-injection time histograms, MS2 ion-injection time versus retention time, dependent scan histograms, charge-state histograms, mass-to-charge ratio (M/Z) distributions, M/Z histograms, mass histograms, mass distribution, summary, repeat analyses, Raw MS, and Raw MS2. Systematically optimizing all metrics allowed us to increase our protein identification rates from 600 proteins to routinely determine up to 1400 proteins in any 160-min analysis of a complex mixture (e.g., yeast lysate) at a false discovery rate of <1%. Visualization tools, such as LogViewer, make QC of complex liquid chromotography (LC)-MS and LC-MS/MS data and optimization of the instrument''s parameters accessible to users.  相似文献   

13.
Surface-enhanced laser desorption/ionization time-of-flight mass spectrometry is a powerful tool for rapidly generating protein expression data (peptide and protein profiles) from a large number of samples. However, as with any technology, it must be optimized and reproducible for one to have confidence in the results. Using a classical statistical method called the fractional factorial design of experiments, we assessed the effects of 11 different experimental factors. We also developed several metrics that reflect trace quality and reproducibility. These were used to measure the effect of each individual factor, and the interactions between factors, to determine optimal factor settings and thus ultimately produce the best possible traces. Significant improvements to output traces were seen by simultaneously altering several parameters, either in the sample preparation procedure or during the matrix preparation and application procedure. This has led to the implementation of an improved method that gives a better quality, reproducible, and robust output.  相似文献   

14.
SAMtools is a widely-used genomics application for post-processing high-throughput sequence alignment data. Such sequence alignment data are commonly sorted to make downstream analysis more efficient. However, this sorting process itself can be computationally- and I/O-intensive: high-throughput sequence alignment files in the de facto standard binary alignment/map (BAM) format can be many gigabytes in size, and may need to be decompressed before sorting and compressed afterwards. As a result, BAM-file sorting can be a bottleneck in genomics workflows. This paper describes a case study on the performance analysis and optimization of SAMtools for sorting large BAM files. OpenMP task parallelism and memory optimization techniques resulted in a speedup of 5.9X versus the upstream SAMtools 1.3.1 for an internal (in-memory) sort of 24.6 GiB of compressed BAM data (102.6 GiB uncompressed) with 32 processor cores, while a 1.98X speedup was achieved for an external (out-of-core) sort of a 271.4 GiB BAM file.  相似文献   

15.
16.
Direct Sanger sequencing of a diploid template containing a heterozygous insertion or deletion results in a difficult-to-interpret mixed trace formed by two allelic traces superimposed onto each other. Existing computational methods for deconvolution of such traces require knowledge of a reference sequence or the availability of both direct and reverse mixed sequences of the same template. We describe a simple yet accurate method, which uses dynamic programming optimization to predict superimposed allelic sequences solely from a string of letters representing peaks within an individual mixed trace. We used the method to decode 104 human traces (mean length 294 bp) containing heterozygous indels 5 to 30 bp with a mean of 99.1% bases per allelic sequence reconstructed correctly and unambiguously. Simulations with artificial sequences have demonstrated that the method yields accurate reconstructions when (1) the allelic sequences forming the mixed trace are sufficiently similar, (2) the analyzed fragment is significantly longer than the indel, and (3) multiple indels, if present, are well-spaced. Because these conditions occur in most encountered DNA sequences, the method is widely applicable. It is available as a free Web application Indelligent at http://ctap.inhs.uiuc.edu/dmitriev/indel.asp.  相似文献   

17.
Structure prediction methods often generate a large number of models for a target sequence. Even if the correct fold for the target sequence is sampled in this dataset, it is difficult to distinguish it from other decoy structures. An attempt to solve this problem using experimental mutational sensitivity data for the CcdB protein was described previously by exploiting the correlation of residue depth with mutational sensitivity (r ~ 0.6). We now show that such a correlation extends to four other proteins with localized active sites, and for which saturation mutagenesis datasets exist. We also examine whether incorporation of predicted secondary structure information and the DOPE model quality assessment score, in addition to mutational sensitivity, improves the accuracy of model discrimination using a decoy dataset of 163 targets from CASP. Although most CASP models would have been subjected to model quality assessment prior to submission, we find that the DOPE score makes a substantial contribution to the observed improvement. We therefore also applied the approach to CcdB and four other proteins for which reliable experimental mutational data exist and observe that inclusion of experimental mutational data results in a small qualitative improvement in model discrimination relative to that seen with just the DOPE score. This is largely because of our limited ability to quantitatively predict effects of point mutations on in vivo protein activity. Further improvements in the methodology are required to facilitate improved utilization of single mutant data.  相似文献   

18.
Variability in the ecological quality assessment of reference sites was tested on small headwater streams in Ireland. Although headwater streams constitute a large portion of the river channel network, they are not routinely monitored for water quality. Various metrics were used including the Irish Q-value and the newly developed Small Streams Risk Score (SSRS), and metrics applied elsewhere in the Atlantic biogeographic region in Europe, including the Biological Monitoring Working Party score (BMWP), the Average Score per Taxon (ASPT), the Ephemeroptera, Plecoptera and Trichoptera taxa (EPT), the Belgium Biotic Index (BBI) and the Danish Stream Fauna Index (DSFI). The AQEM (version 2.5a) assessment software was used to apply some of these metrics. The spring and summer datasets are used to test the performance of biotic metrics with respect to season, and the applicability of their use to assess the ecological quality of wadeable streams. The quality status of most sites assigned by the various metrics was high using the spring invertebrate data, and an apparent considerable deviation in quality status occurred when the summer data was applied. Seasonal differences were noted using all the biotic indices and are attributed to the absence of pollution-sensitive groups in summer. Seasonal variability in the water quality status was particularly evident in acidic streams draining non-calcareous geologies with peaty soils that had relatively lower numbers of taxa. Some indices applied reflect a greater seasonal difference in the quality category assigned. The least amount of variability between seasons was obtained using the ASPT and the SSRS risk assessment system. Results suggest that reference status is reliably reflected in spring when more pollution-sensitive taxa were present, and that a new ecological quality assessment tool is required for application in summer when impacts may be most severe. This highly heterogeneous freshwater habitat seems to have too few taxa present in the summer to reliably determine the ecological quality of the stream using the available indices. Handling editor: R. Bailey  相似文献   

19.
Multimetric fish-based indices have been increasingly gaining importance in Europe, as the Water Framework Directive (WFD) requires fish fauna, and particularly its composition and abundance, to be taken into account in the assessment of the ecological quality of continental surface waters, including transitional waters. These indices are composed of several metrics, mostly related with structural and functional characteristics of fish communities, such as species richness, the role of nursery areas, or trophic web structure. Therefore, ecological quality assessments should ensure that these structural and functional characteristics of fish communities were covered by the sampling methods used. In the present work, the influence of sampling effort on several metrics of the Estuarine Fish Assessment Index (EFAI) was studied. Pseudo-random samples were generated from data of four Portuguese estuaries and bootstrap cycles were performed, in order to obtain metrics’ means and standard deviations per number of hauls analysed. The number of hauls necessary for the means to level off differed with the metrics considered. Generally, for metrics on percentages (percentage of marine migrants, percentage of estuarine residents and percentage of piscivores) the curve levelled off with less than 20 hauls, both for the estuary as a whole and for different estuarine salinity zones. On the other hand, metrics on species richness required much larger samples. In order to decrease to −5% the current estimated bias of metrics, the WFD sampling costs would have to be more than 3 times higher than they currently are. The findings in the present study are of great importance for an effective assessment of estuarine ecological quality and particularly in the context of the WFD, as the metrics studied are common to other Member State indices.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号