首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
With the advent of DNA sequencing technologies, more and more reference genome sequences are available for many organisms. Analyzing sequence variation and understanding its biological importance are becoming a major research aim. However, how to store and process the huge amount of eukaryotic genome data, such as those of the human, mouse and rice, has become a challenge to biologists. Currently available bioinformatics tools used to compress genome sequence data have some limitations, such as the requirement of the reference single nucleotide polymorphisms (SNPs) map and information on deletions and insertions. Here, we present a novel compression tool for storing and analyzing Genome ReSequencing data, named GRS. GRS is able to process the genome sequence data without the use of the reference SNPs and other sequence variation information and automatically rebuild the individual genome sequence data using the reference genome sequence. When its performance was tested on the first Korean personal genome sequence data set, GRS was able to achieve ~159-fold compression, reducing the size of the data from 2986.8 to 18.8 MB. While being tested against the sequencing data from rice and Arabidopsis thaliana, GRS compressed the 361.0 MB rice genome data to 4.4 MB, and the A. thaliana genome data from 115.1 MB to 6.5 KB. This de novo compression tool is available at http://gmdd.shgmo.org/Computational-Biology/GRS.  相似文献   

2.
Genome data are becoming increasingly important for modern medicine. As the rate of increase in DNA sequencing outstrips the rate of increase in disk storage capacity, the storage and data transferring of large genome data are becoming important concerns for biomedical researchers. We propose a two-pass lossless genome compression algorithm, which highlights the synthesis of complementary contextual models, to improve the compression performance. The proposed framework could handle genome compression with and without reference sequences, and demonstrated performance advantages over best existing algorithms. The method for reference-free compression led to bit rates of 1.720 and 1.838 bits per base for bacteria and yeast, which were approximately 3.7% and 2.6% better than the state-of-the-art algorithms. Regarding performance with reference, we tested on the first Korean personal genome sequence data set, and our proposed method demonstrated a 189-fold compression rate, reducing the raw file size from 2986.8 MB to 15.8 MB at a comparable decompression cost with existing algorithms. DNAcompact is freely available at https://sourceforge.net/projects/dnacompact/for research purpose.  相似文献   

3.

Background  

Processing raw DNA sequence data is an especially challenging task for relatively small laboratories and core facilities that produce as many as 5000 or more DNA sequences per week from multiple projects in widely differing species. To meet this challenge, we have developed the flexible, scalable, and automated sequence processing package described here.  相似文献   

4.
SUMMARY: Single nucleotide polymorphisms (SNPs) are the most abundant form of genetic variations in closely related microbial species, strains or isolates. Some SNPs confer selective advantages for microbial pathogens during infection and many others are powerful genetic markers for distinguishing closely related strains or isolates that could not be distinguished otherwise. To facilitate SNP discovery in microbial genomes, we have developed a web-based application, SNPsFinder, for genome-wide identification of SNPs. SNPsFinder takes multiple genome sequences as input to identify SNPs within homologous regions. It can also take contig sequences and sequence quality scores from ongoing sequencing projects for SNP prediction. SNPsFinder will use genome sequence annotation if available and map the predicted SNP regions to known genes or regions to assist further evaluation of the predicted SNPs for their functional significance. SNPsFinder can generate PCR primers for all predicted SNP regions according to user's input parameters to facilitate experimental validation. The results from SNPsFinder analysis are accessible through the World Wide Web. AVAILABILITY: The SNPsFinder program is available at http://snpsfinder.lanl.gov/. SUPPLEMENTARY INFORMATION: The user's manual is available at http://snpsfinder.lanl.gov/UsersManual/  相似文献   

5.
Labile memory is thought to be held in the brain as persistent neural network activity. However, it is not known how biologically relevant memory circuits are organized and operate. Labile and persistent appetitive memory in Drosophila requires output after training from the α'β' subset of mushroom body (MB) neurons and from a pair of modulatory dorsal paired medial (DPM) neurons. DPM neurons innervate the entire MB lobe region and appear to be pre- and postsynaptic to the MB, consistent with a recurrent network model. Here we identify a role after training for synaptic output from the GABAergic anterior paired lateral (APL) neurons. Blocking synaptic output from APL neurons after training disrupts labile memory but does not affect long-term memory. APL neurons contact DPM neurons most densely in the α'β' lobes, although their processes are intertwined and contact throughout all of the lobes. Furthermore, APL contacts MB neurons in the α' lobe but makes little direct contact with those in the distal α lobe. We propose that APL neurons provide widespread inhibition to stabilize and maintain synaptic specificity of a labile memory trace in a recurrent DPM and MB α'β' neuron circuit.  相似文献   

6.
MOTIVATION: Multiple sequence alignment is an important tool in computational biology. In order to solve the task of computing multiple alignments in affordable time, the most commonly used multiple alignment methods have to use heuristics. Nevertheless, the computation of optimal multiple alignments is important in its own right, and it provides a means of evaluating heuristic approaches or serves as a subprocedure of heuristic alignment methods. RESULTS: We present an algorithm that uses the divide-and-conquer alignment approach together with recent results on search space reduction to speed up the computation of multiple sequence alignments. The method is adaptive in that depending on the time one wants to spend on the alignment, a better, up to optimal alignment can be obtained. To speed up the computation in the optimal alignment step, we apply the alpha(*) algorithm which leads to a procedure provably more efficient than previous exact algorithms. We also describe our implementation of the algorithm and present results showing the effectiveness and limitations of the procedure.  相似文献   

7.
One of the most challenging parts of large scale sequencing projects is the identification of functional elements encoded in a genome. Recently, studies of genomes of up to six different Saccharomyces species have demonstrated that a comparative analysis of genome sequences from closely related species is a powerful approach to identify open reading frames and other functional regions within genomes [Science 301 (2003) 71, Nature 423 (2003) 241]. Here, we present a comparison of selected sequences from Sordaria macrospora to their corresponding Neurospora crassa orthologous regions. Our analysis indicates that due to the high degree of sequence similarity and conservation of overall genomic organization, S. macrospora sequence information can be used to simplify the annotation of the N. crassa genome.  相似文献   

8.
Molecular information is crucial for species identification when facing challenging morphology‐based specimen identifications. The use of DNA barcodes partially solves this problem, but in some cases when PCR is not an option (i.e., primers are not available, problems in reaction standardization), amplification‐free approaches could be an optimal alternative. Recent advances in DNA sequencing, like the MinION device from Oxford Nanopore Technologies (ONT), allow to obtain genomic data with low laboratory and technical requirements, and at a relatively low cost. In this study, we explore ONT sequencing for molecular species identification from a total DNA sample obtained from a neotropical rodent and we also test the technology for complete mitochondrial genome reconstruction via genome skimming. We were able to obtain “de novo” the complete mitogenome of a specimen from the genus Melanomys (Cricetidae: Sigmodontinae) with average depth coverage of 78X using ONT‐only data and by combining multiple assembly routines. Our pipeline for an automated species identification was able to identify the sample using unassembled sequence data (raw) in a reasonable computing time, which was substantially reduced when a priori information related to the organism identity was known. Our findings suggest ONT sequencing as a suitable candidate to solve species identification problems in metazoan nonmodel organisms and generate complete mtDNA datasets.  相似文献   

9.
10.
MOTIVATION: Recently, the concept of the constrained sequence alignment was proposed to incorporate the knowledge of biologists about structures/functionalities/consensuses of their datasets into sequence alignment such that the user-specified residues/nucleotides are aligned together in the computed alignment. The currently developed programs use the so-called progressive approach to efficiently obtain a constrained alignment of several sequences. However, the kernels of these programs, the dynamic programming algorithms for computing an optimal constrained alignment between two sequences, run in (gamman2) memory, where gamma is the number of the constraints and n is the maximum of the lengths of sequences. As a result, such a high memory requirement limits the overall programs to align short sequences only. RESULTS: We adopt the divide-and-conquer approach to design a memory-efficient algorithm for computing an optimal constrained alignment between two sequences, which greatly reduces the memory requirement of the dynamic programming approaches at the expense of a small constant factor in CPU time. This new algorithm consumes only O(alphan) space, where alpha is the sum of the lengths of constraints and usually alpha < n in practical applications. Based on this algorithm, we have developed a memory-efficient tool for multiple sequence alignment with constraints. AVAILABILITY: http://genome.life.nctu.edu.tw/MUSICME.  相似文献   

11.
Samples from diverse upland soils that oxidize atmospheric methane were characterized with regard to methane oxidation activity and the community composition of methanotrophic bacteria (MB). MB were identified on the basis of the detection and comparative sequence analysis of the pmoA gene, which encodes a subunit of particulate methane monooxygenase. MB commonly detected in soils were closely related to Methylocaldum spp., Methylosinus spp., Methylocystis spp., or the "forest sequence cluster" (USC alpha), which has previously been detected in upland soils and is related to pmoA sequences of type II MB (Alphaproteobacteria). As well, a novel group of sequences distantly related (<75% derived amino acid identity) to those of known type I MB (Gammaproteobacteria) was often detected. This novel "upland soil cluster gamma" (USC gamma) was significantly more likely to be detected in soils with pH values of greater than 6.0 than in more acidic soils. To identify active MB, four selected soils were incubated with (13)CH(4) at low mixing ratios (<50 ppm of volume), and extracted methylated phospholipid fatty acids (PLFAs) were analyzed by gas chromatography-online combustion isotope ratio mass spectrometry. Incorporation of (13)C into PLFAs characteristic for methanotrophic Gammaproteobacteria was observed in all soils in which USC gamma sequences were detected, suggesting that the bacteria possessing these sequences were active methanotrophs. A pattern of labeled PLFAs typical for methanotrophic Alphaproteobacteria was obtained for a sample in which only USC alpha sequences were detected. The data indicate that different MB are present and active in different soils that oxidize atmospheric methane.  相似文献   

12.
A growing number of solved protein structures display an elongated structural domain, denoted here as alpha-rod, composed of stacked pairs of anti-parallel alpha-helices. Alpha-rods are flexible and expose a large surface, which makes them suitable for protein interaction. Although most likely originating by tandem duplication of a two-helix unit, their detection using sequence similarity between repeats is poor. Here, we show that alpha-rod repeats can be detected using a neural network. The network detects more repeats than are identified by domain databases using multiple profiles, with a low level of false positives (<10%). We identify alpha-rod repeats in approximately 0.4% of proteins in eukaryotic genomes. We then investigate the results for all human proteins, identifying alpha-rod repeats for the first time in six protein families, including proteins STAG1-3, SERAC1, and PSMD1-2 & 5. We also characterize a short version of these repeats in eight protein families of Archaeal, Bacterial, and Fungal species. Finally, we demonstrate the utility of these predictions in directing experimental work to demarcate three alpha-rods in huntingtin, a protein mutated in Huntington''s disease. Using yeast two hybrid analysis and an immunoprecipitation technique, we show that the huntingtin fragments containing alpha-rods associate with each other. This is the first definition of domains in huntingtin and the first validation of predicted interactions between fragments of huntingtin, which sets up directions toward functional characterization of this protein. An implementation of the repeat detection algorithm is available as a Web server with a simple graphical output: http://www.ogic.ca/projects/ard. This can be further visualized using BiasViz, a graphic tool for representation of multiple sequence alignments.  相似文献   

13.
About 63 species of Dendrobium are identified in China, making the identification of the origin of a particular Dendrobium species on the consumer market very difficult. We report evaluation of multiple species-specific probes screened from genomic DNA for closely related Dendrobium species identification, based on DNA array hybridization. Fourteen species-specific probes were screened from five closely related Dendrobium species, D. aurantiacum Kerr, D. officinale Kimura et Migo, D. nobile Lindl., D. chrysotoxum Lindl. and D. fimbriatum Hook., based on the SSH-Array technology we developed. Various commercial Dendrobium samples and unrelated samples were definitely identified. The specificity and accuracy of the multiple species-specific probes for species identification was assessed by identifying various commercial Dendrobium samples (Herba Dendrobii). Hybridization patterns of these multiple probes on digested genomic DNAs of Dendrobium species indicated that there are distinct polymorphic sequence fragment in the higher eukaryotes. This is the first report on detection and utilization of multiple species-specific probes of Dendrobium in whole genomic DNA, and this could be useful tools not only for a new technical platform for the closely related species identification but also for epidemiological studies on higher eukaryotes.  相似文献   

14.
Identifying multiple enzyme targets for metabolic engineering is very critical for redirecting cellular metabolism to achieve desirable phenotypes, e.g., overproduction of a target chemical. The challenge is to determine which enzymes and how much of these enzymes should be manipulated by adding, deleting, under-, and/or over-expressing associated genes. In this study, we report the development of a systematic multiple enzyme targeting method (SMET), to rationally design optimal strains for target chemical overproduction. The SMET method combines both elementary mode analysis and ensemble metabolic modeling to derive SMET metrics including l-values and c-values that can identify rate-limiting reaction steps and suggest which enzymes and how much of these enzymes to manipulate to enhance product yields, titers, and productivities. We illustrated, tested, and validated the SMET method by analyzing two networks, a simple network for concept demonstration and an Escherichia coli metabolic network for aromatic amino acid overproduction. The SMET method could systematically predict simultaneous multiple enzyme targets and their optimized expression levels, consistent with experimental data from the literature, without performing an iterative sequence of single-enzyme perturbation. The SMET method was much more efficient and effective than single-enzyme perturbation in terms of computation time and finding improved solutions.  相似文献   

15.
Shen HH  Huang AM  Hoheisel J  Tsai SF 《Genomics》2001,71(1):21-33
A new member of the NAP/SET gene family, named MB20, was isolated from a mouse brain cDNA library by virtue of its CAG trinucleotide repetitive sequence and a brain-specific gene expression pattern. The complementary DNA sequence predicted an open reading frame of 545 amino acids, with four copies of an 11-amino-acid direct repeat. The consensus sequence for these repeats, PKE-P--K-EE, is present in the largest subunit of murine neurofilament (NF-H). The MB20 protein sequence is homologous to nucleosome assembly proteins of several species, and its C-terminus is homologous to SET proteins. Immunoblot analysis revealed that MB20 protein is expressed in the brain. Transient transfection and immunofluorescence microscopy demonstrated that MB20 is distributed in the cytoplasm as well as in the nucleus. Deletion of the N-terminal end imparts the complete localization of MB20 protein to the nucleus. The ability of MB20 to bind histone proteins was analyzed by sucrose gradient sedimentation and by retention of histone proteins by immobilized MB20 protein. On the basis of its expression pattern, predicted sequence, and protein properties, we propose that MB20 plays a unique role in modulating nucleosome structure and gene expression during brain development.  相似文献   

16.
Liu C  Shi L  Xu X  Li H  Xing H  Liang D  Jiang K  Pang X  Song J  Chen S 《PloS one》2012,7(5):e35146
The DNA barcoding technology uses a standard region of DNA sequence for species identification and discovery. At present, "DNA barcode" actually refers to DNA sequences, which are not amenable to information storage, recognition, and retrieval. Our aim is to identify the best symbology that can represent DNA barcode sequences in practical applications. A comprehensive set of sequences for five DNA barcode markers ITS2, rbcL, matK, psbA-trnH, and CO1 was used as the test data. Fifty-three different types of one-dimensional and ten two-dimensional barcode symbologies were compared based on different criteria, such as coding capacity, compression efficiency, and error detection ability. The quick response (QR) code was found to have the largest coding capacity and relatively high compression ratio. To facilitate the further usage of QR code-based DNA barcodes, a web server was developed and is accessible at http://qrfordna.dnsalias.org. The web server allows users to retrieve the QR code for a species of interests, convert a DNA sequence to and from a QR code, and perform species identification based on local and global sequence similarities. In summary, the first comprehensive evaluation of various barcode symbologies has been carried out. The QR code has been found to be the most appropriate symbology for DNA barcode sequences. A web server has also been constructed to allow biologists to utilize QR codes in practical DNA barcoding applications.  相似文献   

17.
Next Generation Sequencing (NGS) is a disruptive technology that has found widespread acceptance in the life sciences research community. The high throughput and low cost of sequencing has encouraged researchers to undertake ambitious genomic projects, especially in de novo genome sequencing. Currently, NGS systems generate sequence data as short reads and de novo genome assembly using these short reads is computationally very intensive. Due to lower cost of sequencing and higher throughput, NGS systems now provide the ability to sequence genomes at high depth. However, currently no report is available highlighting the impact of high sequence depth on genome assembly using real data sets and multiple assembly algorithms. Recently, some studies have evaluated the impact of sequence coverage, error rate and average read length on genome assembly using multiple assembly algorithms, however, these evaluations were performed using simulated datasets. One limitation of using simulated datasets is that variables such as error rates, read length and coverage which are known to impact genome assembly are carefully controlled. Hence, this study was undertaken to identify the minimum depth of sequencing required for de novo assembly for different sized genomes using graph based assembly algorithms and real datasets. Illumina reads for E.coli (4.6 MB) S.kudriavzevii (11.18 MB) and C.elegans (100 MB) were assembled using SOAPdenovo, Velvet, ABySS, Meraculous and IDBA-UD. Our analysis shows that 50X is the optimum read depth for assembling these genomes using all assemblers except Meraculous which requires 100X read depth. Moreover, our analysis shows that de novo assembly from 50X read data requires only 6–40 GB RAM depending on the genome size and assembly algorithm used. We believe that this information can be extremely valuable for researchers in designing experiments and multiplexing which will enable optimum utilization of sequencing as well as analysis resources.  相似文献   

18.
The Sea URchin Fibrillar (SURF) domain is a four-cysteine module present in the amino-propeptide of the sea urchin 2alpha fibrillar collagen chain. Despite numerous international genome and expressed sequence tag projects, computer searches have so far failed to identify similar domains in other species. Here, we have characterized a new sea urchin protein of 2656 amino acids made up of a series of epidermal growth factor-like and SURF modules. From its striking similarity to the modular organization of fibropellins, we called this new protein fibrosurfin. This protein is acidic with a calculated pI of 4.12. Eleven of the 17 epidermal growth factor-like domains correspond to the consensus sequence of calcium-binding type. By Western blot and immunofluorescence analyses, this protein is not detectable during embryogenesis. In adult tissues, fibrosurfin is co-localized with the amino-propeptide of the 2alpha fibrillar collagen chain in several collagenous ligaments, i.e., test sutures, spine ligaments, peristomial membrane, and to a lesser extent, tube feet. Finally, immunogold labeling indicates that fibrosurfin is an interfibrillar component of collagenous tissues. Taken together, the data suggest that proteins possessing SURF modules are localized in the vicinity of mineralized tissues and could be responsible for the unique properties of sea urchin mutable collagenous tissues.  相似文献   

19.
Next-generation sequencing (NGS) technologies permit the rapid production of vast amounts of data at low cost. Economical data storage and transmission hence becomes an increasingly important challenge for NGS experiments. In this paper, we introduce a new non-reference based read sequence compression tool called SRComp. It works by first employing a fast string-sorting algorithm called burstsort to sort read sequences in lexicographical order and then Elias omega-based integer coding to encode the sorted read sequences. SRComp has been benchmarked on four large NGS datasets, where experimental results show that it can run 5–35 times faster than current state-of-the-art read sequence compression tools such as BEETL and SCALCE, while retaining comparable compression efficiency for large collections of short read sequences. SRComp is a read sequence compression tool that is particularly valuable in certain applications where compression time is of major concern.  相似文献   

20.
Next-generation sequencing–based metagenomics has enabled to identify microorganisms in characteristic habitats without the need for lengthy cultivation. Importantly, clinically relevant phenomena such as resistance to medication, virulence or interactions with the environment can vary already within species. Therefore, a major current challenge is to reconstruct individual genomes from the sequencing reads at the level of strains, and not just the level of species. However, strains of one species can differ only by minor amounts of variants, which makes it difficult to distinguish them. Despite considerable recent progress, related approaches have remained fragmentary so far. Here, we present StrainXpress, as a comprehensive solution to the problem of strain aware metagenome assembly from next-generation sequencing reads. In experiments, StrainXpress reconstructs strain-specific genomes from metagenomes that involve up to >1000 strains and proves to successfully deal with poorly covered strains. The amount of reconstructed strain-specific sequence exceeds that of the current state-of-the-art approaches by on average 26.75% across all data sets (first quartile: 18.51%, median: 26.60%, third quartile: 35.05%).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号