首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The explosive growth in biological data in recent years has led to the development of new methods to identify DNA sequences. Many algorithms have recently been developed that search DNA sequences looking for unique DNA sequences. This paper considers the application of the Burrows-Wheeler transform (BWT) to the problem of unique DNA sequence identification. The BWT transforms a block of data into a format that is extremely well suited for compression. This paper presents a time-efficient algorithm to search for unique DNA sequences in a set of genes. This algorithm is applicable to the identification of yeast species and other DNA sequence sets.  相似文献   

2.
The objectives of this study were to infer phenotypic causal networks involving gestation length (GL) and calving difficulty (CD) for the primiparity of 1850 Japanese Black heifers, and the birth weight (BWT), withers height (WH) and chest girth (CHG) of their full blood calves, and to compare the causal effects among them. The inductive causation (IC) algorithm was employed to search for causal links among these traits; it was applied to the posterior distribution of the residual (co)variance matrix of a multiple-trait sire-maternal grand sire (MGS) model. The IC algorithm implemented with 95% and 90% highest posterior density intervals detected only one structure with links between GL and BWT (WH or CHG) and between BWT (WH or CHG) and CD, although their directions were not resolved. Therefore, a possible causal structure based on the networks obtained from the IC algorithm [GL→BWT (WH or CHG)→CD] was fitted using a structural equation model to infer causal structure coefficients between the traits. The structural coefficients of GL on BWT and of BWT on GL on the observable scale showed that an extra day of GL led to a 270-g gain in BWT, and a 1-kg increase in BWT increased the risk for dystocia by 1.1%, in the causal structure. Similarly, an increase in GL by 1 day resulted in a 2.1 (2.0)-mm growth in WH (CHG), and a 1-cm increase in WH (CHG) increased the risk of dystocia by 1.2% (0.9%). The structural equation model was also fitted to alternative causal structures, which involved the addition of a directed link from GL to CD, or GL→CD to the structures described above. The inferred structural coefficients with the alternative structures were almost the same as the corresponding ones that had GL→BWT (WH or CHG)→CD. However, the direct causal effect of the extra link from GL on CD was similar to the indirect causal effect of GL through the mediating effect of BWT (WH or CHG) on CD and significant (P<0.05). This suggest that maternal genetic effects might not be removed completely from the residual variance components in the sire-MGS model, and the application of the IC algorithm to the variances from the model could detect an incorrect structure. Nonetheless, fitting the structural equation model to the causal structure provided useful information such as the magnitude of the causal effects between the traits.  相似文献   

3.
MOTIVATION: Recent experimental studies on compressed indexes (BWT, CSA, FM-index) have confirmed their practicality for indexing very long strings such as the human genome in the main memory. For example, a BWT index for the human genome (with about 3 billion characters) occupies just around 1 G bytes. However, these indexes are designed for exact pattern matching, which is too stringent for biological applications. The demand is often on finding local alignments (pairs of similar substrings with gaps allowed). Without indexing, one can use dynamic programming to find all the local alignments between a text T and a pattern P in O(|T||P|) time, but this would be too slow when the text is of genome scale (e.g. aligning a gene with the human genome would take tens to hundreds of hours). In practice, biologists use heuristic-based software such as BLAST, which is very efficient but does not guarantee to find all local alignments. RESULTS: In this article, we show how to build a software called BWT-SW that exploits a BWT index of a text T to speed up the dynamic programming for finding all local alignments. Experiments reveal that BWT-SW is very efficient (e.g. aligning a pattern of length 3 000 with the human genome takes less than a minute). We have also analyzed BWT-SW mathematically for a simpler similarity model (with gaps disallowed), and we show that the expected running time is O(/T/(0.628)/P/) for random strings. As far as we know, BWT-SW is the first practical tool that can find all local alignments. Yet BWT-SW is not meant to be a replacement of BLAST, as BLAST is still several times faster than BWT-SW for long patterns and BLAST is indeed accurate enough in most cases (we have used BWT-SW to check against the accuracy of BLAST and found that only rarely BLAST would miss some significant alignments). AVAILABILITY: www.cs.hku.hk/~ckwong3/bwtsw CONTACT: twlam@cs.hku.hk.  相似文献   

4.
Software based efficient and reliable ECG data compression and transmission scheme is proposed here. The algorithm has been applied to various ECG data of all the 12 leads taken from PTB diagnostic ECG database (PTB-DB). First of all, R-peaks are detected by differentiation and squaring technique and QRS regions are located. To achieve a strict lossless compression in the QRS regions and a tolerable lossy compression in rest of the signal, two different compression algorithms have used. The whole compression scheme is such that the compressed file contains only ASCII characters. These characters are transmitted using internet based Short Message Service (SMS) and at the receiving end, original ECG signal is brought back using just the reverse logic of compression. It is observed that the proposed algorithm can reduce the file size significantly (compression ratio: 22.47) preserving ECG signal morphology.  相似文献   

5.
1IntroductionElbo(ECG)offersalotofilllcorralltinfondionforthediagnosisOfheartdis-eases.Berz1,seahaormalEChcax[lrins>llleu"knownsituations,thcycanbeCSUghtinthelongti1Ylecontin~11xHlltoriDg.ndterrnoultonngsystemwhichcanrecord24-hoUrECGdataisoneofeffectiveme~toprovidethefun~.AlthOUghthehag6scaleICmorestohavebeeddevelopepbedeavailabletostorelOngtboeECGdsta,itisveqdifficultyandtioublesomethatalopnUmbeOfdstaisprasersandstoredortiallsillltted.InthedigitedECGdata,thereare…  相似文献   

6.
Data compression is concerned with how information is organized in data. Efficient storage means removal of redundancy from the data being stored in the DNA molecule. Data compression algorithms remove redundancy and are used to understand biologically important molecules. We present a compression algorithm, "DNABIT Compress" for DNA sequences based on a novel algorithm of assigning binary bits for smaller segments of DNA bases to compress both repetitive and non repetitive DNA sequence. Our proposed algorithm achieves the best compression ratio for DNA sequences for larger genome. Significantly better compression results show that "DNABIT Compress" algorithm is the best among the remaining compression algorithms. While achieving the best compression ratios for DNA sequences (Genomes),our new DNABIT Compress algorithm significantly improves the running time of all previous DNA compression programs. Assigning binary bits (Unique BIT CODE) for (Exact Repeats, Reverse Repeats) fragments of DNA sequence is also a unique concept introduced in this algorithm for the first time in DNA compression. This proposed new algorithm could achieve the best compression ratio as much as 1.58 bits/bases where the existing best methods could not achieve a ratio less than 1.72 bits/bases.  相似文献   

7.
ECG data compression techniques have received extensive attention in ECG analysis. Numerous data compression algorithms for ECG signals have been proposed during the last three decades. We describe two algorithms based on the scan-along polygonal approximation algorithm (SAPA) that are suitable for multichannel ECG data reduction on a microprocessor-based system. One represents a modification of SAPA (MSAPA) which adopts the method of integer division table searching to speed up data reduction; the other (CSAPA) combines MSAPA and TP, a turning-point algorithm, to preserve ST segment signals. Results show that our algorithms achieve a compression ratio of more than 5:1 and a percent rms difference (PRD) to the original signal of less than 3.5%. In addition, the maximum execution time of MSAPA for processing one data point is about 50μ s. Moreover, the CSAPA algorithm retains all of the details of the ST segment, which are important in ischaemia diagnosis, by employing the TP algorithm.  相似文献   

8.
A new string searching algorithm is presented aimed at searchingfor the occurrence of character patterns in longer charactertexts. The algorithm, specifically designed for nucleic acidsequence data, is essentially derived from the Boyer –Moore method (Comm. ACM, 20, 762 – 772, 1977). Both patternand text data are compressed so that the natural 4-letter alphabetof nucleic acid sequences is considerably enlarged. The stringsearch starts from the last character of the pattern and proceedsin large jumps through the text to be searched. The data compressionand searching algorithm allows one to avoid searching for patternsnot present in the text as well as to inspect, for each pattern,all text characters until the exact match with the text is found.These considerations are supported by empirical evidence andcomparisons with other methods.  相似文献   

9.
This research investigated two sources of sire-specific genetic effects on the birth weight (BWT) and weaning weight (WWT) of Bruna dels Pirineus beef calves. More specifically, we focused on the influence of genes located in the non-autosomal region of the Y chromosome and the contribution of paternal imprinting. Our analyses were performed on 8130 BWT and 1245 WWT records from 12 and 2 purebred herds, respectively, they being collected between years 1986 and 2010. All animals included in the study were registered in the Yield Recording Scheme of the Bruna dels Pirineus breed. Both BWT and WWT were analyzed using a univariate linear animal model, and the relevance of paternal imprinting and Y chromosome-linked effects were checked by the deviance information criterion (DIC). In addition to sire-specific and direct genetic effects, our model accounted for random permanent effects (dam and herd-year-season) and three systematic sources of variation, that is, sex of the calf (male or female), age of the dam at calving (six levels) and birth type (single or twin). Both weight traits evidenced remarkable effects from the Y chromosome, whereas paternal imprinting was only revealed in WWT. Note that differences in DIC between the preferred model and the remaining ones exceed 39 000 and 2 800 000 DIC units for BWT and WWT, respectively. It is important to highlight that Y chromosome accounted for ∼2% and ∼6% of the total phenotypic variance for BWT and WWT, respectively, and paternal imprinting accounted for ∼13% of the phenotypic variance for WWT. These results revealed two relevant sources of sire-specific genetic variability with potential contributions to the current breeding scheme of the Bruna dels Pirineus beef cattle breed; moreover, these sire-specific effects could be included in other beef cattle breeding programs or, at least, they must be considered and appropriately analyzed.  相似文献   

10.
High-efficiency video compression technology is of primary importance to the storage and transmission of digital medical video in modern medical communication systems. To further improve the compression performance of medical ultrasound video, two innovative technologies based on diagnostic region-of-interest (ROI) extraction using the high efficiency video coding (H.265/HEVC) standard are presented in this paper. First, an effective ROI extraction algorithm based on image textural features is proposed to strengthen the applicability of ROI detection results in the H.265/HEVC quad-tree coding structure. Second, a hierarchical coding method based on transform coefficient adjustment and a quantization parameter (QP) selection process is designed to implement the otherness encoding for ROIs and non-ROIs. Experimental results demonstrate that the proposed optimization strategy significantly improves the coding performance by achieving a BD-BR reduction of 13.52% and a BD-PSNR gain of 1.16 dB on average compared to H.265/HEVC (HM15.0). The proposed medical video coding algorithm is expected to satisfy low bit-rate compression requirements for modern medical communication systems.  相似文献   

11.
(Co)variance components and genetic parameters of weight at birth (BWT), weaning (3WT), 6, 9 and 12 months of age (6WT, 9WT and 12WT, respectively) and first greasy fleece weight (GFW) of Bharat Merino sheep, maintained at Central Sheep and Wool Research Institute, Avikanagar, Rajasthan, India, were estimated by restricted maximum likelihood, fitting six animal models with various combinations of direct and maternal effects. Data were collected over a period of 10 years (1998 to 2007). A log-likelihood ratio test was used to select the most appropriate univariate model for each trait, which was subsequently used in bivariate analysis. Heritability estimates for BWT, 3WT, 6WT, 9WT and 12WT and first GFW were 0.05 ± 0.03, 0.04 ± 0.02, 0.00, 0.03 ± 0.03, 0.09 ± 0.05 and 0.05 ± 0.03, respectively. There was no evidence for the maternal genetic effect on the traits under study. Maternal permanent environmental effect contributed 19% for BWT and 6% to 11% from 3WT to 9WT and 11% for first GFW. Maternal permanent environmental effect on the post-3WT was a carryover effect of maternal influences during pre-weaning age. A low rate of genetic progress seems possible in the flock through selection. Direct genetic correlations between body weight traits were positive and ranged from 0.36 between BWT and 6WT to 0.94 between 3WT and 6WT and between 6WT and 12WT. Genetic correlations of 3WT with 6WT, 9WT and 12WT were high and positive (0.94, 0.93 and 0.93, respectively), suggesting that genetic gain in post-3WT will be maintained if selection age is reduced to 3 months. The genetic correlations of GFW with live weights were 0.01, 0.16, 0.18, 0.40 and 0.32 for BWT, 3WT, 6WT, 9WT and 12WT, respectively. Correlations of permanent environmental effects of the dam across different traits were high and positive for all the traits (0.45 to 0.98).  相似文献   

12.
Next-generation sequencing (NGS) technologies permit the rapid production of vast amounts of data at low cost. Economical data storage and transmission hence becomes an increasingly important challenge for NGS experiments. In this paper, we introduce a new non-reference based read sequence compression tool called SRComp. It works by first employing a fast string-sorting algorithm called burstsort to sort read sequences in lexicographical order and then Elias omega-based integer coding to encode the sorted read sequences. SRComp has been benchmarked on four large NGS datasets, where experimental results show that it can run 5–35 times faster than current state-of-the-art read sequence compression tools such as BEETL and SCALCE, while retaining comparable compression efficiency for large collections of short read sequences. SRComp is a read sequence compression tool that is particularly valuable in certain applications where compression time is of major concern.  相似文献   

13.
This article introduces an algorithm for the lossless compression of DNA files, which contain annotation text besides the nucleotide sequence. First a grammar is specifically designed to capture the regularities of the annotation text. A revertible transformation uses the grammar rules in order to equivalently represent the original file as a collection of parsed segments and a sequence of decisions made by the grammar parser. This decomposition enables the efficient use of state-of-the-art encoders for processing the parsed segments. The output size of the decision-making process of the grammar is optimized by extending the states to account for high-order Markovian dependencies. The practical implementation of the algorithm achieves a significant improvement when compared to the general-purpose methods currently used for DNA files.  相似文献   

14.
The impact of NaOH as a ballast water treatment (BWT) on microbial community diversity was assessed using the 16S rRNA gene based Ion Torrent sequencing with its new 400 base chemistry. Ballast water samples from a Great Lakes ship were collected from the intake and discharge of both control and NaOH (pH 12) treated tanks and were analyzed in duplicates. One set of duplicates was treated with the membrane-impermeable DNA cross-linking reagent propidium mono-azide (PMA) prior to PCR amplification to differentiate between live and dead microorganisms. Ion Torrent sequencing generated nearly 580,000 reads for 31 bar-coded samples and revealed alterations of the microbial community structure in ballast water that had been treated with NaOH. Rarefaction analysis of the Ion Torrent sequencing data showed that BWT using NaOH significantly decreased microbial community diversity relative to control discharge (p<0.001). UniFrac distance based principal coordinate analysis (PCoA) plots and UPGMA tree analysis revealed that NaOH-treated ballast water microbial communities differed from both intake communities and control discharge communities. After NaOH treatment, bacteria from the genus Alishewanella became dominant in the NaOH-treated samples, accounting for <0.5% of the total reads in intake samples but more than 50% of the reads in the treated discharge samples. The only apparent difference in microbial community structure between PMA-processed and non-PMA samples occurred in intake water samples, which exhibited a significantly higher amount of PMA-sensitive cyanobacteria/chloroplast 16S rRNA than their corresponding non-PMA total DNA samples. The community assembly obtained using Ion Torrent sequencing was comparable to that obtained from a subset of samples that were also subjected to 454 pyrosequencing. This study showed the efficacy of alkali ballast water treatment in reducing ballast water microbial diversity and demonstrated the application of new Ion Torrent sequencing techniques to microbial community studies.  相似文献   

15.

Background  

With the rapid emergence of RNA databases and newly identified non-coding RNAs, an efficient compression algorithm for RNA sequence and structural information is needed for the storage and analysis of such data. Although several algorithms for compressing DNA sequences have been proposed, none of them are suitable for the compression of RNA sequences with their secondary structures simultaneously. This kind of compression not only facilitates the maintenance of RNA data, but also supplies a novel way to measure the informational complexity of RNA structural data, raising the possibility of studying the relationship between the functional activities of RNA structures and their complexities, as well as various structural properties of RNA based on compression.  相似文献   

16.
In this paper, two novel and simple, target distortion level (TDL) and target data rate (TDR), Wavelet threshold based ECG compression algorithms are proposed for real-time applications. The issues on the use of objective error measures, such as percentage root mean square difference (PRD) and root mean square error (RMSE) as a quality measures, in quality controlled/guranteed algorithm are investigated with different sets of experiments. For the proposed TDL and TDR algorithm, data rate variability and reconstructed signal quality is evaluated under different ECG signal test conditions. Experimental results show that the TDR algorithm achieves the required compression data rate to meet the demands of wire/wireless link while the TDL algorithm does not. The compression performance is assessed in terms of number of iterations required to achieve convergence and accuracy, reconstructed signal quality and coding delay. The reconstructed signal quality is evaluated by correct diagnosis (CD) test through visual inspection. Three sets of ECG data from three different databases, the MIT-BIH Arrhythmia (mita) (Fs=360 Hz, 11 b/sample), the Creighton University Ventricular Tachyarrhythmia (cuvt) (Fs=250 Hz, 12 b/sample) and the MIT-BIH Supraventricular Arrhythmia (mitsva) (Fs=128 Hz, 10 b/sample), are used for this work. For each set of ECG data, the compression ratio (CR) range is defined. The CD value of 100% is achieved for CR ≤12, CR ≤ 8 and CR ≤ 4 for data from mita, cuvt and mitsva databases, respectively. The experimental results demonstrate that the proposed TDR algorithm is suitable for real-time applications.  相似文献   

17.
In this paper, we present a novel approach Bio-IEDM (biomedical information extraction and data mining) to integrate text mining and predictive modeling to analyze biomolecular network from biomedical literature databases. Our method consists of two phases. In phase 1, we discuss a semisupervised efficient learning approach to automatically extract biological relationships such as protein-protein interaction, protein-gene interaction from the biomedical literature databases to construct the biomolecular network. Our method automatically learns the patterns based on a few user seed tuples and then extracts new tuples from the biomedical literature based on the discovered patterns. The derived biomolecular network forms a large scale-free network graph. In phase 2, we present a novel clustering algorithm to analyze the biomolecular network graph to identify biologically meaningful subnetworks (communities). The clustering algorithm considers the characteristics of the scale-free network graphs and is based on the local density of the vertex and its neighborhood functions that can be used to find more meaningful clusters with different density level. The experimental results indicate our approach is very effective in extracting biological knowledge from a huge collection of biomedical literature. The integration of data mining and information extraction provides a promising direction for analyzing the biomolecular network  相似文献   

18.
We present general algorithms for the compression of molecular dynamics trajectories. The standard ways to store MD trajectories as text or as raw binary floating point numbers result in very large files when efficient simulation programs are used on supercomputers. Our algorithms are based on the observation that differences in atomic coordinates/velocities, in either time or space, are generally smaller than the absolute values of the coordinates/velocities. Also, it is often possible to store values at a lower precision. We apply several compression schemes to compress the resulting differences further. The most efficient algorithms developed here use a block sorting algorithm in combination with Huffman coding. Depending on the frequency of storage of frames in the trajectory, either space, time, or combinations of space and time differences are usually the most efficient. We compare the efficiency of our algorithms with each other and with other algorithms present in the literature for various systems: liquid argon, water, a virus capsid solvated in 15 mM aqueous NaCl, and solid magnesium oxide. We perform tests to determine how much precision is necessary to obtain accurate structural and dynamic properties, as well as benchmark a parallelized implementation of the algorithms. We obtain compression ratios (compared to single precision floating point) of 1:3.3–1:35 depending on the frequency of storage of frames and the system studied.  相似文献   

19.
We compared two algorithms, which are used to assess the number of forward saccades in a reading task from records of eye movements. In one algorithm saccades are detected analysing the velocity of eye movements. The third derivate of eye position in time (jerk) is used in the second algorithm for the detection of saccades. Both algorithms were applied on the same set of data, recorded using 24 subjects reading a German text, which was presented on two different displays. Our subjects read the text at a mean reading speed of 258.5 word/min. Both algorithms were found to produce a similar rate of artefacts in the number of detected saccades (2.5%), provided the threshold for detection (velocity or jerk) is set at an appropriate level and the same level of threshold is applied to all data. In both algorithms, the rate of artefacts increases with increasing distance of the threshold from its optimum. Inter-individual variation of the rate of artefacts increases more pronounced in the algorithm based on jerks. Eye blinks were identified as a major source of artefacts. A remedy is proposed, by means of which the rate of artefacts can be reduced.  相似文献   

20.
Automatic text categorization is one of the key techniques in information retrieval and the data mining field. The classification is usually time-consuming when the training dataset is large and high-dimensional. Many methods have been proposed to solve this problem, but few can achieve satisfactory efficiency. In this paper, we present a method which combines the Latent Dirichlet Allocation (LDA) algorithm and the Support Vector Machine (SVM). LDA is first used to generate reduced dimensional representation of topics as feature in VSM. It is able to reduce features dramatically but keeps the necessary semantic information. The Support Vector Machine (SVM) is then employed to classify the data based on the generated features. We evaluate the algorithm on 20 Newsgroups and Reuters-21578 datasets, respectively. The experimental results show that the classification based on our proposed LDA+SVM model achieves high performance in terms of precision, recall and F1 measure. Further, it can achieve this within a much shorter time-frame. Our process improves greatly upon the previous work in this field and displays strong potential to achieve a streamlined classification process for a wide range of applications.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号