首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 390 毫秒
1.
An important topic in genomic sequence analysis is the identification of protein coding regions. In this context, several coding DNA model-independent methods, based on the occurrence of specific patterns of nucleotides at coding regions, have been proposed. Nonetheless, these methods have not been completely suitable due to their dependence on an empirically pre-defined window length required for a local analysis of a DNA region. We introduce a method, based on a modified Gabor-wavelet transform (MGWT), for the identification of protein coding regions. This novel transform is tuned to analyze periodic signal components and presents the advantage of being independent of the window length. We compared the performance of the MGWT with other methods using eukaryote datasets. The results show that the MGWT outperforms all assessed model-independent methods with respect to identification accuracy. These results indicate that the source of at least part of the identification errors produced by the previous methods is the fixed working scale. The new method not only avoids this source of errors, but also makes available a tool for detailed exploration of the nucleotide occurrence.  相似文献   

2.
With the quick progress of the Human Genome Project, a great amount of uncharacterized DNA sequences needs to be annotated copiously by better algorithms. Recognizing shorter coding sequences of human genes is one of the most important problems in gene recognition, which is not yet completely solved. This paper is devoted to solving the issue using a new method. The distributions of the three stop codons, i.e., TAA, TAG and TGA, in three phases along coding, noncoding, and intergenic sequences are studied in detail. Using the obtained distributions and other coding measures, a new algorithm for the recognition of shorter coding sequences of human genes is developed. The accuracy of the algorithm is tested based on a larger database of human genes. It is found that the average accuracy achieved is as high as 92.1% for the sequences with length of 192 base pairs, which is confirmed by sixfold cross-validation tests. It is hoped that by incorporating the present method with some existing algorithms, the accuracy for identifying human genes from unannotated sequences would be increased.  相似文献   

3.
Assessment of protein coding measures.   总被引:23,自引:6,他引:17  
A number of methods for recognizing protein coding genes in DNA sequence have been published over the last 13 years, and new, more comprehensive algorithms, drawing on the repertoire of existing techniques, continue to be developed. To optimize continued development, it is valuable to systematically review and evaluate published techniques. At the core of most gene recognition algorithms is one or more coding measures--functions which produce, given any sample window of sequence, a number or vector intended to measure the degree to which a sample sequence resembles a window of 'typical' exonic DNA. In this paper we review and synthesize the underlying coding measures from published algorithms. A standardized benchmark is described, and each of the measures is evaluated according to this benchmark. Our main conclusion is that a very simple and obvious measure--counting oligomers--is more effective than any of the more sophisticated measures. Different measures contain different information. However there is a great deal of redundancy in the current suite of measures. We show that in future development of gene recognition algorithms, attention can probably be limited to six of the twenty or so measures proposed to date.  相似文献   

4.
We have modified and improved the GOR algorithm for the protein secondary structure prediction by using the evolutionary information provided by multiple sequence alignments, adding triplet statistics, and optimizing various parameters. We have expanded the database used to include the 513 non-redundant domains collected recently by Cuff and Barton (Proteins 1999;34:508-519; Proteins 2000;40:502-511). We have introduced a variable size window that allowed us to include sequences as short as 20-30 residues. A significant improvement over the previous versions of GOR algorithm was obtained by combining the PSI-BLAST multiple sequence alignments with the GOR method. The new algorithm will form the basis for the future GOR V release on an online prediction server. The average accuracy of the prediction of secondary structure with multiple sequence alignment and full jack-knife procedure was 73.5%. The accuracy of the prediction increases to 74.2% by limiting the prediction to 375 (of 513) sequences having at least 50 PSI-BLAST alignments. The average accuracy of the prediction of the new improved program without using multiple sequence alignments was 67.5%. This is approximately a 3% improvement over the preceding GOR IV algorithm (Garnier J, Gibrat JF, Robson B. Methods Enzymol 1996;266:540-553; Kloczkowski A, Ting K-L, Jernigan RL, Garnier J. Polymer 2002;43:441-449). We have discussed alternatives to the segment overlap (Sov) coefficient proposed by Zemla et al. (Proteins 1999;34:220-223).  相似文献   

5.
A large number of unclassified sequences is still found in public databases, which suggests that there is still need for new investigations in the area. In this contribution, we present a methodology based on Artificial Neural Networks for protein functional classification. A new protein coding scheme, called here Extended-Sequence Coding by Sliding Windows, is presented with the goal of overcoming some of the difficulties of the well method Sequence Coding by Sliding Window. The new protein coding scheme uses more than one sliding window length with a weight factor that is proportional to the window length, avoiding the ambiguity problem without ignoring the identity of small subsequences Accuracy for Sequence Coding by Sliding Windows ranged from 60.1 to 77.7 percent for the first bacterium protein set and from 61.9 to 76.7 percent for the second one, whereas the accuracy for the proposed Extended-Sequence Coding by Sliding Windows scheme ranged from 70.7 to 97.1 percent for the first bacterium protein set and from 61.1 to 93.3 percent for the second one. Additionally, protein sequences classified inconsistently by the Artificial Neural Networks were analyzed by CD-Search revealing that there are some disagreement in public repositories, calling the attention for the relevant issue of error propagation in annotated databases due the incorrect transferred annotations.  相似文献   

6.
Let A denote an alphabet consisting of n types of letters. Given a sequence S of length L with v(i) letters of type i on A, to describe the compositional properties and combinatorial structure of S, we propose a new complexity function of S, called the reciprocal complexity of S, as C(S) = (i=1) product operator (n) (L/nv(i))(vi) Based on this complexity measure, an efficient algorithm is developed for classifying and analyzing simple segments of protein and nucleotide sequence databases associated with scoring schemes. The running time of the algorithm is nearly proportional to the sequence length. The program DSR corresponding to the algorithm was written in C++, associated with two parameters (window length and cutoff value) and a scoring matrix. Some examples regarding protein sequences illustrate how the method can be used to find regions. The first application of DSR is the masking of simple sequences for searching databases. Queries masked by DSR returned a manageable set of hits below the E-value cutoff score, which contained all true positive homologues. The second application is to study simple regions detected by the DSR program corresponding to known structural features of proteins. An extensive computational analysis has been made of protein sequences with known, physicochemically defined nonglobular segments. For the SWISS-PROT amino acid sequence database (Release 40.2 of 02-Nov-2001), we determine that the best parameters and the best BLOSUM matrix are, respectively, for automatic segmentation of amino acid sequences into nonglobular and globular regions by the DSR program: Window length k = 35, cutoff value b = 0.46, and the BLOSUM 62.5 matrix. The average "agreement accuracy (sensitivity)" of DSR segmentation for the SWISS-PROT database is 97.3%.  相似文献   

7.
8.
Methods to determine periodicity in protein sequences are useful for inferring function. Fourier transformation is one approach but care is required to ensure the periodicity is genuine. Here we have shown that empirically-derived statistical tables can be used as a measure of significance. Genuine protein sequences data rather than randomly generated sequences were used as the statistical backdrop. The method has been applied to G-protein coupled receptor (GPCR) sequences, by Fourier transformation of hydrophobicity values, codon frequencies and the extent of over-representation of codon pairs; the latter being related to translational step times. Genuine periodicity was observed in the hydrophobicity whereas the apparent periodicity (as inferred from previously reported measures) in the translation step times was not validated statistically. GCR2 has recently been proposed as the plant GPCR receptor for the hormone abscisic acid. It has homology to the Lanthionine synthetase C-like family of proteins, an observation confirmed by fold recognition. Application of the Fourier transform algorithm to the GCR2 family revealed strongly predicted seven fold periodicity in hydrophobicity, suggesting why GCR2 has been reported to be a GPCR, despite negative indications in most transmembrane prediction algorithms. The underlying multiple sequence alignment, also required for the Fourier transform analysis of periodicity, indicated that the hydrophobic regions around the 7 GXXG motifs commence near the C-terminal end of each of the 7 inner helices of the alpha-toroid and continue to the N-terminal region of the helix. The results clearly explain why GCR2 has been understandably but erroneously predicted to be a GPCR.  相似文献   

9.
Various conventional methods to estimate the mean and median power spectral frequencies, and amplitude of the surface electromyogram during 30-90 min, cyclic, force-varying, constant-posture contractions were cross-compared in an experimental trial. The aim was to determine the most appropriate algorithm implementations and reduce the total number of algorithms that need to be considered when monitoring time trends. Subjects produced hand-grip contractions in a repeated intermittent pattern until exhaustion. For all estimated parameters: analysis of contraction levels below 25% maximum voluntary contraction produced poor estimates due to high relative measurement noise; parameter reproducibility was best when comparisons were aligned to the actual force produced rather than the target force and when the biomechanics of the contraction were more consistent; and estimates were not greatly influenced by the rate of change of the force trajectory. For frequency parameters: estimates based on the short-time Fourier transform were similar to those based on time-varying autoregressive methods; longer duration analysis windows exhibited better repeatability; and simple frequency-domain noise filters were not effective in reducing the impact of measurement noise. For amplitude estimates: whitening reduced the variance of the amplitude estimate; and the best analysis window duration was a trade-off between bias (decreased with a short duration window) and variance (decreased with a long duration window).  相似文献   

10.
High mass measurement accuracy is critical for confident protein identification and characterization in proteomics research. Fourier transform ion cyclotron resonance (FTICR) mass spectrometry is a unique technique which can provide unparalleled mass accuracy and resolving power. However, the mass measurement accuracy of FTICR-MS can be affected by space charge effects. Here, we present a novel internal calibrant-free calibration method that corrects for space charge-induced frequency shifts in FTICR fragment spectra called Calibration Optimization on Fragment Ions (COFI). This new strategy utilizes the information from fixed mass differences between two neighboring peptide fragment ions (such as y(1) and y(2)) to correct the frequency shift after data collection. COFI has been successfully applied to LC-FTICR fragmentation data. Mascot MS/MS ion search data demonstrate that most of the fragments from BSA tryptic digested peptides can be identified using a much lower mass tolerance window after applying COFI to LC-FTICR-MS/MS of BSA tryptic digest. Furthermore, COFI has been used for multiplexed LC-CID-FTICR-MS which is an attractive technique because of its increased duty cycle and dynamic range. After the application of COFI to a multiplexed LC-CID-FTICR-MS of BSA tryptic digest, we achieved an average measured mass accuracy of 2.49 ppm for all the identified BSA fragments.  相似文献   

11.
Levenshtein dissimilarity measures are used to compare sequences in application areas including coding theory, computer science and macromolecular biology. In general, they measure sequence dissimilarity by the length of a shortest weighted sequence of insertions, deletions and substitutions required, to transform one sequence into another. Those Levenshtein dissimilarity measures based on insertions and deletions are analyzed by a model involving valuations on a partially ordered set. The model reveals structural relationships among poset, valuation and dissimilarity measure. As a consequence, certain Levenshtein dissimilarity measures are shown to be metrics characterized by betweenness properties and computable in terms of well-known measures of sequence similarity. This work was supported in part by the Natural Sciences and Engineering Research Council of Canada under Grant A-4142.  相似文献   

12.
Ultrasound velocity is one of the key acoustic parameters for noninvasive diagnosis of osteoporosis. Ultrasound phase velocity can be uniquely measured from the phase of the ultrasound signal at a specified frequency. Many previous studies used fast Fourier transform (FFT) to determine the phase velocity, which may cause errors due to the limitations of FFT. The new phase tracking technique applied an adaptive tracking algorithm to detect the time dependent phase and amplitude of the ultrasound signal at a specified frequency. This overcame the disadvantages of FFT to ensure the accuracy of the ultrasound phase velocity. As a result, the new method exhibited high accuracy in the measurement of ultrasound phase velocity of two phantom blocks with the error less than 0.4%. 41 cubic trabecular samples from sheep femoral condyles were used in the study. The phase velocity of the samples using the new method had significantly high correlation to the bulk stiffness of the samples (r = 0.84) compared to the phase velocity measured using fast Fourier transform FFT (r = 0.14). In conclusion, the new method provided an accurate measurement of the ultrasound phase velocity in bone.  相似文献   

13.
Frequency-domain analysis of biomolecular sequences   总被引:7,自引:0,他引:7  
MOTIVATION: Frequency-domain analysis of biomolecular sequences is hindered by their representation as strings of characters. If numerical values are assigned to each of these characters, then the resulting numerical sequences are readily amenable to digital signal processing. RESULTS: We introduce new computational and visual tools for biomolecular sequences analysis. In particular, we provide an optimization procedure improving upon traditional Fourier analysis performance in distinguishing coding from noncoding regions in DNA sequences. We also show that the phase of a properly defined Fourier transform is a powerful predictor of the reading frame of protein coding regions. Resulting color maps help in visually identifying not only the existence of protein coding areas for both DNA strands, but also the coding direction and the reading frame for each of the exons. Furthermore, we demonstrate that color spectrograms can visually provide, in the form of local 'texture', significant information about biomolecular sequences, thus facilitating understanding of local nature, structure and function.  相似文献   

14.
Xu S  Rao N  Chen X  Zhou B 《Biotechnology letters》2011,33(5):889-896
The accuracy of prediction methods based on power spectrum analysis depends on the threshold that is used to discriminate between protein coding and non-coding sequences in the genomes of eukaryotes. Because the structure of genes vary among different eukaryotes, it is difficult to determine the best prediction threshold for a eukaryote relying only on prior biological knowledge. To improve the accuracy of prediction methods based on power spectral analysis, we developed a novel method based on a bootstrap algorithm to infer organism-specific optimal thresholds for eukaryotes. As prior information, our method requires the input of only a few annotated protein coding regions from the organism being studied. Our results show that using the calculated optimal thresholds for our test datasets, the average prediction accuracy of our method is 81%, an increase of 19% over that obtained using the same empirical threshold P = 4 for all datasets. The proposed method is simple and convenient and easily applied to infer optimal thresholds that can be used to predict coding regions in the genomes of most organisms.  相似文献   

15.
Indels in the coding regions of a gene can either cause frameshifts or amino acid insertions/deletions. Frameshifting indels are indels that have a length that is not divisible by 3 and subsequently cause frameshifts. Indels that have a length divisible by 3 cause amino acid insertions/deletions or block substitutions; we call these 3n indels. The new amino acid changes resulting from 3n indels could potentially affect protein function. Therefore, we construct a SIFT Indel prediction algorithm for 3n indels which achieves 82% accuracy, 81% sensitivity, 82% specificity, 82% precision, 0.63 MCC, and 0.87 AUC by 10-fold cross-validation. We have previously published a prediction algorithm for frameshifting indels. The rules for the prediction of 3n indels are different from the rules for the prediction of frameshifting indels and reflect the biological differences of these two different types of variations. SIFT Indel was applied to human 3n indels from the 1000 Genomes Project and the Exome Sequencing Project. We found that common variants are less likely to be deleterious than rare variants. The SIFT indel prediction algorithm for 3n indels is available at http://sift-dna.org/  相似文献   

16.
Choong MK  Yan H 《Bioinformation》2008,2(7):273-278
This paper presents a new method for exon detection in DNA sequences based on multi-scale parametric spectral analysis. A forward-backward linear prediction (FBLP) with the singular value decomposition (SVD) algorithm FBLP-SVD is applied to the double-base curves (DB-curves) of a DNA sequence using a variable moving window sizes to estimate the signal spectrum at multiple scales. Simulations are done on short human genes in the range of 11bp to 2032bp and the results show that our proposed method out-performs the classical Fourier transform method. The multi-scale approach is shown to be more effective than using a single scale with a fixed window size. In addition, our method is flexible as it requires no training data.  相似文献   

17.
The miniaturization and affordability of new technology is driving a biologging revolution in wildlife ecology with use of animal‐borne data logging devices. Among many new biologging technologies, accelerometers are emerging as key tools for continuously recording animal behavior. Yet a critical, but under‐acknowledged consideration in biologging is the trade‐off between sampling rate and sampling duration, created by battery‐ (or memory‐) related sampling constraints. This is especially acute among small animals, causing most researchers to sample at high rates for very limited durations. Here, we show that high accuracy in behavioral classification is achievable when pairing low‐frequency acceleration recordings with temperature. We conducted 84 hr of direct behavioral observations on 67 free‐ranging red squirrels (200–300 g) that were fitted with accelerometers (2 g) recording tri‐axial acceleration and temperature at 1 Hz. We then used a random forest algorithm and a manually created decision tree, with variable sampling window lengths, to associate observed behavior with logger recorded acceleration and temperature. Finally, we assessed the accuracy of these different classifications using an additional 60 hr of behavioral observations, not used in the initial classification. The accuracy of the manually created decision tree classification using observational data varied from 70.6% to 91.6% depending on the complexity of the tree, with increasing accuracy as complexity decreased. Short duration behavior like running had lower accuracy than long‐duration behavior like feeding. The random forest algorithm offered similarly high overall accuracy, but the manual decision tree afforded the flexibility to create a hierarchical tree, and to adjust sampling window length for behavioral states with varying durations. Low frequency biologging of acceleration and temperature allows accurate behavioral classification of small animals over multi‐month sampling durations. Nevertheless, low sampling rates impose several important limitations, especially related to assessing the classification accuracy of short duration behavior.  相似文献   

18.
Grain yield of the maize plant depends on the sizes, shapes, and numbers of ears and the kernels they bear. An automated pipeline that can measure these components of yield from easily‐obtained digital images is needed to advance our understanding of this globally important crop. Here we present three custom algorithms designed to compute such yield components automatically from digital images acquired by a low‐cost platform. One algorithm determines the average space each kernel occupies along the cob axis using a sliding‐window Fourier transform analysis of image intensity features. A second counts individual kernels removed from ears, including those in clusters. A third measures each kernel's major and minor axis after a Bayesian analysis of contour points identifies the kernel tip. Dimensionless ear and kernel shape traits that may interrelate yield components are measured by principal components analysis of contour point sets. Increased objectivity and speed compared to typical manual methods are achieved without loss of accuracy as evidenced by high correlations with ground truth measurements and simulated data. Millimeter‐scale differences among ear, cob, and kernel traits that ranged more than 2.5‐fold across a diverse group of inbred maize lines were resolved. This system for measuring maize ear, cob, and kernel attributes is being used by multiple research groups as an automated Web service running on community high‐throughput computing and distributed data storage infrastructure. Users may create their own workflow using the source code that is staged for download on a public repository.  相似文献   

19.
This report describes an optimised version of a secondary structure prediction method based on local homologies, using a new data base. A 63% prediction accuracy, for three states, was obtained after elimination of the protein to be predicted and all proteins with a percentage identity greater than 22% from the data base. This corresponds to a 5% increase in accuracy on the original method (Levin et al. FEBS Lett. 205 (1986) 303-308). The flexibility of the method to the incorporation of information extraneous to the prediction was demonstrated by the prediction of the homologous proteins in the data base. Using the percentage identity with the protein to be predicted, to weight the relative importance of each protein, for all proteins with a percentage identity greater than 30%, the mean correct prediction per chain was 87%. As a result this algorithm can be used during the molecular modelling process, both to give an idea of the structural similarity between two proteins and as an aid in the determination of the best alignment. Incorporation of the result of a protein folding type assignment based on the global amino-acid composition increased the overall prediction to 66%.  相似文献   

20.
A new system to recognize protein coding genes in the coronavirus genomes, specially suitable for the SARS-CoV genomes, has been proposed in this paper. Compared with some existing systems, the new program package has the merits of simplicity, high accuracy, reliability, and quickness. The system ZCURVE_CoV has been run for each of the 11 newly sequenced SARS-CoV genomes. Consequently, six genomes not annotated previously have been annotated, and some problems of previous annotations in the remaining five genomes have been pointed out and discussed. In addition to the polyprotein chain ORFs 1a and 1b and the four genes coding for the major structural proteins, spike (S), small envelop (E), membrane (M), and nuleocaspid (N), respectively, ZCURVE_CoV also predicts 5-6 putative proteins in length between 39 and 274 amino acids with unknown functions. Some single nucleotide mutations within these putative coding sequences have been detected and their biological implications are discussed. A web service is provided, by which a user can obtain the annotated result immediately by pasting the SARS-CoV genome sequences into the input window on the web site (http://tubic.tju.edu.cn/sars/). The software ZCURVE_CoV can also be downloaded freely from the web address mentioned above and run in computers under the platforms of Windows or Linux.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号