首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We propose a new approach to study protein coding and non-coding regions in DNA sequences, by making use of two complementary statistical methods. The principal component analysis (PCA) is a graphical method to represent DNA sequences which are characterized by some quantitative parameters: it is a help to the intuition. The discriminating analysis (DA) is a quantitative method which permits to classify the DNA sequences. It leads to an evaluation of the first method and to a decision. The value of this approach has been confirmed since we also have found some results which had been described recently in the literature. Furthermore, this general methodology has permitted us to show the existence of parameters which identify the nucleic acid sequence functional domains, without having to make use of the properties of the genetic code.  相似文献   

2.
Li X  Zeng J  Yan H 《Bioinformation》2008,2(9):373-378
  相似文献   

3.
Rapid methods for the characterization of biomass for energy purpose utilization are fundamental. In this work, near infrared spectroscopy is used to measure ash and char content of various types of biomass. Very strong models were developed, independently of the type of biomass, to predict ash and char content by near infrared spectroscopy and multivariate analysis. Several statistical approaches such as principal component analysis (PCA), orthogonal signal correction (OSC) treated PCA and partial least squares (PLS), Kernel PCA and PLS were tested in order to find the best method to deal with near infrared data to classify and predict these biomass characteristics. The model with the highest coefficient of correlation and the lowest RMSEP was obtained with OSC-treated Kernel PLS method.  相似文献   

4.
We have used Fragmentation Sequencing logic to analyse the repetition structure of several large human genomic genes. The method, based on a proposed laboratory scheme for DNA sequencing, detects short sequences which are repeated near, but not necessarily adjacent, to each other (cryptically simple DNA). We find a low frequency of such repeats. There is a slight excess of such repeats in introns over exons, and a slight but significant excess in genomic DNA over random DNA, confirming that cryptically simple sequences are over-represented in the genome. The analysis suggests that Fragmentation Sequencing will be a suitable method for sequencing large mammalian genes.  相似文献   

5.
Complexity charts can be used to map functional domains in DNA   总被引:4,自引:0,他引:4  
We measured local compositional complexity (LCC) of DNA sequences by calculating Shannon information content over mononucleotide frequencies. Eukaryotic DNA appeared to be "simpler" than bacterial DNA even at the level of short oligonucleotides. Moreover, different DNA functional domains displayed different compositional complexity in a systematic manner. In particular, the complexity of exon sequences was systematically higher than the complexity of corresponding introns. We therefore present examples of complexity charts (plots of complexity versus position in sequence) for pre-mRNA sequences from higher eukaryotes. By taking a window width of 100 nucleotides and a window step of 1 nucleotide, introns can be distinguished from exons in the majority of cases studied. Complexity charts of immunoglobulin variable regions allowed correct mapping of exons and introns in these sequences as well, a task that was impossible with commercial programs available to date.  相似文献   

6.
This paper describes a computer method that uses codon preference to help find protein coding regions in long DNA sequences. The method can distinguish between introns and exons and can help to detect sequencing errors.  相似文献   

7.
We have studied the relationship between amino acid sequence and substrate specificity in a DNA glycosylase family by characterizing experimentally the specificity of four new members of the family. We show that principal component analysis (PCA) of the sequence family correctly predicts the substrate specificity of one of the novel homologs even though conventional sequence analysis methods fail to group this homolog with other sequences of the same specificity. PCA also suggested, correctly, that another homolog characterized previously differs in its specificity from those sequences with which it clusters by conventional criteria. These results suggest that principal component analysis of sequence families can be a useful tool in annotating genome sequences when there is ambiguity concerning which subfamily a new homolog belongs to. Published 2000 Wiley-Liss, Inc.  相似文献   

8.
The exon structure of the collagen IV gene provides a striking example for collagen evolution and the role of introns in gene evolution. Collagen IV, a major component of basement membranes, differs from the fibrillar collagens in that it contains numerous interruptions in the triple helical Gly-X-Y repeat domain. We have characterized all 47 exons in the mouse alpha 2(IV) collagen gene and find two 36-, two 45-, and one 54-bp exons as well as one 99- and three 108-bp exons encoding the Gly-X-Y repeat sequence. All these exons sizes are also found in the fibrillar collagen genes. Strikingly, of the 24 interruption sequences present in the alpha 2-chain of mouse collagen IV, 11 are encoded at the exon/intron borders of the gene, part of one interruption sequence is encoded by an exon of its own, and the remaining interruptions are encoded within the body of exons. In such "fusion exons" the Gly-X-Y encoding domain is also derived from 36-, 45-, or 54-bp sequence elements. These data support the idea that collagen IV genes evolved from a primordial 54-bp coding unit. We furthermore interpret these data to suggest that the interruption sequences in collagen IV may have evolved from introns, presumably by inactivation of splice site signals, following which intronic sequences could have been recruited into exons. We speculated that this mechanism could provide a role for introns in gene evolution in general.  相似文献   

9.
We previously observed that Antarctic fish genes contain intron sequences of high A+T content (60-70% average A+T) which are in stark contrast with adjacent protein coding-sequences. Here, we report that this disparity in intron/exon base composition is a common feature among teleosts. We analyzed 483 teleost genomic DNA sequences, containing 2583 introns, from 80 teleost genera that populate polar, temperate, or tropical habitats. Eighty-nine percent of teleost introns display an A+T content between 50-84% A+T with a mean of 60% A+T. In contrast, only 37% of teleost exons have an A+T content greater-than 50% with a mean of 48% A+T. A comparison to homologous mammalian genes showed a striking difference; in this case, introns and exons have similar base compositions, averaging 45-47% A+T. This indicates that most teleost genes exhibit a large difference in base composition between their introns and exons. There was no correlation of teleost intron A+T content to intron length or habitat temperature range. Thus, teleost intron sequences tend to show the common feature of being much higher in A+T content then neighboring exons.  相似文献   

10.
11.
Genetic programming (GP) can be used to classify a given gene sequence as either constitutively or alternatively spliced. We describe the principles of GP and apply it to a well-defined data set of alternatively spliced genes. A feature matrix of sequence properties, such as nucleotide composition or exon length, was passed to the GP system "Discipulus." To test its performance we concentrated on cassette exons (SCE) and retained introns (SIR). We analyzed 27,519 constitutively spliced and 9641 cassette exons including their neighboring introns; in addition we analyzed 33,316 constitutively spliced introns compared to 2712 retained introns. We find that the classifier yields highly accurate predictions on the SIR data with a sensitivity of 92.1% and a specificity of 79.2%. Prediction accuracies on the SCE data are lower, 47.3% (sensitivity) and 70.9% (specificity), indicating that alternative splicing of introns can be better captured by sequence properties than that of exons.  相似文献   

12.
A targeted and timely offered treatment can be a benefitting tool for patients with acute promyelocytic leukemia (APML). Current round of study made use of potential morphological and immature fraction–related parameters (cell population data) generated during complete blood cell count (CBC), through artificial neural network (ANN) predictive modeling for early flagging of APML cases. We collected classical CBC items along with cell population data (CPD) from hematology analyzer at diagnosis of 1067 study subjects with hematological neoplasms. For morphological assessment, peripheral blood films were examined. Statistical and machine learning tools including principal component analysis (PCA) helped in the evaluation of predictive capacity of routine and CPD items. Then selected CBC item–driven ANN predictive modeling was developed to smartly use the hidden trend by increasing the auguring accuracy of these parameters in differentiation of APML cases. We found a characteristic triad based on lower (53.73) platelet count (PLT) with decreased/normal (4.72) immature fraction of platelet (IPF) with addition of significantly higher value (65.5) of DNA/RNA content–related neutrophil (NE-SFL) parameter in patients with APML against other hematological neoplasm's groups. On PCA, APML showed exceptionally significant variance for PLT, IPF, and NE-SFL. Through training of ANN predictive modeling, our selected CBC items successfully classify the APML group from non-APML groups at highly significant (0.894) AUC value with lower (2.3 percent) false prediction rate. Practical results of using our ANN model were found acceptable with value of 95.7% and 97.7% for training and testing data sets, respectively. We proposed that PLT, IPF, and NE-SFL could potentially be used for early flagging of APML cases in the hematology-oncology unit. CBC item–driven ANN modeling is a novel approach that substantially strengthen the predictive potential of CBC items, allowing the clinicians to be confident by the typical trend raised by these studied parameters.  相似文献   

13.
The work reported in this paper examines the use of principal component analysis (PCA), a technique of multivariate statistics to facilitate the extraction of meaningful diagnostic information from a data set of chromatographic traces. Two data sets mimicking archived production records were analysed using PCA. In the first a full-factorial experimental design approach was used to generate the data. In the second, the chromatograms were generated by adjusting just one of the process variables at a time. Data base mining was achieved through the generation of both gross and disjoint principal component (PC) models. PCA provided easily interpretable 2-dimensional diagnostic plots revealing clusters of chromatograms obtained under similar operating conditions. PCA methods can be used to detect and diagnose changes in process conditions, however results show that a PCA model may require recalibration if an equipment change is made. We conclude that PCA methods may be useful for the diagnosis of subtle deviations from process specification not readily distinguishable to the operator.  相似文献   

14.
JX Liu  Y Xu  CH Zheng  Y Wang  JY Yang 《PloS one》2012,7(7):e38873
Conventional gene selection methods based on principal component analysis (PCA) use only the first principal component (PC) of PCA or sparse PCA to select characteristic genes. These methods indeed assume that the first PC plays a dominant role in gene selection. However, in a number of cases this assumption is not satisfied, so the conventional PCA-based methods usually provide poor selection results. In order to improve the performance of the PCA-based gene selection method, we put forward the gene selection method via weighting PCs by singular values (WPCS). Because different PCs have different importance, the singular values are exploited as the weights to represent the influence on gene selection of different PCs. The ROC curves and AUC statistics on artificial data show that our method outperforms the state-of-the-art methods. Moreover, experimental results on real gene expression data sets show that our method can extract more characteristic genes in response to abiotic stresses than conventional gene selection methods.  相似文献   

15.
In this study, we wanted to inspect whether the evolutionary driven differences in primary sequences could correlate, and thus predict the genetic diversity of related marker loci, which is an important criterion to assess the quality of any DNA marker. We adopted new approach of quantitative symbolic DNA sequence analysis called DNA random walk representation to study multiallelic marker loci from Begonia × tuberhybrida Voss. We described significant correlation of random walk-derived digital invariants to genetic diversity of the marker loci. Specifically, on the 3D-contour plot of multivariate principal component analysis (PCA), we revealed statistical correlation between the first two PCA factors and the number of alleles per marker locus. Based on that correlation, we suggest that DNA walk representation may predict allele-rich loci solely from their primary sequences, which improves current design of new DNA germplasm identificators.  相似文献   

16.
17.
Analysis of an artificial neural network trained to classify DNA as coding or non-coding revealed compositional differences between sequence parts translated into protein and those that were not. The 5' end of human introns was found to have a base composition that was non-random to an extent matching the non-randomness in the 3' end that contains the polypyrimidine tract. The prevailing nucleotides in the initial 50 nucleotides of human introns are guanine and cytosine, the trinucleotide GGG was found to occur almost four times as frequently as it would in sequences with a uniform distribution of the nucleotides. The initial part of terminal exons and their associated terminal introns were shown to have a very special base composition deviating strongly from the normal picture in other exons and introns.  相似文献   

18.
19.
20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号