首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
A system for pattern matching applications on biosequences   总被引:5,自引:0,他引:5  
ANREP is a system for finding matches to patterns composed of(i) spacing constraints called ‘spacers’, and (ii)approximate matches to ‘motifs’ that are, recursively,patterns composed of ‘atomic’ symbols. A user specifiessuch patterns via a declarative, free-format and strongly typedlanguage called A that is presented here in a tutorial stylethrough a series of progressively more complex examples. Thesample patterns are for protein and DNA sequences, the applicationdomain for which ANREP wos specifically created. ANREP providesa unified framework for almost all previously proposed biosequencepatterns and extends them by providing approximate matching,a feature heretofore unavailable except for the limited caseof individual sequences. The pemformance of ANREP is discussedand an appendix gives concise specification of syntax and semantics.A portable C softwore package implementing ANREP is availablevia anonymous remote file transfer.  相似文献   

2.
In this paper we discuss and demonstrate the importance of several factors relative to the relationship between time and evolution of biosequences. In both quantitative and qualitative measurements of the genetic distances, the compositional constraints of the nucleotide sequences play a very important role. We demonstrate that when homologous sequences significantly differ in base composition we get erratic branching order and/or wrong evaluation of the evolutionary rates. We must consider that every gene may have a different evolutionary dynamic along its sequence, generally linked to its functional constraints; this too can seriously affect its clocklike behavior. We report some cases showing how these factors can affect the quantitative measurements of the genetic distances of biosequences. Presented at the NATO Advanced Research Workshop onGenome Organization and Evolution, Spetsai, Greece, 16–22 September 1992  相似文献   

3.
The list of species whose complete DNA sequence have been read is growing steadily, and it is believed that comparative genomics is in its early days. Permutations patterns (groups of genes in some "close" proximity) on gene sequences of genomes across species is being studied under different models, to cope with this explosion of data. The challenge is to (intelligently and efficiently) analyze the genomes in the context of other genomes. In this paper, we present a generalized model that uses three notions, gapped permutation patterns (with gap g), genome clusters, via quorum, K>1, parameter, and, possible multiplicity in the patterns. The task is to automatically discover all permutation patterns (with possible multiplicity), that occur with gap g in at least K of the given m genomes. We present (log mN (I) + /Sigma/log/Sigma/N (O)) time algorithm where m is the number of sequences, each defined on Sigma, N (I) is the size of the input and N (O) is the size of the maximal gene clusters that appear in at least K of the m genomes.  相似文献   

4.
The role of pattern in biomarker discovery and clinical diagnosis is examined in its historical context. The use of MS-derived pattern is treated as a logical extension of prior applications of non-MS-derived pattern. Criticisms pertaining to specific technology platforms and analytic methodologies are considered separately from the larger issues of pattern utility and deployment in biomarker discovery. We present a hybrid strategy that marries the desirable attributes of high-information content MS pattern with the capability to obtain identity, and explore the key steps in establishing a data analysis pipeline for pattern-based biomarker discovery.  相似文献   

5.
Biology, chemistry and medicine are faced by tremendous challenges caused by an overwhelming amount of data and the need for rapid interpretation. Computational intelligence (CI) approaches such as artificial neural networks, fuzzy systems and evolutionary computation are being used with increasing frequency to contend with this problem, in light of noise, non-linearity and temporal dynamics in the data. Such methods can be used to develop robust models of processes either on their own or in combination with standard statistical approaches. This is especially true for database mining, where modeling is a key component of scientific understanding. This review provides an introduction to current CI methods, their application to biological problems, and concludes with a commentary about the anticipated impact of these approaches in bioinformatics.  相似文献   

6.
In the past few years, pattern discovery has been emerging as a generic tool of choice for tackling problems from the computational biology domain. In this presentation, and after defining the problem in its generality, we review some of the algorithms that have appeared in the literature and describe several applications of pattern discovery to problems from computational biology.  相似文献   

7.
A reliable and precise identification of the type of tumors is crucial to the effective treatment of cancer. With the rapid development of microarray technologies, tumor clustering based on gene expression data is becoming a powerful approach to cancer class discovery. In this paper, we apply the penalized matrix decomposition (PMD) to gene expression data to extract metasamples for clustering. The extracted metasamples capture the inherent structures of samples belong to the same class. At the same time, the PMD factors of a sample over the metasamples can be used as its class indicator in return. Compared with the conventional methods such as hierarchical clustering (HC), self-organizing maps (SOM), affinity propagation (AP) and nonnegative matrix factorization (NMF), the proposed method can identify the samples with complex classes. Moreover, the factor of PMD can be used as an index to determine the cluster number. The proposed method provides a reasonable explanation of the inconsistent classifications made by the conventional methods. In addition, it is able to discover the modules in gene expression data of conterminous developmental stages. Experiments on two representative problems show that the proposed PMD-based method is very promising to discover biological phenotypes.  相似文献   

8.
Zhu  Fangfang  Li  Jiang  Liu  Juan  Min  Wenwen 《BMC genetics》2021,22(1):1-10
Background

Next-generation sequencing (NGS) has profoundly changed the approach to genetic/genomic research. Particularly, the clinical utility of NGS in detecting mutations associated with disease risk has contributed to the development of effective therapeutic strategies. Recently, comprehensive analysis of somatic genetic mutations by NGS has also been used as a new approach for controlling the quality of cell substrates for manufacturing biopharmaceuticals. However, the quality evaluation of cell substrates by NGS largely depends on the limit of detection (LOD) for rare somatic mutations. The purpose of this study was to develop a simple method for evaluating the ability of whole-exome sequencing (WES) by NGS to detect mutations with low allele frequency. To estimate the LOD of WES for low-frequency somatic mutations, we repeatedly and independently performed WES of a reference genomic DNA using the same NGS platform and assay design. LOD was defined as the allele frequency with a relative standard deviation (RSD) value of 30% and was estimated by a moving average curve of the relation between RSD and allele frequency.

Results

Allele frequencies of 20 mutations in the reference material that had been pre-validated by droplet digital PCR (ddPCR) were obtained from 5, 15, 30, or 40 G base pair (Gbp) sequencing data per run. There was a significant association between the allele frequencies measured by WES and those pre-validated by ddPCR, whose p-value decreased as the sequencing data size increased. By this method, the LOD of allele frequency in WES with the sequencing data of 15 Gbp or more was estimated to be between 5 and 10%.

Conclusions

For properly interpreting the WES data of somatic genetic mutations, it is necessary to have a cutoff threshold of low allele frequencies. The in-house LOD estimated by the simple method shown in this study provides a rationale for setting the cutoff.

  相似文献   

9.
10.
11.
EEG signals are important to capture brain disorders. They are useful for analyzing the cognitive activity of the brain and diagnosing types of seizure and potential mental health problems. The Event Related Potential can be measured through the EEG signal. However, it is always difficult to interpret due to its low amplitude and sensitivity to changes of the mental activity. In this paper, we propose a novel approach to incrementally detect the pattern of this kind of EEG signal. This approach successfully summarizes the whole stream of the EEG signal by finding the correlations across the electrodes and discriminates the signals corresponding to various tasks into different patterns. It is also able to detect the transition period between different EEG signals and identify the electrodes which contribute the most to these signals. The experimental results show that the proposed method allows the significant meaning of the EEG signal to be obtained from the extracted pattern.  相似文献   

12.
13.
MOTIVATION: Evolutionary comparison leads to efficient functional characterisation of hypothetical proteins. Here, our goal is to map specific sequence patterns to putative functional classes. The evolutionary signal stands out most clearly in a maximally diverse set of homologues. This diversity, however, leads to a number of technical difficulties. The targeted patterns-as gleaned from structure comparisons-are too sparse for statistically significant signals of sequence similarity and accurate multiple sequence alignment. RESULTS: We address this problem by a fuzzy alignment model, which probabilistically assigns residues to structurally equivalent positions (attributes) of the proteins. We then apply multivariate analysis to the 'attributes x proteins' matrix. The dimensionality of the space is reduced using non-negative matrix factorization. The method is general, fully automatic and works without assumptions about pattern density, minimum support, explicit multiple alignments, phylogenetic trees, etc. We demonstrate the discovery of biologically meaningful patterns in an extremely diverse superfamily related to urease.  相似文献   

14.
MOTIVATION: Several pattern discovery methods have been proposed to detect over-represented motifs in upstream sequences of co-regulated genes, and are for example used to predict cis-acting elements from clusters of co-expressed genes. The clusters to be analyzed are often noisy, containing a mixture of co-regulated and non-co-regulated genes. We propose a method to discriminate co-regulated from non-co-regulated genes on the basis of counts of pattern occurrences in their non-coding sequences. METHODS: String-based pattern discovery is combined with discriminant analysis to classify genes on the basis of putative regulatory motifs. RESULTS: The approach is evaluated by comparing the significance of patterns detected in annotated regulons (positive control), random gene selections (negative control) and high-throughput regulons (noisy data) from the yeast Saccharomyces cerevisiae. The classification is evaluated on the annotated regulons, and the robustness and rejection power is assessed with mixtures of co-regulated and random genes.  相似文献   

15.

Background

Detection and quantification of cyclic alternating patterns (CAP) components has the potential to serve as a disease bio-marker. Few methods exist to discriminate all the different CAP components, they do not present appropriate sensitivities, and often they are evaluated based on accuracy (AC) that is not an appropriate measure for imbalanced datasets.

Methods

We describe a knowledge discovery methodology in data (KDD) aiming the development of automatic CAP scoring approaches. Automatic CAP scoring was faced from two perspectives: the binary distinction between A-phases and B-phases, and also for multi-class classification of the different CAP components. The most important KDD stages are: extraction of 55 features, feature ranking/transformation, and classification. Classification is performed by (i) support vector machine (SVM), (ii) k-nearest neighbors (k-NN), and (iii) discriminant analysis. We report the weighted accuracy (WAC) that accounts for class imbalance.

Results

The study includes 30 subjects from the CAP Sleep Database of Physionet. The best alternative for the discrimination of the different A-phase subtypes involved feature ranking by the minimum redundancy maximum relevance algorithm (mRMR) and classification by SVM, with a WAC of 51%. Concerning the binary discrimination between A-phases and B-phases, k-NN with mRMR ranking achieved the best WAC of 80%.

Conclusions

We describe a KDD that, to the best of our knowledge, was for the first time applied to CAP scoring. In particular, the fully discrimination of the three different A-phases subtypes is a new perspective, since past works tried multi-class approaches but based on grouping of different sub-types. We also considered the weighted accuracy, in addition to simple accuracy, resulting in a more trustworthy performance assessment. Globally, better subtype sensitivities than other published approaches were achieved.
  相似文献   

16.
The study of protein folding mechanisms continues to be one of the most challenging problems in computational biology. Currently, the protein folding mechanism is often characterized by calculating the free energy landscape versus various reaction coordinates, such as the fraction of native contacts, the radius of gyration, RMSD from the native structure, and so on. In this paper, we present a combinatorial pattern discovery approach toward understanding the global state changes during the folding process. This is a first step toward an unsupervised (and perhaps eventually automated) approach toward identification of global states. The approach is based on computing biclusters (or patterned clusters)-each cluster is a combination of various reaction coordinates, and its signature pattern facilitates the computation of the Z-score for the cluster. For this discovery process, we present an algorithm of time complexity c in RO((N + nm) log n), where N is the size of the output patterns and (n x m) is the size of the input with n time frames and m reaction coordinates. To date, this is the best time complexity for this problem. We next apply this to a beta-hairpin folding trajectory and demonstrate that this approach extracts crucial information about protein folding intermediate states and mechanism. We make three observations about the approach: (1) The method recovers states previously obtained by visually analyzing free energy surfaces. (2) It also succeeds in extracting meaningful patterns and structures that had been overlooked in previous works, which provides a better understanding of the folding mechanism of the beta-hairpin. These new patterns also interconnect various states in existing free energy surfaces versus different reaction coordinates. (3) The approach does not require calculating the free energy values, yet it offers an analysis comparable to, and sometimes better than, the methods that use free energy landscapes, thus validating the choice of reaction coordinates. (An abstract version of this work was presented at the 2005 Asia Pacific Bioinformatics Conference [1].).  相似文献   

17.
18.
There have been recent attempts to use the principles of combinatorial chemistry and high-throughput screening strategies for catalyst identification. With the technology available that allows the synthesis of large libraries, scientists of varied backgrounds have implemented screening efforts to identify active and selective catalysts. Within this context, several techniques have come to light in the past year: infrared thermography is used to identify optimal catalysts by monitoring the change in temperature for exothermic reactions; fluorescence and colored-dye assays, a familiar tool to biologists, is being applied to the identification of catalysts that exhibit the highest activity. Whereas none of these screening methods provide a general solution to the problem of screening large combinatorial libraries (there is likely to be no general solution), each advance represents an important intellectual and technological step forward.  相似文献   

19.
20.
Class and biomarker discovery continue to be among the preeminent goals in gene microarray studies of cancer. We have developed a new data mining technique, which we call Binary State Pattern Clustering (BSPC) that is specifically adapted for these purposes, with cancer and other categorical datasets. BSPC is capable of uncovering statistically significant sample subclasses and associated marker genes in a completely unsupervised manner. This is accomplished through the application of a digital paradigm, where the expression level of each potential marker gene is treated as being representative of its discrete functional state. Multiple genes that divide samples into states along the same boundaries form a kind of gene-cluster that has an associated sample-cluster. BSPC is an extremely fast deterministic algorithm that scales well to large datasets. Here we describe results of its application to three publicly available oligonucleotide microarray datasets. Using an alpha-level of 0.05, clusters reproducing many of the known sample classifications were identified along with associated biomarkers. In addition, a number of simulations were conducted using shuffled versions of each of the original datasets, noise-added datasets, as well as completely artificial datasets. The robustness of BSPC was compared to that of three other publicly available clustering methods: ISIS, CTWC and SAMBA. The simulations demonstrate BSPC's substantially greater noise tolerance and confirm the accuracy of our calculations of statistical significance.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号