首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The number of high-dimensional datasets recording multiple aspects of a single phenomenon is increasing in many areas of science, accompanied by a need for mathematical frameworks that can compare multiple large-scale matrices with different row dimensions. The only such framework to date, the generalized singular value decomposition (GSVD), is limited to two matrices. We mathematically define a higher-order GSVD (HO GSVD) for N≥2 matrices D(i)∈R(m(i) × n), each with full column rank. Each matrix is exactly factored as D(i)=U(i)Σ(i)V(T), where V, identical in all factorizations, is obtained from the eigensystem SV=VΛ of the arithmetic mean S of all pairwise quotients A(i)A(j)(-1) of the matrices A(i)=D(i)(T)D(i), i≠j. We prove that this decomposition extends to higher orders almost all of the mathematical properties of the GSVD. The matrix S is nondefective with V and Λ real. Its eigenvalues satisfy λ(k)≥1. Equality holds if and only if the corresponding eigenvector v(k) is a right basis vector of equal significance in all matrices D(i) and D(j), that is σ(i,k)/σ(j,k)=1 for all i and j, and the corresponding left basis vector u(i,k) is orthogonal to all other vectors in U(i) for all i. The eigenvalues λ(k)=1, therefore, define the "common HO GSVD subspace." We illustrate the HO GSVD with a comparison of genome-scale cell-cycle mRNA expression from S. pombe, S. cerevisiae and human. Unlike existing algorithms, a mapping among the genes of these disparate organisms is not required. We find that the approximately common HO GSVD subspace represents the cell-cycle mRNA expression oscillations, which are similar among the datasets. Simultaneous reconstruction in the common subspace, therefore, removes the experimental artifacts, which are dissimilar, from the datasets. In the simultaneous sequence-independent classification of the genes of the three organisms in this common subspace, genes of highly conserved sequences but significantly different cell-cycle peak times are correctly classified.  相似文献   

2.
In this article, we introduce three 3D graphical representations of DNA primary sequences, which we call RY-curve, MK-curve and SW-curve, based on three classifications of the DNA bases. The advantages of our representations are that (i) these 3D curves are strictly non-degenerate and there is no loss of information when transferring a DNA sequence to its mathematical representation and (ii) the coordinates of every node on these 3D curves have clear biological implication. Two applications of these 3D curves are presented: (a) a simple formula is derived to calculate the content of the four bases (A, G, C and T) from the coordinates of nodes on the curves; and (b) a 12-component characteristic vector is constructed to compare similarity among DNA sequences from different species based on the geometrical centers of the 3D curves. As examples, we examine similarity among the coding sequences of the first exon of beta-globin gene from eleven species and validate similarity of cDNA sequences of beta-globin gene from eight species.  相似文献   

3.
We consider a novel 2-D graphical representation of DNA sequences according to chemical structures of bases, reflecting distribution of bases with different chemical structure, preserving information on sequential adjacency of bases, and allowing numerical characterization. The representation avoids loss of information accompanying alternative 2-D representations in which the curve standing for DNA overlaps and intersects itself. Based on this representation we present a numerical characterization approach by the leading eigenvalues of the matrices associated with the DNA sequences. The utility of the approach is illustrated on the coding sequences of the first exon of human beta-globin gene.  相似文献   

4.

Background  

Two-dimensional data colourings are an effective medium by which to represent three-dimensional data in two dimensions. Such "color-grid" representations have found increasing use in the biological sciences (e.g. microarray 'heat maps' and bioactivity data) as they are particularly suited to complex data sets and offer an alternative to the graphical representations included in traditional statistical software packages. The effectiveness of color-grids lies in their graphical design, which introduces a standard for customizable data representation. Currently, software applications capable of generating limited color-grid representations can be found only in advanced statistical packages or custom programs (e.g. micro-array analysis tools), often associated with steep learning curves and requiring expert knowledge.  相似文献   

5.
Part A of this review describes the particular computer-assisted identification service operated by the NCTC. In Part B, the use of probability matrices is examined, discussing various methods of calculating likelihoods and the problems that arise when calculating these from probability matrices. Part C describes the alternative numerical methods of constructing identification keys and the supplementary methods of selecting best sets of characters to aid identification. Finally, in Part D, the prospects and limitations of numerical methods in bacterial identification are assessed, first with regard to methodology used and then in terms of performance and practical limitations.  相似文献   

6.
7.
基于DNA序列的3D图形表示,通过L/L矩阵的规范化最大特征值组成的3维向量来刻画了DNA序列,并基于这种方法,用β-globin基因的第一个外显子分析了11个物种的相似性问题。  相似文献   

8.
分支分类学中和谐性概念与和谐性分析方法   总被引:6,自引:0,他引:6  
和谐性是分支分类学中的一个基本概念。本文给出一个和谐性的数学定义,称为Kexue和谐性。并在Kexue和谐性的基础上开发出一个新的和谐性分析方法。并对该方法在分支分类研究中的应用进行讨论。  相似文献   

9.
10.
The subunit stoichiometry of a large, multisubunit protein can be determined from the molar amino acid compositions (i amino acids) of the protein and its subunits. The number of copies of the subunits (1, 2, ... j) is calculated by solving all possible combinations of simultaneous equations in j unknowns (i!/j!(i - j)!). Calculations carried out using the published amino acid compositions determined by analysis and the compositions calculated from the sequences for two proteins of known stoichiometry provided the following results: Escherichia coli aspartate transcarbamoylase (R6C6, Mr = 307.5 kDa), R = 5.6 to 6.6 and C = 5.8 to 6.3, and spinach ribulose-bisphosphate carboxylase (L8S8, Mr = 535 kDa), L = 7.3 to 9.1 and S = 5.6 to 10.6. Calculations were also carried out with the amino acid compositions of two much larger proteins, the E. coli pyruvate dehydrogenase complex, Mr = 5280 kDa, subunits E1 (99.5 kDa), E2 (66 kDa), and E3 (50.6 kDa), and the extracellular hemoglobin of Lumbricus terrestris, Mr = 3760 kDa, subunits M (17 kDa), D1 (31 kDa), D2 (37 kDa), and T (51 kDa); the results for PDHase were E1 = 20 to 24, E2 = 18 to 31, E3 = 21 to 33 and those for Lumbricus hemoglobin were M = 34 to 46, D1 = 13 to 19, D2 = 13 to 18, and T = 34 to 36. Although the sample standard deviations of the mean values are generally high, the proposed method works surprisingly well for the two smaller proteins and provides physically reasonable results for the two larger proteins.  相似文献   

11.
12.
The problem of discovering novel motifs of binding sites is important to the understanding of gene regulatory networks. Motifs are generally represented by matrices (position weight matrix (PWM) or position specific scoring matrix (PSSM) or strings. However, these representations cannot model biological binding sites well because they fail to capture nucleotide interdependence. It has been pointed out by many researchers that the nucleotides of the DNA binding site cannot be treated independently, e.g. the binding sites of zinc finger in proteins. In this paper, a new representation called Scored Position Specific Pattern (SPSP), which is a generalization of the matrix and string representations, is introduced which takes into consideration the dependent occurrences of neighboring nucleotides. Even though the problem of discovering the optimal motif in SPSP representation is proved to be NP-hard, we introduce a heuristic algorithm called SPSP-Finder, which can effectively find optimal motifs in most simulated cases and some real cases for which existing popular motif finding software, such as Weeder, MEME and AlignACE, fail.  相似文献   

13.
Summary Heteronuclear 2D (13C, 1H) and (15N, 1H) correlation spectra of (13C, 15N) fully enriched proteins can be acquired simultaneously with virtually no sensitivity loss or increase in artefact levels. Three pulse sequences are described, for 2D time-shared or TS-HSQC, 2D TS-HMQC and 2D TS-HSMQC spectra, respectively. Independent spectral widths can be sampled for both heteronuclei. The sequences can be greatly improved by combining them with field-gradient methods. By applying the sequences to 3D and 4D NMR spectroscopy, considerable time savings can be obtained. The method is demonstrated for the 18 kDa HU protein.Abbreviations HMQC heteronuclear multiple-quantum coherence spectroscopy - HSQC heteronuclear single-quantum coherence spectroscopy - HSMQC heteronuclear single- and multiple-quantum coherence spectroscopy - NOESY nuclear Overhauser enhancement spectroscopy  相似文献   

14.
The CBCAnalyzer (CBC=compensatory base change) is a custom written software toolbox consisting of three parts, CTTransform, CBCDetect, and CBCTree. CTTransform reads several ct-file formats, and generates a so called "bracket-dot-bracket" format that typically is used as input for other tools such as RNAforester, RNAmovie or MARNA. The latter one creates a multiple alignment based on primary sequences and secondary structures that now can be used as input for CBCDetect. CBCDetect counts CBCs in all against all of the aligned sequences. This is important in detecting species that are discriminated by their sexual incompatibility. The count (distance) matrix obtained by CBCDetect is used as input for CBCTree that reconstructs a phylogram by using the algorithm of BIONJ. In this note we describe the features of the toolbox as well as application examples. The toolbox provides a graphical user interface. It is written in C++ and freely available at: http://cbcanalyzer.bioapps.biozentrum.uni-wuerzburg.de.  相似文献   

15.
Let A denote an alphabet consisting of n types of letters. Given a sequence S of length L with v(i) letters of type i on A, to describe the compositional properties and combinatorial structure of S, we propose a new complexity function of S, called the reciprocal complexity of S, as C(S) = (i=1) product operator (n) (L/nv(i))(vi) Based on this complexity measure, an efficient algorithm is developed for classifying and analyzing simple segments of protein and nucleotide sequence databases associated with scoring schemes. The running time of the algorithm is nearly proportional to the sequence length. The program DSR corresponding to the algorithm was written in C++, associated with two parameters (window length and cutoff value) and a scoring matrix. Some examples regarding protein sequences illustrate how the method can be used to find regions. The first application of DSR is the masking of simple sequences for searching databases. Queries masked by DSR returned a manageable set of hits below the E-value cutoff score, which contained all true positive homologues. The second application is to study simple regions detected by the DSR program corresponding to known structural features of proteins. An extensive computational analysis has been made of protein sequences with known, physicochemically defined nonglobular segments. For the SWISS-PROT amino acid sequence database (Release 40.2 of 02-Nov-2001), we determine that the best parameters and the best BLOSUM matrix are, respectively, for automatic segmentation of amino acid sequences into nonglobular and globular regions by the DSR program: Window length k = 35, cutoff value b = 0.46, and the BLOSUM 62.5 matrix. The average "agreement accuracy (sensitivity)" of DSR segmentation for the SWISS-PROT database is 97.3%.  相似文献   

16.
We have analyzed 29 published substitution matrices (SMs) and five statistical protein contact potentials (CPs) for comparison. We find that popular, 'classical' SMs obtained mainly from sequence alignments of globular proteins are mostly correlated by at least a value of 0.9. The BLOSUM62 is the central element of this group. A second group includes SMs derived from alignments of remote homologs or transmembrane proteins. These matrices correlate better with classical SMs (0.8) than among themselves (0.7). A third group consists of intermediate links between SMs and CPs - matrices and potentials that exhibit mutual correlations of at least 0.8. Next, we show that SMs can be approximated with a correlation of 0.9 by expressions c(0) + x(i)x(j) + y(i)y(j) + z(i)z(j), 1相似文献   

17.
Summary With the advent of high density restriction fragment length polymorphism (RFLP) maps, it has become possible to determine the genotype of an individual at many genetic loci simultaneously. Often, such RFLP data are expressed as long strings of numbers or letters indicating the genotype for each locus analyzed. In this form, RFLP data can be difficult to interpret or utilize without complex statistical analysis. By contrast, numerical genotype data can also be expressed in a more useful, graphical form, known as a graphical genotype, which is described in detail in this paper. Ideally, a graphical genotype portrays the parental origin and allelic composition throughout the entire genome, yet is simple to comprehend and utilize. In order to demonstrate the usefulness of this concept, graphical genotypes for individuals from backcross and F2 populations in tomato are described. The concept can also be utilized in more complex mating schemes involving two or more parents. A model that predicts the accuracy of graphical genotypes is presented for hypothetical RFLP maps of varying marker spacing. This model indicates that graphical genotypes can be more than 99% correct in describing a genome of total size, 1000 cM, with RFLP markers located every 10 cM. In order to facilitate the application of graphical genotypes to genetics and breeding, we have developed computer software that generates and manipulates graphical genotypes. The concept of graphical genotypes should be useful in whole genome selection for polygenic traits in plant and animal breeding programs and in the diagnosis of heterogenously based genetic diseases in humans.  相似文献   

18.
The mean measure of divergence is a dissimilarity measure between groups of individuals described by dichotomous variables. It is well suited to datasets with many missing values, and it is generally used to compute distance matrices and represent phenograms. Although often used in biological anthropology and archaeozoology, this method suffers from a lack of implementation in common statistical software. A package for the R statistical software, AnthropMMD, is presented here. Offering a dynamic graphical user interface, it is the first one dedicated to Smith's mean measure of divergence. The package also provides facilities for graphical representations and the crucial step of trait selection, so that the entire analysis can be performed through the graphical user interface. Its use is demonstrated using an artificial dataset, and the impact of trait selection is discussed. Finally, AnthropMMD is compared to three other free tools available for calculating the mean measure of divergence, and is proven to be consistent with them.  相似文献   

19.
It is shown that the multiple alignment problem with SP-score is NP-hard for each scoring matrix in a broad class M that includes most scoring matrices actually used in biological applications. The problem remains NP-hard even if sequences can only be shifted relative to each other and no internal gaps are allowed. It is also shown that there is a scoring matrix M(0) such that the multiple alignment problem for M(0) is MAX-SNP-hard, regardless of whether or not internal gaps are allowed.  相似文献   

20.
Genetic code,hamming distance and stochastic matrices   总被引:3,自引:0,他引:3  
In this paper we use the Gray code representation of the genetic code C = 00, U = 10, G = 11 and A = 01 (C pairs with G, A pairs with U) to generate a sequence of genetic code-based matrices. In connection with these code-based matrices, we use the Hamming distance to generate a sequence of numerical matrices. We then further investigate the properties of the numerical matrices and show that they are doubly stochastic and symmetric. We determine the frequency distributions of the Hamming distances, building blocks of the matrices, decomposition and iterations of matrices. We present an explicit decomposition formula for the genetic code-based matrix in terms of permutation matrices, which provides a hypercube representation of the genetic code. It is also observed that there is a Hamiltonian cycle in a genetic code-based hypercube.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号