首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 671 毫秒
1.
Abstract. A method is proposed to estimate the frequency and the spatial heterogeneity of occurrence of individual plant species composing the community of a grassland or a plant community with a short height. The measure is based on the beta‐binomial distribution. The weighted average heterogeneity of all the species composing a community provides a measure of community‐level heterogeneity determining the spatial intricateness of community composition of existing species. As an example to illustrate the method, a sown grassland with grazing cows was analysed, on 102 quadrats of 50 cm × 50 cm, each of which divided into four small quadrats of 25 cm × 25 cm. The frequency of occurrence for all the species was recorded in each small quadrat. Good fits to the beta‐binomial series for most species of the community were obtained. These results indicate that (1) each species is distributed heterogeneously with respective spatial patterns, (2) the degree of heterogeneity is different from species to species, and (3) the beta‐binomial distribution can be applied for grassland communities. In most of the observed species spatial heterogeneity is often characterized by species‐specific propagating traits: seed‐propagating plant species exhibited a low heterogeneity/random pattern while clonal species exhibited a high heterogeneity/aggregated pattern. This measure can be applied to field surveys and to the estimation of community parameters for grassland diagnosis.  相似文献   

2.
We propose a new model-based approach linking word learning to the age of acquisition (AoA) of words; a new computational tool for understanding the relationships among word learning processes, psychological attributes, and word AoAs as measures of vocabulary growth. The computational model developed describes the distinct statistical relationships between three theoretical factors underpinning word learning and AoA distributions. Simply put, this model formulates how different learning processes, characterized by change in learning rate over time and/or by the number of exposures required to acquire a word, likely result in different AoA distributions depending on word type. We tested the model in three respects. The first analysis showed that the proposed model accounts for empirical AoA distributions better than a standard alternative. The second analysis demonstrated that the estimated learning parameters well predicted the psychological attributes, such as frequency and imageability, of words. The third analysis illustrated that the developmental trend predicted by our estimated learning parameters was consistent with relevant findings in the developmental literature on word learning in children. We further discuss the theoretical implications of our model-based approach.  相似文献   

3.
MOTIVATION: Many proposed statistical measures can efficiently compare biological sequences to further infer their structures, functions and evolutionary information. They are related in spirit because all the ideas for sequence comparison try to use the information on the k-word distributions, Markov model or both. Motivated by adding k-word distributions to Markov model directly, we investigated two novel statistical measures for sequence comparison, called wre.k.r and S2.k.r. RESULTS: The proposed measures were tested by similarity search, evaluation on functionally related regulatory sequences and phylogenetic analysis. This offers the systematic and quantitative experimental assessment of our measures. Moreover, we compared our achievements with these based on alignment or alignment-free. We grouped our experiments into two sets. The first one, performed via ROC (receiver operating curve) analysis, aims at assessing the intrinsic ability of our statistical measures to search for similar sequences from a database and discriminate functionally related regulatory sequences from unrelated sequences. The second one aims at assessing how well our statistical measure is used for phylogenetic analysis. The experimental assessment demonstrates that our similarity measures intending to incorporate k-word distributions into Markov model are more efficient.  相似文献   

4.
MOTIVATION: Distance measures built on the notion of text compression have been used for the comparison and classification of entire genomes and mitochondrial genomes. The present study was undertaken in order to explore their utility in the classification of protein sequences. RESULTS: We constructed compression-based distance measures (CBMs) using the Lempel-Zlv and the PPMZ compression algorithms and compared their performance with that of the Smith-Waterman algorithm and BLAST, using nearest neighbour or support vector machine classification schemes. The datasets included a subset of the SCOP protein structure database to test distant protein similarities, a 3-phosphoglycerate-kinase sequences selected from archaean, bacterial and eukaryotic species as well as low and high-complexity sequence segments of the human proteome, CBMs values show a dependence on the length and the complexity of the sequences compared. In classification tasks CBMs performed especially well on distantly related proteins where the performance of a combined measure, constructed from a CBM and a BLAST score, approached or even slightly exceeded that of the Smith-Waterman algorithm and two hidden Markov model-based algorithms.  相似文献   

5.
S Engen 《Biometrics》1975,31(1):201-208
A taxonomic group will frequently have a large number of species with small abundances. When a sample is drawn at random from this group, one is therefore faced with the problem that a large proportion of the species will not be discovered. A general definition of quantitative measures of "sample coverage" is proposed, and the problem of statistical inference is considered for two special cases, (1) the actual total relative abundance of those species that are represented in the sample, and (2) their relative contribution to the information index of diversity. The analysis is based on a extended version of the negative binomial species frequency model. The results are tabulated.  相似文献   

6.
MOTIVATION: Several measures of DNA sequence dissimilarity have been developed. The purpose of this paper is 3-fold. Firstly, we compare the performance of several word-based or alignment-based methods. Secondly, we give a general guideline for choosing the window size and determining the optimal word sizes for several word-based measures at different window sizes. Thirdly, we use a large-scale simulation method to simulate data from the distribution of SK-LD (symmetric Kullback-Leibler discrepancy). These simulated data can be used to estimate the degree of dissimilarity beta between any pair of DNA sequences. RESULTS: Our study shows (1) for whole sequence similiarity/dissimilarity identification the window size taken should be as large as possible, but probably not >3000, as restricted by CPU time in practice, (2) for each measure the optimal word size increases with window size, (3) when the optimal word size is used, SK-LD performance is superior in both simulation and real data analysis, (4) the estimate beta of beta based on SK-LD can be used to filter out quickly a large number of dissimilar sequences and speed alignment-based database search for similar sequences and (5) beta is also applicable in local similarity comparison situations. For example, it can help in selecting oligo probes with high specificity and, therefore, has potential in probe design for microarrays. AVAILABILITY: The algorithm SK-LD, estimate beta and simulation software are implemented in MATLAB code, and are available at http://www.stat.ncku.edu.tw/tjwu  相似文献   

7.
A method is developed to study the periodic properties of nucleotide sequences allowing the favoured pattern of the repeating unit, as well as the length and localization of this periodic segment to be determined simultaneously. The degree of periodicity is evaluated calculating the probabilities for random occurrence of the maximal deviations of the nucleotide composition in each phase, making use of the binomial formula.The nucleotide sequence of the tobacco mosaic virus (TMV) RNA responsible for recognition of the homologous protein (“assembly origin”, AO) (Zimmern & Butler, 1977) was investigated in order to find periodic regions of primary structure which might be essential in the recognition process. As a result the most periodic segments of the AO consisting of 31 and 17 nucleotides corresponding to the schemes GAU or GA1 have been found. However, the periodicities in these regions do not exceed that expected for random sequences. It can be considered as an evidence that in addition to peculiarities of primary structure, some other features such as RNA secondary or tertiary structure are essential in this interaction.For comparison the nucleotide sequences of the other fragments of TMV RNA as well as MS2 RNA, TYMV RNA, 16S rRNA and phage fd DNA were investigated by the same method.  相似文献   

8.
A probabilistic measure for alignment-free sequence comparison   总被引:3,自引:0,他引:3  
MOTIVATION: Alignment-free sequence comparison methods are still in the early stages of development compared to those of alignment-based sequence analysis. In this paper, we introduce a probabilistic measure of similarity between two biological sequences without alignment. The method is based on the concept of comparing the similarity/dissimilarity between two constructed Markov models. RESULTS: The method was tested against six DNA sequences, which are the thrA, thrB and thrC genes of the threonine operons from Escherichia coli K-12 and from Shigella flexneri; and one random sequence having the same base composition as thrA from E.coli. These results were compared with those obtained from CLUSTAL W algorithm (alignment-based) and the chaos game representation (alignment-free). The method was further tested against a more complex set of 40 DNA sequences and compared with other existing sequence similarity measures (alignment-free). AVAILABILITY: All datasets and computer codes written in MATLAB are available upon request from the first author.  相似文献   

9.
The information capacity of nucleotide sequences is defined through the specific entropy of frequency dictionary of a sequence determined with respect to another one containing the most probable continuations of shorter strings. This measure distinguishes a sequence both from a random one, and from ordered entity. A comparison of sequences based on their information capacity is studied. An order within the genetic entities is found at the length scale ranged from 3 to 8. Some other applications of the developed methodology to genetics, bioinformatics, and molecular biology are discussed.  相似文献   

10.
Summary Three measures of sequence dissimilarity have been compared on a computer-generated model system in which substitutions in random sequences were made at randomly selected sites and the replacement character was chosen at random from the set of characters different from the original occupant of the site. The three measures were the conventionalmmismatch count between aligned sequences (AMC=m) and two measures not requiring prior sequence alignment. The latter two measures were the squared Euclidean distance between vectors of counts of t-tuples (t=1–6) of characters in the two sequences (multiplet distribution distances or MDD=d) and counts of characters not covered by word structures of statistically significant length common to the two sequences (common long words or CLW=SIB, SIS, or SAB). Average MDD distances were found to be two times average mismatch counts in the simulated sequences for all values of t from 1 to 6 and all degrees of substitution from one per sequence to so many as to produce, effectively, random sequences. This simple relation held independently of sequence length and of sequence composition. The relation was confirmed by exact results on small model systems and by formal asymptotic results in the limit of so few substitutions that no double hits occur and in the limit of two random sequences. The coefficient of variation for MDD distances was greater than that for mismatch counts for singlets but both measures approached the same low value for sextets. Needleman-Wunsch alignment produced incorrect mismatch counts at higher degrees of substitution. The model satisfied the conditions for the derivation of the Jukes-Cantor asymptotic adjustment, but its application produced increasingly bad results with increasing degrees of substitution in accord with earlier results on model and natural sequences. This fact was a consequence of the increase with increasing degrees of substitution of the sensitivity of the adjustment to error in the observations. Average CLW distances for a variety of common word structures were more or less parallel to MDD distances for appropriately long t-tuples. These results on model systems supported the validity of the two dissimilarity measures not requiring sequence alignment that was found in earlier work on natural sequences (Blaisdell 1989).  相似文献   

11.
The classification accuracy of new diagnostic tests is based on receiver operating characteristic (ROC) curves. The area under the ROC curve (AUC) is one of the well-accepted summary measures for describing the accuracy of diagnostic tests. The AUC summary measure can vary by patient and testing characteristics. Thus, the performance of the test may be different in certain subpopulation of patients and readers. For this purpose, we propose a direct semi-parametric regression model for the non-parametric AUC measure for ordinal data while accounting for discrete and continuous covariates. The proposed method can be used to estimate the AUC value under degenerate data where certain rating categories are not observed. We will discuss the non-standard asymptotic theory, since the estimating functions were based on cross-correlated random variables. Simulation studies based on different classification models showed that the proposed model worked reasonably well with small percent bias and percent mean-squared error. The proposed method was applied to the prostate cancer study to estimate the AUC for four readers, and the carotid vessel study with age, gender, history of previous stroke, and total number of risk factors as covariates, to estimate the accuracy of the diagnostic test in the presence of subject-level covariates.  相似文献   

12.
A text can be considered as a one dimensional array of words. The locations of each word type in this array form a fractal pattern with certain fractal dimension. We observe that important words responsible for conveying the meaning of a text have dimensions considerably different from one, while the fractal dimensions of unimportant words are close to one. We introduce an index quantifying the importance of the words in a given text using their fractal dimensions and then ranking them according to their importance. This index measures the difference between the fractal pattern of a word in the original text relative to a shuffled version. Because the shuffled text is meaningless (i.e., words have no importance), the difference between the original and shuffled text can be used to ascertain degree of fractality. The degree of fractality may be used for automatic keyword detection. Words with the degree of fractality higher than a threshold value are assumed to be the retrieved keywords of the text. We measure the efficiency of our method for keywords extraction, making a comparison between our proposed method and two other well-known methods of automatic keyword extraction.  相似文献   

13.
The binomial test is applied for the problem of testing a hypothesis based on a sample of independent, but non-identically distributed random variables. The used basic idea is that each random variable indicates the presence of the hypothesis. Hence each random variable is transformed such that the binomial test can be used as a simple procedure.  相似文献   

14.
MOTIVATION: A global view of the protein space is essential for functional and evolutionary analysis of proteins. In order to achieve this, a similarity network can be built using pairwise relationships among proteins. However, existing similarity networks employ a single similarity measure and therefore their utility depends highly on the quality of the selected measure. A more robust representation of the protein space can be realized if multiple sources of information are used. RESULTS: We propose a novel approach for analyzing multi-attribute similarity networks by combining random walks on graphs with Bayesian theory. A multi-attribute network is created by combining sequence and structure based similarity measures. For each attribute of the similarity network, one can compute a measure of affinity from a given protein to every other protein in the network using random walks. This process makes use of the implicit clustering information of the similarity network, and we show that it is superior to naive, local ranking methods. We then combine the computed affinities using a Bayesian framework. In particular, when we train a Bayesian model for automated classification of a novel protein, we achieve high classification accuracy and outperform single attribute networks. In addition, we demonstrate the effectiveness of our technique by comparison with a competing kernel-based information integration approach.  相似文献   

15.
Abstract. Statistical measures of fidelity, i.e. the concentration of species occurrences in vegetation units, are reviewed and compared. The focus is on measures suitable for categorical data which are based on observed species frequencies within a vegetation unit compared with the frequencies expected under random distribution. Particular attention is paid to Bruelheide's u value. It is shown that its original form, based on binomial distribution, is an asymmetric measure of fidelity of a species to a vegetation unit which tends to assign comparatively high fidelity values to rare species. Here, a hypergeometric form of u is introduced which is a symmetric measure of the joint fidelity of species to a vegetation unit and vice versa. It is also shown that another form of the binomial u value may be defined which measures the asymmetric fidelity of a vegetation unit to a species. These u values are compared with phi coefficient, chi‐square, G statistic and Fisher's exact test. Contrary to the other measures, phi coefficient is independent of the number of relevés in the data set, and like the hypergeometric form of u and the chi‐square it is little affected by the relative size of the vegetation unit. It is therefore particularly useful when comparing species fidelity values among differently sized data sets and vegetation units. However, unlike the other measures it does not measure any statistical significance and may produce unreliable results for small vegetation units and small data sets. The above measures, all based on the comparison of observed/expected frequencies, are compared with the categorical form of the Dufrêne‐Legendre Indicator Value Index, an index strongly underweighting the fidelity of rare species. These fidelity measures are applied to a data set of 15 989 relevés of Czech herbaceous vegetation. In a small subset of this data set which simulates a phytosociological table, we demonstrate that traditional table analysis fails to determine diagnostic species of general validity in different habitats and large areas. On the other hand, we show that fidelity calculations used in conjunction with large data sets can replace expert knowledge in the determination of generally valid diagnostic species. Averaging positive fidelity values for all species within a vegetation unit is a useful approach to measure quality of delimination of the vegetation unit. We propose a new way of ordering species in synoptic species‐by‐relevé tables, using fidelity calculations.  相似文献   

16.
Abstract  Studies of citrus leafminer in a coastal orchard in NSW, Australia indicated that an increase in abundance to about one mine per flush was followed during the midseason flush by a rapid increase in population that was related to an increase in the percentage of leaves infested within flushes and the number of mines per leaf. The fits of frequency distributions and Iwao's patchiness regression indicated that populations were highly contagious initially, and as the exponent k of the negative binomial distribution increased with increasing population density, the distribution approached random. Concurrently, the coefficient of variation of mines per flush (which was strongly related to the proportion of un-infested flushes) decreased to about unity as the proportion of un-infested flushes reached zero and fell further as the number of mines per flush increased. Both numerative and binomial sequential sampling plans were developed using a decision threshold based on 1.2 mines per flush. The binomial sampling plan was based on a closely fitting model of the functional relationship between mean density and proportion of infested flushes. Functional relationships using the parameters determined from Iwao's patchiness regression and Taylor's power law were equally satisfactory, and one based on the negative binomial model also fitted well, but the Poisson model did not. The three best fitting models indicated that a decision threshold of 1.2 mines per flush was equivalent to 50% of flushes infested. From a practical point of view, the transition from 25% infestation of flushes through 50% is so rapid that it may be prudent to take action when the 25% level is reached; otherwise, the 50% may be passed before the crop is checked again. For valuable nursery stock should infestation be detected in spring, it may be advisable to apply prophylactic treatment as the midseason flush starts.  相似文献   

17.
18.
Moderated statistical tests for assessing differences in tag abundance   总被引:2,自引:0,他引:2  
MOTIVATION: Digital gene expression (DGE) technologies measure gene expression by counting sequence tags. They are sensitive technologies for measuring gene expression on a genomic scale, without the need for prior knowledge of the genome sequence. As the cost of sequencing DNA decreases, the number of DGE datasets is expected to grow dramatically. Various tests of differential expression have been proposed for replicated DGE data using binomial, Poisson, negative binomial or pseudo-likelihood (PL) models for the counts, but none of the these are usable when the number of replicates is very small. RESULTS: We develop tests using the negative binomial distribution to model overdispersion relative to the Poisson, and use conditional weighted likelihood to moderate the level of overdispersion across genes. Not only is our strategy applicable even with the smallest number of libraries, but it also proves to be more powerful than previous strategies when more libraries are available. The methodology is equally applicable to other counting technologies, such as proteomic spectral counts. AVAILABILITY: An R package can be accessed from http://bioinf.wehi.edu.au/resources/  相似文献   

19.
Dai Q  Liu X  Yao Y  Zhao F 《Amino acids》2012,42(5):1867-1877
There are two crucial problems with statistical measures for sequence comparison: overlapping structures and background information of words in biological sequences. Word normalization in improved composition vector method took into account these problems and achieved better performance in evolutionary analysis. The word normalization is desirable, but not sufficient, because it assumes that the four bases A, C, T, and G occur randomly with equal chance. This paper proposed an improved word normalization which uses Markov model to estimate exact k-word distribution according to observed biological sequence and thus has the ability to adjust the background information of the k-word frequencies in biological sequences. The improved word normalization was tested with three experiments and compared with the existing word normalization. The experiment results confirm that the improved word normalization using Markov model to estimate the exact k-word distribution in biological sequences is more efficient.  相似文献   

20.
Abstract

A method to quantify the dynamic schemes of vegetation based on information functions.—The dynamic schemes of vegetation as proposed by Braun-Blanquet have been never quantified with the aim to give indirect measures of the probability of transition between the types. This work presents a method that quantifies the arrows between the types by means of a redundancy measure. This is calculated by comparing the phytosociological types two by two. Redundancy, as proposed here, measures the similarity between the tables not only based on species composition but also on the basis of species cooccurrence. The method relies on the assumption that in a succession the higher is the redundancy between the types, that are presumably in sequence, the higher is the probability and then the velocity of transition from one type to another one. An example is given with data from coastal dune grasslands of the Venice Lagoon.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号