首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We develop novel methods for recognizing and cataloging conformational states of RNA, and for discovering statistical rules governing those states. We focus on the conformation of the large ribosomal subunit from Haloarcula marismortui. The two approaches described here involve torsion matching and binning. Torsion matching is a pattern-recognition code which finds structural repetitions. Binning is a classification technique based on distributional models of the data. In comparing the results of the two methods we have tested the hypothesis that the conformation of a very large complex RNA molecule can be described accurately by a limited number of discrete conformational states. We identify and eliminate extraneous and redundant information without losing accuracy. We conclude, as expected, that four of the torsion angles contain the overwhelming bulk of the structural information. That information is not significantly compromised by binning the continuous torsional information into a limited number of discrete values. The correspondence between torsion matching and binning is 99% (per residue). Binning, however, does have several advantages. In particular, we demonstrate that the conformation of a large complex RNA molecule can be represented by a small alphabet. In addition, the binning method lends itself to a natural graphical representation using trees.  相似文献   

2.
Next-generation sequencing (NGS) technologies allow the sequencing of microbial communities directly from the environment without prior culturing. The output of environmental DNA sequencing consists of many reads from genomes of different unknown species, making the clustering together reads from the same (or similar) species (also known as binning) a crucial step. The difficulties of the binning problem are due to the following four factors: (1) the lack of reference genomes; (2) uneven abundance ratio of species; (3) short NGS reads; and (4) a large number of species (can be more than a hundred). None of the existing binning tools can handle all four factors. No tools, including both AbundanceBin and MetaCluster 3.0, have demonstrated reasonable performance on a sample with more than 20 species. In this article, we introduce MetaCluster 4.0, an unsupervised binning algorithm that can accurately (with about 80% precision and sensitivity in all cases and at least 90% in some cases) and efficiently bin short reads with varying abundance ratios and is able to handle datasets with 100 species. The novelty of MetaCluster 4.0 stems from solving a few important problems: how to divide reads into groups by a probabilistic approach, how to estimate the 4-mer distribution of each group, how to estimate the number of species, and how to modify MetaCluster 3.0 to handle a large number of species. We show that Meta Cluster 4.0 is effective for both simulated and real datasets. Supplementary Material is available at www.liebertonline.com/cmb.  相似文献   

3.
Allele-rich VNTR loci provide valuable information for forensic inference. Interpretation of this information is complicated by measurement error, which renders discrete alleles difficult to distinguish. Two methods have been used to circumvent this difficulty--i.e., binning methods and direct evaluation of allele frequencies, the latter achieved by modeling the data as a mixture distribution. We use this modeling approach to estimate the allele frequency distributions for two loci--D17S79 and D2S44--for black, Caucasian, and Hispanic samples from the Lifecodes and FBI data bases. The data bases are differentiated by the restriction enzyme used: PstI (Lifecodes) and HaeIII (FBI). Our results show that alleles common in one ethnic group are almost always common in all ethnic groups, and likewise for rare alleles; this pattern holds for both loci. Gene diversity, or heterozygosity, measured as one minus the sum of the squared allele frequencies, is greater for D2S44 than for D17S79, in both data bases. The average gene diversity across ethnic groups when PstI (HaeIII) is used is .918 (.918) for D17S79 and is .985 (.983) for D2S44. The variance in gene diversity among ethnic groups is greater for D17S79 than for D2S44. The number of alleles, like the gene diversity, is greater for D2S44 than for D17S79. The mean numbers of alleles across ethnic groups, estimated from the PstI (HaeIII) data, are 40.25 (41.5) for D17S79 and 104 (103) for D2S44. The number of alleles is correlated with sample size. We use the estimated allele frequency distributions for each ethnic group to explore the effects of unwittingly mixing populations and thereby violating independence assumptions. We show that, even in extreme cases of mixture, the estimated genotype probabilities are good estimates of the true probabilities, contradicting recent claims. Because the binning methods currently used for forensic inference show even less differentiation among ethnic groups, we conclude that mixture has little or no impact on the use of VNTR loci for forensics.  相似文献   

4.
Baggerly KA 《Cytometry》2001,45(2):141-150
BACKGROUND: A key problem in immunohistochemistry is assessing when two sample histograms are significantly different. One test that is commonly used for this purpose in the univariate case is the chi-squared test. Comparing multivariate distributions is qualitatively harder, as the "curse of dimensionality" means that the number of bins can grow exponentially. For the chi-squared test to be useful, data-dependent binning methods must be employed. An example of how this can be done is provided by the "probability binning" method of Roederer et al. (1,2,3). METHODS: We derive the theoretical distribution of the probability binning statistic, giving it a more rigorous foundation. We show that the null distribution is a scaled chi-square, and show how it can be related to the standard chi-squared statistic. RESULTS: A small simulation shows how the theoretical results can be used to (a) modify the probability binning statistic to make it more sensitive and (b) suggest variant statistics which, while still exploiting the data-dependent strengths of the probability binning procedure, may be easier to work with. CONCLUSIONS: The probability binning procedure effectively uses adaptive binning to locate structure in high-dimensional data. The derivation of a theoretical basis provides a more detailed interpretation of its behavior and renders the probability binning method more flexible.  相似文献   

5.
In 1985, Alec Jeffreys reported the development of multilocus DNA fingerprinting by Southern blot-detection of hypervariable minisatellites or variable number of tandem repeat (VNTR) loci. This technology found immediate application to various forensic and scientific problems, including fisheries and aquaculture. By 1989, however, it was recognized by many researchers that inherent problems exist in the application of multilocus fingerprinting to large sample sizes as might occur in fisheries and aquaculture genetic studies. As such, individual VNTRs were cloned for single-locus DNA fingerprinting. Although single-locus fingerprinting ameliorates many of the problems associated with multilocus DNA fingerprinting, it suffers from the problem that electrophorectic anomalies of band migration within and between gels necessitates binning of alleles, thus underestimating genetic variability in a given population. Amplification of microsatellite loci by the polymerase chain reaction, however, solved many of the problems of Southern blot-based DNA fingerprinting. Moreover, microsatellites exhibit attributes that make them particularly suitable as genetic markers for numerous applications in aquaculture and fisheries research: (1) they are abundant in the genome; (2) they display varying levels of polymorphism; (3) alleles exhibit codominant Mendelian inheritance; (4) minute amounts of tissue are required for assay (e.g., dried scales or otoliths); (5) loci are conserved in related species; (6) potential for automated assay. Recent innovations in DNA fingerprinting technology developed over the past 5 years are discussed with special emphasis on microsatellites and their application to fisheries and aquaculture, e.g., behavioural and population genetics of wild species, and selection and breeding programmes for aquaculture broodstock.  相似文献   

6.
On plotting species abundance distributions   总被引:3,自引:0,他引:3  
1. There has been a revival of interest in species abundance distribution (SAD) models, stimulated by the claim that the log-normal distribution gave an underestimate of the observed numbers of rare species in species-rich assemblages. This led to the development of the neutral Zero Sum Multinomial distribution (ZSM) to better fit the observed data. 2. Yet plots of SADs, purportedly of the same data, showed differences in frequencies of species and of statistical fits to the ZSM and log-normal models due to the use of different binning methods. 3. We plot six different binning methods for the Barro Colorado Island (BCI) tropical tree data. The appearances of the curves are very different for the different binning methods. Consequently, the fits to different models may vary depending on the binning system used. 4. There is no agreed binning method for SAD plots. Our analysis suggests that a simple doubling of the number of individuals per species in each bin is perhaps the most practical one for illustrative purposes. Alternatively rank-abundance plots should be used. 5. For fitting and testing models exact methods have been developed and application of these does not require binning of data. Errors are introduced unnecessarily if data are binned before testing goodness-of-fit to models.  相似文献   

7.
In many metabolomics studies, NMR spectra are divided into bins of fixed width. This spectral quantification technique, known as uniform binning, is used to reduce the number of variables for pattern recognition techniques and to mitigate effects from variations in peak positions; however, shifts in peaks near the boundaries can cause dramatic quantitative changes in adjacent bins due to non-overlapping boundaries. Here we describe a new Gaussian binning method that incorporates overlapping bins to minimize these effects. A Gaussian kernel weights the signal contribution relative to distance from bin center, and the overlap between bins is controlled by the kernel standard deviation. Sensitivity to peak shift was assessed for a series of test spectra where the offset frequency was incremented in 0.5 Hz steps. For a 4 Hz shift within a bin width of 24 Hz, the error for uniform binning increased by 150%, while the error for Gaussian binning increased by 50%. Further, using a urinary metabolomics data set (from a toxicity study) and principal component analysis (PCA), we showed that the information content in the quantified features was equivalent for Gaussian and uniform binning methods. The separation between groups in the PCA scores plot, measured by the J 2 quality metric, is as good or better for Gaussian binning versus uniform binning. The Gaussian method is shown to be robust in regards to peak shift, while still retaining the information needed by classification and multivariate statistical techniques for NMR-metabolomics data.  相似文献   

8.
As genotyping methods move ever closer to full automation, care must be taken to ensure that there is no equivalent rise in allele‐calling error rates. One clear source of error lies with how raw allele lengths are converted into allele classes, a process referred to as binning. Standard automated approaches usually assume collinearity between expected and measured fragment length. Unfortunately, such collinearity is often only approximate, with the consequence that alleles do not conform to a perfect 2‐, 3‐ or 4‐base‐pair periodicity. To account for these problems, we introduce a method that allows repeat units to be fractionally shorter or longer than their theoretical value. Tested on a large human data set, our algorithm performs well over a wide range of dinucleotide repeat loci. The size of the problem caused by sticking to whole numbers of bases is indicated by the fact that the effective repeat length was within 5% of the assumed length only 68.3% of the time.  相似文献   

9.
Genotype calling procedures vary from laboratory to laboratory for many microsatellite markers. Even within the same laboratory, application of different experimental protocols often leads to ambiguities. The impact of these ambiguities ranges from irksome to devastating. Resolving the ambiguities can increase effective sample size and preserve evidence in favor of disease-marker associations. Because different data sets may contain different numbers of alleles, merging is unfortunately not a simple process of matching alleles one to one. Merging data sets manually is difficult, time-consuming, and error-prone due to differences in genotyping hardware, binning methods, molecular weight standards, and curve fitting algorithms. Merging is particularly difficult if few or no samples occur in common, or if samples are drawn from ethnic groups with widely varying allele frequencies. It is dangerous to align alleles simply by adding a constant number of base pairs to the alleles of one of the data sets. To address these issues, we have developed a Bayesian model and a Markov chain Monte Carlo (MCMC) algorithm for sampling the posterior distribution under the model. Our computer program, MicroMerge, implements the algorithm and almost always accurately and efficiently finds the most likely correct alignment. Common allele frequencies across laboratories in the same ethnic group are the single most important cue in the model. MicroMerge computes the allelic alignments with the greatest posterior probabilities under several merging options. It also reports when data sets cannot be confidently merged. These features are emphasized in our analysis of simulated and real data.  相似文献   

10.
Reddy RM  Mohammed MH  Mande SS 《Gene》2012,505(2):259-265
Phylogenetic assignment of individual sequence reads to their respective taxa, referred to as 'taxonomic binning', constitutes a key step of metagenomic analysis. Existing binning methods have limitations either with respect to time or accuracy/specificity of binning. Given these limitations, development of a method that can bin vast amounts of metagenomic sequence data in a rapid, efficient and computationally inexpensive manner can profoundly influence metagenomic analysis in computational resource poor settings. We introduce TWARIT, a hybrid binning algorithm, that employs a combination of short-read alignment and composition-based signature sorting approaches to achieve rapid binning rates without compromising on binning accuracy and specificity. TWARIT is validated with simulated and real-world metagenomes and the results demonstrate significantly lower overall binning times compared to that of existing methods. Furthermore, the binning accuracy and specificity of TWARIT are observed to be comparable/superior to them. A web server implementing TWARIT algorithm is available at http://metagenomics.atc.tcs.com/Twarit/  相似文献   

11.
We have used a new method for binning minisatellite alleles (semi-automated allele aggregation) and report the extent of population diversity detectable by eleven minisatellite loci in 2,689 individuals from 19 human populations distributed widely throughout the world. Whereas population relationships are consistent with those found in other studies, our estimate of genetic differentiation (F(st)) between populations is less than 8%, which is lower than comparative estimates of between 10%-15% obtained by using other sources of polymorphism data. We infer that mutational processes are involved in reducing F(st) estimates from minisatellite data because, first, the lowest F(st) estimates are found at loci showing autocorrelated frequencies among alleles of similar size and, second, F(st) declines with heterozygosity but by more than predicted assuming simple models of mutation. These conclusions are consistent with the view that minisatellites are subject to selective or mutational constraints in addition to those expected under simple step-wise mutation models.  相似文献   

12.
The interpretation of nuclear magnetic resonance (NMR) experimental results for metabolomics studies requires intensive signal processing and multivariate data analysis techniques. A key step in this process is the quantification of spectral features, which is commonly accomplished by dividing an NMR spectrum into several hundred integral regions or bins. Binning attempts to minimize effects from variations in peak positions caused by sample pH, ionic strength, and composition, while reducing the dimensionality for multivariate statistical analyses. Herein we develop an improved novel spectral quantification technique, dynamic adaptive binning. With this technique, bin boundaries are determined by optimizing an objective function using a dynamic programming strategy. The objective function measures the quality of a bin configuration based on the number of peaks per bin. This technique shows a significant improvement over both traditional uniform binning and other adaptive binning techniques. This improvement is quantified via synthetic validation sets by analyzing an algorithm’s ability to create bins that do not contain more than a single peak and that maximize the distance from peak to bin boundary. The validation sets are developed by characterizing the salient distributions in experimental NMR spectroscopic data. Further, dynamic adaptive binning is applied to a 1H NMR-based experiment to monitor rat urinary metabolites to empirically demonstrate improved spectral quantification.  相似文献   

13.
MICROSATELIGHT is a Perl/Tk pipeline with a graphical user interface that facilitates several tasks when scoring microsatellites. It implements new subroutines in R and PERL and takes advantage of features provided by previously developed freeware. MICROSATELIGHT takes raw genotype data and automates the peak identification through PeakScanner. The PeakSelect subroutine assigns peaks to different microsatellite markers according to their multiplex group, fluorochrome type, and size range. After peak selection, binning of alleles can be carried out 1) automatically through AlleloBin or 2) by manual bin definition through Binator. In both cases, several features for quality checking and further binning improvement are provided. The genotype table can then be converted into input files for several population genetics programs through CREATE. Finally, Hardy-Weinberg equilibrium tests and confidence intervals for null allele frequency can be obtained through GENEPOP. MICROSATELIGHT is the only freely available public-domain software that facilitates full multiplex microsatellite scoring, from electropherogram files to user-defined text files to be used with population genetics software. MICROSATELIGHT has been created for the Windows XP operating system and has been successfully tested under Windows 7. It is available at http://sourceforge.net/projects/microsatelight/.  相似文献   

14.
Successful discovery of therapeutic antibodies hinges on the identification of appropriate affinity binders targeting a diversity of molecular epitopes presented by the antigen. Antibody campaigns that yield such broad “epitope coverage” increase the likelihood of identifying candidates with the desired biological functions. Accordingly, epitope binning assays are employed in the early discovery stages to partition antibodies into epitope families or “bins” and prioritize leads for further characterization and optimization. The collaborative program described here, which used hen egg white lysozyme (HEL) as a model antigen, combined 3 key capabilities: 1) access to a diverse panel of antibodies selected from a human in vitro antibody library; 2) application of state-of-the-art high-throughput epitope binning; and 3) analysis and interpretation of the epitope binning data with reference to an exhaustive set of published antibody:HEL co-crystal structures. Binning experiments on a large merged panel of antibodies containing clones from the library and the literature revealed that the inferred epitopes for the library clones overlapped with, and extended beyond, the known structural epitopes. Our analysis revealed that nearly the entire solvent-exposed surface of HEL is antigenic, as has been proposed for protein antigens in general. The data further demonstrated that synthetic antibody repertoires provide as wide epitope coverage as those obtained from animal immunizations. The work highlights molecular insights contributed by increasingly higher-throughput binning methods and their broad utility to guide the discovery of therapeutic antibodies representing a diverse set of functional epitopes.  相似文献   

15.
The use of Markovian models is an established way for deriving the complete distribution of the size of a population and the probability of extinction. However, computationally impractical transition matrices frequently result if this mathematical approach is applied to natural populations. Binning, or aggregating population sizes, has been used to permit a reduction in the dimensionality of matrices. Here, we present three deterministic binning methods and study the errors due to binning for a metapopulation model. Our results indicate that estimation errors of the investigated methods are not consistent and one cannot make generalizations about the quality of a method. For some compared output variables of populations studied, binning methods that caused a strong reduction in dimensionality of matrices resulted in better estimations than methods that produced a weaker reduction. The main problem with deterministic binning methods is that they do not properly take into account the stochastic population process itself. Straightforward usage of binning methods may lead to substantial errors in population-dynamical predictions.  相似文献   

16.
Patterns that resemble strongly skewed size distributions are frequently observed in ecology. A typical example represents tree size distributions of stem diameters. Empirical tests of ecological theories predicting their parameters have been conducted, but the results are difficult to interpret because the statistical methods that are applied to fit such decaying size distributions vary. In addition, binning of field data as well as measurement errors might potentially bias parameter estimates. Here, we compare three different methods for parameter estimation – the common maximum likelihood estimation (MLE) and two modified types of MLE correcting for binning of observations or random measurement errors. We test whether three typical frequency distributions, namely the power-law, negative exponential and Weibull distribution can be precisely identified, and how parameter estimates are biased when observations are additionally either binned or contain measurement error. We show that uncorrected MLE already loses the ability to discern functional form and parameters at relatively small levels of uncertainties. The modified MLE methods that consider such uncertainties (either binning or measurement error) are comparatively much more robust. We conclude that it is important to reduce binning of observations, if possible, and to quantify observation accuracy in empirical studies for fitting strongly skewed size distributions. In general, modified MLE methods that correct binning or measurement errors can be applied to ensure reliable results.  相似文献   

17.
姜忠俊  李小波 《微生物学报》2022,62(8):2954-2968
宏基因组学技术可以直接从环境中提取微生物的全部遗传物质,而不需要像传统方法一样在培养基上纯培养。这种技术的出现为科学家对微生物群落的结构和功能的认识提供了重要的方法,同时对疾病的诊治、环境的治理以及生命的认识具有重大的意义。从环境中提取出微生物全部遗传物质,对其进行测序从而得到它们的reads片段,通过reads组装工具可以进一步组装成重叠群片段。对重叠群片段进行分箱,可以从宏基因组样本中重建出更多完整的基因。分箱效果的好坏直接影响到后续的生物分析,因此如何将这些含有不同微生物基因混合的重叠群序列进行有效的分箱成为了宏基因组学研究的热点和难点。机器学习方法被广泛应用于宏基因组重叠群分箱,通常分为有监督重叠群分类方法和无监督重叠群聚类方法。该综述针对宏基因组重叠群分箱方法进行了较为全面的阐述,深入剖析了重叠群分类方法与聚类方法,发现其存在分类准确率较低、分箱时间较长、难以从复杂数据集中重建更多微生物基因等问题,并对未来重叠群分箱方法的研究和发展进行了展望。作者建议可以使用半监督学习、集成学习以及深度学习方法,并采用更有效的数据特征表示等途径来提高分箱效果。  相似文献   

18.
Chen XH  O'Dell SD  Day IN 《BioTechniques》2002,32(5):1080-2, 1084, 1086 passim
After PCR amplification, we have achieved precise sizing of trinucleotide and tetranucleotide microsatellite alleles on 96-well open-faced polyacrylamide microplate array diagonal gel electrophoresis (MADGE) gels: two tetranucleotide repeats, HUMTHOI (five alleles 248-263 bp) and DYS390 (eight alleles 200-228 bp), and DYS392, a trinucleotide repeat (eight alleles 210-231 bp). A gel matrix of Duracryl, a high mechanical strength polyacrylamide derivative, and appropriate ionic conditions provide the 1.3%-1.5% band resolution required. No end-labeling of primers is needed, as the sensitive Vistra Green intercalating dye is used for the visualization of bands. Co-run markers bracketing the PCR fragments ensure accurate sizing without inter-lane variability. Electrophoresis of multiple gels in a thermostatically controlled tank allows up to 1000 samples to be run in 90 min. Gel images were analyzed using a Fluorlmager 595 fluorescent scanning system, and alleles were identified using Phoretix software for band migration measurement and Microsoft Excel to compute fragment sizes. Estimated sizes were interpolated precisely to achieve accurate binning. Microsatellite-MADGE represents a utilitarian methodfor high-throughput genotyping in cohort studies, using standard laboratory equipment.  相似文献   

19.
DNA typing offers a unique opportunity to identify individuals for medical and forensic purposes. Probabilistic inference regarding the chance occurrence of a match between the DNA type of an evidentiary sample and that of an accused suspect, however, requires reliable estimation of genotype and allele frequencies in the population. Although population-based data on DNA typing at several hypervariable loci are being accumulated at various laboratories, a rigorous treatment of the sample size needed for such purposes has not been made from population genetic considerations. It is shown here that the loci that are potentially most useful for forensic identification of individuals have the intrinsic property that they involve a large number of segregating alleles, and a great majority of these alleles are rare. As a consequence, because of the large number of possible genotypes at the hypervariable loci that offer the maximum potential for individualization, the sample size needed to observe all possible genotypes in a sample is large. In fact, the size is so large that even if such a huge number of individuals could be sampled, it could not be guaranteed that such a sample was drawn from a single homogeneous population. Therefore adequate estimation of genotypic probabilities must be based on allele frequencies, and the sample size needed to represent all possible alleles is far more reasonable. Further economization of sample size is possible if one wants to have representation of only the frequent alleles in the sample, so that the rare allele frequencies can be approximated by an upper bound for forensic applications.  相似文献   

20.
The number of alleles in a sample (allelic richness) is a fundamental measure of genetic diversity. However, this diversity measure has been difficult to use because large samples are expected to contain more alleles than small samples. The statistical technique of rarefaction compensates for this sampling disparity. Here I introduce a computer program that performs rarefaction on private alleles and hierarchical sampling designs.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号