首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We introduce a new method for identifying optimal incomplete data sets from large sequence databases based on the graph theoretic concept of alpha-quasi-bicliques. The quasi-biclique method searches large sequence databases to identify useful phylogenetic data sets with a specified amount of missing data while maintaining the necessary amount of overlap among genes and taxa. The utility of the quasi-biclique method is demonstrated on large simulated sequence databases and on a data set of green plant sequences from GenBank. The quasi-biclique method greatly increases the taxon and gene sampling in the data sets while adding only a limited amount of missing data. Furthermore, under the conditions of the simulation, data sets with a limited amount of missing data often produce topologies nearly as accurate as those built from complete data sets. The quasi-biclique method will be an effective tool for exploiting sequence databases for phylogenetic information and also may help identify critical sequences needed to build large phylogenetic data sets.  相似文献   

2.
MOTIVATION: Protein-protein interactions are systematically examined using the yeast two-hybrid method. Consequently, a lot of protein-protein interaction data are currently being accumulated. Nevertheless, general information or knowledge on protein-protein interactions is poorly extracted from these data. Thus we have been trying to extract the knowledge from the protein-protein interaction data using data mining. RESULTS: A data mining method is proposed to discover association rules related to protein-protein interactions. To evaluate the detected rules by the method, a new scoring measure of the rules is introduced. The method allowed us to detect popular interaction rules such as "An SH3 domain binds to a proline-rich region." These results indicate that the method may detect novel knowledge on protein-protein interactions.  相似文献   

3.
A qualitative, quantitative, and overall quality assessment of life cycle inventory is suggested. The method is composed of five indicators which are set up at three levels of the inventory quality: flows, processes, and the system. The method allows one to assess the reliability of the method generating inventory data (justness of data, completeness of data, representativity of processes, repeatability of system definition) and at the same time to quantify the uncertainty of the resulting data made under the data generation method. LCA practitioners can finally decide the overall inventory quality through the information for the acceptability of the inventory result comparing the objective of quality and the cost necessary to improve the quality. The operation of the method was verified in the application to the production of polyethylene bottles. The proposed method was also found applicable for the validation of data in the ISO’s LCA data documentation format.  相似文献   

4.
Empirical Bayes models have been shown to be powerful tools for identifying differentially expressed genes from gene expression microarray data. An example is the WAME model, where a global covariance matrix accounts for array-to-array correlations as well as differing variances between arrays. However, the existing method for estimating the covariance matrix is very computationally intensive and the estimator is biased when data contains many regulated genes. In this paper, two new methods for estimating the covariance matrix are proposed. The first method is a direct application of the EM algorithm for fitting the multivariate t-distribution of the WAME model. In the second method, a prior distribution for the log fold-change is added to the WAME model, and a discrete approximation is used for this prior. Both methods are evaluated using simulated and real data. The first method shows equal performance compared to the existing method in terms of bias and variability, but is superior in terms of computer time. For large data sets (>15 arrays), the second method also shows superior computer run time. Moreover, for simulated data with regulated genes the second method greatly reduces the bias. With the proposed methods it is possible to apply the WAME model to large data sets with reasonable computer run times. The second method shows a small bias for simulated data, but appears to have a larger bias for real data with many regulated genes.  相似文献   

5.
6.
Abstract: Animal locations estimated by Global Positioning System (GPS) inherently contain errors. Screening procedures used to remove large positional errors often trade data accuracy for data loss. We developed a simple screening method that identifies locations arising from unrealistic movement patterns. When applied to a large data set of moose (Alces alces) locations, our method identified virtually all known errors with minimal loss of data. Thus, our method for screening GPS data improves the quality of data sets and increases the value of such data for research and management.  相似文献   

7.
To qualify DNA data, we have developed a statistical method of deciding whether the DNA data has an acceptable accuracy in sequencing process. The method is to test the probability of sequencing errors, based on partial re-sequencing. The method was successfully applied to a yeast mitochondrial DNA which is previously sequenced (1). The analysis indicates that the entire sequence is very accurate although we found one base change error on the ND1 gene sequence data by a partial re-sampling. This method is applicable to any DNA data.  相似文献   

8.
DNA barcoding is the assignment of individuals to species using standardized mitochondrial sequences. Nuclear data are sometimes added to the mitochondrial data to increase power. A barcoding method for analysing mitochondrial and nuclear data is developed. It is a Bayesian method based on the coalescent model. Then this method is assessed using simulated and real data. It is found that adding nuclear data can reduce the number of ambiguous assignments. Finally, the robustness of coalescent-based barcoding to departures from model assumptions is studied using simulations. This method is found to be robust to past population size variations, to within-species population structures, and to designs that poorly sample populations within species. Supplementary Material is available online at www.liebertonline.com/cmb.  相似文献   

9.
Recent advances in high-throughput technologies have made it possible to generate both gene and protein sequence data at an unprecedented rate and scale thereby enabling entirely new "omics"-based approaches towards the analysis of complex biological processes. However, the amount and complexity of data that even a single experiment can produce seriously challenges researchers with limited bioinformatics expertise, who need to handle, analyze and interpret the data before it can be understood in a biological context. Thus, there is an unmet need for tools allowing non-bioinformatics users to interpret large data sets. We have recently developed a method, NNAlign, which is generally applicable to any biological problem where quantitative peptide data is available. This method efficiently identifies underlying sequence patterns by simultaneously aligning peptide sequences and identifying motifs associated with quantitative readouts. Here, we provide a web-based implementation of NNAlign allowing non-expert end-users to submit their data (optionally adjusting method parameters), and in return receive a trained method (including a visual representation of the identified motif) that subsequently can be used as prediction method and applied to unknown proteins/peptides. We have successfully applied this method to several different data sets including peptide microarray-derived sets containing more than 100,000 data points. NNAlign is available online at http://www.cbs.dtu.dk/services/NNAlign.  相似文献   

10.
A simple method for the spectral analysis of multispecies microfossil data through time or stratigraphic level is presented. The method is based on the Mantel correlogram, allowing any ecological similarity measure to be used. The method can therefore be applied to binary (presence-absence) data as well as raw or normalized species counts. In contrast with spectral analysis of univariate ordination scores, this approach does not explicitly discard information. The method, referred to as the Mantel periodogram, is exemplified with a data set from the literature, demonstrating several astronomically forced periodicities in microfaunal data from the Plio-Pleistocene.  相似文献   

11.
The finite element (FE) method when coupled with computed tomography (CT) is a powerful tool in orthopaedic biomechanics. However, substantial data is required for patient-specific modelling. Here we present a new method for generating a FE model with a minimum amount of patient data. Our method uses high order cubic Hermite basis functions for mesh generation and least-square fits the mesh to the dataset. We have tested our method on seven patient data sets obtained from CT assisted osteodensitometry of the proximal femur. Using only 12 CT slices we generated smooth and accurate meshes of the proximal femur with a geometric root mean square (RMS) error of less than 1 mm and peak errors less than 8 mm. To model the complex geometry of the pelvis we developed a hybrid method which supplements sparse patient data with data from the visible human data set. We tested this method on three patient data sets, generating FE meshes of the pelvis using only 10 CT slices with an overall RMS error less than 3 mm. Although we have peak errors about 12 mm in these meshes, they occur relatively far from the region of interest (the acetabulum) and will have minimal effects on the performance of the model. Considering that linear meshes usually require about 70-100 pelvic CT slices (in axial mode) to generate FE models, our method has brought a significant data reduction to the automatic mesh generation step. The method, that is fully automated except for a semi-automatic bone/tissue boundary extraction part, will bring the benefits of FE methods to the clinical environment with much reduced radiation risks and data requirement.  相似文献   

12.
13.
The relationship of least squared-error estimation to the commonly used data pre-processing method of stimulus locked signal averaging is discussed. First, a generalized squared-error estimate is derived. Second, two data pre-processing methods are introduced and shown analytically to be equivalent with respect to subsequent least squared-error estimation. The first method consists of fitting known functions directly to unaltered data while the second method fits to the same data after it has been time-averaged. A third method of less utility is also demonstrated to be equivalent. It consists of first fitting to sub-blocks of the unaltered data and then averaging the resulting estimates. Finally, a numerical example is presented. It substantiates the analytical contentions and points out practical considerations which might arise in the course of implementation of the estimation procedure.  相似文献   

14.
The biological data are scattered in various areas with various formats and they are changing continuously. Therefore, data integration becomes an important issue to provide researcher a dynamic access of data. In the data integration process, the method of extracting heterogeneous data dynamically from the data source is an essential part. Data extraction method using wrapper can provide flexibility and extensibility to an integration system.  相似文献   

15.
《IRBM》2014,35(5):233-243
Simulation of dynamic contrast-enhanced ultrasound sequences with known perfusion characteristics and speckle characteristics that are consistent with those observed in experimental data would provide a useful tool for the evaluation of new perfusion quantification and image-processing techniques. A framework is proposed to simulate such perfusion data. It is based on the use of an example-based texture generation method. The generated texture of noise is compared to experimental data in terms of its statistical distribution and spatial correlation. Results show that the example-based method generates data that are closer to the experimental data than those obtained using a conventional parametric simulation method (33 to 80% smaller Hellinger squared distance). This fast and simple method allows simulation of dynamic contrast-enhanced ultrasound data for complex perfusion patterns, and should be useful for the validation of registration, segmentation or perfusion quantification methods.  相似文献   

16.
DNA microarray gene expression and microarray-based comparative genomic hybridization (aCGH) have been widely used for biomedical discovery. Because of the large number of genes and the complex nature of biological networks, various analysis methods have been proposed. One such method is "gene shaving," a procedure which identifies subsets of the genes with coherent expression patterns and large variation across samples. Since combining genomic information from multiple sources can improve classification and prediction of diseases, in this paper we proposed a new method, "ICA gene shaving" (ICA, independent component analysis), for jointly analyzing gene expression and copy number data. First we used ICA to analyze joint measurements, gene expression and copy number, of a biological system and project the data onto statistically independent biological processes. Next, we used these results to identify patterns of variation in the data and then applied an iterative shaving method. We investigated the properties of our proposed method by analyzing both simulated and real data. We demonstrated that the robustness of our method to noise using simulated data. Using breast cancer data, we showed that our method is superior to the Generalized Singular Value Decomposition (GSVD) gene shaving method for identifying genes associated with breast cancer.  相似文献   

17.
MOTIVATION: Identifying candidate genes associated with a given phenotype or trait is an important problem in biological and biomedical studies. Prioritizing genes based on the accumulated information from several data sources is of fundamental importance. Several integrative methods have been developed when a set of candidate genes for the phenotype is available. However, how to prioritize genes for phenotypes when no candidates are available is still a challenging problem. RESULTS: We develop a new method for prioritizing genes associated with a phenotype by Combining Gene expression and protein Interaction data (CGI). The method is applied to yeast gene expression data sets in combination with protein interaction data sets of varying reliability. We found that our method outperforms the intuitive prioritizing method of using either gene expression data or protein interaction data only and a recent gene ranking algorithm GeneRank. We then apply our method to prioritize genes for Alzheimer's disease. AVAILABILITY: The code in this paper is available upon request.  相似文献   

18.
A statistical method to evaluate data from the mouse lymphoma L5178Y/tk assay (MLA) using microwell method is proposed. This proposed method is designed for data obtained from a single culture protocol instead of the duplicate culture recommended by United Kingdom Environmental Mutagen Society (UKEMS). The proposed method consists of the following three steps: (1) to apply Dunnett type test for identifying clear negative; (2) to apply a Simpson-Margolin procedure for detecting downturn data; and (3) to apply a trend test to evaluate the dose-dependent increase in mutant frequency (MF). The performance of the proposed method was evaluated through a Monte Carlo study and a case study. False positive rates realized in the Monte Carlo study were comparable with the UKEMS method modified for a single culture protocol with the heterogeneity factors being kept at 1.0. False negative rates were less than those of the modified UKEMS method for dose response patterns with a sharp uprise in higher dose groups, whereas, they were comparable for other patterns. The results of evaluating the data from an International Collaborative Study by the proposed method seem comparable with the UKEMS method. The proposed method enables us to evaluate data from the microwell MLA with a single culture protocol.  相似文献   

19.
The main bottleneck of molecular dynamic simulations is the estimation of nonbonded pairwise interaction, which often employs neighbour search algorithms to find out interacting atom pairs. These methods have some drawbacks in fulfilling data locality principle, which is unable to take full advantage of modern computer architecture. In this article, we developed a new method by introducing a temporary list to reduce the sparsity in data access. This list permits to obtain a compact and sequential data structure which benefits to efficiently fulfil the data locality principle. We tested and compared the performance of the new method with that of the extensively used reordering method. The new method based on linked cell list is shown to increase 13% of computation speed and have better parallelism in comparison with reordering method. The increase in parallel efficiency makes the new method a promising option for large-scale molecular simulations.  相似文献   

20.
Species-occurrence data sets tend to contain a large proportion of zero values, i.e., absence values (zero-inflated). Statistical inference using such data sets is likely to be inefficient or lead to incorrect conclusions unless the data are treated carefully. In this study, we propose a new modeling method to overcome the problems caused by zero-inflated data sets that involves a regression model and a machine-learning technique. We combined a generalized liner model (GLM), which is widely used in ecology, and bootstrap aggregation (bagging), a machine-learning technique. We established distribution models of Vincetoxicum pycnostelma (a vascular plant) and Ninox scutulata (an owl), both of which are endangered and have zero-inflated distribution patterns, using our new method and traditional GLM and compared model performances. At the same time we modeled four theoretical data sets that contained different ratios of presence/absence values using new and traditional methods and also compared model performances. For distribution models, our new method showed good performance compared to traditional GLMs. After bagging, area under the curve (AUC) values were almost the same as with traditional methods, but sensitivity values were higher. Additionally, our new method showed high sensitivity values compared to the traditional GLM when modeling a theoretical data set containing a large proportion of zero values. These results indicate that our new method has high predictive ability with presence data when analyzing zero-inflated data sets. Generally, predicting presence data is more difficult than predicting absence data. Our new modeling method has potential for advancing species distribution modeling.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号