首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We introduce a new method for identifying optimal incomplete data sets from large sequence databases based on the graph theoretic concept of alpha-quasi-bicliques. The quasi-biclique method searches large sequence databases to identify useful phylogenetic data sets with a specified amount of missing data while maintaining the necessary amount of overlap among genes and taxa. The utility of the quasi-biclique method is demonstrated on large simulated sequence databases and on a data set of green plant sequences from GenBank. The quasi-biclique method greatly increases the taxon and gene sampling in the data sets while adding only a limited amount of missing data. Furthermore, under the conditions of the simulation, data sets with a limited amount of missing data often produce topologies nearly as accurate as those built from complete data sets. The quasi-biclique method will be an effective tool for exploiting sequence databases for phylogenetic information and also may help identify critical sequences needed to build large phylogenetic data sets.  相似文献   

2.
This paper contends that a research ethics approach to the regulation of health data research is unhelpful in the era of population‐level research and big data because it results in a primary focus on consent (meta‐, broad, dynamic and/or specific consent). Two recent guidelines – the 2016 WMA Declaration of Taipei on ethical considerations regarding health databases and biobanks and the revised CIOMS International ethical guidelines for health‐related research involving humans – both focus on the growing reliance on health data for research. But as research ethics documents, they remain (to varying degrees) focused on consent and individual control of data use. Many current and future uses of health data make individual consent impractical, if not impossible. Many of the risks of secondary data use apply to communities and stakeholders rather than individual data subjects. Shifting from a research ethics perspective to a public health lens brings a different set of issues into view: how are the benefits and burdens of data use distributed, how can data research empower communities, who has legitimate decision‐making capacity? I propose that a public health ethics framework – based on public benefit, proportionality, equity, trust and accountability – provides more appropriate tools for assessing the ethical uses of health data. The main advantage of a public health approach for data research is that it is more likely to foster debate about power, justice and equity and to highlight the complexity of deciding when data use is in the public interest.  相似文献   

3.
Ecological data can be difficult to collect, and as a result, some important temporal ecological datasets contain irregularly sampled data. Since many temporal modelling techniques require regularly spaced data, one common approach is to linearly interpolate the data, and build a model from the interpolated data. However, this process introduces an unquantified risk that the data is over-fitted to the interpolated (and hence more typical) instances. Using one such irregularly-sampled dataset, the Lake Kasumigaura algal dataset, we compare models built on the original sample data, and on the interpolated data, to evaluate the risk of mis-fitting based on the interpolated data.  相似文献   

4.
Missing data problems persist in many scientific investigations. Although various strategies for analyzing missing data have been proposed, they are mainly limited to data on continuous measurements. In this paper, we focus on implementing some of the available strategies to analyze item response data. In particular, we investigate the effects of popular missing data methods on various missing data mechanisms. We examine large sample behaviors of estimators in a simulation study that evaluates and compares their performance. We use data from a quality of life study with lung cancer patients to illustrate the utility of these methods.  相似文献   

5.
Although computers are capable of storing a huge amount of data, there is a need for more sophisticated software to assemble and organize raw data into useful information for dissemination. Therefore we developed tools that assist in gathering and categorizing data for the study of microbial diversity and systematics. The first tool is for data retrieval from heterogeneous data sources on the INTERNET. The second tool provides researchers with a polyphasic view of microbes based on phenotypic characteristics and molecular sequence data.  相似文献   

6.
Delimitation of species based exclusively on genetic data has been advocated despite a critical knowledge gap: how might such approaches fail because they rely on genetic data alone, and would their accuracy be improved by using multiple data types. We provide here the requisite framework for addressing these key questions. Because both phenotypic and molecular data can be analyzed in a common Bayesian framework with our program iBPP, we can compare the accuracy of delimited taxa based on genetic data alone versus when integrated with phenotypic data. We can also evaluate how the integration of phenotypic data might improve species delimitation when divergence occurs with gene flow and/or is selectively driven. These two realities of the speciation process are ignored by currently available genetic approaches. Our model accommodates phenotypic characters that exhibit different degrees of divergence, allowing for both neutral traits and traits under selection. We found a greater accuracy of estimated species boundaries with the integration of phenotypic and genetic data, with a strong beneficial influence of phenotypic data from traits under selection when the speciation process involves gene flow. Our results highlight the benefits of multiple data types, but also draws into question the rationale of species delimitation based exclusively on genetic data.  相似文献   

7.
Analysis of repeatability in spotted cDNA microarrays   总被引:7,自引:3,他引:4  
We report a strategy for analysis of data quality in cDNA microarrays based on the repeatability of repeatedly spotted clones. We describe how repeatability can be used to control data quality by developing adaptive filtering criteria for microarray data containing clones spotted in multiple spots. We have applied the method on five publicly available cDNA microarray data sets and one previously unpublished data set from our own laboratory. The results demonstrate the feasibility of the approach as a foundation for data filtering, and indicate a high degree of variation in data quality, both across the data sets and between arrays within data sets.  相似文献   

8.
Functional understanding of signaling pathways requires detailed information about the constituent molecules and their interactions. Simulations of signaling pathways therefore build upon a great deal of data from various sources. We first survey electronic data resources for cell signaling modeling and then based on the type of data representation the data sources are broadly classified into five groups. None of the data sources surveyed provide all required data in a ready-to-be-modeled fashion. We then put forward a "wish list" for the desired attributes for an ideal modeling centric database. Finally, we close with perspectives on how electronic data sources for cell signaling modeling have developed. We suggest that future directions in such data sources are largely model-driven and are hinged on interoperability of data sources.  相似文献   

9.
10.
Centralisation of tools for analysis of genomic data is paramount in ensuring that research is always carried out on the latest currently available data. As such, World Wide Web sites providing a range of online analyses and displays of data can play a crucial role in guaranteeing consistency of in silico work. In this respect, the protozoan parasite research community is served by several resources, either focussing on data and tools for one species or taking a broader view and providing tools for analysis of data from many species, thereby facilitating comparative studies. In this paper, we give a broad overview of the online resources available. We then focus on the GeneDB project, detailing the features and tools currently available through it. Finally, we discuss data curation and its importance in keeping genomic data 'relevant' to the research community.  相似文献   

11.
With the changes in the nature and the society, risks will inevitably change. It implies that, with the passage of time, some historical data would be invalid for probabilistic risk analysis. In this paper, a model to acquire the valid data is suggested, which is based on the Mann- Kendall test to detect abrupt change-point on time series data. What's more, the typhoon risk analysis in Guangdong Province, China is used as a case study to show how to apply the model. The valid data of the intensities of typhoons and the related losses in the province for the probabilistic risk analysis is obtained from the data during the time from 1984 to 2012. Comparing with the results based on the set of invalid data and the set of all collected data, the assessed risk based on the valid data is more reliable, which could reflect the dynamics of the typhoon risk.  相似文献   

12.
The collection of data on physical parameters of body segments is a preliminary critical step in studying the biomechanics of locomotion. Little data on nonhuman body segment parameters has been published. The lack of standardization of techniques for data collection and presentation has made the comparative use of these data difficult and at times impossible. This study offers an approach for collecting data on center of gravity and moments of inertia for standardized body segments. The double swing pendulum approach is proposed as a solution for difficulties previously encountered in calculating moments of inertia for body segments. A format for prompting a computer to perform these calculations is offered, and the resulting segment mass data for Lemur fulvus is presented.  相似文献   

13.
The classification of tissue samples based on gene expression data is an important problem in medical diagnosis of diseases such as cancer. In gene expression data, the number of genes is usually very high (in the thousands) compared to the number of data samples (in the tens or low hundreds); that is, the data dimension is large compared to the number of data points (such data is said to be undersampled). To cope with performance and accuracy problems associated with high dimensionality, it is commonplace to apply a preprocessing step that transforms the data to a space of significantly lower dimension with limited loss of the information present in the original data. Linear discriminant analysis (LDA) is a well-known technique for dimension reduction and feature extraction, but it is not applicable for undersampled data due to singularity problems associated with the matrices in the underlying representation. This paper presents a dimension reduction and feature extraction scheme, called uncorrelated linear discriminant analysis (ULDA), for undersampled problems and illustrates its utility on gene expression data. ULDA employs the generalized singular value decomposition method to handle undersampled data and the features that it produces in the transformed space are uncorrelated, which makes it attractive for gene expression data. The properties of ULDA are established rigorously and extensive experimental results on gene expression data are presented to illustrate its effectiveness in classifying tissue samples. These results provide a comparative study of various state-of-the-art classification methods on well-known gene expression data sets  相似文献   

14.
Environmental DNA (eDNA) analysis of water samples is on the brink of becoming a standard monitoring method for aquatic species. This method has improved detection rates over conventional survey methods and thus has demonstrated effectiveness for estimation of site occupancy and species distribution. The frontier of eDNA applications, however, is to infer species density. Building upon previous studies, we present and assess a modeling approach that aims at inferring animal density from eDNA. The modeling combines eDNA and animal count data from a subset of sites to estimate species density (and associated uncertainties) at other sites where only eDNA data are available. As a proof of concept, we first perform a cross‐validation study using experimental data on carp in mesocosms. In these data, fish densities are known without error, which allows us to test the performance of the method with known data. We then evaluate the model using field data from a study on a stream salamander species to assess the potential of this method to work in natural settings, where density can never be known with absolute certainty. Two alternative distributions (Normal and Negative Binomial) to model variability in eDNA concentration data are assessed. Assessment based on the proof of concept data (carp) revealed that the Negative Binomial model provided much more accurate estimates than the model based on a Normal distribution, likely because eDNA data tend to be overdispersed. Greater imprecision was found when we applied the method to the field data, but the Negative Binomial model still provided useful density estimates. We call for further model development in this direction, as well as further research targeted at sampling design optimization. It will be important to assess these approaches on a broad range of study systems.  相似文献   

15.
We evaluated the prediction of beta-turns from amino acid sequences using the residue-coupled model with an enlarged representative protein data set selected from the Protein Data Bank. Our results show that the probability values derived from a data set comprising 425 protein chains yielded an overall beta-turn prediction accuracy 68.74%, compared with 94.7% reported earlier on a data set of 30 proteins using the same method. However, we noted that the overall beta-turn prediction accuracy using probability values derived from the 30-protein data set reduces to 40.74% when tested on the data set comprising 425 protein chains. In contrast, using probability values derived from the 425 data set used in this analysis, the overall beta-turn prediction accuracy yielded consistent results when tested on either the 30-protein data set (64.62%) used earlier or a more recent representative data set comprising 619 protein chains (64.66%) or on a jackknife data set comprising 476 representative protein chains (63.38%). We therefore recommend the use of probability values derived from the 425 representative protein chains data set reported here, which gives more realistic and consistent predictions of beta-turns from amino acid sequences.  相似文献   

16.
Although the haplotype data can be used to analyze the function of DNA, due to the significant efforts required in collecting the haplotype data, usually the genotype data is collected and then the population haplotype inference (PHI) problem is solved to infer haplotype data from genotype data for a population. This paper investigates the PHI problem based on the pure parsimony criterion (HIPP), which seeks the minimum number of distinct haplotypes to infer a given genotype data. We analyze the mathematical structure and properties for the HIPP problem, propose techniques to reduce the given genotype data into an equivalent one of much smaller size, and analyze the relations of genotype data using a compatible graph. Based on the mathematical properties in the compatible graph, we propose a maximal clique heuristic to obtain an upper bound, and a new polynomial-sized integer linear programming formulation to obtain a lower bound for the HIPP problem.  相似文献   

17.
Missing data are commonly encountered using multilocus, fragment‐based (dominant) fingerprinting methods, such as random amplified polymorphic DNA (RAPD) or amplified fragment length polymorphism (AFLP). Data sets containing missing data have been analysed by eliminating those bands or samples with missing data, assigning values to missing data or ignoring the problem. Here, we present a method that uses random assignments of band presence–absence to the missing data, implemented by the computer program famd (available from http://homepage.univie.ac.at/philipp.maria.schlueter/famd.html ), for analyses based on pairwise similarity and Shannon's index. When missing values group in a data set, sample or band elimination is likely to be the most appropriate action. However, when missing values are scattered across the data set, minimum, maximum and average similarity coefficients are a simple means of visualizing the effects of missing data on tree structure. Our approach indicates the range of values that a data set containing missing data points might generate, and forces the investigator to consider the effects of missing values on data interpretation.  相似文献   

18.
19.
This paper is a survey of the current state of molecular studies on bryophyte phylogeny. Molecular data have greatly contributed to developing a phylogeny and classification of bryophytes. The previous traditional systems of classification based on morphological data are being significantly revised. New data of the authors are presented on phylogeny of Hypnales pleurocarpous mosses inferred from nucleotide sequence data of the nuclear DNA internal transcribed spacers ITS1-2 and the trnL-F region of the chloroplast genome.  相似文献   

20.
The study of evolutionary relationships among human populations is fundamental to inferring processes that determine their structure and history. Among the different data types used to infer such relationships, molecular data, particularly nuclear and mitochondrial DNA, are preferred because of their high heritability and the low probability of changes during development. However, although the reliability of relatedness patterns based on other traits is discussed, except in unusual circumstances most prehistoric populations remain within the domain of morphological study. Therefore the primary goal of this study is to test the reliability of relatedness patterns constructed on the basis of craniometric data on a regional scale. In particular, we analyze samples from populations belonging to the Chaco, Pampa, and Patagonia regions of South America for which craniometric and molecular data are available. We compare a strongly supported relatedness pattern based on molecular data with the results obtained through landmark-based and semilandmark-based facial data. The matrices based on Euclidean distance for morphometric data and DA distances for molecular data were used to perform principal coordinates analyses and to obtain reticulograms. Finally, a principal components analysis of all individuals was performed with morphometric data. The results indicate that ordination analyses yield slightly different results depending on the morphometric data used. However, the reticulograms obtained with both landmark-based and semilandmark-based data allow the separation of the Chubut samples from the Chaco samples, with the Pampa sample in between the others; this pattern is congruent with molecular-based analyses. As a consequence, our results indicate that facial morphometric data allow the inference of the structure and history of the prehistoric populations for the studied region.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号