首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
BackgroundRecord linkage integrates records across multiple related data sources identifying duplicates and accounting for possible errors. Real life applications require efficient algorithms to merge these voluminous data sources to find out all records belonging to same individuals. Our recently devised highly efficient record linkage algorithms provide best-known solutions to this challenging problem.MethodWe have developed RLT-S, a freely available web tool, which implements our single linkage clustering algorithm for record linkage. This tool requires input data sets and a small set of configuration settings about these files to work efficiently. RLT-S employs exact match clustering, blocking on a specified attribute and single linkage based hierarchical clustering among these blocks.ResultsRLT-S is an implementation package of our sequential record linkage algorithm. It outperforms previous best-known implementations by a large margin. The tool is at least two times faster for any dataset than the previous best-known tools.ConclusionsRLT-S tool implements our record linkage algorithm that outperforms previous best-known algorithms in this area. This website also contains necessary information such as instructions, submission history, feedback, publications and some other sections to facilitate the usage of the tool.AvailabilityRLT-S is integrated into http://www.rlatools.com, which is currently serving this tool only. The tool is freely available and can be used without login. All data files used in this paper have been stored in https://github.com/abdullah009/DataRLATools. For copies of the relevant programs please see https://github.com/abdullah009/RLATools.  相似文献   

2.

Background

The Enhanced Matching System (EMS) is a probabilistic record linkage program developed by the tuberculosis section at Public Health England to match data for individuals across two datasets. This paper outlines how EMS works and investigates its accuracy for linkage across public health datasets.

Methods

EMS is a configurable Microsoft SQL Server database program. To examine the accuracy of EMS, two public health databases were matched using National Health Service (NHS) numbers as a gold standard unique identifier. Probabilistic linkage was then performed on the same two datasets without inclusion of NHS number. Sensitivity analyses were carried out to examine the effect of varying matching process parameters.

Results

Exact matching using NHS number between two datasets (containing 5931 and 1759 records) identified 1071 matched pairs. EMS probabilistic linkage identified 1068 record pairs. The sensitivity of probabilistic linkage was calculated as 99.5% (95%CI: 98.9, 99.8), specificity 100.0% (95%CI: 99.9, 100.0), positive predictive value 99.8% (95%CI: 99.3, 100.0), and negative predictive value 99.9% (95%CI: 99.8, 100.0). Probabilistic matching was most accurate when including address variables and using the automatically generated threshold for determining links with manual review.

Conclusion

With the establishment of national electronic datasets across health and social care, EMS enables previously unanswerable research questions to be tackled with confidence in the accuracy of the linkage process. In scenarios where a small sample is being matched into a very large database (such as national records of hospital attendance) then, compared to results presented in this analysis, the positive predictive value or sensitivity may drop according to the prevalence of matches between databases. Despite this possible limitation, probabilistic linkage has great potential to be used where exact matching using a common identifier is not possible, including in low-income settings, and for vulnerable groups such as homeless populations, where the absence of unique identifiers and lower data quality has historically hindered the ability to identify individuals across datasets.  相似文献   

3.
MOTIVATION: Recent studies have shown that microarray gene expression data are useful for phenotype classification of many diseases. A major problem in this classification is that the number of features (genes) greatly exceeds the number of instances (tissue samples). It has been shown that selecting a small set of informative genes can lead to improved classification accuracy. Many approaches have been proposed for this gene selection problem. Most of the previous gene ranking methods typically select 50-200 top-ranked genes and these genes are often highly correlated. Our goal is to select a small set of non-redundant marker genes that are most relevant for the classification task. RESULTS: To achieve this goal, we developed a novel hybrid approach that combines gene ranking and clustering analysis. In this approach, we first applied feature filtering algorithms to select a set of top-ranked genes, and then applied hierarchical clustering on these genes to generate a dendrogram. Finally, the dendrogram was analyzed by a sweep-line algorithm and marker genes are selected by collapsing dense clusters. Empirical study using three public datasets shows that our approach is capable of selecting relatively few marker genes while offering the same or better leave-one-out cross-validation accuracy compared with approaches that use top-ranked genes directly for classification. AVAILABILITY: The HykGene software is freely available at http://www.cs.dartmouth.edu/~wyh/software.htm CONTACT: wyh@cs.dartmouth.edu SUPPLEMENTARY INFORMATION: Supplementary material is available from http://www.cs.dartmouth.edu/~wyh/hykgene/supplement/index.htm.  相似文献   

4.
MOTIVATION: The increasing use of microarray technologies is generating large amounts of data that must be processed in order to extract useful and rational fundamental patterns of gene expression. Hierarchical clustering technology is one method used to analyze gene expression data, but traditional hierarchical clustering algorithms suffer from several drawbacks (e.g. fixed topology structure; mis-clustered data which cannot be reevaluated). In this paper, we introduce a new hierarchical clustering algorithm that overcomes some of these drawbacks. RESULT: We propose a new tree-structure self-organizing neural network, called dynamically growing self-organizing tree (DGSOT) algorithm for hierarchical clustering. The DGSOT constructs a hierarchy from top to bottom by division. At each hierarchical level, the DGSOT optimizes the number of clusters, from which the proper hierarchical structure of the underlying dataset can be found. In addition, we propose a new cluster validation criterion based on the geometric property of the Voronoi partition of the dataset in order to find the proper number of clusters at each hierarchical level. This criterion uses the Minimum Spanning Tree (MST) concept of graph theory and is computationally inexpensive for large datasets. A K-level up distribution (KLD) mechanism, which increases the scope of data distribution in the hierarchy construction, was used to improve the clustering accuracy. The KLD mechanism allows the data misclustered in the early stages to be reevaluated at a later stage and increases the accuracy of the final clustering result. The clustering result of the DGSOT is easily displayed as a dendrogram for visualization. Based on a yeast cell cycle microarray expression dataset, we found that our algorithm extracts gene expression patterns at different levels. Furthermore, the biological functionality enrichment in the clusters is considerably high and the hierarchical structure of the clusters is more reasonable. AVAILABILITY: DGSOT is available upon request from the authors.  相似文献   

5.
Inferring the structure of populations has many applications for genetic research. In addition to providing information for evolutionary studies, it can be used to account for the bias induced by population stratification in association studies. To this end, many algorithms have been proposed to cluster individuals into genetically homogeneous sub-populations. The parametric algorithms, such as Structure, are very popular but their underlying complexity and their high computational cost led to the development of faster parametric alternatives such as Admixture. Alternatives to these methods are the non-parametric approaches. Among this category, AWclust has proven efficient but fails to properly identify population structure for complex datasets. We present in this article a new clustering algorithm called Spectral Hierarchical clustering for the Inference of Population Structure (SHIPS), based on a divisive hierarchical clustering strategy, allowing a progressive investigation of population structure. This method takes genetic data as input to cluster individuals into homogeneous sub-populations and with the use of the gap statistic estimates the optimal number of such sub-populations. SHIPS was applied to a set of simulated discrete and admixed datasets and to real SNP datasets, that are data from the HapMap and Pan-Asian SNP consortium. The programs Structure, Admixture, AWclust and PCAclust were also investigated in a comparison study. SHIPS and the parametric approach Structure were the most accurate when applied to simulated datasets both in terms of individual assignments and estimation of the correct number of clusters. The analysis of the results on the real datasets highlighted that the clusterings of SHIPS were the more consistent with the population labels or those produced by the Admixture program. The performances of SHIPS when applied to SNP data, along with its relatively low computational cost and its ease of use make this method a promising solution to infer fine-scale genetic patterns.  相似文献   

6.
Diffusion Tensor Imaging (DTI) and fiber tractography are important tools to map the cerebral white matter microstructure in vivo and to model the underlying axonal pathways in the brain with three-dimensional fiber tracts. As the fast and consistent extraction of anatomically correct fiber bundles for multiple datasets is still challenging, we present a novel atlas-guided clustering framework for exploratory data analysis of large tractography datasets. The framework uses an hierarchical cluster analysis approach that exploits the inherent redundancy in large datasets to time-efficiently group fiber tracts. Structural information of a white matter atlas can be incorporated into the clustering to achieve an anatomically correct and reproducible grouping of fiber tracts. This approach facilitates not only the identification of the bundles corresponding to the classes of the atlas; it also enables the extraction of bundles that are not present in the atlas. The new technique was applied to cluster datasets of 46 healthy subjects. Prospects of automatic and anatomically correct as well as reproducible clustering are explored. Reconstructed clusters were well separated and showed good correspondence to anatomical bundles. Using the atlas-guided cluster approach, we observed consistent results across subjects with high reproducibility. In order to investigate the outlier elimination performance of the clustering algorithm, scenarios with varying amounts of noise were simulated and clustered with three different outlier elimination strategies. By exploiting the multithreading capabilities of modern multiprocessor systems in combination with novel algorithms, our toolkit clusters large datasets in a couple of minutes. Experiments were conducted to investigate the achievable speedup and to demonstrate the high performance of the clustering framework in a multiprocessing environment.  相似文献   

7.
8.
In recent years sympatry networks have been proposed as a mean to perform biogeographic analysis, but their computation posed practical difficulties that limited their use. We propose a novel approach, bringing closer the application of well-established network analysis tools to the study of sympatry patterns using both geographic and environmental data associated with the occurrence of species. Our proposed algorithm, SGraFuLo, combines the use of fuzzy logic and numerical methods to directly compute the network of interest from point locality records, without the need of specialized tools, such as geographic information systems, thereby simplifying the process for end users. By posing the problem in matrix terms, SGraFuLo is able to achieve remarkable efficiency even for large datasets, taking advantage of well established scientific computing algorithms. We present sympatry networks constructed using real-world data collected in Mexico and Central America and highlight the potential of our approach in the analysis of overlapping niches of species that could have important applications even in evolutionary studies. We also present details on the design and implementation of the algorithm, as well as experiments that show its efficiency. The source code is freely released and datasets are also available to support the reproducibility of our results.  相似文献   

9.
K-ary clustering with optimal leaf ordering for gene expression data   总被引:2,自引:0,他引:2  
MOTIVATION: A major challenge in gene expression analysis is effective data organization and visualization. One of the most popular tools for this task is hierarchical clustering. Hierarchical clustering allows a user to view relationships in scales ranging from single genes to large sets of genes, while at the same time providing a global view of the expression data. However, hierarchical clustering is very sensitive to noise, it usually lacks of a method to actually identify distinct clusters, and produces a large number of possible leaf orderings of the hierarchical clustering tree. In this paper we propose a new hierarchical clustering algorithm which reduces susceptibility to noise, permits up to k siblings to be directly related, and provides a single optimal order for the resulting tree. RESULTS: We present an algorithm that efficiently constructs a k-ary tree, where each node can have up to k children, and then optimally orders the leaves of that tree. By combining k clusters at each step our algorithm becomes more robust against noise and missing values. By optimally ordering the leaves of the resulting tree we maintain the pairwise relationships that appear in the original method, without sacrificing the robustness. Our k-ary construction algorithm runs in O(n(3)) regardless of k and our ordering algorithm runs in O(4(k)n(3)). We present several examples that show that our k-ary clustering algorithm achieves results that are superior to the binary tree results in both global presentation and cluster identification. AVAILABILITY: We have implemented the above algorithms in C++ on the Linux operating system.  相似文献   

10.
11.
12.
Evaluation and comparison of gene clustering methods in microarray analysis   总被引:4,自引:0,他引:4  
MOTIVATION: Microarray technology has been widely applied in biological and clinical studies for simultaneous monitoring of gene expression in thousands of genes. Gene clustering analysis is found useful for discovering groups of correlated genes potentially co-regulated or associated to the disease or conditions under investigation. Many clustering methods including hierarchical clustering, K-means, PAM, SOM, mixture model-based clustering and tight clustering have been widely used in the literature. Yet no comprehensive comparative study has been performed to evaluate the effectiveness of these methods. RESULTS: In this paper, six gene clustering methods are evaluated by simulated data from a hierarchical log-normal model with various degrees of perturbation as well as four real datasets. A weighted Rand index is proposed for measuring similarity of two clustering results with possible scattered genes (i.e. a set of noise genes not being clustered). Performance of the methods in the real data is assessed by a predictive accuracy analysis through verified gene annotations. Our results show that tight clustering and model-based clustering consistently outperform other clustering methods both in simulated and real data while hierarchical clustering and SOM perform among the worst. Our analysis provides deep insight to the complicated gene clustering problem of expression profile and serves as a practical guideline for routine microarray cluster analysis.  相似文献   

13.
14.
15.
We aim to improve segmentation through the use of machine learning tools during region agglomeration. We propose an active learning approach for performing hierarchical agglomerative segmentation from superpixels. Our method combines multiple features at all scales of the agglomerative process, works for data with an arbitrary number of dimensions, and scales to very large datasets. We advocate the use of variation of information to measure segmentation accuracy, particularly in 3D electron microscopy (EM) images of neural tissue, and using this metric demonstrate an improvement over competing algorithms in EM and natural images.  相似文献   

16.
We present likelihood-based methods for assigning the individuals in a sample to source populations, on the basis of their genotypes at co-dominant marker loci. The source populations are assumed to be at Hardy-Weinberg and linkage equilibrium, but the allelic composition of these source populations and even the number of source populations represented in the sample are treated as uncertain. The parameter of interest is the partition of the set of sampled individuals, induced by the assignment of individuals to source populations. We present a maximum likelihood method, and then a more powerful Bayesian approach for estimating this sample partition. In general, it will not be feasible to evaluate the evidence supporting each possible partition of the sample. Furthermore, when the number of individuals in the sample is large, it may not even be feasible to evaluate the evidence supporting, individually, each of the most plausible partitions because there may be many individuals which are difficult to assign. To overcome these problems, we use low-dimensional marginals (the 'co-assignment probabilities') of the posterior distribution of the sample partition as measures of 'similarity', and then apply a hierarchical clustering algorithm to identify clusters of individuals whose assignment together is well supported by the posterior distribution. A binary tree provides a visual representation of how well the posterior distribution supports each cluster in the hierarchy. These methods are applicable to other problems where the parameter of interest is a partition of a set. Because the co-assignment probabilities are independent of the arbitrary labelling of source populations, we avoid the label-switching problem of previous Bayesian methods.  相似文献   

17.
We present a novel approach for feature correspondence and multiple structure discovery in computer vision. In contrast to existing methods, we exploit the fact that point-sets on the same structure usually lie close to each other, thus forming clusters in the image. Given a pair of input images, we initially extract points of interest and extract hierarchical representations by agglomerative clustering. We use the maximum weighted clique problem to find the set of corresponding clusters with maximum number of inliers representing the multiple structures at the correct scales. Our method is parameter-free and only needs two sets of points along with their tentative correspondences, thus being extremely easy to use. We demonstrate the effectiveness of our method in multiple-structure fitting experiments in both publicly available and in-house datasets. As shown in the experiments, our approach finds a higher number of structures containing fewer outliers compared to state-of-the-art methods.  相似文献   

18.
Suppose that many polymorphic sites have been identified and genotyped in a region showing strong linkage with a trait. A key question of interest is which site (or combination of sites) in the region influences susceptibility to the trait. We have developed a novel statistical approach to this problem, in the context of qualitative-trait mapping, in which we use linkage data to identify the polymorphic sites whose genotypes could fully explain the observed linkage to the region. The information provided by this analysis is different from that provided by tests of either linkage or association. Our approach is based on the observation that if a particular site is the only site in the region that influences the trait, then-conditional on the genotypes at that site for the affected relatives-there should be no unexplained oversharing in the region among affected individuals. We focus on the affected sib-pair study design and develop test statistics that are variations on the usual allele-sharing methods used in linkage studies. We perform hypothesis tests and derive a confidence set for the true causal polymorphic site, under the assumption that there is only one site in the region influencing the trait. Our method is appropriate under a very general model for how the site influences the trait, including epistasis with unlinked loci, correlated environmental effects within families, and gene-environment interaction. We extend our method to larger sibships and apply it to an NIDDM1 data set.  相似文献   

19.
Recently developed algorithms permit nonparametric linkage analysis of large, complex pedigrees with multiple inbreeding loops. We have used one such algorithm, implemented in the package SimWalk2, to reanalyze previously published genome-screen data from a Costa Rican kindred segregating for severe bipolar disorder. Our results are consistent with previous linkage findings on chromosome 18 and suggest a new locus on chromosome 5 that was not identified using traditional linkage analysis.  相似文献   

20.
Recent advances in massively parallel sequencing technology have created new opportunities to probe the hidden world of microbes. Taxonomy-independent clustering of the 16S rRNA gene is usually the first step in analyzing microbial communities. Dozens of algorithms have been developed in the last decade, but a comprehensive benchmark study is lacking. Here, we survey algorithms currently used by microbiologists, and compare seven representative methods in a large-scale benchmark study that addresses several issues of concern. A new experimental protocol was developed that allows different algorithms to be compared using the same platform, and several criteria were introduced to facilitate a quantitative evaluation of the clustering performance of each algorithm. We found that existing methods vary widely in their outputs, and that inappropriate use of distance levels for taxonomic assignments likely resulted in substantial overestimates of biodiversity in many studies. The benchmark study identified our recently developed ESPRIT-Tree, a fast implementation of the average linkage-based hierarchical clustering algorithm, as one of the best algorithms available in terms of computational efficiency and clustering accuracy.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号