首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.

Background

The Enhanced Matching System (EMS) is a probabilistic record linkage program developed by the tuberculosis section at Public Health England to match data for individuals across two datasets. This paper outlines how EMS works and investigates its accuracy for linkage across public health datasets.

Methods

EMS is a configurable Microsoft SQL Server database program. To examine the accuracy of EMS, two public health databases were matched using National Health Service (NHS) numbers as a gold standard unique identifier. Probabilistic linkage was then performed on the same two datasets without inclusion of NHS number. Sensitivity analyses were carried out to examine the effect of varying matching process parameters.

Results

Exact matching using NHS number between two datasets (containing 5931 and 1759 records) identified 1071 matched pairs. EMS probabilistic linkage identified 1068 record pairs. The sensitivity of probabilistic linkage was calculated as 99.5% (95%CI: 98.9, 99.8), specificity 100.0% (95%CI: 99.9, 100.0), positive predictive value 99.8% (95%CI: 99.3, 100.0), and negative predictive value 99.9% (95%CI: 99.8, 100.0). Probabilistic matching was most accurate when including address variables and using the automatically generated threshold for determining links with manual review.

Conclusion

With the establishment of national electronic datasets across health and social care, EMS enables previously unanswerable research questions to be tackled with confidence in the accuracy of the linkage process. In scenarios where a small sample is being matched into a very large database (such as national records of hospital attendance) then, compared to results presented in this analysis, the positive predictive value or sensitivity may drop according to the prevalence of matches between databases. Despite this possible limitation, probabilistic linkage has great potential to be used where exact matching using a common identifier is not possible, including in low-income settings, and for vulnerable groups such as homeless populations, where the absence of unique identifiers and lower data quality has historically hindered the ability to identify individuals across datasets.  相似文献   

2.
We investigated the feasibility of using driver’s license records to obtain height and weight data of individuals. First, we linked Washington State driver’s license records (DOL) to the state birth files to assess how well driver’s licenses can be linked to a public health database. We were able to match 78.4% of mothers and 71.7% of fathers on birth records to driver’s license records. Then we assessed the accuracy of DOL height and weight data by comparing them to heights and weights measured on control women enrolled in a cancer etiology study (CES). There is a close relation between CES and DOL heights, but not a close relation between weights. Our results suggest that driver’s license files are a good source of information for women’s heights, but are not as good for women’s weights.  相似文献   

3.
The power to detect linkage for likelihood and nonparametric (Haseman-Elston, affected-sib-pair, and affected-pedigree-member) methods is compared for the case of a common, dichotomous trait resulting from the segregation of two loci. Pedigree data for several two-locus epistatic and heterogeneity models have been simulated, with one of the loci linked to a marker locus. Replicate samples of 20 three-generation pedigrees (16 individuals/pedigree) were simulated and then ascertained for having at least 6 affected individuals. The power of linkage detection calculated under the correct two-locus model is only slightly higher than that under a single locus model with reduced penetrance. As expected, the nonparametric linkage methods have somewhat lower power than does the lod-score method, the difference depending on the mode of transmission of the linked locus. Thus, for many pedigree linkage studies, the lod-score method will have the best power. However, this conclusion depends on how many times the lod score will be calculated for a given marker. The Haseman-Elston method would likely be preferable to calculating lod scores under a large number of genetic models (i.e., varying both the mode of transmission and the penetrances), since such an analysis requires an increase in the critical value of the lod criterion. The power of the affected-pedigree-member method is lower than the other methods, which can be shown to be largely due to the fact that marker genotypes for unaffected individuals are not used.  相似文献   

4.
BACKGROUND: Although teratogen information services (TISs) obtain maternal exposure information from their callers, such services often do not know if the pregnancies were affected by a birth defect. This study attempted to improve the completeness of this information for Texas Teratogen Information Service (TTIS) callers by linking their records with the Texas Birth Defects Registry (TBDR) and Texas birth certificates (TBCs). METHODS: A total of 344 expectant mothers called TTIS with expected dates of delivery between 1 January 2000 and 31 December 2001. These pregnancies were linked with TBDR and TBC data. The percentages of pregnancies with known birth defect information both before and after the linkage were compared. RESULTS: The TTIS originally collected birth defect status information for 101 of the 344 callers (29.4%) and 0.6% of all 344 callers or 2.0% of callers with birth defect status information had a pregnancy affected by a birth defect. Linking TTIS records with TBDR and TBC data helped to raise the percentage of callers with birth defect status information from 29.4% to 71.5%. Among those callers, the percentage known to have birth defects increased from 2.0% to 4.1%. The sensitivity of TTIS follow-up calls in identifying birth defects was 50%, and the specificity was 100%. CONCLUSIONS: Linking TTIS caller records with TBDR and TBC data significantly increased both the percentage of pregnancies with birth defect status information and the percentage of pregnancies identified as affected by birth defects. Such linkage may be a good approach by which TISs can increase the completeness of their birth defect status information.  相似文献   

5.
A sample of pairs of twins who were born in the United States in 1919 and survived to adulthood is identified through an innovative and large-scale application of the methodology of probabilistic linkage. The social security program began in the United States in November 1936, and the file of applicants for a social security number - which was used in this study - is the closest thing in the United States to a population register. The study results are very satisfactory, and demonstrate the superiority of probabilistic linkage to exact linkage. We estimate that about 33,000 twin pairs were born in the United States in 1919 and about 19,000 survived to age 17. Since the social security number was not then, at the inception of the program, the universal identifier that it is today, the number of enumerated twin pairs is somewhat less. Nonetheless, over 16,000 twin pairs can be identified by the method of probabilistic linkage. By comparison, only about half as many can be identified by straightforward exact linkage.  相似文献   

6.
Peptide identification by tandem mass spectrometry is the dominant proteomics workflow for protein characterization in complex samples. The peptide fragmentation spectra generated by these workflows exhibit characteristic fragmentation patterns that can be used to identify the peptide. In other fields, where the compounds of interest do not have the convenient linear structure of peptides, fragmentation spectra are identified by comparing new spectra with libraries of identified spectra, an approach called spectral matching. In contrast to sequence-based tandem mass spectrometry search engines used for peptides, spectral matching can make use of the intensities of fragment peaks in library spectra to assess the quality of a match. We evaluate a hidden Markov model approach (HMMatch) to spectral matching, in which many examples of a peptide's fragmentation spectrum are summarized in a generative probabilistic model that captures the consensus and variation of each peak's intensity. We demonstrate that HMMatch has good specificity and superior sensitivity, compared to sequence database search engines such as X!Tandem. HMMatch achieves good results from relatively few training spectra, is fast to train, and can evaluate many spectra per second. A statistical significance model permits HMMatch scores to be compared with each other, and with other peptide identification tools, on a unified scale. HMMatch shows a similar degree of concordance with X!Tandem, Mascot, and NIST's MS Search, as they do with each other, suggesting that each tool can assign peptides to spectra that the others miss. Finally, we show that it is possible to extrapolate HMMatch models beyond a single peptide's training spectra to the spectra of related peptides, expanding the application of spectral matching techniques beyond the set of peptides previously observed.  相似文献   

7.
This paper describes the creation of a unique maternal identifier for use in the investigation of perinatal, postneonatal and child outcomes in relation to maternal characteristics. All Midwives' records of Western Australian (WA) births were routinely linked to registrations of births and deaths for infants born from 1980 to 1992 inclusive, then linked to WA hospital discharge data and to registries of birth defects and cerebral palsy to create a longitudinal health record for each infant. However, since each birth to a woman was recorded as a separate event, there was no way to identify siblings. Probabilistic record linkage, based on information about the mother, was used for this task. Logical inconsistencies within the data were used to test the validity of the linkages between birth records attributed to each mother. Information about the mother from other epidemiological studies and data abstracted from hospital case notes was also used to validate sibships. Linkage of the records of 310,255 births in WA during that period resulted in the formation of 181,133 sibships of one or more children. Pooling the results of all of the validation methods gave an error of 0.9%. Linkage identified 3678 sibships containing multiple births, and 305 sets of maternal twins. Ascertainment of twins and their siblings for an ongoing twin register, the WA Twin Child Health (WATCH) study, was a natural consequence of this process.  相似文献   

8.
Some methods of statistical analysis of data on DNA fingerprinting suffer serious weaknesses. Unlinked Mendelizing loci that are at linkage equilibrium in subpopulations may be statistically associated, not statistically independent, in the population as a whole if there is heterogeneity in gene frequencies between subpopulations. In the populations where DNA fingerprinting is used for forensic applications, the assumption that DNA fragments occur statistically independently for different probes, different loci, or different fragment size classes lacks supporting data so far; there is some contrary evidence. Statistical association of alleles may cause estimates based on the assumption of statistical independence to understate the true matching probabilities by many orders of magnitude. The assumptions that DNA fragments occur independently and with constant frequency within a size class appear to be contradicted by the available data on the mean and variance of the number of fragments per person. The mistaken use of the geometric mean instead of the arithmetic mean to compute the probability that every DNA fragment of a randomly chosen person is present among the DNA fragments of a specimen may substantially understate the probability of a match between blots, even if other assumptions involved in the calculations are taken as correct. The conclusion is that some astronomically small probabilities of matching by chance, which have been claimed in forensic applications of DNA fingerprinting, presently lack substantial empirical and theoretical support.  相似文献   

9.
BACKGROUND: The Texas Birth Defects Registry (TBDR) does not access prenatal diagnostic facilities to ascertain cases. Objectives of the study were to determine how many cases may be missing from the registry as a result, and to assess the feasibility and utility of prenatal surveillance for birth defects, through a pilot test in one region of Texas. METHODS: A trained abstractor reviewed medical records of all patients with abnormal ultrasound findings during 2004 in all prenatal diagnostic facilities in Texas Health Region 11 (n = 6 facilities). When birth defects were prenatally detected, demographic and diagnostic data were abstracted. Prenatal abstractions were matched to cases in the TBDR. Those that did not match to registry cases were matched to vital records to determine where and when the pregnancy ended; delivery hospital medical records were reviewed for these cases. RESULTS: Approximately 760 patient charts were reviewed at prenatal diagnostic facilities and 365 were abstracted. Of these, 165 (45%) matched to cases in the TBDR. Delivery medical records were located and reviewed for 177 prenatal abstractions, with 170 (47%) indicating at delivery no defects monitored by the registry. Delivery records for one (0.3%) prenatal abstraction were not found by the hospital. Date and place of delivery were unknown for 22 (6%) prenatal abstractions. Only eight additional infants and fetuses (one twin pair) eligible for the registry were identified. CONCLUSIONS: For Texas Health Service Region 11, it is not necessary to conduct surveillance in prenatal diagnostic facilities, and to do so would be very labor-intensive.  相似文献   

10.
The long-term effects of toxic substances in man that have been discovered so far have involved gross relative risks or bizarre effects, or have stumbled upon by chance or because of special circumstances. These facts and some recent epidemiological evidence together suggest that a systematic approach with better methods would reveal the effects of many more toxic substances, particularly in manufacturing industry. Record linkage is a powerful tool because it makes possible the correlation of indicators of exposure with indicators of the biological effect of such exposure in the same persons or in their progeny even after considerable periods of time have elapsed. A system of linked records exists in England and Wales which is at present used by research workers to follow up samples of persons defined in various ways, e.g. in respect of exposure to a suspected toxic factor. In this way hypotheses about substances causing cancer or other lethal effects can be tested. It is suggested that there are two additional ways in which record linkage techniques could be used to identify substances with long-term toxic effects: the first would be by setting up a register of women employed in industry during pregnancy and linking this register to records of the occurrence of congenital malformations and of stillbirth or death in their children; the second would be to follow samples of workers in manufacturing industry, notably those engaged in the manufacture of products from raw materials including the chemical industry, to death and to the development of cancer. Regular analyses of material from these two systems of linked records would provide the basis for a monitoring system for certain gross effects of long-term toxic substances in man. There are two principal obstacles to further progress in this field. The first is the lack of a clear statement of public policy concerning the issues of confidentiality and informed consent in the use of identifiable personal records for medical research. A settlement is needed which defines the proper limits of their use in the interests of health with safeguards to privacy. The second obstacle is a lack of resources to improve the quality, accessibility and organization of the appropriate data.  相似文献   

11.
BackgroundRecord linkage integrates records across multiple related data sources identifying duplicates and accounting for possible errors. Real life applications require efficient algorithms to merge these voluminous data sources to find out all records belonging to same individuals. Our recently devised highly efficient record linkage algorithms provide best-known solutions to this challenging problem.MethodWe have developed RLT-S, a freely available web tool, which implements our single linkage clustering algorithm for record linkage. This tool requires input data sets and a small set of configuration settings about these files to work efficiently. RLT-S employs exact match clustering, blocking on a specified attribute and single linkage based hierarchical clustering among these blocks.ResultsRLT-S is an implementation package of our sequential record linkage algorithm. It outperforms previous best-known implementations by a large margin. The tool is at least two times faster for any dataset than the previous best-known tools.ConclusionsRLT-S tool implements our record linkage algorithm that outperforms previous best-known algorithms in this area. This website also contains necessary information such as instructions, submission history, feedback, publications and some other sections to facilitate the usage of the tool.AvailabilityRLT-S is integrated into http://www.rlatools.com, which is currently serving this tool only. The tool is freely available and can be used without login. All data files used in this paper have been stored in https://github.com/abdullah009/DataRLATools. For copies of the relevant programs please see https://github.com/abdullah009/RLATools.  相似文献   

12.
Chen L  Storey JD 《Genetics》2006,173(4):2371-2381
Linkage analysis involves performing significance tests at many loci located throughout the genome. Traditional criteria for declaring a linkage statistically significant have been formulated with the goal of controlling the rate at which any single false positive occurs, called the genomewise error rate (GWER). As complex traits have become the focus of linkage analysis, it is increasingly common to expect that a number of loci are truly linked to the trait. This is especially true in mapping quantitative trait loci (QTL), where sometimes dozens of QTL may exist. Therefore, alternatives to the strict goal of preventing any single false positive have recently been explored, such as the false discovery rate (FDR) criterion. Here, we characterize some of the challenges that arise when defining relaxed significance criteria that allow for at least one false positive linkage to occur. In particular, we show that the FDR suffers from several problems when applied to linkage analysis of a single trait. We therefore conclude that the general applicability of FDR for declaring significant linkages in the analysis of a single trait is dubious. Instead, we propose a significance criterion that is more relaxed than the traditional GWER, but does not appear to suffer from the problems of the FDR. A generalized version of the GWER is proposed, called GWERk, that allows one to provide a more liberal balance between true positives and false positives at no additional cost in computation or assumptions.  相似文献   

13.
MOTIVATION: Matching a biological sequence against a probabilistic pattern (or profile) is a common task in computational biology. A probabilistic profile, represented as a scoring matrix, is more suitable than a deterministic pattern to retain the peculiarities of a given segment of a family of biological sequences. Brute-force algorithms take O(NP) to match a sequence of N characters against a profile of length P < N. RESULTS: In this work, we exploit string compression techniques to speedup brute-force profile matching. We present two algorithms, based on run-length and LZ78 encodings, that reduce computational complexity by the compression factor of the encoding.  相似文献   

14.

Background

In the absence of clinical trial data, large post-marketing observational studies are essential to evaluate the safety and effectiveness of medications during pregnancy. We identified a cohort of pregnancies ending in live birth within the 2000–2007 Medicaid Analytic eXtract (MAX). Herein, we provide a blueprint to guide investigators who wish to create similar cohorts from healthcare utilization data and we describe the limitations in detail.

Methods

Among females ages 12–55, we identified pregnancies using delivery-related codes from healthcare utilization claims. We linked women with pregnancies to their offspring by state, Medicaid Case Number (family identifier) and delivery/birth dates. Then we removed inaccurate linkages and duplicate records and implemented cohort eligibility criteria (i.e., continuous and appropriate enrollment type, no private insurance, no restricted benefits) for claim information completeness.

Results

From 13,460,273 deliveries and 22,408,810 child observations, 6,107,572 pregnancies ending in live birth were available after linkage, cleaning, and removal of duplicate records. The percentage of linked deliveries varied greatly by state, from 0 to 96%. The cohort size was reduced to 1,248,875 pregnancies after requiring maternal eligibility criteria throughout pregnancy and to 1,173,280 pregnancies after further applying infant eligibility criteria. Ninety-one percent of women were dispensed at least one medication during pregnancy.

Conclusions

Mother-infant linkage is feasible and yields a large pregnancy cohort, although the size decreases with increasing eligibility requirements. MAX is a useful resource for studying medications in pregnancy and a spectrum of maternal and infant outcomes within the indigent population of women and their infants enrolled in Medicaid. It may also be used to study maternal characteristics, the impact of Medicaid policy, and healthcare utilization during pregnancy. However, careful attention to the limitations of these data is necessary to reduce biases.  相似文献   

15.
We report construction of a genetic linkage map of the guppy genome using 790 single nucleotide polymorphism markers, integrated from six mapping crosses. The markers define 23 linkage groups (LGs), corresponding to the known haploid number of guppy chromosomes. The map, which spans a genetic length of 899 cM, includes 276 markers linked to expressed genes (expressed sequence tag), which have been used to derive broad syntenic relationships of guppy LGs with medaka chromosomes. This combined linkage map should facilitate the advancement of genetic studies for a wide variety of complex adaptive phenotypes relevant to natural and sexual selection in this species. We have used the linkage data to predict quantitative trait loci for a set of variable male traits including size and colour pattern. Contributing loci map to the sex LG for many of these traits.  相似文献   

16.
Microsatellites are highly polymorphic repetitive DNA segments dispersed throughout the genome and have been widely used for genetic linkage analysis and allele loss. Instability of microsatellites sequences has been linked to deficiencies in DNA mismatch repair, and is observed in a number of different tumor types. Analysis of microsatellite instability is thought to be a useful clinical tool for cancer diagnosis. Fluorescent detection of microsatellite instability using an automated DNA sequencer holds several distinct advantages over traditional radioactive analysis and electrophoresis, allowing simultaneous analysis of a number of different markers for a large number of samples, high resolution, sensitivity, and clear interpretation of data. In this article we present an established protocol, which has been used successfully to detect microsatellite instability in DNA samples from human tumors and circulating tumor DNA in serum/plasma.  相似文献   

17.
N-(4-hydroxyphenyl)retinamide (fenretinide, 4-HPR) has been shown to be active toward many tumors without appreciable side effects. However its in vitro activity does not match a correspondent efficacy in vivo. The main reason is that the drug's hydrophobicity hinders its bioavailability in the body fluids. Even if the drug is previously dissolved in organic solvents, such as ethanol or DMSO, the subsequent dilution in body fluids trigger its precipitation in fine aggregates characterized by very low dissolution efficiency, never reaching amounts suitable for therapeutic response. To date no intravenous formulation of 4-HPR exists on the market. The 4-HPR linkage to a hydrophilic polymer by a covalent bond easily hydrolyzable in aqueous environment is expected to increase the drug's aqueous solubility, providing the free drug after hydrolysis of the covalent bond. This may be a useful tool for the preparation of aqueous intravenous formulations of 4-HPR. For this purpose, we linked 4-HPR to polyvinylalcohol (PVA) by a carbonate bond at different drug/hydroxy vinyl monomer molar ratios. We demonstrated that conjugation increased 4-HPR aqueous solubility and strongly inhibited neuroblastoma cell proliferation. In addition, in an in vivo neuroblastoma metastatic model, we obtained a significant antitumor effect as a consequence of the improved drug bioavailability.  相似文献   

18.
Summary Methodologies commonly used to detect linkage of marker loci to loci affecting quantitative traits are discussed. It is shown that variances for the quantitative trait differ among marker genotypes when using F2 or pooled backcross data if linkage exists. Hence, to analyze this type of data by single factor ANOVA or other statistical techniques that assume a common variance is inadequate. Restriction fragment length polymorphism (RFLP) markers are a powerful tool in plant breeding but cost is an important drawback; hence, a methodology is suggested to obtain the minimum number of plants in F2 populations to detect such linkage.  相似文献   

19.
Assigning Linkage Haplotypes from Parent and Progeny Genotypes   总被引:2,自引:1,他引:1       下载免费PDF全文
A. Nejati-Javaremi  C. Smith 《Genetics》1996,142(4):1363-1367
Given the genotypes of parents and progeny, their haplotypes over several or many linked loci can be easily assigned by listing the allele type at each locus along the haplotype known to be from each parent. Only a small number (5-10) of progeny per family is usually needed to assign the parental and progeny haplotypes. Any gaps left in the haplotypes may be filled in from the assigned haplotypes of relatives. The process is facilitated by having multiple alleles at the loci and by using more linked loci in the haplotype and with more progeny from the mating. Crossover haplotypes in the progeny can be identified by their being unique or uncommon, and the crossover point can often be detected if the locus linkage map order is known. The haplotyping method applies to outbreeding populations in plants, animals and man, as well as to traditional experimental crosses of inbred lines. The method also applies to half-sib families, whether the genotypes of the mates are known or unknown. The haplotyping procedure is already used in linkage analysis but does not seem to have been published. It should be useful in teaching and in genetic applications of haplotypes.  相似文献   

20.
Family studies suggest that genetic variation may influence birth weight. We have assessed linkage of birth weight in a genome-wide scan in 269 Pima Indian siblings (334 sibling pairs, 92 families). As imprinting (expression of only a single copy of a gene depending on parent-of-origin), is commonly found in genes that affect fetal growth, we used a recently described modification of standard multipoint variance-component methods of linkage analysis of quantitative traits. This technique allows for comparison of linkage models that incorporate imprinting effects (in which the strength of linkage is expressed as LOD(IMP)) and models where parent-of-origin effects are not included (LOD(EQ)). Where significant evidence of linkage was present, separate contributions of alleles derived from father (LOD(FA)) or mother (LOD(MO)) to the imprinting model were estimated. Significant evidence of linkage was found on chromosome 11 (at map position 88 cM, LOD(IMP)=3.4) with evidence for imprinting (imprinting model superior, P<0.001). In this region, birth weight was linked predominantly to paternally derived alleles (LOD(FA)=4.1, LOD(MO)=0.0). An imprinted gene on chromosome 11 may influence birth weight in the Pima population. This chromosome contains one of the two major known clusters of imprinted genes in the human genome, lending biological plausibility to our findings.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号