首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
ABSTRACT: BACKGROUND: Mass spectrometry (MS) data are often generated from various biological or chemical experiments and there may exist outlying observations, which are extreme due to technical reasons. The determination of outlying observations is important in the analysis of replicated MS data because elaborate pre-processing is essential for successful analysis with reliable results and manual outlier detection as one of pre-processing steps is time-consuming. The heterogeneity of variability and low replication are often obstacles to successful analysis, including outlier detection. Existing approaches, which assume constant variability, can generate many false positives (outliers) and/or false negatives non-outliers). Thus, a more powerful and accurate approach is needed to account for the heterogeneity of variability and low replication. FINDINGS: We proposed an outlier detection algorithm using projection and quantile regression in MS data from multiple experiments. The performance of the algorithm and program was demonstrated by using both simulated and real-life data. The projection approach with linear, nonlinear, or nonparametric quantile regression was appropriate in heterogeneous high-throughput data with low replication. CONCLUSION: Various quantile regression approaches combined with projection were proposed for detecting outliers. The choice among linear, nonlinear, and nonparametric regressions is dependent on the degree of heterogeneity of the data. The proposed approach was illustrated with MS data with two or more replicates.  相似文献   

2.
3.
Many biological processes are periodic, for example cell cycle expression, circadian rhythms and calcium oscillations. However, measured time series from these processes are commonly short and noisy, and finding frequencies in such data can be challenging. Here we present BaSAR, Bayesian Spectrum Analysis in R, a package for extracting frequency information from time series data. The software uses advanced techniques of Bayesian inference that are well suited for handling typical biological data. The core functions are designed for detecting a single key frequency, without the need for data pre-processing such as detrending. The package is freely available at CRAN - The Comprehensive R Archive Network: http://cran.r-project.org/web/packages/BaSAR.  相似文献   

4.

Aim

Species distribution data play a pivotal role in the study of ecology, evolution, biogeography and biodiversity conservation. Although large amounts of location data are available and accessible from public databases, data quality remains problematic. Of the potential sources of error, positional errors are critical for spatial applications, particularly where these errors place observations beyond the environmental or geographical range of species. These outliers need to be identified, checked and removed to improve data quality and minimize the impact on subsequent analyses. Manually checking all species records within large multispecies datasets is prohibitively costly. This work investigates algorithms that may assist in the efficient vetting of outliers in such large datasets.

Location

We used real, spatially explicit environmental data derived from the western part of Victoria, Australia, and simulated species distributions within this same region.

Methods

By adapting species distribution modelling (SDM), we developed a pseudo‐SDM approach for detecting outliers in species distribution data, which was implemented with random forest (RF) and support vector machine (SVM) resulting in two new methods: RF_pdSDM and SVM_pdSDM. Using virtual species, we compared eight existing multivariate outlier detection methods with these two new methods under various conditions.

Results

The two new methods based on the pseudo‐SDM approach had higher true skill statistic (TSS) values than other approaches, with TSS values always exceeding 0. More than 70% of the true outliers in datasets for species with a low and intermediate prevalence can be identified by checking 10% of the data points with the highest outlier scores.

Main conclusions

Pseudo‐SDM‐based methods were more effective than other outlier detection methods. However, this outlier detection procedure can only be considered as a screening tool, and putative outliers must be examined by experts to determine whether they are actual errors or important records within an inherently biased set of data.  相似文献   

5.
Local adaptation is considered a paradigm in studies of salmonid fish populations. Yet, little is known about the geographical scale of local adaptation. Is adaptive divergence primarily evident at the scale of regions or individual populations? Also, many salmonid populations are subject to spawning intrusion by farmed conspecifics that experience selection regimes fundamentally different from wild populations. This prompts the question if adaptive differences between wild populations and hatchery strains are more pronounced than between different wild populations? We addressed these issues by analyzing variation at 74 microsatellite loci (including anonymous and expressed sequence tag- and quantitative trait locus-linked markers) in 15 anadromous wild brown trout (Salmo trutta L.) populations, representing five geographical regions, along with two lake populations and two hatchery strains used for stocking some of the populations. FST-based outlier tests revealed more outlier loci between different geographical regions separated by 522±228 km (mean±s.d.) than between populations within regions separated by 117±79 km (mean±s.d.). A significant association between geographical distance and number of outliers between regions was evident. There was no evidence for more outliers in comparisons involving hatchery trout, but the loci under putative selection generally were not the same as those found to be outliers between wild populations. Our study supports the notion of local adaption being increasingly important at the scale of regions as compared with individual populations, and suggests that loci involved in adaptation to captive environments are not necessarily the same as those involved in adaptive divergence among wild populations.  相似文献   

6.
A PCR procedure has been developed for routine analysis of viable Salmonella spp. in feed samples. The objective was to develop a simple PCR-compatible enrichment procedure to enable DNA amplification without any sample pretreatment such as DNA extraction or cell lysis. PCR inhibition by 14 different feed samples and natural background flora was circumvented by the use of the DNA polymerase Tth. This DNA polymerase was found to exhibit a high level of resistance to PCR inhibitors present in these feed samples compared to DyNAzyme II, FastStart Taq, Platinum Taq, Pwo, rTth, Taq, and Tfl. The specificity of the Tth assay was confirmed by testing 101 Salmonella and 43 non-Salmonella strains isolated from feed and food samples. A sample preparation method based on culture enrichment in buffered peptone water and DNA amplification with Tth DNA polymerase was developed. The probability of detecting small numbers of salmonellae in feed, in the presence of natural background flora, was accurately determined and found to follow a logistic regression model. From this model, the probability of detecting 1 CFU per 25 g of feed in artificially contaminated soy samples was calculated and found to be 0.81. The PCR protocol was evaluated on 155 naturally contaminated feed samples and compared to an established culture-based method, NMKL-71. Eight percent of the samples were positive by PCR, compared with 3% with the conventional method. The reasons for the differences in sensitivity are discussed. Use of this method in the routine analysis of animal feed samples would improve safety in the food chain.  相似文献   

7.
Yang M  Wyckoff GJ 《Genetica》2011,139(5):639-648
The neutral theory of molecular evolution (Kimura 1985) is the basis for most current statistical tests for detecting selection, mainly using polymorphism data within species, divergence data between species, and/or genomic structures like linkage disequilibrium (Wang et al. 2006). In most cases informative tests can only be constructed with ample variations within these parameters and many common tests are difficult to formulate when identity-by-descent is not clear, for example in gene families or repetitive elements. With the current progress being made toward whole-genome sequencing and re-sequencing efforts, as well as protein sequencing via tandem mass spectrometry where genomic sequencing is lacking, we felt it was necessary to re-visit possible methods for rapid screening and detection of evolutionary outliers. These outliers might be of interest for other research, such as candidate gene association studies or genome annotations, drug- and disease-target searches, and functional studies. We focused on methods that would work on both protein and nucleotide data, could be used on large gene or protein domain families, and could be generated quickly in order for “first pass” annotation of large scale data. For these reasons, we chose properties of trees generated routinely in molecular phylogenetic studies; genetic distance, tree shape and balance, and internal node statistics (Heard 1992). Our current research looking at protein domain family data and phylogenetic trees from PFAM (Finn et al. 2008) suggests this approach towards detecting evolutionary outliers is feasible, but additional work will be necessary to determine the parameters that suggest either positive or negative selection is occurring in specific gene families. This is particularly true when other factors such as rapid duplication and deletion of genes containing these domains is taking place, and we suggest phylogenetic statistics may be useful in combination with existing methodologies for detailed studies of gene family data.  相似文献   

8.
Ordinary least square (OLS) in regression has been widely used to analyze patient-level data in cost-effectiveness analysis (CEA). However, the estimates, inference and decision making in the economic evaluation based on OLS estimation may be biased by the presence of outliers. Instead, robust estimation can remain unaffected and provide result which is resistant to outliers. The objective of this study is to explore the impact of outliers on net-benefit regression (NBR) in CEA using OLS and to propose a potential solution by using robust estimations, i.e. Huber M-estimation, Hampel M-estimation, Tukey''s bisquare M-estimation, MM-estimation and least trimming square estimation. Simulations under different outlier-generating scenarios and an empirical example were used to obtain the regression estimates of NBR by OLS and five robust estimations. Empirical size and empirical power of both OLS and robust estimations were then compared in the context of hypothesis testing.Simulations showed that the five robust approaches compared with OLS estimation led to lower empirical sizes and achieved higher empirical powers in testing cost-effectiveness. Using real example of antiplatelet therapy, the estimated incremental net-benefit by OLS estimation was lower than those by robust approaches because of outliers in cost data. Robust estimations demonstrated higher probability of cost-effectiveness compared to OLS estimation. The presence of outliers can bias the results of NBR and its interpretations. It is recommended that the use of robust estimation in NBR can be an appropriate method to avoid such biased decision making.  相似文献   

9.
10.
11.
The energy content of finishing diets offered to feedlot cattle may vary across countries. We assumed that the lower is the energy content of the finishing diet, the shorter can be the adaptation period to high-concentrate diets without negatively impacting rumen health while still improving feedlot performance. This study was designed to determine the effects of adaptation periods of 6, 9, 14 and 21 days on feedlot performance, feeding behaviour, blood gas profile, carcass characteristics and rumen morphometrics of Nellore cattle. The experiment was designed as a completely randomised block, replicated 6 times, in which 96 20-month-old yearling Nellore bulls (391.1 ± 30.9 kg) were fed in 24 pens (4 animals/pen) according to the adaptation period adopted: 6, 9, 14 or 21 days. The adaptation diets contained 70%, 75% and 80.5% concentrate, and the finishing diet contained 86% concentrate. After adaptation, one animal per pen was slaughtered (n = 24) for rumen morphometric evaluations and the remaining 72 animals were harvested after 88 days on feed. Orthogonal contrasts were used to assess linear, quadratic and cubic relationships between days of adaptation and the dependent variable. Overall, as days of adaptation increased, final BW (P = 0.06), average daily gain (ADG) (P = 0.07), hot carcass weight (P = 0.04) and gain to feed ratio (G : F) (P = 0.07) were affected quadratically, in which yearling bulls adapted by 14 days presented greater final BW, ADG, hot carcass weight and improved G : F. No significant (P > 0.10) days of adaptation effect was observed for any of feeding behaviour variables. As days of adaptation increased, the absorptive surface area of the rumen was affected cubically, where yearling bulls adapted by 14 days presented greater absorptive surface area (P = 0.03). Thus, Nellore yearling bulls should be adapted by 14 days because it led to improved feedlot performance and greater development of rumen epithelium without increasing rumenitis scores.  相似文献   

12.
13.

Background

The removal of outliers to acquire a significant result is a questionable research practice that appears to be commonly used in psychology. In this study, we investigated whether the removal of outliers in psychology papers is related to weaker evidence (against the null hypothesis of no effect), a higher prevalence of reporting errors, and smaller sample sizes in these papers compared to papers in the same journals that did not report the exclusion of outliers from the analyses.

Methods and Findings

We retrieved a total of 2667 statistical results of null hypothesis significance tests from 153 articles in main psychology journals, and compared results from articles in which outliers were removed (N = 92) with results from articles that reported no exclusion of outliers (N = 61). We preregistered our hypotheses and methods and analyzed the data at the level of articles. Results show no significant difference between the two types of articles in median p value, sample sizes, or prevalence of all reporting errors, large reporting errors, and reporting errors that concerned the statistical significance. However, we did find a discrepancy between the reported degrees of freedom of t tests and the reported sample size in 41% of articles that did not report removal of any data values. This suggests common failure to report data exclusions (or missingness) in psychological articles.

Conclusions

We failed to find that the removal of outliers from the analysis in psychological articles was related to weaker evidence (against the null hypothesis of no effect), sample size, or the prevalence of errors. However, our control sample might be contaminated due to nondisclosure of excluded values in articles that did not report exclusion of outliers. Results therefore highlight the importance of more transparent reporting of statistical analyses.  相似文献   

14.
15.
This paper presents 3 years of data (2009–2011) on the occurrence of two mycotoxins, aflatoxin B1 (AFB1) and zearalenone (ZEA), in samples of feedstuff for dairy cows (n?=?963), ewes (n?=?42), and goats (n?=?131) produced in Portugal. AFB1 was found in 15 samples of cow feed (1.6 %), 3 samples of ewe feed (2.3 %) and in 2 samples of goat feed (4.8 %). All but two samples contained AFB1 at levels below the European Union maximum level (5 μg/kg). Nearly half (45 %) of the samples were contaminated with ZEA, but its levels were relatively low, at 5–136.9 μg/kg, well below the European Union guidance value (500 μg/kg).  相似文献   

16.
Most of the drugs in use against Plasmodium falciparum share similar modes of action and, consequently, there is a need to identify alternative potential drug targets. Here, we focus on the apicoplast, a malarial plastid-like organelle of algal source which evolved through secondary endosymbiosis. We undertake a systematic in silico target-based identification approach for detecting drugs already approved for clinical use in humans that may be able to interfere with the P. falciparum apicoplast. The P. falciparum genome database GeneDB was used to compile a list of ≈600 proteins containing apicoplast signal peptides. Each of these proteins was treated as a potential drug target and its predicted sequence was used to interrogate three different freely available databases (Therapeutic Target Database, DrugBank and STITCH3.1) that provide synoptic data on drugs and their primary or putative drug targets. We were able to identify several drugs that are expected to interact with forty-seven (47) peptides predicted to be involved in the biology of the P. falciparum apicoplast. Fifteen (15) of these putative targets are predicted to have affinity to drugs that are already approved for clinical use but have never been evaluated against malaria parasites. We suggest that some of these drugs should be experimentally tested and/or serve as leads for engineering new antimalarials.  相似文献   

17.
Biswas  Bipasa  Lai  Yinglei 《BMC genomics》2019,20(2):35-47
Background

The next generation sequencing technology allows us to obtain a large amount of short DNA sequence (DNA-seq) reads at a genome-wide level. DNA-seq data have been increasingly collected during the recent years. Count-type data analysis is a widely used approach for DNA-seq data. However, the related data pre-processing is based on the moving window method, in which a window size need to be defined in order to obtain count-type data. Furthermore, useful information can be reduced after data pre-processing for count-type data.

Results

In this study, we propose to analyze DNA-seq data based on the related distance-type measure. Distances are measured in base pairs (bps) between two adjacent alignments of short reads mapped to a reference genome. Our experimental data based simulation study confirms the advantages of distance-type measure approach in both detection power and detection accuracy. Furthermore, we propose artificial censoring for the distance data so that distances larger than a given value are considered potential outliers. Our purpose is to simplify the pre-processing of DNA-seq data. Statistically, we consider a mixture of right censored geometric distributions to model the distance data. Additionally, to reduce the GC-content bias, we extend the mixture model to a mixture of generalized linear models (GLMs). The estimation of model can be achieved by the Newton-Raphson algorithm as well as the Expectation-Maximization (E-M) algorithm. We have conducted simulations to evaluate the performance of our approach. Based on the rank based inverse normal transformation of distance data, we can obtain the related z-values for a follow-up analysis. For an illustration, an application to the DNA-seq data from a pair of normal and tumor cell lines is presented with a change-point analysis of z-values to detect DNA copy number alterations.

Conclusion

Our distance-type measure approach is novel. It does not require either a fixed or a sliding window procedure for generating count-type data. Its advantages have been demonstrated by our simulation studies and its practical usefulness has been illustrated by an experimental data application.

  相似文献   

18.

Objective

To determine the value of contourlet textural features obtained from solitary pulmonary nodules in two dimensional CT images used in diagnoses of lung cancer.

Materials and Methods

A total of 6,299 CT images were acquired from 336 patients, with 1,454 benign pulmonary nodule images from 84 patients (50 male, 34 female) and 4,845 malignant from 252 patients (150 male, 102 female). Further to this, nineteen patient information categories, which included seven demographic parameters and twelve morphological features, were also collected. A contourlet was used to extract fourteen types of textural features. These were then used to establish three support vector machine models. One comprised a database constructed of nineteen collected patient information categories, another included contourlet textural features and the third one contained both sets of information. Ten-fold cross-validation was used to evaluate the diagnosis results for the three databases, with sensitivity, specificity, accuracy, the area under the curve (AUC), precision, Youden index, and F-measure were used as the assessment criteria. In addition, the synthetic minority over-sampling technique (SMOTE) was used to preprocess the unbalanced data.

Results

Using a database containing textural features and patient information, sensitivity, specificity, accuracy, AUC, precision, Youden index, and F-measure were: 0.95, 0.71, 0.89, 0.89, 0.92, 0.66, and 0.93 respectively. These results were higher than results derived using the database without textural features (0.82, 0.47, 0.74, 0.67, 0.84, 0.29, and 0.83 respectively) as well as the database comprising only textural features (0.81, 0.64, 0.67, 0.72, 0.88, 0.44, and 0.85 respectively). Using the SMOTE as a pre-processing procedure, new balanced database generated, including observations of 5,816 benign ROIs and 5,815 malignant ROIs, and accuracy was 0.93.

Conclusion

Our results indicate that the combined contourlet textural features of solitary pulmonary nodules in CT images with patient profile information could potentially improve the diagnosis of lung cancer.  相似文献   

19.
Awassi is a multi-purpose sheep breed. Awassi lambs being finished are usually offered an 18% crude protein (CP) diet. The growth rate of Awassi lambs is lower than other meat breeds. Therefore, this high content of dietary CP is questionable. The objective of this study was to estimate the optimum CP level for finishing Awassi lambs. Fifty male Awassi lambs (23.0±1.2 kg) were fed five high concentrate isocaloric diets (10 lambs per diet) that contained 10, 12, 14, 16, and 18% CP in a totally mixed diets for 9 weeks using a completely randomized design. Lambs were fed twice daily, and feed offered and feed refusals recorded for each feeding. Individual lamb intakes were calculated using daily feed offered and feed refused averaged over the interval of the experiment. Digestibility estimates were measured by total fecal collection. Lambs fed diets that contained 10, 12, and 14% CP gained less weight than those fed the 16 and 18% CP diets (P<0.05). Dry matter and CP intakes increased (P<0.05) with increasing levels of dietary CP. No difference (P>0.10) was observed in feed-to-gain ratio between diets except for the diet that contained 10% CP (P<0.05) which had a lower ratio. Organic matter and CP digestibility were lowest in lambs fed the 10% CP diet. Results suggest that the optimum CP concentration is 16% and that any increase above this level will not result in any improvement in production.  相似文献   

20.
A complex mixture of diverse oligosaccharides related to the carbohydrates in glycoconjugates involved in various biological events is found in animal milk/colostrum and has been challenging targets for separation and structural studies. In the current study, we isolated oligosaccharides having high molecular masses (MW ∼ 3800) from the milk samples of bearded and hooded seals and analyzed their structures by off-line normal-phase-high-performance liquid chromatography-matrix-assisted laser desorption/ionization-time-of-flight (NP-HPLC-MALDI-TOF) mass spectrometry (MS) by combination with sequential exoglycosidase digestion. Initially, a mixture of oligosaccharides from the seal milk was reductively aminated with 2-aminobenzoic acid and analyzed by a combination of HPLC and MALDI-TOF MS. From MS data, these oligosaccharides contained different numbers of lactosamine units attached to the nonreducing lactose (Galβ1-4Glc) and fucose residue. The isolated oligosaccharides were sequentially digested with exoglycosidases and characterized by MALDI-TOF MS. The data revealed that oligosaccharides from both seal species were composed from lacto-N-neohexaose (LNnH, Galβ1-4GlcNAcβ1-6[Galβ1-4GlcNAcβ1-3]Galβ1-4Glc) as the common core structure, and most of them contained Fucα1-2 residues at the nonreducing ends. Furthermore, the oligosaccharides from both samples contained multibranched oligosaccharides having two Galβ1-4GlcNAc (N-acetyllactosamine, LacNAc) residues on the Galβ1-4GlcNAcβ1-3 branch or both branches of LNnH. Elongation of the chains was observed at 3-OH positions of Gal residues, but most of the internal Gal residues were also substituted with an N-acetyllactosamine at the 6-OH position.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号