首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
OpenMP, a typical shared memory programming paradigm, has been extensively applied in high performance computing community due to the popularity of multicore architectures in recent years. The most significant feature of the OpenMP 3.0 specification is the introduction of the task constructs to express parallelism at a much finer level of detail. This feature, however, has posed new challenges for performance monitoring and analysis. In particular, task creation is separated from its execution, causing the traditional monitoring methods to be ineffective. This paper presents a mechanism to monitor task-based OpenMP programs with interposition and proposes two demonstration graphs for performance analysis as well. The results of two experiments are discussed to evaluate the overhead of monitoring mechanism and to verify the effects of demonstration graphs using the BOTS benchmarks.  相似文献   

2.
We investigate the performance increase provided by the Intel® Xeon Phi? coprocessor in multiple replica molecular dynamics applications using a novel parallelisation scheme. The benefits of the proposed parallelisation scheme are demonstrated by glycine in water, a system of significant interest in the crystallisation simulation community. The molecular dynamics (MD) engine consists of initially serial LAMMPS and NAMD subroutines, and is subsequently modified and parallelised using a heterogeneous programming model, where each MPI rank is paired with a unique Intel® Xeon Phi? coprocessor and CPU socket. The MD engine is parallelised using an OpenMP atom domain decomposition algorithm on the Intel® Xeon Phi? coprocessor and OpenMP task parallelism on the host CPU socket. Using nodes with two Intel® Xeon Phi? coprocessors, one per socket, we demonstrate that a factor of five reduction in the required computational resources is achieved per replica with the coprocessor, when compared against employing the standard spatial domain decomposition algorithm with no accelerator. Furthermore, the proposed parallelisation scheme achieves ideal weak scaling with respect to the number of employed MPI ranks (replicas). The Intel® Xeon Phi? coprocessor not only allows us to the increase performance output per socket by a factor of five, when compared against no accelerators, but also significantly reduces the parallelisation complexity necessary to achieve this performance, as the Intel® Xeon Phi? coprocessor operates using the simple OpenMP programming language.  相似文献   

3.
We present a framework to design efficient and portable HPF applications which exploit a mixture of task and data parallelism. According to the framework proposed, data parallelism is restricted within HPF modules, and task parallelism is achieved by the concurrent execution of several data-parallel modules cooperating through COLTHPF, a coordination layer implemented on top of PVM. COLTHPF can be used independently of the HPF compilation system exploited, and it allows instances of cooperating HPF tasks to be created either statically or at run-time. We claim that COLTHPF can be exploited by means of a simple skeleton-based coordination language and associated compiler to easily express mixed data and task parallel applications runnable on either multicomputers or cluster of workstations. We used a physics application as a test case of our approach for mixing task and data parallelism, and we present the results of several experiments conducted on a cluster of Linux SMPs. This revised version was published online in July 2006 with corrections to the Cover Date.  相似文献   

4.
SAMtools is a widely-used genomics application for post-processing high-throughput sequence alignment data. Such sequence alignment data are commonly sorted to make downstream analysis more efficient. However, this sorting process itself can be computationally- and I/O-intensive: high-throughput sequence alignment files in the de facto standard binary alignment/map (BAM) format can be many gigabytes in size, and may need to be decompressed before sorting and compressed afterwards. As a result, BAM-file sorting can be a bottleneck in genomics workflows. This paper describes a case study on the performance analysis and optimization of SAMtools for sorting large BAM files. OpenMP task parallelism and memory optimization techniques resulted in a speedup of 5.9X versus the upstream SAMtools 1.3.1 for an internal (in-memory) sort of 24.6 GiB of compressed BAM data (102.6 GiB uncompressed) with 32 processor cores, while a 1.98X speedup was achieved for an external (out-of-core) sort of a 271.4 GiB BAM file.  相似文献   

5.
A new implementation of molecular dynamics simulation is presented. We employed policy-based design to achieve static polymorphism within our simulation programs. This technique provides flexibility and extensibility without additional if-statement branching in the simulation program development. It is shown that policy-based implementation prevents computational performance degradation. We used a fine-grained domain decomposition scheme to parallelise the simulation program. The smaller size decomposition reduces the total amount of inter-processing-core communication and affords good scalability for parallel calculation of short-range forces. The calculation of long-range interactions limits the total scalability. For enhanced performance at high levels of parallelism, the calculation methods for long-range interactions should be improved.  相似文献   

6.
We investigate proactive dynamic load balancing on multicore systems, in which threads are continually migrated to reduce the impact of processor/thread mismatches. Our goal is to enhance the flexibility of the SPMD-style programming model and enable SPMD applications to run efficiently in multiprogrammed environments. We present Juggle, a practical decentralized, user-space implementation of a proactive load balancer that emphasizes portability and usability. In this paper we assume perfect intrinsic load balance and focus on extrinsic imbalances caused by OS noise, multiprogramming and mismatches of threads to hardware parallelism. Juggle shows performance improvements of up to 80 % over static load balancing for oversubscribed UPC, OpenMP, and pthreads benchmarks. We also show that Juggle is effective in unpredictable, multiprogrammed environments, with up to a 50 % performance improvement over the Linux load balancer and a 25 % reduction in performance variation. We analyze the impact of Juggle on parallel applications and derive lower bounds and approximations for thread completion times. We show that results from Juggle closely match theoretical predictions across a variety of architectures, including NUMA and hyper-threaded systems.  相似文献   

7.
New computational approaches for analysis of cis-regulatory networks   总被引:1,自引:0,他引:1  
The investigation and modeling of gene regulatory networks requires computational tools specific to the task. We present several locally developed software tools that have been used in support of our ongoing research into the embryogenesis of the sea urchin. These tools are especially well suited to iterative refinement of models through experimental and computational investigation. They include: BioArray, a macroarray spot processing program; SUGAR, a system to display and correlate large-BAC sequence analyses; SeqComp and FamilyRelations, programs for comparative sequence analysis; and NetBuilder, an environment for creating and analyzing models of gene networks. We also present an overview of the process used to build our model of the Strongylocentrotus purpuratus endomesoderm gene network. Several of the tools discussed in this paper are still in active development and some are available as open source.  相似文献   

8.
Recently, the graphic processing units (GPUs) are becoming increasingly popular for the high performance computing applications. Although the GPUs provide high peak performance, exploiting the full performance potential for application programs, however, leaves a challenging task to the programmers. When launching a parallel kernel of an application on the GPU, the programmer needs to carefully select the number of blocks (grid size) and the number of threads per block (block size). These values determine the degree of SIMD parallelism and the multithreading, and greatly influence the performance. With a huge range of possible combinations of these values, choosing the right grid size and the block size is not straightforward. In this paper, we propose a mathematical model for tuning the grid size and the block size based on the GPU architecture parameters. Using our model we first calculate a small set of candidate grid size and block size values, then search for the optimal values out of the candidate values through experiments. Our approach significantly reduces the potential search space instead of exhaustive search approaches in the previous research. Thus our approach can be practically applied to the real applications.  相似文献   

9.
10.
The increases in multi-core processor parallelism and in the flexibility of many-core accelerator processors, such as GPUs, have turned traditional SMP systems into hierarchical, heterogeneous computing environments. Fully exploiting these improvements in parallel system design remains an open problem. Moreover, most of the current tools for the development of parallel applications for hierarchical systems concentrate on the use of only a single processor type (e.g., accelerators) and do not coordinate several heterogeneous processors. Here, we show that making use of all of the heterogeneous computing resources can significantly improve application performance. Our approach, which consists of optimizing applications at run-time by efficiently coordinating application task execution on all available processing units is evaluated in the context of replicated dataflow applications. The proposed techniques were developed and implemented in an integrated run-time system targeting both intra- and inter-node parallelism. The experimental results with a real-world complex biomedical application show that our approach nearly doubles the performance of the GPU-only implementation on a distributed heterogeneous accelerator cluster.  相似文献   

11.
Task partitioning is the decomposition of a task into two or more sub-tasks that can be tackled separately. Task partitioning can be observed in many species of social insects, as it is often an advantageous way of organizing the work of a group of individuals. Potential advantages of task partitioning are, among others: reduction of interference between workers, exploitation of individuals?? skills and specializations, energy efficiency, and higher parallelism. Even though swarms of robots can benefit from task partitioning in the same way as social insects do, only few works in swarm robotics are dedicated to this subject. In this paper, we study the case in which a swarm of robots has to tackle a task that can be partitioned into a sequence of two sub-tasks. We propose a method that allows the individual robots in the swarm to decide whether to partition the given task or not. The method is self-organized, relies on the experience of each individual, and does not require explicit communication between robots. We evaluate the method in simulation experiments, using foraging as testbed. We study cases in which task partitioning is preferable and cases in which it is not. We show that the proposed method leads to good performance of the swarm in both cases, by employing task partitioning only when it is advantageous. We also show that the swarm is able to react to changes in the environmental conditions by adapting the behavior on-line. Scalability experiments show that the proposed method performs well across all the tested group sizes.  相似文献   

12.
Interventions that aim to help farmers change on-farm practices recommend that advisors communicate effectively with farmers, work collaboratively to set goals and provide farmers with resources that are applicable to the farm context. We developed an intervention that aimed to help farmers modify and use a standard operating procedure (SOP) for colostrum management; failure of passive transfer of immunoglobulins is common on dairy farms and SOPs for colostrum management are increasingly required by farm animal welfare assurance programs. We used Realistic Evaluation to evaluate whether, how and why our intervention to help farmers modify and use SOPs for colostrum management facilitated change and provide recommendations based on our approach that can improve the design and implementation of future interventions. We used a multiple case study on five farms over 8 months, collecting data through interviews, participant observation, document analysis and field notes. We identified three mechanisms that influenced whether participants modified and used their SOP. The purpose mechanism distinguished between participants who thought the aim of the SOP was for farm staff to learn and understand how to complete a task versus those who thought that the SOP was only useful for compliance with assurance programs. The utility mechanism distinguished between participants who thought that the SOP would be helpful for daily use on their farm, versus those who did not. The physical text mechanism distinguished between participants who used the templates we provided to modify and use their SOP, versus those who did not. A key contextual factor on all farms was participant belief of having capable and engaged staff on their farm; modification and use of the SOP did not occur unless this was the case. To facilitate change, intervention developers should actively participate in the intervention to develop an understanding of farmer needs, understand the purpose behind different goals set by farmers and integrate tools, advice and resource demonstrations when possible. We conclude that Realistic Evaluation is a useful framework for evaluating how contexts and mechanisms generate outcomes on farms, and to understand how, and in which contexts, complex interventions facilitate change. We suggest that this approach can improve the success of interventions and help direct the approaches used on different farms.  相似文献   

13.
Recent developments in modern computational accelerators like Graphics Processing Units (GPUs) and coprocessors provide great opportunities for making scientific applications run faster than ever before. However, efficient parallelization of scientific code using new programming tools like CUDA requires a high level of expertise that is not available to many scientists. This, plus the fact that parallelized code is usually not portable to different architectures, creates major challenges for exploiting the full capabilities of modern computational accelerators. In this work, we sought to overcome these challenges by studying how to achieve both automated parallelization using OpenACC and enhanced portability using OpenCL. We applied our parallelization schemes using GPUs as well as Intel Many Integrated Core (MIC) coprocessor to reduce the run time of wave propagation simulations. We used a well-established 2D cardiac action potential model as a specific case-study. To the best of our knowledge, we are the first to study auto-parallelization of 2D cardiac wave propagation simulations using OpenACC. Our results identify several approaches that provide substantial speedups. The OpenACC-generated GPU code achieved more than speedup above the sequential implementation and required the addition of only a few OpenACC pragmas to the code. An OpenCL implementation provided speedups on GPUs of at least faster than the sequential implementation and faster than a parallelized OpenMP implementation. An implementation of OpenMP on Intel MIC coprocessor provided speedups of with only a few code changes to the sequential implementation. We highlight that OpenACC provides an automatic, efficient, and portable approach to achieve parallelization of 2D cardiac wave simulations on GPUs. Our approach of using OpenACC, OpenCL, and OpenMP to parallelize this particular model on modern computational accelerators should be applicable to other computational models of wave propagation in multi-dimensional media.  相似文献   

14.
The present study investigated both the direct and delayed effects of a 50 Hz, 100 microT magnetic field on human performance. Eighty subjects completed a visual duration discrimination task, half being exposed to the field and the other half sham exposed. The delayed effects of this field were also examined in a recognition memory task that followed immediately upon completion of the discrimination task, Unlike our earlier studies, we were unable to find any effects of the field on reaction time and accuracy in the visual discrimination task. However, the field had a delayed effect on memory, producing a decrement in recognition accuracy. We conclude that after many years of experimentation, finding a set of magnetic field parameters and human performance measures that reliably yield magnetic field effects is proving elusive. Yet the large number of significant findings suggests that further research is warranted.  相似文献   

15.
16.
Nowadays, remote sensing technologies produce huge amounts of satellite images that can be helpful to monitor geographical areas over time. A satellite image time series (SITS) usually contains spatio-temporal phenomena that are complex and difficult to understand. Conceiving new data mining tools for SITS analysis is challenging since we need to simultaneously manage the spatial and the temporal dimensions at the same time. In this work, we propose a new clustering framework specifically designed for SITS data. Our method firstly detects spatio-temporal entities, then it characterizes their evolutions by mean of a graph-based representation, and finally it produces clusters of spatio-temporal entities sharing similar temporal behaviors. Unlike previous approaches, which mainly work at pixel-level, our framework exploits a purely object-based representation to perform the clustering task. Object-based analysis involves a segmentation step where segments (objects) are extracted from an image and constitute the element of analysis. We experimentally validate our method on two real world SITS datasets by comparing it with standard techniques employed in remote sensing analysis. We also use a qualitative analysis to highlight the interpretability of the results obtained.  相似文献   

17.
List scheduling algorithms are known to be efficient when the application to be executed can be described statically as a Directed Acyclic Graph (DAG) of tasks. Regardless of knowing the entire DAG beforehand, obtaining an optimal schedule in a parallel machine is a NP-hard problem. Moreover, many programming tools propose the use of scheduling techniques based on list strategies. This paper presents an analysis of scheduling algorithms for multithread programs in a dynamic scenario where threads are created and destroyed during execution. We introduce an algorithm to convert DAGs, describing applications as tasks, into Directed Cyclic Graphs (DCGs) describing the same application designed in a multithread programming interface. Our algorithm covers case studies described in previous works, successfully mapping from the abstract level of graphs to the application environment. These mappings preserve the guarantees offered by the abstract model, providing efficient scheduling of dynamic programs that follow the intended multithread model. We conclude the paper presenting some performance results we obtained by list schedulers in dynamic multithreaded environments. We also compare these results with the best scheduling we could obtain with similar static task schedulers.  相似文献   

18.
Tech M  Merkl R 《In silico biology》2003,3(4):441-451
The performance of gene-predicting tools varies considerably if evaluated with respect to the parameters sensitivity and specificity or their capability to identify the correct start codon. We were interested to validate tools for gene prediction and to implement a metatool named YACOP, which combines existing tools and has a higher performance. YACOP parses and combines the output of the three gene-predicting systems Criticia, Glimmer and ZCURVE. It outperforms each of the programs tested with its high sensitivity and specificity values combined with a larger number of correctly predicted gene starts. Performance of YACOP and the gene-finding programs was tested by comparing their output with a carefully selected set of annotated genomes. We found that the problem of identifying genes in prokaryotic genomes by means of computational analysis was solved satisfactorily. In contrast, the correct localization of the start codon still appeared to be a problem, as in all cases under test at least 7.8% and up to 32.3% of the positions given in the annotations differed from the locus predicted by any of the programs tested. YACOP can be downloaded from http://www.g2l.bio.uni-goettingen.de.  相似文献   

19.
The amazing revolution in computer hardware performance and cost reduction has yet to be carried over to computer software. In fact, application software today is often more expensive and less reliable than the hardware. New enhancements in software development techniques, such as object oriented programming and interactive graphics based user interface design, finally may be having a significant impact on the time-to-market and reliability of these application programs. We discuss our experiences using one such set of software development tools available on the NeXT workstation and describe the effort required to port our MidasPlus molecular modeling package to the NeXT workstation.  相似文献   

20.
Expanding digital data sources, including social media, online news articles and blogs, provide an opportunity to understand better the context and intensity of human-nature interactions, such as wildlife exploitation. However, online searches encompassing large taxonomic groups can generate vast datasets, which can be overwhelming to filter for relevant content without the use of automated tools. The variety of machine learning models available to researchers, and the need for manually labelled training data with an even balance of labels, can make applying these tools challenging. Here, we implement and evaluate a hierarchical text classification pipeline which brings together three binary classification tasks with increasingly specific relevancy criteria. Crucially, the hierarchical approach facilitates the filtering and structuring of a large dataset, of which relevant sources make up a small proportion. Using this pipeline, we also investigate how the accuracy with which text classifiers identify relevant and irrelevant texts is influenced by the use of different models, training datasets, and the classification task. To evaluate our methods, we collected data from Facebook, Twitter, Google and Bing search engines, with the aim of identifying sources documenting the hunting and persecution of bats (Chiroptera). Overall, the ‘state-of-the-art’ transformer-based models were able to identify relevant texts with an average accuracy of 90%, with some classifiers achieving accuracy of >95%. Whilst this demonstrates that application of more advanced models can lead to improved accuracy, comparable performance was achieved by simpler models when applied to longer documents and less ambiguous classification tasks. Hence, the benefits from using more computationally expensive models are dependent on the classification context. We also found that stratification of training data, according to the presence of key search terms, improved classification accuracy for less frequent topics within datasets, and therefore improves the applicability of classifiers to future data collection. Overall, whilst our findings reinforce the usefulness of automated tools for facilitating online analyses in conservation and ecology, they also highlight that the effectiveness and appropriateness of such tools is determined by the nature and volume of data collected, the complexity of the classification task, and the computational resources available to researchers.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号