期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Performance Evaluation of TCP over Optical Channels and Heterogeneous Networks

Xu Jianxuan Miguel A. Labrador Mohsen Guizani 《Cluster computing》2004,7(3):225-238

Next generation optical networks will soon provide users the capability to request and obtain end-to-end all optical 10 Gbps channels on demand. Individual users will use these channels to exchange large amounts of data and support applications for scientific collaborative work. These new applications, which expect steady transfer rates in the order of Gbps, will very likely use either TCP or a new transport layer protocol as the end-to-end communication protocol. In this paper, we investigate the performance of TCP and newer TCP versions over High Bandwidth Delay Product Channels (HBDPC), such as the on demand optical channels described above. In addition, we investigate the performance of these new TCP versions over wireless networks and according to old issues such as fairness. This is particularly important to make adoption decisions. Using simulations, we show that (1) the window-based mechanism of current TCP implementations is not suitable to achieve high link utilization and (2) congestion control mechanisms, such as the one utilized by TCP Vegas and Westwood are more appropriate and provide better performance. We also show that new TCP proposals, although perform better than current TCP versions, they still perform worse than TCP Vegas. In addition, we found that even though these newer versions improve TCP's performance over their original counterparts in HBDPC, they still have performance problems in wireless networks and present worse fairness problems than their old counterparts. We conclude that all these versions are still based on TCP's AIMD strategy or similar and therefore continue to be fairly blind in the way they increase and decrease their transmission rates. TCP will not be able to utilize the foreseen optical infrastructure adequately and support future applications if not redesigned to scale. 相似文献

2.

Accelerating comparative genomics using parallel computing

Janaki C Joshi RR 《In silico biology》2003,3(4):429-440

In the past decade there has been an increase in the number of completely sequenced genomes due to the race of multibillion-dollar genome-sequencing projects. The enormous biological sequence data thus flooding into the sequence databases necessitates the development of efficient tools for comparative genome sequence analysis. The information deduced by such analysis has various applications viz. structural and functional annotation of novel genes and proteins, finding gene order in the genome, gene fusion studies, constructing metabolic pathways etc. Such study also proves invaluable for pharmaceutical industries, such as in silico drug target identification and new drug discovery. There are various sequence analysis tools available for mining such useful information of which FASTA and Smith-Waterman algorithms are widely used. However, analyzing large datasets of genome sequences using the above codes seems to be impractical on uniprocessor machines. Hence there is a need for improving the performance of the above popular sequence analysis tools on parallel cluster computers. Performance of the Smith-Waterman (SSEARCH) and FASTA programs were studied on PARAM 10000, a parallel cluster of workstations designed and developed in-house. FASTA and SSEARCH programs, which are available from the University of Virginia, were ported on PARAM and were optimized. In this era of high performance computing, where the paradigm is shifting from conventional supercomputers to the cost-effective general-purpose cluster of workstations and PCs, this study finds extreme relevance. Good performance of sequence analysis tools on a cluster of workstations was demonstrated, which is important for accelerating identification of novel genes and drug targets by screening large databases. 相似文献

3.

Speculative Defragmentation – Leading Gigabit Ethernet to True Zero-Copy Communication

Christian Kurmann Felix Rauch Thomas M. Stricker 《Cluster computing》2001,4(1):7-18

Clusters of Personal Computers (CoPs) offer excellent compute performance at a low price. Workstations with Gigabit to the Desktop can give workers access to a new game of multimedia applications. Networking PCs with their modest memory subsystem performance requires either extensive hardware acceleration for protocol processing or alternatively, a highly optimized software system to reach the full Gigabit/sec speeds in applications. So far this could not be achieved, since correctly defragmenting packets of the various communication protocols in hardware remains an extremely complex task and prevented a clean zero-copy solution in software. We propose and implement a defragmenting driver based on the same speculation techniques that are common to improve processor performance with instruction level parallelism. With a speculative implementation we are able to eliminate the last copy of a TCP/IP stack even on simple, existing Ethernet NIC hardware. We integrated our network interface driver into the Linux TCP/IP protocol stack and added the well known page remapping and fast buffer strategies to reach an overall zero-copy solution. An evaluation with measurement data indicates three trends: (1) for Gigabit Ethernet the CPU load of communication can be reduced processing significantly, (2) speculation will succeed in most cases, and (3) the performance for burst transfers can be improved by a factor of 1.5–2 over the standard communication software in Linux 2.2. Finally we can suggest simple hardware improvements to increase the speculation success rates based on our implementation. 相似文献

4.

ACS: An adaptive communication system for heterogeneous wide-area ATM clusters

Sung-Yong Park Salim Hariri 《Cluster computing》1999,2(3):229-246

This paper presents an architecture, implementation, and performance evaluation of an adaptive message-passing system for a heterogeneous wide-area ATM cluster that we call the Adaptive Communication System (ACS). ACS uses multithreading to provide efficient techniques for overlapping computation and communication in wide-area computing. By separating control and data activities, ACS eliminates unnecessary control transfers over the data path. This optimizes the data path and improves the performance. ACS supports several different flow control algorithms, error control algorithms, and multicasting algorithms. Furthermore, ACS allows programmers to select at runtime the suitable communication schemes per-connection basis to meet the requirements of a given application. ACS provides three application communication interfaces: Socket Communication Interface (SCI), ATM Communication Interface (ACI), and High Performance Interface (HPI) to support various classes of applications. The SCI is provided mainly for applications that must be portable to many different computing platforms. The ACI provides services that are compatible with ATM connection oriented services where each connection can be configured to meet the Quality of Service (QOS) requirements of that connection. This allows programmers to fully utilize the benefits of the ATM network. The HPI supports applications that demand low-latency and high-throughput communication services. In this interface, ACS uses read/write trap routines to reduce latency and data transfer time, and to avoid using traditional communication protocols. We analyze and compare the performance of ACS with those of other message-passing systems such as p4, PVM, and MPI in terms of point-to-point, multicasting, and application performance. The benchmarking results show that ACS outperforms other message-passing systems and provides flexible communication services for various classes of applications. This revised version was published online in July 2006 with corrections to the Cover Date. 相似文献

5.

Performance Analysis of a Myrinet-Based Cluster

Teddy Surya Gunawan Wentong Cai 《Cluster computing》2003,6(4):299-313

In recent years, there has been a growing interest in the cluster system as an accepted form of supercomputing, due to its high performance at an affordable cost. This paper attempts to elaborate performance analysis of Myrinet-based cluster. The communication performance and effect of background load on parallel applications were analyzed. For point-to-point communication, it was found that an extension to the Hockney's model was required to estimate the performance. The proposed model suggested that there should be two ranges to be used for the performance metrics to cope with the cache effect. Moreover, based on the extension of the point-to-point communication model, the Xu and Hwang's model for collective communication performance was also extended. Results showed that our models can make better estimation of the communication performance than the previous models. Finally, the interference of other user processes to the cluster system is evaluated by using synthetic background load generation programs. 相似文献

6.

Geographical information system parallelization for spatial big data processing: a review 总被引：1，自引：0，他引：1

Lingjun Zhao Lajiao Chen Rajiv Ranjan Kim-Kwang Raymond Choo Jijun He 《Cluster computing》2016,19(1):139-152

With the increasing interest in large-scale, high-resolution and real-time geographic information system (GIS) applications and spatial big data processing, traditional GIS is not efficient enough to handle the required loads due to limited computational capabilities.Various attempts have been made to adopt high performance computation techniques from different applications, such as designs of advanced architectures, strategies of data partition and direct parallelization method of spatial analysis algorithm, to address such challenges. This paper surveys the current state of parallel GIS with respect to parallel GIS architectures, parallel processing strategies, and relevant topics. We present the general evolution of the GIS architecture which includes main two parallel GIS architectures based on high performance computing cluster and Hadoop cluster. Then we summarize the current spatial data partition strategies, key methods to realize parallel GIS in the view of data decomposition and progress of the special parallel GIS algorithms. We use the parallel processing of GRASS as a case study. We also identify key problems and future potential research directions of parallel GIS. 相似文献

7.

Efficiency of Parallel Direct Optimization

Daniel A. Janies Ward C. Wheeler 《Cladistics : the international journal of the Willi Hennig Society》2001,17(1):S71-S82

Tremendous progress has been made at the level of sequential computation in phylogenetics. However, little attention has been paid to parallel computation. Parallel computing is particularly suited to phylogenetics because of the many ways large computational problems can be broken into parts that can be analyzed concurrently. In this paper, we investigate the scaling factors and efficiency of random addition and tree refinement strategies using the direct optimization software, POY, on a small (10 slave processors) and a large (256 slave processors) cluster of networked PCs running LINUX. These algorithms were tested on several data sets composed of DNA and morphology ranging from 40 to 500 taxa. Various algorithms in POY show fundamentally different properties within and between clusters. All algorithms are efficient on the small cluster for the 40-taxon data set. On the large cluster, multibuilding exhibits excellent parallel efficiency, whereas parallel building is inefficient. These results are independent of data set size. Branch swapping in parallel shows excellent speed-up for 16 slave processors on the large cluster. However, there is no appreciable speed-up for branch swapping with the further addition of slave processors (>16). This result is independent of data set size. Ratcheting in parallel is efficient with the addition of up to 32 processors in the large cluster. This result is independent of data set size. 相似文献

8.

A Load Balancing Tool for Distributed Parallel Loops 总被引：1，自引：0，他引：1

Ricolindo?L.?Cari?o Email author Ioana?Banicescu 《Cluster computing》2005,8(4):313-321

Large scale applications typically contain parallel loops with many iterates. The iterates of a parallel loop may have variable execution times which translate into performance degradation of an application due to load imbalance. This paper describes a tool for load balancing parallel loops on distributed-memory systems. The tool assumes that the data for a parallel loop to be executed is already partitioned among the participating processors. The tool utilizes the MPI library for interprocessor coordination, and determines processor workloads by loop scheduling techniques. The tool was designed independent of any application; hence, it must be supplied with a routine that encapsulates the computations for a chunk of loop iterates, as well as the routines to transfer data and results between processors. Performance evaluation on a Linux cluster indicates that the tool reduces the cost of executing a simulated irregular loop without load balancing by up to 81%. The tool is useful for parallelizing sequential applications with parallel loops, or as an alternate load balancing routine for existing parallel applications. 相似文献

9.

Selective expression of the proprotein convertases furin, pc5, and pc7 in proliferating vascular smooth muscle cells of the rat aorta in vitro.

P Stawowy J Marcinkiewicz K Graf N Seidah M Chrétien E Fleck M Marcinkiewicz 《The journal of histochemistry and cytochemistry》2001,49(3):323-332

The aim of this study was to investigate whether transformation of quiescent vascular smooth muscle cells (VSMCs) into proliferating secretory cells is accompanied by an expression of processing enzymes that activate de novo-synthesized growth factors. Three enzymes belonging to the family of the kexin/subtilisin-like mammalian proprotein convertases (PCs), furin, PC5, and PC7, were found to be upregulated after balloon denudation in vivo. To determine their importance in these cell processes, we investigated their gene regulation using a short-term organ culture system. After incubation of rat aorta for 4 and 24 hr in serum-free medium, we demonstrated a significant induction of VSMC proliferation. The affected subset of VSMCs, positive for alpha-smooth muscle actin, also expressed proliferating cell nuclear antigen (PCNA). Our results revealed a parallel upregulation of furin, PC5, and PC7 in PCNA-immunolabeled cells. As a substrate model for comparison with PCs we used nerve growth factor (NGF). NGF is known to be activated by PCs. As shown by Northern blotting analysis, NGF mRNA concentration was significantly increased in cultured explants. NGF was released into the culture medium. In conclusion, both PCs and NGF are coordinately modulated on induction of VSMC proliferation. 相似文献

10.

Performance evaluation of a remote memory system with commodity hardware for large-memory data processing

Hyuck Han Hyungsoo Jung Sooyong Kang Heon Y. Yeom 《Cluster computing》2011,14(4):325-344

The explosion of data and transactions demands a creative approach for data processing in a variety of applications. Research on remote memory systems (RMSs), so as to exploit the superior characteristics of dynamic random access memory (DRAM), has been performed for many decades, and today’s information explosion galvanizes researchers into shedding new light on the technology. Prior studies have mainly focused on architectural suggestions for such systems, highlighting different design rationale. These studies have shown that choosing the appropriate applications to run on an RMS is important in fully utilizing the advantages of remote memory. This article provides an extensive performance evaluation for various types of data processing applications so as to address the efficacy of an RMS by means of a prototype RMS with reliability functionality. The prototype RMS used is a practical kernel-level RMS that renders large memory data processing feasible. The abstract concept of remote memory was materialized by borrowing unused local memory in commodity PCs via a high speed network capable of Remote Direct Memory Access (RDMA) operations. The prototype RMS uses remote memory without any part of its computation power coming from remote computers. Our experimental results suggest that an RMS can be practical in supporting the rigorous demands of commercial in memory database systems that have high data access locality. Our evaluation also convinces us of the possibility that a reliable RMS can satisfy both the high degree of reliability and efficiency for large memory data processing applications whose data access pattern has high locality. 相似文献

11.

A Software Suite for High-Performance Communications on Clusters of SMPs

P. Geoffray C. Pham B. Tourancheau 《Cluster computing》2002,5(4):353-363

A cluster, by opposition to a parallel computer, is a set of separate workstations interconnected by a high-speed network. The performances one can get on a cluster heavily depend on the performances of the lowest communication layers. In this paper, we address the special case where the cluster contains multi-processor machines. These shared-memory multi-processors desktop machines (SMPs) with 2 or 4 processors are now becoming very popular and present a high performance/price ratio. We present a software suite for achieving high-performance communications on a Myrinet-based cluster: BIP, BIP-SMP and MPI-BIP. The software suite supports single-processor (Intel PC and Digital Alpha) and multi-processor machines, as well as any combination of the two architectures. 相似文献

12.

Key Message Approach to Optimize Communication of Parallel Applications on Clusters

Ming Zhu Wentong Cai Bu-Sung Lee 《Cluster computing》2003,6(3):253-265

Over the past few years, cluster/distributed computing has been gaining popularity. The proliferation of the cluster/distributed computing is due to the improved performance and increased reliability of these systems. Many parallel programming languages and related parallel programming models have become widely accepted. However, one of the major shortcomings of running parallel applications on cluster/distributed computing environments is the high communication overhead incurred. To reduce the communication overhead, and thus the completion time of a parallel application, this paper describes a simple, efficient and portable Key Message (KM) approach to support parallel computing on cluster/distributed computing environments. To demonstrate the advantage of the KM approach, a prototype runtime system has been implemented and evaluated. Our preliminary experimental results show that the KM approach has better improvement on communication of a parallel application when network background load increases or the computation to communication ratio of the application decreases. 相似文献

13.

Parallel FFT on ATM‐based networks of workstations

Suresh Chalasani Parameswaran Ramanathan 《Cluster computing》1998,1(1):13-26

We consider parallel computing on a network of workstations using a connection-oriented protocol (e.g., Asynchronous Transfer Mode) for data communication. In a connection-oriented protocol, a virtual circuit of guaranteed bandwidth is established for each pair of communicating workstations. Since all virtual circuits do not have the same guaranteed bandwidth, a parallel application must deal with the unequal bandwidths between workstations. Since most works in the design of parallel algorithms assume equal bandwidths on all the communication links, they often do not perform well when executed on networks of workstations using connection-oriented protocols. In this paper, we first evaluate the performance degradation caused by unequal bandwidths on the execution of conventional parallel algorithms such as the fast Fourier transform and bitonic sort. We then present a strategy based on dynamic redistribution of data points to reduce the bottlenecks caused by unequal bandwidths. We also extend this strategy to deal with processor heterogeneity. Using analysis and simulation we show that there is a considerable reduction in the runtime if the proposed redistribution strategy is adopted. The basic idea presented in this paper can also be used to improve the runtimes of other parallel applications in connection-oriented environments. This revised version was published online in July 2006 with corrections to the Cover Date. 相似文献

14.

High Performance Network of PC Cluster Maestro

Koichi Wada Shinichi Yamagiwa Munehiro Fukuda 《Cluster computing》2002,5(1):33-42

This paper presents a design, an architecture, and performance evaluation of high-performance network of PC cluster, called Maestro. Most networks of recent clusters have been organized based on WAN or LAN technology, due to their market availability. However, communication protocols and functions of such conventional networks are not optimal for parallel computing, which requires low latency and high bandwidth communication. In this paper, we propose two optimizations for high-performance communication: (1) transferring in burst as many packets as the receiving buffer accepts at once, and (2) having each hardware component pass one data unit to another in a pipelined manner. We have developed a network interface and a switch, which are composed of dedicated hardware modules to realize these optimizations. An implementatin of the message passing library developed on Maestro cluster is also described. Performance evaluation shows that the proposed optimizations can extract the potential performance of the physical layer efficiently and improve the performance in communication. 相似文献

15.

Genetic and Nongenetic Variation Revealed for the Principal Components of Human Gene Expression

Anita Goldinger Anjali K. Henders Allan F. McRae Nicholas G. Martin Greg Gibson Grant W. Montgomery Peter M. Visscher Joseph E. Powell 《Genetics》2013,195(3):1117-1128

相似文献

16.

Nephele streaming: stream processing under QoS constraints at scale

Björn Lohrmann Daniel Warneke Odej Kao 《Cluster computing》2014,17(1):61-78

The ability to process large numbers of continuous data streams in a near-real-time fashion has become a crucial prerequisite for many scientific and industrial use cases in recent years. While the individual data streams are usually trivial to process, their aggregated data volumes easily exceed the scalability of traditional stream processing systems. At the same time, massively-parallel data processing systems like MapReduce or Dryad currently enjoy a tremendous popularity for data-intensive applications and have proven to scale to large numbers of nodes. Many of these systems also provide streaming capabilities. However, unlike traditional stream processors, these systems have disregarded QoS requirements of prospective stream processing applications so far. In this paper we address this gap. First, we analyze common design principles of today’s parallel data processing frameworks and identify those principles that provide degrees of freedom in trading off the QoS goals latency and throughput. Second, we propose a highly distributed scheme which allows these frameworks to detect violations of user-defined QoS constraints and optimize the job execution without manual interaction. As a proof of concept, we implemented our approach for our massively-parallel data processing framework Nephele and evaluated its effectiveness through a comparison with Hadoop Online. For an example streaming application from the multimedia domain running on a cluster of 200 nodes, our approach improves the processing latency by a factor of at least 13 while preserving high data throughput when needed. 相似文献

17.

MetAlign 3.0: performance enhancement by efficient use of advances in computer hardware

Lommen A Kools HJ 《Metabolomics : Official journal of the Metabolomic Society》2012,8(4):719-726

A new, multi-threaded version of the GC-MS and LC-MS data processing software, metAlign, has been developed which is able to utilize multiple cores on one PC. This new version was tested using three different multi-core PCs with different operating systems. The performance of noise reduction, baseline correction and peak-picking was 8-19 fold faster compared to the previous version on a single core machine from 2008. The alignment was 5-10 fold faster. Factors influencing the performance enhancement are discussed. Our observations show that performance scales with the increase in processor core numbers we currently see in consumer PC hardware development. 相似文献

18.

A System for Distributed Storage and Analysis of Genome Information

Chernyi A. A. Trushkin K. A. Bokovoy V. A. Yanovskii A. K. Tverdokhlebov N. V. Joutchkov A. V. Lysov Yu. P. 《Molecular Biology》2004,38(1):89-93

A distributed computing system is developed to search and analyze genetic databases using parallel computing technologies. Queries are processed by a local network PC cluster. A universal task and data exchange format is developed for effective query processing. A multilevel hierarchic task batching procedure is elaborated to generate multiple subtasks and distribute them over cluster units under dynamic priority levels and with dynamic distribution of replicated source data subbases. Primary source data preparation and generation of annotation word indices are used to significantly reduce query processing time. 相似文献

19.

System of distributed storage and analysis of genomic information

Chernyĭ AA Trushkin KA Bokovoĭ VA Ianovskiĭ AK Tverdokhlebov NV Zhukov AV Lysov IuP 《Molekuliarnaia biologiia》2004,38(1):104-109

A distributed computing system is developed to search and analyze genetic databases using parallel computing technologies. Queries are processed by a local network PC cluster. A universal task and data exchange format is developed for effective query processing. A multilevel hierarchic task batching procedure is elaborated to generate multiple subtasks and distribute them over cluster units under dynamic priority levels and with dynamic distribution of replicated source data subbases. Primary source data preparation and generation of annotation word indices are used to significantly reduce query processing time. 相似文献

20.

Optimizing MPI collective communication by orthogonal structures

Matthias Kühnemann Thomas Rauber Gudula Rünger 《Cluster computing》2006,9(3):257-279

MPI collective communication operations to distribute or gather data are used for many parallel applications from scientific computing, but they may lead to scalability problems since their execution times increase with the number of participating processors. In this article, we show how the execution time of collective communication operations can be improved significantly by an internal restructuring based on orthogonal processor structures with two or more levels. The execution time of operations like MPI_Bcast() or MPI_Allgather() can be reduced by 40% and 70% on a dual Xeon cluster and a Beowulf cluster with single-processor nodes. But also on a Cray T3E a significant performance improvement can be obtained by a careful selection of the processor structure. The use of these optimized communication operations can reduce the execution time of data parallel implementations of complex application programs significantly without requiring any other change of the computation and communication structure. We present runtime functions for the modeling of two-phase realizations and verify that these runtime functions can predict the execution time both for communication operations in isolation and in the context of application programs. 相似文献