Talks and Guests


  1. Martti Forsell, VTT, Finland, 17.6-29.6.2019

  2. Nikita Koval, IST Austria, 9.1.-4.3.2019


  1. Trevor Brown, IST Austria, 14.6.2018

    Getting to the root of concurrent search trees

    Time and Place: June 14th, 10:00 am, Hörsaal EI6 (old EI), Gusshausstrasse 27, 6th floor

    Many systems rely on optimistic concurrent search trees for multi-core scalability. In principle, optimistic trees have a simple performance story: searches are read-only and run in parallel, with writes to shared memory occurring only when modifying the data structure. However, in practice, obtaining the full performance benefits of optimistic search trees is not so simple.

    In this talk, I will first briefly describe the major developments in concurrent search trees, providing a high level overview of the most popular implementations. Then, I will describe some unexpected performance differences between BSTs with similar tree structure and search implementations, which can be traced to subtle performance-degrading interactions of BSTs with systems software and hardware subsystems. This talk should be of interest to those curious about concurrent search trees, systems and experimentation.


  1. Martti Forsell, VTT, Finland, 29.5-28.6.2017

    The Thick Control Flow Model and Architecural Support for It

    Time and Place: June 22, 10:00 am, Favoritenstr. 16, 3rd floor

    The recently invented thick control flow (TCF) model packs together an unbounded number of fibers, thread-like computational entities, flowing through the same control path. This promises to simplify parallel programming by partially eliminating looping and artificial thread arithmetics. In this presentation we ntroduce the TCF model and outline an architecture for supporting efficiently executing programs written for the model. The outlined architecture features scalable latency hiding via replication of instructions, radical synchronization cost reduction via a wave-based synchronization mechanism, and improved low-level parallelism exploitation via chaining of functional units. Replication of instructions is supported by a dynamic multithreading-like mechanism, which saves the fiber-wise data into special replicated register blocks. The model and architecture facilitate programmers with compact, unbounded notation of fibers and groups of them together with strong synchronous shared memory algorithmics. According to early evaluations, the model inspires strong but simple parallel programming language design and the architecture is able to efficiently handle workloads featuring computational elements with the same control flow, independently of the number of elements. In its turn, this pays out as improved performance and lower power consumption due to elimination of redundant parts of computation and machinery.


  1. Karl Fürlinger, Ludwig-Maximilians Universität München

    DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorithms

    Time and Place: October 18, 10:00 am, Favoritenstr. 16, 3rd floor

    DASH is a realization of the partitioned global address space (PGAS) programming model in the form of a C++ template library. DASH offers distributed datastructures with flexible data distribution schemes and implements a PGAS model by relying on a one-sided communication substrate which is accessed through an intermediate runtime layer called DART (the DASH runtime). DASH can be used within a shared memory node as well as between nodes and it provides an iterator-based interface that is similar to the data containers of the C++ Standard Template Library (STL). To support the development of applications that exploit a hierarchical organization, either on the algorithmic or on the hardware level, DASH features the notion of teams that are arranged in a hierarchy. Based on a team hierarchy, the DASH data structures support locality iterators as a generalization of the conventional local/global distinction found in many PGAS approaches.

  2. Roger Kowalewski, Ludwig-Maximilians Universität München

    Debugging Synchronization Errors in MPI-3 One-Sided Applications

    Time and Place: October 18, 11:00 am, Favoritenstr. 16, 3rd floor

    MPI Remote Memory Access (RMA) allows one process to specify all communication parameters for both the sending and receiving side by providing support for asynchronous reads and updates of distributed shared data. While such an interface can be highly efficient, proper synchronization of possibly conflicting accesses to shared data is a challenging task and error-prone. We present a tool that supports developers in finding latent synchronization errors. It exploits the semantic flexibility given by the MPI standard and applies various heuristics to dynamically schedule pessimistic executions. An experimental evaluation reveals that this approach forces a manifestation of latent synchronization bugs which would otherwise likely go unnoticed.

  3. Alfons Laarman, TU Wien

    Title: A DFS-based Parallel Algorithm for SCC Decomposition

    Time and Place: October 11, 11:00 am, Favoritenstr. 16, 3rd floor

    Reif’s complexity result is often cited to demonstrate that algorithms based on DFS are (probably) inherently sequential. However, for many applications the result does not apply directly, since Reif considered lexicographical DFS whereas many applications of DFS merely require a non-lexicographical search order. In other words, the freedom in ordering adjacent vertices could yield efficient parallelization. Indeed, it is unknown whether non-lex DFS is efficiently parallelizable. With a novel algorithm for parallel SCC decomposition, we demonstrate that in practice such DFS-based solutions are feasible. In particular, we show that a DFS solution outperforms existing fixpoint approaches. Contrary to the latter approaches, our algorithm also is on-the-fly and with a fixed number of processors has a better runtime complexity.

  4. Louis-Claude Canon, University of Franche-Comté, France

    Title: Controlling and Assessing Correlations of Cost Matrices in Heterogeneous Scheduling

    Time and Place: March 4, 10:00 am, Favoritenstr. 16, 3rd floor

    This talk considers the problem of allocating independent tasks to unrelated machines such as to minimize the maximum completion time. Testing heuristics for this problem requires the generation of cost matrices that specify the execution time of each task on each machine. Numerous studies showed that the task and machine heterogeneities belong to the properties impacting heuristics performance the most. This study focuses on orthogonal properties, the average correlations between each pair of rows and each pair of columns, which is a proximity measure with uniform instances. Cost matrices generated with a novel generation method show the effect of these correlations on the performance of several heuristics from the literature. In particular, EFT performance depends on whether the tasks are more correlated than the machines and HLPT performs the best when both correlations are close to one.

  5. Henning Meyerhenke, Karlsruhe Institute of Technology, Germany, 25.1.2016-28.1.2016

    Title: Scalable algorithms for analyzing large networks with NetworKit

    Time and Place: 26.1.2016, 11:00 a.m., Favoritenstr. 16, 3rd floor

    We present NetworKit, an open-source software toolkit for efficient graph algorithms and the analysis of massive complex networks. Complex networks are equally attractive and challenging targets for data mining. Novel algorithmic solutions, including parallelization, are required to handle data sets containing billions of connections. NetworKit is a hybrid combining the performance of kernels written in C++ with a convenient Python frontend. The package targets shared-memory platforms with OpenMP support. Two selected algorithm classes from NetworKit will be presented in some detail, those for community detection and for dynamic betweenness centrality approximation. In comparison with related software, we propose NetworKit as a package geared towards large networks and satisfying three important criteria: High performance, interactive workflows, and integration into an ecosystem of tested tools for data analysis and scientific computation.

    Based on joint work with Elisabetta Bergamini, Alex Sazonovs, and Christian L. Staudt

  6. Louis-Claude Canon, University of Franche-Comté, France, 15.2.2016 - 27.3.2016


  1. Michael Axtmann, Karlsruhe Institute of Technology, Karlsruhe, 23.11.2015-27.11.2015, talk 23.11.2015 at 13:00:

    Practical Massively Parallel Sorting

    Previous parallel sorting algorithms do not scale to the largest available machines, since they either have prohibitive communication volume or prohibitive critical path length. We describe algorithms that are a viable compromise and overcome this gap both in theory and practice. The algorithms are multi-level generalizations of the known algorithms sample sort and multiway mergesort. In particular, our sample sort variant turns out to be very scalable both in theory and practice where it scales up to 2^15 MPI processes with outstanding performance in particular for medium sized inputs. Some tools we develop may be of independent interest – a simple, practical, and flexible sorting algorithm for very small inputs, a near linear time optimal algorithm for solving a constrained bin packing problem, and an algorithm for data delivery, that guarantees a small number of message startups on each processor.

  2. Christian Schulz, 1.9.2015 - 29.2.2016

  3. François Tessier, Inria, Bordeaux, Friday 31.7.2015, at 12:00(!):

    Placement techniques to improve data locality in parallel applications

    With hardware topologies becoming incessantly complex and growing needs in term of computing power for the parallel applications, the issue of data locality becomes central: how to reduce the distance between a processing entity and data to which it needs to access? Application placement is one of the levers to address this problem. In this talk, we present the TreeMatch algorithm and its application for static mapping, that is to say at the lauchtime of the application, and the dynamic placement. For this second approach, we propose awareness of data locality within a load balancing algorithm. These two different approaches are validated by experiments both on benchmarking codes and on real applications. Finally, we present recent work on data locality in parallel I/O targeting the election of aggregators in charge of gathering the pieces of data and writing them to disk.

  4. Martin Schatz, University of Texas at Austin/Sandia Natl. Labs; Tuesday 7.4 - Friday 10.4.2015

    On Tuesday, 7.4.2015, 14:00:

    Formalizing Distributed Tensor Computations

    Recently, data models have become more complex, requiring multi-dimensional representations and multilinear algebra to meaningfully express and compute with such data. Tensors are often the structures used for such representations, and tensor operations are often used for the necessary computation. In fields such as computational chemistry, these datasets become too large to compute locally on a single processing node, and thus, require leveraging distributed-memory systems to perform the necessary computation. Through decades of research, insights in the area of dense linear algebra, a special case of multilinear algebra, have led to the development of libraries that can achieve high performance on classes of distributed-memory architectures. Due to the added complexity of tensor operations, stemming from the arbitrary number of dimensions the tensors can represent, it becomes daunting to reason about factors such as data distribution, communication overhead, and algorithmic choice when creating high-performance distributed-memory libraries for tensor operations. In this work, we define a notation that generalizes key insights from research in dense linear algebra, formalizing arbitrary-dimensional tensor data distributed on arbitrary-dimensional processing meshes and redistribution operations in terms of well-studied collective communications. Additionally, we define a systematic process for the derivation of a class of high-performance algorithms for the tensor contraction operation on distributed-memory architectures, as well as show how the defined notation can express communication optimizations domain-experts utilize to improve performance of their applications.

    Martin Schatz is a 5th year Ph.D student supervised by Tamara G. Kolda of Sandia National Laboratories: Livermore, and Robert van de Geijn of the University of Texas at Austin. He is currently a Sandia National Laboratories Graduate Fellowship awardee. My background is in Computer Science and Chemistry (Bacherlor of Science in both).

  5. Christian Schulz, Monday-Friday, 16.-20.2.2015

    On Monday, 16.2.2015, 14:00:

    Partitioning for Complex Networks

    Processing large complex networks like social networks or web graphs has recently attracted considerable interest. To do this in parallel, we need to partition them into pieces of about equal size. Unfortunately, previous parallel graph partitioners originally developed for more regular mesh-like networks do not work well for these networks. This paper addresses this problem by parallelizing and adapting the label propagation technique originally developed for graph clustering. By introducing size constraints, label propagation becomes applicable for both the coarsening and the refinement phase of multilevel graph partitioning. We obtain very high quality by applying a highly parallel evolutionary algorithm to the coarsest graph. The resulting system is both more scalable and achieves higher quality than state-of-the-art systems like ParMetis or PT-Scotch. For large complex networks the performance differences are very big. As an example, our algorithm partitions a web graph with 3.3G edges in 16 seconds using 512 cores of a high-performance cluster while producing a high quality partition -- none of the competing systems can handle this graph on our system.

  6. Jörg Keller, FernUniversität Hagen, Monday-Wednesday, 19.-21.1.2015

    On Tuesday, 20.1.2015, 14:00:

    Heuristics for energy-efficient and fault-tolerant scheduling of taskgraphs

    Deadline scheduling of taskgraphs is a well-known optimization problem that has been researched in depth for decades. In the last decade, the advent of frequency scaling in processors has extended the problem to scheduling for a certain deadline with minimum energy consumption. We present a stepwise heuristic for this problem that we evaluate with the help of a benchmark suite of small taskgraphs for which optimal solutions can be computed with linear programming. We extend the problem to include fault-tolerance, ie. to overcome failure of processing elements by task duplication. We present several tradeoffs between performance degradation and energy investment in the fault-case, that allow the user to adapt the schedule to his preferences. The fault-tolerance strategies are evaluated with the same benchmark suite as before.


  1. Jost Berthold, Københavns Universitet, Danmark (University of Copenhagen, Denmark), Thursday, 4.12.2015, 15:00

    Parallel Functional Programming
    (a biased overview focussed on Haskell)

    With the advent of manycore and GPGPUs, parallel and concurrent programming have undergone a veritable renaissance, and have become common among application programmers at all levels. Today, even common smartphones operate on at least four cores. At the same time, our society is increasingly knowledge-based and depends on parallel processing to realise their computational demand. Efficient large-scale data processing become more and more important for modern knowledge-based industries like banks and social networks.

    However, identifying, allocating and controlling parallelism in massively parallel hardware is challenging, and programmers need to "think parallel" from the start in order to effectively exploit modern multicores and networked architectures. Functional languages - such as Haskell with parallelism extensions - are commonly considered as convenient tools to reason about and prototype parallel programs without getting lost in implementation details.

    In this research talk, we will provide an overview of functional concepts for parallelism, with a special focus on parallel Haskell dialects, and then discuss a selection of common parallelisation patterns and skeletons, and their implementations in the parallel Haskell dialect Eden. We argue that the functional programming paradigm is a good match to the requirement of "parallel thinking", and that parallel Haskell provides good support for both skeleton usage and lower-level implementation. The functional paradigm enables programmers to analyse and design applications for parallel computing from the start, and a rich type system and libraries provide excellent support for prototyping and exploring algorithm designs in a structured mathematical way.

    Biography: Dr.rer.nat. Jost Berthold is Assistant Professor (Adjunkt) at DIKU, the Department of Computer Science at the University of Copenhagen. His core expertise lies in functional programming techniques for parallel computing, and covers all aspects: programming models, language and library design, implementation techniques, tool support, and research into patterns of parallelism (skeletons). The focus of Dr.Berthold's practical work is parallel Haskell dialects; his conceptual work on parallel patterns and skeletons links up to different application areas. After post-doctoral positions at Microsoft Research Cambridge, in St.Andrews and at DIKU, Dr. Berthold now works at the core of the HIPERFIT research centre, applying advanced programming language technology (DSLs and parallel functional programming) to problems from financial mathematics.

  2. Elias Wimmer, Wilfried Gansterer, Universität Wien, Tuesday 28.10.2014, 14:00

    Towards implementing randomized reduction algorithms with MPI

    Randomized communication patterns are not very common in HPC codes. MPI point-to-point communication tends to enforce fixed communication patterns, as both sender and receiver must participate in order to communicate. With MPI one-sided communication, implementation of randomized algorithms is a lot easier, but sustained performance is - in our experience - rather poor. In this talk we discuss strategies for implementing a randomized iterative reduction algorithm with MPI. Implementations based on one-sided communication are compared to implementations based on point-to-point communication. Moreover, use cases for new features in MPI 3 are discussed.

    We are looking forward to discussing various open questions related to our implementations

  3. Francisco D. Igual, Universidad Complutense de Madrid, Friday 30.5.2014, 10:00:

    The New Role of the DSP on the HPC arena

    As the High Performance Computing industry looks to mobile processors for power savings opportunities in Exascale systems, a related but overlooked segment of the communications market may prove to be a better hunting ground for such devices. Communications has seen explosive growth from the first 3G data communications and now the race to 4G and video delivery continues to push wireless infrastructure forward as exponential rates. Wireless OEMs have been pushing for lower power consumption in base stations and the connectivity networks for over a decade and the resultant devices have evolved from simple DSPs to powerful multicore technologies with excellent power/compute ratios.

    In this talk we will review how TI’s multicore devices have evolved into potential HPC compute engines and how they differ from existing HPC processors.  We will look at this in the context how the technologies developed for the wireless infrastructure market apply to HPC and where this may lead in the future. We will focus on recent efforts towards efficient ports of state-of-the-art dense linear algebra libraries to single-core, multi-core and multi-DSP architectures, and will propose open research lines and possibilities for future DSP generations and applications.

  4. Matthias Hochsteger, Analysis and Scientific Computing, TU Wien, Thursday 22.5.2014, 11:00:

    A Memory-Efficient Reduce-Scatter Algorithm and application in DG-FEM on CUDA GPUs

    In Discontinuous Galerkin Finite Element Methods (DG-FEM) most of the computational effort is spent on the evaluation of polynomials and its inverse operation, polynomial interpolation. This problems translate to matrix-vector multiplications with a constant matrix. Using a hierarchical, orthogonal polynomial basis it is possible to calculate the matrix entries instead of loading them, which reduces the gap between processor and memory speeds. This approach leads to the reduction of values across threads on the GPU. For this purpose, a pipelined reduce-scatter algorithm with constant memory requirements and optimal bandwidth usage was developed for CUDA GPUs.

  5. Jyrki Katajainen, University of Copenhagen (DIKU), April 2014

  6. Vanessa End, T-Systems, Germany. Tuesday 11.3, 13:00:

    The Global Address Space Programming Interface and a New Allreduce Algorithm

    The Global Address Space Programming Interface (GASPI) is an API for the partitioned global address space which aims at high scalability, asynchronous communication and failure tolerance with one-sided communication patterns. One of the means for achieving this are the split-phase collectives and the timeout mechanism, through which the user is capable of overlapping communication and computation in a very application specific manner. For these split-phase collectives we ideally have an underlying algorithm, which has a small number of communication rounds and at the same time involves all processes in every step. We have put our emphasis on the allreduce operation and have developed a new algorithm based on the n-way dissemination algorithm, which addresses exactly these two issues.

  7. Manfred Mücke, Sustainable Computing Research. Thursday, 6.3.2014, 13:15:

    NeuroPhire - Analysing and Optimising Millisecond-Scale Molecular Dynamics Simulation on Intel Xeon Phi Architectures

    The aim of the NeuroPhire project is characterising and increasing the performance of the molecular dynamics simulation code GROMACS when executing models relevant to proteine-mebrane interaction on Xeon Phi architectures. Specifically, it aims at accelerating respective simulations to achieve biological simulation times in the order of milliseconds within acceptable power budgets. NeuroPhire has won access to the BEACON cluster at the National Institute for Computational Sciences in Oak Ridge (11,520 accelerator cores, was #1 on the Nov 2012 Green500).

    This talk will discuss

    • experiments suitable to evaluate the use of the GROMACS PME autotune feature
    • experiments suitable to evaluate the use of the GROMACS domain load balancing feature (-dlb)
    • experiments suitable to evaluate the FFTw performance portability across different GROMACS configurations (cores, domains, PME nodes)
    • possible methods to effectively tailor FFTw library performance to the requirements of GROMACS when executing specific workloads/models.

    Project Background: Molecular dynamics (MD) simulation is a key technology to understanding of biological processes (via in-silico experiments). Protein-membrane interaction is central to many regulation schemes in the human body. Respective simulation models comprise in the order of tens of thousands of atoms and require biological simulation times in the order of milliseconds. Current computer infrastructure does not lend itself well to these models as efficient scaling beyond several thousand cores has not been demonstrated yet. One of the dominant MD codes in use today is GROMACS. Recent work has identified both communication and FFT as key bottlenecks to efficient execution of GROMACS. The Intel Xeon Phi seems well suited to improve efficiency of current MD Proteine-mebrane simulation models due to its improved floating-point SIMD units and higher on-chip integration (60 cores). Through an open project proposal contest, Sustainable Computing Research has gained access to the BEACON cluster at National Institute for Computational Sciences in Oak Ridge ( Beacon features 768 conventional cores and 11,520 accelerator cores that provide over 210 TFLOP/s of combined computational performance, 12 TB of system memory, 1.5 TB of coprocessor memory, and over 73 TB of SSD storage, in aggregate.

  8. Christoph Kessler, Linköpings Universitet, Thursday, 23. January, 10:15:

    Crown Scheduling: Energy-Efficient Resource Allocation, Mapping and Discrete Frequency Scaling for Collections of Malleable Streaming Tasks

    We investigate the problem of generating energy-optimal code for a collection of streaming tasks that include parallelizable or malleable tasks on a generic manycore processor with dynamic discrete frequency scaling. Streaming task collections differ from classical task sets in that all tasks are running concurrently, so that cores typically run several tasks that are scheduled round-robin at user level in a data driven way. A stream of data flows through the tasks and intermediate results are forwarded on-chip to other tasks. In this presentation we introduce Crown Scheduling, a novel technique for the combined optimization of resource allocation, mapping and discrete voltage/frequency scaling for malleable streaming task sets in order to optimize energy efficiency given a throughput constraint. We present optimal off-line algorithms for separate and integrated crown scheduling based on integer linear programming (ILP). Our energy model considers both static idle power and dynamic power consumption of the processor cores. Our experimental evaluation of the ILP models for a generic manycore architecture shows that at least for small and medium sized task sets even the integrated variant of crown scheduling can be solved to optimality by a state-of-the-art ILP solver within a few seconds. - We conclude with a short outlook to the new EU FP7 project EXCESS (Execution Models for Energy-Efficient Computing Systems).

    Acknowledgements: This is joint work with Nicolas Melot (Linköping University), Patrick Eitschberger and Jörg Keller (FernUniv. in Hagen, Germany). Partly funded by VR, SeRC, and CUGS. Based on our recent paper with the same title at Int. Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS-2013), Sep. 2013, Karlsruhe, Germany.

  9. Bryan Marker, University of Texas at Austin, Thursday, 23. January, 14:15:

    Automating the Actions of Dense Linear Algebra Experts

    After decades of studying dense linear algebra (DLA) algorithms and their implementations on various hardware architectures, experts still spend a lot of time porting the same algorithms to new architectures. Entire libraries are developed by experts re-applying their knowledge repeatedly. This is a waste of their unique and rare knowledge and time. We explore how to encode expert knowledge of the DLA core algorithms and software components in the style of Design by Transformation (DxT) to automatically generate high-performance code for distributed-memory systems. We compare our performance results to that of hand-written code, our results demonstrate knowledge reuse, and we explore how proper abstractions in domain-specific languages and domain representations are essential to this kind of work. We show results using the distributed-memory Elemental library and explain what it takes to retarget our knowledge base to generate shared-memory code with the BLIS API.


  1. Jyrki Katajainen, University of Copenhagen (DIKU), Friday, 23. August, 2013, 14:00:

    Weak heaps and friends: recent results

    A weak heap is a variant of a binary heap where for each node the heap ordering is enforced only for one of its two children. In 1993, Dutton showed that this data structure yields a simple worst-case-efficient sorting algorithm. In this paper we review the recent improvements made to the basic data structure that push the envelope even further. In addition, we look at several other applications of weak heaps. This encompasses the creation of a sorting index and the use of a weak heap as a tournament tree leading to even more efficient sorting with respect to the number of element comparisons performed. Constant-factor-optimal adaptive sorting will be available by extending the weak-heap data structure in two directions. Various improvements suggest that weak heaps can be used as an intermediate step in an efficient construction of binary heaps. For graph search and network optimization, a worst-case-efficient priority queue based on weak heaps is known to be provably better than a Fibonacci heap.

  2. Professor Veljko Milutinovic, IEEE Fellow, Thursday 2nd of May, 2013; 10:00:

    SuperComputing for BigData: DataFlow vs ControlFlow (the Programming Paradigm and the Ongoing Research)

    DataFlow computers, compared to ControlFlow computers, offer speedups of 20 to 200, and power reductions of about 20. However, the programming paradigm is different. This talk explains the paradigm, using Maxeler as an example (Maxeler is 20% owned by JPMorgan), and sheds light on the ongoing research in the field.

  3. Daniel Cederman, Chalmers, 20.-26.4.2013 (Post-PEPPHER)

  4. Arnaud Legrand, CNRS, LIG Grenoble, Thursday 4.4-Friday 5.4

    On Thursday, 4.4, 16:00:

    Fair Scheduling of Bags-of-Tasks Applications Using Distributed Lagrangian Optimization

    Large scale distributed systems typically comprise hundreds to millions of entities (applications, users, companies, universities) that have only a partial view of resources (computers, communication links). How to fairly and efficiently share such resources between entities in a distributed way has thus become a critical question. Although not all applications are suitable for execution on large scale distributed computing platform, ideal are the Bags-of-Tasks (BoT) applications. Hence a large fraction of jobs in workloads imposed on Grids is made of sequential applications submitted in the form of BoTs. Up until now, mainly simple mechanisms have been used to ensure a fair sharing of resources among these applications. Although these mechanisms have proved to be efficient for CPU-bound applications, they are known to be ineffective in the presence of network-bound applications. A possible answer resorts to Lagrangian optimization and distributed gradient descent. Under certain conditions, the resource sharing problem can be formulated as a global optimization problem, which can be solved by a distributed self-stabilizing supply and demand algorithm. In the last decade, this technique has been applied to design network protocols (variants of TCP, multi-path network protocols, wireless network protocols) and even distributed algorithms for smart grids. In this talk, I will explain how to use this technique for fairly scheduling concurrent BoT applications with arbitrary communication-to-computation ratio on a Grid. Yet, application heterogeneity raises severe convergence and stability issues that did not appear in previous contexts and need to be addressed by non-trivial modifications. The effectiveness of our proposal is assessed through an extensive set of complex and realistic simulations.

  5. Mikhail Panshenskov, Tuesday, 29.1, 14:00:

    Quality of independent tasks scheduling

    Nowadays the problem of effective scheduling becomes highly relevant. Numerous heuristics (like Min-Min, XSufferage, genetic algorithms etc.) provide relatively good solution for a NP-hard problem of scheduling. Although most of the published heuristics have two issues. Firstly, many of such algorithms do not consider inter-processor communication which becomes highly important in modern computing. Secondly, many scheduling heuristics do not guarantee the quality of scheduling whatsoever. The author will present the lower bound quality estimate for independent tasks scheduling in a distributed system with communication. As a special case for cloud computing the author will demonstrate the lower bound quality estimate for scheduling after horizontal scaling. In order to demonstrate applicability the author contributed to and used an open source project "Adaptive Parallel COmputations framework" (APCO). The project supports flexible scheduling algorithms and models including job stealing model. The author will describe few of its features.


  1. Jörg Keller, FernUniversität Hagen, Wednesday-Thursday, 12.-13.12.2012

    On Wednesday, 12.12, 13:30:

    Energy-efficient Use of Multiprocessors -- Is it more than Load-balancing?

    Wie investigate the energy-efficiency of multiprocessors when executing collections of tasks where all tasks are running simultaneously. Application scenarios for such collections are streaming applications on MPSoCs, e.g. in image or audio processing. We first treat the energy minimization in processing cores with consideration of static and dynamic power, and extend our study to include the on-chip network and the memory controllers. We present preliminary experimental results from the Intel SCC to support our hypotheses.

  2. Joachim Schöberl, Thursday 6.12, 10:00

    Parallel Solution Algorithms for Finite Element Methods

  3. Josef Weinbub, Karl Rupp, Institute for Microelectronics, TU Wien, Tuesday 28.8.2012, ViennaCL and ViennaX

    Karl Rupp:

    The High-Level Linear Algebra Library ViennaCL and Its Applications

    In order to provide simple access to the vast computing resources in graphics processing units (GPUs) for general purpose scientific computing, the open source linear algebra library ViennaCL is presented. ViennaCL is written in C++ and used like existing CPU-based linear algebra libraries, thus it can be integrated into existing code easily. Moreover, the generic implementations of algorithms such as iterative solvers allow for code reuse beyond device and library boundaries.

    Further Information:

    Josef Weinbub:

    ViennaX: A Parallel Plugin Execution Framework for Scientific Computing

    We present the free open source plugin execution framework ViennaX for modularizing and parallelizing scientific simulations. In general, functionality is abstracted by the notion of a task, which is implemented as a plugin. The plugin system facilitates the utilization of both, already available functionality as well as new implementations. Each task can define arbitrary data dependencies which are used by ViennaX to build a task graph. The framework supports the execution of this dependence graph based on the Message Passing Interface (MPI) in either a serial or a parallel fashion. The applied modular approach allows for defining highly flexible simulations, as plugins can be easily exchanged. An overview of the library is provided, as well as representative application examples from different fields of scientific computing. The exchange of the linear solver of a Finite Element simulation from the field of mechanical engineering is shown. The iterative nature of the Jacobi algorithm is used to depict the loop execution feature. Finally, a parallel unstructured mesh adaptation is discussed to assess the distributed, MPI-based plugin execution capabilities of ViennaX.

    Further Information:

  4. Ulrich Meyer, University of Frankfurt, 9.-13.7.2012

    On Tuesday, 10.7.2012, Parallel Computing Group, Favoritenstrasse 16, 11 O'clock

    I/O-efficient hierarchical diameter approximation

    Computing diameters of huge graphs is a key challenge in complex network analysis. As long as the graphs fit into main memory, diameters can be efficiently approximated (and frequently even exactly determined) using heuristics that apply a limited number of BFS traversals. If the input graphs have to be kept and processed on external storage, however, even a single BFS run may cause an unacceptable amount of time-consuming I/O-operations.

    In [SWAT08] we proposed the first parameterized diameter approximation algorithm with fewer I/Os than that required for exact BFS traversal. In recent ongoing work we derive hierarchical extensions of this randomized approach and experimentally compare their trade-offs between actually achieved running times and approximation ratios. It turns out that the hierarchical approach is frequently capable of producing surprisingly good diameter approximations in shorter time than BFS. We also provide theoretical and practical insights into worst-case input classes.

    Joint work with Deepak Ajwani and David Veith.

  5. Daniel Cederman, Chalmers, 17.6-1.7.2012 (PEPPHER)

  6. Sascha Hunold, University of Heidelberg, 10.-11.5.2012

  7. Robert van de Geijn, University of Texas at Austin, 9-11.5.2012


    Robert van de Geijn is a Professor of Computer Science and member of the Institute for Computating Engineering and Sciences at UT-Austin. He received his Ph.D. in Applied Mathematics from the University of Maryland. His interests are in linear algebra libraries, scientific computing, parallel computing, and formal derivation of programs. His FLAME project pursues how fundamental techniques from computer science support high-performance linear algebra libraries. He has written more than a hundred refereed articles and several books on this subject.

    On Thursday, 10.5.2012, Seminarroom 121, 13 O'clock:

    Design by Transformation - Application to Dense Linear Algebra Libraries

    The FLAME project has yielded modern alternatives to LAPACK and related effort. An attractive feature of this work is the complete vertical integration of the entire software stack, starting with low level kernels that support the BLAS and finishing with a new distributed memory library, Elemental. In between are layers that target a single core, multicore, and multiGPU architectures. What this now enables is a new approach where libraries are viewed not as instantiations in code but instead as a repository of algorithms, knowledge about those algorithm, and knowledge about target architectures. Representations in code are then mechanically generated by a tool that performs optimizations for a given architecture by applying high-level transformations much like a human expert would. We discuss how this has been used to mechanically generate tens of thousands of different distributed memory implementations given a single sequential algorithm. By attaching cost functions to the component operations, a highly optimized implementation is chosen by the tool. The chosen optimization invariably matches or exceeds the performance of implementations by human experts. We call the underlying approach Design by Transformation (DxT).

    This work is in collaboration with Bryan Marker, Don Batory, Jack Poulson, and Andy Terrell.

    On Friday, 11.5, Parallel Computing Group, Favoritenstrasse 16, 10-13:

    MiniTutorial: Collective Communication and Parallel Matrix Computation - Animated and in 3D

    Distributed memory dense linear algebra libraries (and many other applications) cast most if not all communication in terms of collectives (broadcast, allgather, reduce, etc.). For this reason, it is important for a practitioner to understand efficient algorithms for collective communications and their costs. Yet more than 30 years after the introduction of distributed memory architectures most still think that the Minimum Spanning Tree broadcast is the most efficient way of implementing the broadcast operation. In this minitutorial, we teach the fundamentals of which, we believe, all practitioners should be aware. We show in full animation how collective communications can be composed and how this inherently facilitates a systematic treatment of distributed memory dense linear algebra computations. In the end, the attendee will see parallel matrix multiplication in 3D.

    Lecture 1: Derivation of Dense Linear Algebra Algorithms

    We discuss how a broad class of dense linear algebra algorithms can be systematically derived using Goal Oriented Programming techniques already advocated by Dijkstra and others in the early 1970s. The approach, which we call the FLAME methodology, is highly practical: it yields, from specification of an operation, families of high-performance algorithms via an eight step process that fills out a "worksheet". The algorithm that maps best to a specific architecture can then be chosen. More than a thousand routines that are part of the libflame library (a modern alternative to LAPACK) and Elemental library (a modern alternative to ScaLAPACK) have been written by first deriving the algorithms in this fashion.

    Lecture 2: From Algorithms to Code

    The FLAME methodology derives correct algorithms. The problem is that "bugs" can be easily introduced when translating an algorithm into code. As part of the FLAME project, we have created Application Programming Interfaces (APIs) that allow the code to closely resemble the algorithm. In a matter of minutes a correct algorithm can then be translated into a correct, high-performance implementation. We demonstrate this by implementing a few common dense linear algebra algorithms into MATLAB and C code. It is these APIs that are used for the implementation of both libflame (a modern alternative to LAPACK) and Elemental (a modern alternative to ScaLAPACK). (A participant who missed Lecture 1 will have no problems understanding Lecture 2.)

    Lecture 3: Collective Communication

    Distributed memory dense linear algebra libraries (and many other applications) cast most if not all communication in terms of collective communications (broadcast, allgather, reduce, etc.). For this reason, it is important for a practitioner to understand efficient algorithms for collective communications and their costs. Yet more than 30 years after the introduction of distributed memory architectures most still think that the Minimum Spanning Tree broadcast is the most efficient way of implementing the broadcast operation. In this lecture, we teach the fundamentals of which, we believe, all practitioners should be aware.

    Lecture 4: Distributed Memory Dense Linear Algebra Principles

    In this lecture we show how dense linear algebra algorithms and collective communication combine to yield high-performance scalable dense linear algebra algorithms.

  8. Georg Struth, University of Sheffield, 19.4.2012

  9. Christian Siebert, RWTH Aachen, 26.3-30.4.2012

  10. Martti Forsell, VTT, Finland, 6-8.3.2012