Talks and Guests
2019
 Nikita Koval, IST Austria, 9.1.4.3.2019
2018

Trevor Brown, IST Austria, 14.6.2018
Getting to the root of concurrent search trees
Time and Place: June 14th, 10:00 am, Hörsaal EI6 (old EI), Gusshausstrasse 27, 6th floor
Many systems rely on optimistic concurrent search trees for multicore scalability. In principle, optimistic trees have a simple performance story: searches are readonly and run in parallel, with writes to shared memory occurring only when modifying the data structure. However, in practice, obtaining the full performance benefits of optimistic search trees is not so simple.
In this talk, I will first briefly describe the major developments in concurrent search trees, providing a high level overview of the most popular implementations. Then, I will describe some unexpected performance differences between BSTs with similar tree structure and search implementations, which can be traced to subtle performancedegrading interactions of BSTs with systems software and hardware subsystems. This talk should be of interest to those curious about concurrent search trees, systems and experimentation.
2017

Martti Forsell, VTT, Finland, 29.528.6.2017
The Thick Control Flow Model and Architecural Support for It
Time and Place: June 22, 10:00 am, Favoritenstr. 16, 3rd floor
The recently invented thick control flow (TCF) model packs together an unbounded number of fibers, threadlike computational entities, flowing through the same control path. This promises to simplify parallel programming by partially eliminating looping and artificial thread arithmetics. In this presentation we ntroduce the TCF model and outline an architecture for supporting efficiently executing programs written for the model. The outlined architecture features scalable latency hiding via replication of instructions, radical synchronization cost reduction via a wavebased synchronization mechanism, and improved lowlevel parallelism exploitation via chaining of functional units. Replication of instructions is supported by a dynamic multithreadinglike mechanism, which saves the fiberwise data into special replicated register blocks. The model and architecture facilitate programmers with compact, unbounded notation of fibers and groups of them together with strong synchronous shared memory algorithmics. According to early evaluations, the model inspires strong but simple parallel programming language design and the architecture is able to efficiently handle workloads featuring computational elements with the same control flow, independently of the number of elements. In its turn, this pays out as improved performance and lower power consumption due to elimination of redundant parts of computation and machinery.
2016
 Karl Fürlinger, LudwigMaximilians Universität München
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorithms
Time and Place: October 18, 10:00 am, Favoritenstr. 16, 3rd floor
DASH is a realization of the partitioned global address space (PGAS) programming model in the form of a C++ template library. DASH offers distributed datastructures with flexible data distribution schemes and implements a PGAS model by relying on a onesided communication substrate which is accessed through an intermediate runtime layer called DART (the DASH runtime). DASH can be used within a shared memory node as well as between nodes and it provides an iteratorbased interface that is similar to the data containers of the C++ Standard Template Library (STL). To support the development of applications that exploit a hierarchical organization, either on the algorithmic or on the hardware level, DASH features the notion of teams that are arranged in a hierarchy. Based on a team hierarchy, the DASH data structures support locality iterators as a generalization of the conventional local/global distinction found in many PGAS approaches.
 Roger Kowalewski, LudwigMaximilians Universität München
Debugging Synchronization Errors in MPI3 OneSided Applications
Time and Place: October 18, 11:00 am, Favoritenstr. 16, 3rd floor
MPI Remote Memory Access (RMA) allows one process to specify all communication parameters for both the sending and receiving side by providing support for asynchronous reads and updates of distributed shared data. While such an interface can be highly efficient, proper synchronization of possibly conflicting accesses to shared data is a challenging task and errorprone. We present a tool that supports developers in finding latent synchronization errors. It exploits the semantic flexibility given by the MPI standard and applies various heuristics to dynamically schedule pessimistic executions. An experimental evaluation reveals that this approach forces a manifestation of latent synchronization bugs which would otherwise likely go unnoticed.
 Alfons Laarman, TU Wien
Title: A DFSbased Parallel Algorithm for SCC Decomposition
Time and Place: October 11, 11:00 am, Favoritenstr. 16, 3rd floor
Reif’s complexity result is often cited to demonstrate that algorithms based on DFS are (probably) inherently sequential. However, for many applications the result does not apply directly, since Reif considered lexicographical DFS whereas many applications of DFS merely require a nonlexicographical search order. In other words, the freedom in ordering adjacent vertices could yield efficient parallelization. Indeed, it is unknown whether nonlex DFS is efficiently parallelizable. With a novel algorithm for parallel SCC decomposition, we demonstrate that in practice such DFSbased solutions are feasible. In particular, we show that a DFS solution outperforms existing fixpoint approaches. Contrary to the latter approaches, our algorithm also is onthefly and with a fixed number of processors has a better runtime complexity.
 LouisClaude Canon, University of FrancheComté, France
Title: Controlling and Assessing Correlations of Cost Matrices in Heterogeneous Scheduling
Time and Place: March 4, 10:00 am, Favoritenstr. 16, 3rd floor
This talk considers the problem of allocating independent tasks to unrelated machines such as to minimize the maximum completion time. Testing heuristics for this problem requires the generation of cost matrices that specify the execution time of each task on each machine. Numerous studies showed that the task and machine heterogeneities belong to the properties impacting heuristics performance the most. This study focuses on orthogonal properties, the average correlations between each pair of rows and each pair of columns, which is a proximity measure with uniform instances. Cost matrices generated with a novel generation method show the effect of these correlations on the performance of several heuristics from the literature. In particular, EFT performance depends on whether the tasks are more correlated than the machines and HLPT performs the best when both correlations are close to one.

Henning Meyerhenke, Karlsruhe Institute of Technology, Germany,
25.1.201628.1.2016
Title: Scalable algorithms for analyzing large networks with NetworKit
Time and Place: 26.1.2016, 11:00 a.m., Favoritenstr. 16, 3rd floor
We present NetworKit, an opensource software toolkit for efficient graph algorithms and the analysis of massive complex networks. Complex networks are equally attractive and challenging targets for data mining. Novel algorithmic solutions, including parallelization, are required to handle data sets containing billions of connections. NetworKit is a hybrid combining the performance of kernels written in C++ with a convenient Python frontend. The package targets sharedmemory platforms with OpenMP support. Two selected algorithm classes from NetworKit will be presented in some detail, those for community detection and for dynamic betweenness centrality approximation. In comparison with related software, we propose NetworKit as a package geared towards large networks and satisfying three important criteria: High performance, interactive workflows, and integration into an ecosystem of tested tools for data analysis and scientific computation.
Based on joint work with Elisabetta Bergamini, Alex Sazonovs, and Christian L. Staudt
 LouisClaude Canon, University of FrancheComté, France, 15.2.2016  27.3.2016
2015

Michael Axtmann, Karlsruhe Institute of Technology, Karlsruhe, 23.11.201527.11.2015, talk 23.11.2015 at 13:00:
Practical Massively Parallel Sorting
Previous parallel sorting algorithms do not scale to the largest available machines, since they either have prohibitive communication volume or prohibitive critical path length. We describe algorithms that are a viable compromise and overcome this gap both in theory and practice. The algorithms are multilevel generalizations of the known algorithms sample sort and multiway mergesort. In particular, our sample sort variant turns out to be very scalable both in theory and practice where it scales up to 2^15 MPI processes with outstanding performance in particular for medium sized inputs. Some tools we develop may be of independent interest – a simple, practical, and flexible sorting algorithm for very small inputs, a near linear time optimal algorithm for solving a constrained bin packing problem, and an algorithm for data delivery, that guarantees a small number of message startups on each processor.

Christian Schulz, 1.9.2015  29.2.2016

François Tessier, Inria, Bordeaux, Friday 31.7.2015, at 12:00(!):
Placement techniques to improve data locality in parallel applications
With hardware topologies becoming incessantly complex and growing needs in term of computing power for the parallel applications, the issue of data locality becomes central: how to reduce the distance between a processing entity and data to which it needs to access? Application placement is one of the levers to address this problem. In this talk, we present the TreeMatch algorithm and its application for static mapping, that is to say at the lauchtime of the application, and the dynamic placement. For this second approach, we propose awareness of data locality within a load balancing algorithm. These two different approaches are validated by experiments both on benchmarking codes and on real applications. Finally, we present recent work on data locality in parallel I/O targeting the election of aggregators in charge of gathering the pieces of data and writing them to disk.

Martin Schatz, University of Texas at Austin/Sandia Natl. Labs; Tuesday 7.4  Friday 10.4.2015
On Tuesday, 7.4.2015, 14:00:
Formalizing Distributed Tensor Computations
Recently, data models have become more complex, requiring multidimensional representations and multilinear algebra to meaningfully express and compute with such data. Tensors are often the structures used for such representations, and tensor operations are often used for the necessary computation. In fields such as computational chemistry, these datasets become too large to compute locally on a single processing node, and thus, require leveraging distributedmemory systems to perform the necessary computation. Through decades of research, insights in the area of dense linear algebra, a special case of multilinear algebra, have led to the development of libraries that can achieve high performance on classes of distributedmemory architectures. Due to the added complexity of tensor operations, stemming from the arbitrary number of dimensions the tensors can represent, it becomes daunting to reason about factors such as data distribution, communication overhead, and algorithmic choice when creating highperformance distributedmemory libraries for tensor operations. In this work, we define a notation that generalizes key insights from research in dense linear algebra, formalizing arbitrarydimensional tensor data distributed on arbitrarydimensional processing meshes and redistribution operations in terms of wellstudied collective communications. Additionally, we define a systematic process for the derivation of a class of highperformance algorithms for the tensor contraction operation on distributedmemory architectures, as well as show how the defined notation can express communication optimizations domainexperts utilize to improve performance of their applications.
Martin Schatz is a 5th year Ph.D student supervised by Tamara G. Kolda of Sandia National Laboratories: Livermore, and Robert van de Geijn of the University of Texas at Austin. He is currently a Sandia National Laboratories Graduate Fellowship awardee. My background is in Computer Science and Chemistry (Bacherlor of Science in both).

Christian Schulz, MondayFriday, 16.20.2.2015
On Monday, 16.2.2015, 14:00:
Partitioning for Complex Networks
Processing large complex networks like social networks or web graphs has recently attracted considerable interest. To do this in parallel, we need to partition them into pieces of about equal size. Unfortunately, previous parallel graph partitioners originally developed for more regular meshlike networks do not work well for these networks. This paper addresses this problem by parallelizing and adapting the label propagation technique originally developed for graph clustering. By introducing size constraints, label propagation becomes applicable for both the coarsening and the refinement phase of multilevel graph partitioning. We obtain very high quality by applying a highly parallel evolutionary algorithm to the coarsest graph. The resulting system is both more scalable and achieves higher quality than stateoftheart systems like ParMetis or PTScotch. For large complex networks the performance differences are very big. As an example, our algorithm partitions a web graph with 3.3G edges in 16 seconds using 512 cores of a highperformance cluster while producing a high quality partition  none of the competing systems can handle this graph on our system.

Jörg Keller, FernUniversität Hagen, MondayWednesday, 19.21.1.2015
On Tuesday, 20.1.2015, 14:00:
Heuristics for energyefficient and faulttolerant scheduling of taskgraphs
Deadline scheduling of taskgraphs is a wellknown optimization problem that has been researched in depth for decades. In the last decade, the advent of frequency scaling in processors has extended the problem to scheduling for a certain deadline with minimum energy consumption. We present a stepwise heuristic for this problem that we evaluate with the help of a benchmark suite of small taskgraphs for which optimal solutions can be computed with linear programming. We extend the problem to include faulttolerance, ie. to overcome failure of processing elements by task duplication. We present several tradeoffs between performance degradation and energy investment in the faultcase, that allow the user to adapt the schedule to his preferences. The faulttolerance strategies are evaluated with the same benchmark suite as before.
2014
 Jost Berthold, Københavns Universitet, Danmark (University of Copenhagen, Denmark), Thursday, 4.12.2015, 15:00
Parallel Functional Programming
(a biased overview focussed on Haskell)With the advent of manycore and GPGPUs, parallel and concurrent programming have undergone a veritable renaissance, and have become common among application programmers at all levels. Today, even common smartphones operate on at least four cores. At the same time, our society is increasingly knowledgebased and depends on parallel processing to realise their computational demand. Efficient largescale data processing become more and more important for modern knowledgebased industries like banks and social networks.
However, identifying, allocating and controlling parallelism in massively parallel hardware is challenging, and programmers need to "think parallel" from the start in order to effectively exploit modern multicores and networked architectures. Functional languages  such as Haskell with parallelism extensions  are commonly considered as convenient tools to reason about and prototype parallel programs without getting lost in implementation details.
In this research talk, we will provide an overview of functional concepts for parallelism, with a special focus on parallel Haskell dialects, and then discuss a selection of common parallelisation patterns and skeletons, and their implementations in the parallel Haskell dialect Eden. We argue that the functional programming paradigm is a good match to the requirement of "parallel thinking", and that parallel Haskell provides good support for both skeleton usage and lowerlevel implementation. The functional paradigm enables programmers to analyse and design applications for parallel computing from the start, and a rich type system and libraries provide excellent support for prototyping and exploring algorithm designs in a structured mathematical way.
Biography: Dr.rer.nat. Jost Berthold is Assistant Professor (Adjunkt) at DIKU, the Department of Computer Science at the University of Copenhagen. His core expertise lies in functional programming techniques for parallel computing, and covers all aspects: programming models, language and library design, implementation techniques, tool support, and research into patterns of parallelism (skeletons). The focus of Dr.Berthold's practical work is parallel Haskell dialects; his conceptual work on parallel patterns and skeletons links up to different application areas. After postdoctoral positions at Microsoft Research Cambridge, in St.Andrews and at DIKU, Dr. Berthold now works at the core of the HIPERFIT research centre, applying advanced programming language technology (DSLs and parallel functional programming) to problems from financial mathematics.

Elias Wimmer, Wilfried Gansterer, Universität Wien, Tuesday 28.10.2014, 14:00
Towards implementing randomized reduction algorithms with MPI
Randomized communication patterns are not very common in HPC codes. MPI pointtopoint communication tends to enforce fixed communication patterns, as both sender and receiver must participate in order to communicate. With MPI onesided communication, implementation of randomized algorithms is a lot easier, but sustained performance is  in our experience  rather poor. In this talk we discuss strategies for implementing a randomized iterative reduction algorithm with MPI. Implementations based on onesided communication are compared to implementations based on pointtopoint communication. Moreover, use cases for new features in MPI 3 are discussed.
We are looking forward to discussing various open questions related to our implementations

Francisco D. Igual, Universidad Complutense de Madrid, Friday 30.5.2014, 10:00:
The New Role of the DSP on the HPC arena
As the High Performance Computing industry looks to mobile processors for power savings opportunities in Exascale systems, a related but overlooked segment of the communications market may prove to be a better hunting ground for such devices. Communications has seen explosive growth from the first 3G data communications and now the race to 4G and video delivery continues to push wireless infrastructure forward as exponential rates. Wireless OEMs have been pushing for lower power consumption in base stations and the connectivity networks for over a decade and the resultant devices have evolved from simple DSPs to powerful multicore technologies with excellent power/compute ratios.
In this talk we will review how TI’s multicore devices have evolved into potential HPC compute engines and how they differ from existing HPC processors. We will look at this in the context how the technologies developed for the wireless infrastructure market apply to HPC and where this may lead in the future. We will focus on recent efforts towards efficient ports of stateoftheart dense linear algebra libraries to singlecore, multicore and multiDSP architectures, and will propose open research lines and possibilities for future DSP generations and applications.

Matthias Hochsteger, Analysis and Scientific Computing, TU Wien, Thursday 22.5.2014, 11:00:
A MemoryEfficient ReduceScatter Algorithm and application in DGFEM on CUDA GPUs
In Discontinuous Galerkin Finite Element Methods (DGFEM) most of the computational effort is spent on the evaluation of polynomials and its inverse operation, polynomial interpolation. This problems translate to matrixvector multiplications with a constant matrix. Using a hierarchical, orthogonal polynomial basis it is possible to calculate the matrix entries instead of loading them, which reduces the gap between processor and memory speeds. This approach leads to the reduction of values across threads on the GPU. For this purpose, a pipelined reducescatter algorithm with constant memory requirements and optimal bandwidth usage was developed for CUDA GPUs.

Jyrki Katajainen, University of Copenhagen (DIKU), April 2014

Vanessa End, TSystems, Germany. Tuesday 11.3, 13:00:
The Global Address Space Programming Interface and a New Allreduce Algorithm
The Global Address Space Programming Interface (GASPI) is an API for the partitioned global address space which aims at high scalability, asynchronous communication and failure tolerance with onesided communication patterns. One of the means for achieving this are the splitphase collectives and the timeout mechanism, through which the user is capable of overlapping communication and computation in a very application specific manner. For these splitphase collectives we ideally have an underlying algorithm, which has a small number of communication rounds and at the same time involves all processes in every step. We have put our emphasis on the allreduce operation and have developed a new algorithm based on the nway dissemination algorithm, which addresses exactly these two issues.

Manfred Mücke, Sustainable Computing Research. Thursday, 6.3.2014, 13:15:
NeuroPhire  Analysing and Optimising MillisecondScale Molecular Dynamics Simulation on Intel Xeon Phi Architectures
The aim of the NeuroPhire project is characterising and increasing the performance of the molecular dynamics simulation code GROMACS when executing models relevant to proteinemebrane interaction on Xeon Phi architectures. Specifically, it aims at accelerating respective simulations to achieve biological simulation times in the order of milliseconds within acceptable power budgets. NeuroPhire has won access to the BEACON cluster at the National Institute for Computational Sciences in Oak Ridge (11,520 accelerator cores, was #1 on the Nov 2012 Green500).
This talk will discuss
 experiments suitable to evaluate the use of the GROMACS PME autotune feature
 experiments suitable to evaluate the use of the GROMACS domain load balancing feature (dlb)
 experiments suitable to evaluate the FFTw performance portability across different GROMACS configurations (cores, domains, PME nodes)
 possible methods to effectively tailor FFTw library performance to the requirements of GROMACS when executing specific workloads/models.
Project Background: Molecular dynamics (MD) simulation is a key technology to understanding of biological processes (via insilico experiments). Proteinmembrane interaction is central to many regulation schemes in the human body. Respective simulation models comprise in the order of tens of thousands of atoms and require biological simulation times in the order of milliseconds. Current computer infrastructure does not lend itself well to these models as efficient scaling beyond several thousand cores has not been demonstrated yet. One of the dominant MD codes in use today is GROMACS. Recent work has identified both communication and FFT as key bottlenecks to efficient execution of GROMACS. The Intel Xeon Phi seems well suited to improve efficiency of current MD Proteinemebrane simulation models due to its improved floatingpoint SIMD units and higher onchip integration (60 cores). Through an open project proposal contest, Sustainable Computing Research has gained access to the BEACON cluster at National Institute for Computational Sciences in Oak Ridge (https://www.nics.tennessee.edu/beacon). Beacon features 768 conventional cores and 11,520 accelerator cores that provide over 210 TFLOP/s of combined computational performance, 12 TB of system memory, 1.5 TB of coprocessor memory, and over 73 TB of SSD storage, in aggregate.

Christoph Kessler, Linköpings Universitet, Thursday, 23. January, 10:15:
Crown Scheduling: EnergyEfficient Resource Allocation, Mapping and Discrete Frequency Scaling for Collections of Malleable Streaming Tasks
We investigate the problem of generating energyoptimal code for a collection of streaming tasks that include parallelizable or malleable tasks on a generic manycore processor with dynamic discrete frequency scaling. Streaming task collections differ from classical task sets in that all tasks are running concurrently, so that cores typically run several tasks that are scheduled roundrobin at user level in a data driven way. A stream of data flows through the tasks and intermediate results are forwarded onchip to other tasks. In this presentation we introduce Crown Scheduling, a novel technique for the combined optimization of resource allocation, mapping and discrete voltage/frequency scaling for malleable streaming task sets in order to optimize energy efficiency given a throughput constraint. We present optimal offline algorithms for separate and integrated crown scheduling based on integer linear programming (ILP). Our energy model considers both static idle power and dynamic power consumption of the processor cores. Our experimental evaluation of the ILP models for a generic manycore architecture shows that at least for small and medium sized task sets even the integrated variant of crown scheduling can be solved to optimality by a stateoftheart ILP solver within a few seconds.  We conclude with a short outlook to the new EU FP7 project EXCESS (Execution Models for EnergyEfficient Computing Systems).
Acknowledgements: This is joint work with Nicolas Melot (Linköping University), Patrick Eitschberger and Jörg Keller (FernUniv. in Hagen, Germany). Partly funded by VR, SeRC, and CUGS. Based on our recent paper with the same title at Int. Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS2013), Sep. 2013, Karlsruhe, Germany.

Bryan Marker, University of Texas at Austin, Thursday, 23. January, 14:15:
Automating the Actions of Dense Linear Algebra Experts
After decades of studying dense linear algebra (DLA) algorithms and their implementations on various hardware architectures, experts still spend a lot of time porting the same algorithms to new architectures. Entire libraries are developed by experts reapplying their knowledge repeatedly. This is a waste of their unique and rare knowledge and time. We explore how to encode expert knowledge of the DLA core algorithms and software components in the style of Design by Transformation (DxT) to automatically generate highperformance code for distributedmemory systems. We compare our performance results to that of handwritten code, our results demonstrate knowledge reuse, and we explore how proper abstractions in domainspecific languages and domain representations are essential to this kind of work. We show results using the distributedmemory Elemental library and explain what it takes to retarget our knowledge base to generate sharedmemory code with the BLIS API.
2013

Jyrki Katajainen, University of Copenhagen (DIKU), Friday, 23. August, 2013, 14:00:
Weak heaps and friends: recent results
A weak heap is a variant of a binary heap where for each node the heap ordering is enforced only for one of its two children. In 1993, Dutton showed that this data structure yields a simple worstcaseefficient sorting algorithm. In this paper we review the recent improvements made to the basic data structure that push the envelope even further. In addition, we look at several other applications of weak heaps. This encompasses the creation of a sorting index and the use of a weak heap as a tournament tree leading to even more efficient sorting with respect to the number of element comparisons performed. Constantfactoroptimal adaptive sorting will be available by extending the weakheap data structure in two directions. Various improvements suggest that weak heaps can be used as an intermediate step in an efficient construction of binary heaps. For graph search and network optimization, a worstcaseefficient priority queue based on weak heaps is known to be provably better than a Fibonacci heap.

Professor Veljko Milutinovic, IEEE Fellow, Thursday 2nd of May, 2013; 10:00:
SuperComputing for BigData: DataFlow vs ControlFlow (the Programming Paradigm and the Ongoing Research)
DataFlow computers, compared to ControlFlow computers, offer speedups of 20 to 200, and power reductions of about 20. However, the programming paradigm is different. This talk explains the paradigm, using Maxeler as an example (Maxeler is 20% owned by JPMorgan), and sheds light on the ongoing research in the field.

Daniel Cederman, Chalmers, 20.26.4.2013 (PostPEPPHER)

Arnaud Legrand, CNRS, LIG Grenoble, Thursday 4.4Friday 5.4
On Thursday, 4.4, 16:00:
Fair Scheduling of BagsofTasks Applications Using Distributed Lagrangian Optimization
Large scale distributed systems typically comprise hundreds to millions of entities (applications, users, companies, universities) that have only a partial view of resources (computers, communication links). How to fairly and efficiently share such resources between entities in a distributed way has thus become a critical question. Although not all applications are suitable for execution on large scale distributed computing platform, ideal are the BagsofTasks (BoT) applications. Hence a large fraction of jobs in workloads imposed on Grids is made of sequential applications submitted in the form of BoTs. Up until now, mainly simple mechanisms have been used to ensure a fair sharing of resources among these applications. Although these mechanisms have proved to be efficient for CPUbound applications, they are known to be ineffective in the presence of networkbound applications. A possible answer resorts to Lagrangian optimization and distributed gradient descent. Under certain conditions, the resource sharing problem can be formulated as a global optimization problem, which can be solved by a distributed selfstabilizing supply and demand algorithm. In the last decade, this technique has been applied to design network protocols (variants of TCP, multipath network protocols, wireless network protocols) and even distributed algorithms for smart grids. In this talk, I will explain how to use this technique for fairly scheduling concurrent BoT applications with arbitrary communicationtocomputation ratio on a Grid. Yet, application heterogeneity raises severe convergence and stability issues that did not appear in previous contexts and need to be addressed by nontrivial modifications. The effectiveness of our proposal is assessed through an extensive set of complex and realistic simulations.

Mikhail Panshenskov, Tuesday, 29.1, 14:00:
Quality of independent tasks scheduling
Nowadays the problem of effective scheduling becomes highly relevant. Numerous heuristics (like MinMin, XSufferage, genetic algorithms etc.) provide relatively good solution for a NPhard problem of scheduling. Although most of the published heuristics have two issues. Firstly, many of such algorithms do not consider interprocessor communication which becomes highly important in modern computing. Secondly, many scheduling heuristics do not guarantee the quality of scheduling whatsoever. The author will present the lower bound quality estimate for independent tasks scheduling in a distributed system with communication. As a special case for cloud computing the author will demonstrate the lower bound quality estimate for scheduling after horizontal scaling. In order to demonstrate applicability the author contributed to and used an open source project "Adaptive Parallel COmputations framework" (APCO). The project supports flexible scheduling algorithms and models including job stealing model. The author will describe few of its features.
2012

Jörg Keller, FernUniversität Hagen, WednesdayThursday, 12.13.12.2012
On Wednesday, 12.12, 13:30:
Energyefficient Use of Multiprocessors  Is it more than Loadbalancing?
Wie investigate the energyefficiency of multiprocessors when executing collections of tasks where all tasks are running simultaneously. Application scenarios for such collections are streaming applications on MPSoCs, e.g. in image or audio processing. We first treat the energy minimization in processing cores with consideration of static and dynamic power, and extend our study to include the onchip network and the memory controllers. We present preliminary experimental results from the Intel SCC to support our hypotheses.

Joachim Schöberl, Thursday 6.12, 10:00
Parallel Solution Algorithms for Finite Element Methods

Josef Weinbub, Karl Rupp, Institute for Microelectronics, TU Wien, Tuesday 28.8.2012, ViennaCL and ViennaX
Karl Rupp:
The HighLevel Linear Algebra Library ViennaCL and Its Applications
In order to provide simple access to the vast computing resources in graphics processing units (GPUs) for general purpose scientific computing, the open source linear algebra library ViennaCL is presented. ViennaCL is written in C++ and used like existing CPUbased linear algebra libraries, thus it can be integrated into existing code easily. Moreover, the generic implementations of algorithms such as iterative solvers allow for code reuse beyond device and library boundaries.
Further Information: http://viennacl.sourceforge.net/
Josef Weinbub:
ViennaX: A Parallel Plugin Execution Framework for Scientific Computing
We present the free open source plugin execution framework ViennaX for modularizing and parallelizing scientific simulations. In general, functionality is abstracted by the notion of a task, which is implemented as a plugin. The plugin system facilitates the utilization of both, already available functionality as well as new implementations. Each task can define arbitrary data dependencies which are used by ViennaX to build a task graph. The framework supports the execution of this dependence graph based on the Message Passing Interface (MPI) in either a serial or a parallel fashion. The applied modular approach allows for defining highly flexible simulations, as plugins can be easily exchanged. An overview of the library is provided, as well as representative application examples from different fields of scientific computing. The exchange of the linear solver of a Finite Element simulation from the field of mechanical engineering is shown. The iterative nature of the Jacobi algorithm is used to depict the loop execution feature. Finally, a parallel unstructured mesh adaptation is discussed to assess the distributed, MPIbased plugin execution capabilities of ViennaX.
Further Information: http://viennax.sourceforge.net/

Ulrich Meyer, University of Frankfurt, 9.13.7.2012
On Tuesday, 10.7.2012, Parallel Computing Group, Favoritenstrasse 16, 11 O'clock
I/Oefficient hierarchical diameter approximation
Computing diameters of huge graphs is a key challenge in complex network analysis. As long as the graphs fit into main memory, diameters can be efficiently approximated (and frequently even exactly determined) using heuristics that apply a limited number of BFS traversals. If the input graphs have to be kept and processed on external storage, however, even a single BFS run may cause an unacceptable amount of timeconsuming I/Ooperations.
In [SWAT08] we proposed the first parameterized diameter approximation algorithm with fewer I/Os than that required for exact BFS traversal. In recent ongoing work we derive hierarchical extensions of this randomized approach and experimentally compare their tradeoffs between actually achieved running times and approximation ratios. It turns out that the hierarchical approach is frequently capable of producing surprisingly good diameter approximations in shorter time than BFS. We also provide theoretical and practical insights into worstcase input classes.
Joint work with Deepak Ajwani and David Veith.

Daniel Cederman, Chalmers, 17.61.7.2012 (PEPPHER)

Sascha Hunold, University of Heidelberg, 10.11.5.2012

Robert van de Geijn, University of Texas at Austin, 911.5.2012
Biography
Robert van de Geijn is a Professor of Computer Science and member of the Institute for Computating Engineering and Sciences at UTAustin. He received his Ph.D. in Applied Mathematics from the University of Maryland. His interests are in linear algebra libraries, scientific computing, parallel computing, and formal derivation of programs. His FLAME project pursues how fundamental techniques from computer science support highperformance linear algebra libraries. He has written more than a hundred refereed articles and several books on this subject.
On Thursday, 10.5.2012, Seminarroom 121, 13 O'clock:
Design by Transformation  Application to Dense Linear Algebra Libraries
The FLAME project has yielded modern alternatives to LAPACK and related effort. An attractive feature of this work is the complete vertical integration of the entire software stack, starting with low level kernels that support the BLAS and finishing with a new distributed memory library, Elemental. In between are layers that target a single core, multicore, and multiGPU architectures. What this now enables is a new approach where libraries are viewed not as instantiations in code but instead as a repository of algorithms, knowledge about those algorithm, and knowledge about target architectures. Representations in code are then mechanically generated by a tool that performs optimizations for a given architecture by applying highlevel transformations much like a human expert would. We discuss how this has been used to mechanically generate tens of thousands of different distributed memory implementations given a single sequential algorithm. By attaching cost functions to the component operations, a highly optimized implementation is chosen by the tool. The chosen optimization invariably matches or exceeds the performance of implementations by human experts. We call the underlying approach Design by Transformation (DxT).
This work is in collaboration with Bryan Marker, Don Batory, Jack Poulson, and Andy Terrell.
On Friday, 11.5, Parallel Computing Group, Favoritenstrasse 16, 1013:
MiniTutorial: Collective Communication and Parallel Matrix Computation  Animated and in 3D
Distributed memory dense linear algebra libraries (and many other applications) cast most if not all communication in terms of collectives (broadcast, allgather, reduce, etc.). For this reason, it is important for a practitioner to understand efficient algorithms for collective communications and their costs. Yet more than 30 years after the introduction of distributed memory architectures most still think that the Minimum Spanning Tree broadcast is the most efficient way of implementing the broadcast operation. In this minitutorial, we teach the fundamentals of which, we believe, all practitioners should be aware. We show in full animation how collective communications can be composed and how this inherently facilitates a systematic treatment of distributed memory dense linear algebra computations. In the end, the attendee will see parallel matrix multiplication in 3D.
Lecture 1: Derivation of Dense Linear Algebra Algorithms
We discuss how a broad class of dense linear algebra algorithms can be systematically derived using Goal Oriented Programming techniques already advocated by Dijkstra and others in the early 1970s. The approach, which we call the FLAME methodology, is highly practical: it yields, from specification of an operation, families of highperformance algorithms via an eight step process that fills out a "worksheet". The algorithm that maps best to a specific architecture can then be chosen. More than a thousand routines that are part of the libflame library (a modern alternative to LAPACK) and Elemental library (a modern alternative to ScaLAPACK) have been written by first deriving the algorithms in this fashion.
Lecture 2: From Algorithms to Code
The FLAME methodology derives correct algorithms. The problem is that "bugs" can be easily introduced when translating an algorithm into code. As part of the FLAME project, we have created Application Programming Interfaces (APIs) that allow the code to closely resemble the algorithm. In a matter of minutes a correct algorithm can then be translated into a correct, highperformance implementation. We demonstrate this by implementing a few common dense linear algebra algorithms into MATLAB and C code. It is these APIs that are used for the implementation of both libflame (a modern alternative to LAPACK) and Elemental (a modern alternative to ScaLAPACK). (A participant who missed Lecture 1 will have no problems understanding Lecture 2.)
Lecture 3: Collective Communication
Distributed memory dense linear algebra libraries (and many other applications) cast most if not all communication in terms of collective communications (broadcast, allgather, reduce, etc.). For this reason, it is important for a practitioner to understand efficient algorithms for collective communications and their costs. Yet more than 30 years after the introduction of distributed memory architectures most still think that the Minimum Spanning Tree broadcast is the most efficient way of implementing the broadcast operation. In this lecture, we teach the fundamentals of which, we believe, all practitioners should be aware.
Lecture 4: Distributed Memory Dense Linear Algebra Principles
In this lecture we show how dense linear algebra algorithms and collective communication combine to yield highperformance scalable dense linear algebra algorithms.

Georg Struth, University of Sheffield, 19.4.2012

Christian Siebert, RWTH Aachen, 26.330.4.2012

Martti Forsell, VTT, Finland, 68.3.2012