Generating high-performance multiplatform finite element solvers from high-level descriptions

Florian Rathgeber, Graham Markall, Nicolas Loriant, David Ham, Paul Kelly, Carlo Bertolli

Imperial College London

Lawrence Mitchell

University of Edinburgh

Mike Giles, Gihan Mudalige

University of Oxford

Istvan Reguly

Pazmany Peter Catholic University, Hungary

FEM is a versatile tool for science and engineering

Tsunami simulation of the Hokkaido-Nansei-Oki tsunami of 1993

The simulation was carried out with the Fluidity multi-phase CFD code solving the non-hydrostatic Navier-Stokes equations, using a free surface and wetting and drying algorithm (courtesy Simon Funke).

The challenge

How do we get performance portability for the finite element method without sacrificing generality?

The strategy

Get the abstractions right

... to isolate numerical methods from their mapping to hardware

Start at the top, work your way down

... as the greatest opportunities are at the highest abstraction level

Harness the power of DSLs

... for generative, instead of transformative optimisations

The tools

Embedded domain-specific languages

... capture and efficiently express characteristics of the application/problem domain

Active libraries

... encapsulate specialist performance expertise and deliver domain-specific optimisations

In combination, they

The big picture

Higher level abstraction

From the equation to the finite element implementation

FFC takes equations in UFL

Helmholtz equation

... and generates local assembly kernels

Helmholtz OP2 kernel

Lower level abstraction

From the finite element implementation to its efficient parallel execution

OP2 – an active library for unstructured mesh computations

Abstractions for unstructured grids

Mesh computations as parallel loops

Multiple hardware backends via source-to-source translation

OP2 for finite element computations

Finite element local assembly

... means computing the same kernel for every mesh entity (cell, facet)

OP2 abstracts away data marshaling and parallel execution

Global asssembly and linear algebra operations

... implemented as a thin wrapper on top of backend-specific linear algebra packages:
PETSc on the CPU, Cusp on the GPU

Finite element assembly and solve in PyOP2

UFL equations in Fluidity

For each UFL equation in each time step:

Fluidity-UFL-PyOP2-toolchain

Preliminary performance results

Experimental setup

Solver

CG with Jacobi preconditioning using PETSc 3.1 (PyOP2), 3.2 (DOLFIN)

CPU

Single core of an Intel Xeon E5650 Westmere (HT off), 48GB RAM

Mesh

2D unit square meshed with triangles (200 - 204800 elements)

Dolfin

Revision 6906, Tensor representation, CPP optimisations on, form compiler optimisations off

Resources

All the code mentioned is open source and available on GitHub. Try it!

OP2 library

https://github.com/OP2/OP2-Common

PyOP2

https://github.com/OP2/PyOP2

FFC

https://code.launchpad.net/~mapdes/ffc/pyop2

This talk

https://kynan.github.com/multicore-challenge-iii

#

/