PyOP2: A High-Level Framework for Performance-Portable Simulations on Unstructured Meshes

Florian Rathgeber1, Graham Markall1, Lawrence Mitchell3, Nicolas Loriant1, David Ham1,2, Carlo Bertolli1, Paul Kelly1

1 Department of Computing

2 Grantham Institute for Climate Change

Imperial College London

3 EPCC, University of Edinburgh

Computational science is hard

Radical paradigm shifts in CSE due to many-core computing

Generative (meta)programming is the solution

The challenge

How do we get performance-portable finite element solvers that are efficient, generic and easy to use in the hands of domain scientists?

The strategy

Get the abstractions right

... to isolate numerical methods from their mapping to hardware

Start at the top, work your way down

... and make decisions at the highest abstraction level possible

Harness the power of DSLs

... for generative, instead of transformative optimisations

The tools

Embedded domain-specific languages

... capture and efficiently express characteristics of the application/problem domain

Runtime code generation

... encapsulates specialist expertise to deliver problem- and platform-specific optimisations

Just-in-time (JIT) compilation

... makes problem-specific generated code transparently available to the application at runtime

Tool chain overview

Tool chain overview

An expert for each layer

An expert for each layer

Higher level abstraction

From the equation to the finite element implementation

FFC1 takes UFL2 equations

The weak form of the Helmholtz equation

Helmholtz equation is expressed in UFL as follows:

1 FFC is the FEniCS Form Compiler, 2 UFL is the Unified Form Language from the FEniCS project

... and generates local assembly kernels

Helmholtz OP2 kernel

Lower level abstraction

From the finite element implementation to its efficient parallel execution

PyOP2 – a high-level framework for unstructured mesh computations

Abstractions for unstructured meshes

Parallel computations on mesh entities in PyOP2

Mesh computations as parallel loops

Multiple hardware backends via runtime code generation

PyOP2 for FE computations

Finite element local assembly

... means computing the same kernel for every mesh entity (cell, facet): a perfect match for the PyOP2 abstraction

PyOP2 abstracts away data marshaling and parallel execution

Global asssembly and linear algebra

... implemented as a thin wrapper on top of backend-specific linear algebra packages: PETSc4py on the CPU, Cusp on the GPU

Finite element assembly and solve in PyOP2

Interfacing PyOP2 to Fluidity

Fluidity

Backward-facing step

Interfacing PyOP2

UFL equations in Fluidity

For each UFL equation in each time step:

Fluidity-UFL-PyOP2-toolchain

Preliminary benchmarks

Measure total time to solution for 100 time steps of an advection-diffusion test case; matrix/vector re-assembled every time step.

Solver

CG with Jacobi preconditioning using PETSc 3.3 (PyOP2), 3.2 (DOLFIN)

Host machine

2x Intel Xeon E5650 Westmere 6-core (HT off), 48GB RAM

GPU

NVIDIA GeForce GTX680 (Kepler)

Mesh

2D unit square meshed with triangles (200 - 204800 elements)

Dolfin

Revision 7122, Tensor representation, CPP optimisations on

Conclusions & future work

Conclusions

Future Work

Resources

PyOP2

https://github.com/OP2/PyOP2

FFC

https://code.launchpad.net/~mapdes/ffc/pyop2

Fluidity

https://code.launchpad.net/~fluidity-core/fluidity/pyop2

Benchmarks

https://github.com/OP2/PyOP2_benchmarks

This talk

https://kynan.github.com/wolfhpc2012

#

/