Generating high-performance multiplatform finite element solvers from high-level descriptions

Florian Rathgeber, Graham Markall, Nicolas Loriant, David Ham, Paul Kelly, Carlo Bertolli

Imperial College London

Lawrence Mitchell

University of Edinburgh

Mike Giles, Gihan Mudalige

University of Oxford

Istvan Reguly

Pazmany Peter Catholic University, Hungary

FEM is a versatile tool for science and engineering

Tsunami simulation of the Hokkaido-Nansei-Oki tsunami of 1993

The simulation was carried out with the Fluidity multi-phase CFD code solving the non-hydrostatic Navier-Stokes equations, using a free surface and wetting and drying algorithm (courtesy Simon Funke).

The challenge

How do we get performance portability for the finite element method without sacrificing generality?

The strategy

Get the abstractions right

... to isolate numerical methods from their mapping to hardware

Start at the top, work your way down

... as the greatest opportunities are at the highest abstraction level

Harness the power of DSLs

... for generative, instead of transformative optimisations

The tools

Embedded domain-specific languages

... capture and efficiently express characteristics of the application/problem domain

Active libraries

... encapsulate specialist performance expertise and deliver domain-specific optimisations

In combination, they

raise the level of abstraction and incorporate domain-specific knowledge
decouple problem domains from their efficient implementation on different hardware
capture design spaces and open optimisation spaces
enable reuse of code generation and optimisation expertise and tool chains

The big picture

Higher level abstraction

From the equation to the finite element implementation

FFC takes equations in UFL

Helmholtz equation

f = state.scalar_fields["Tracer"] v = TestFunction(f) u = TrialFunction(f) lmbda = 1 a = (dot(grad(v), grad(u)) - lmbda * v * u) * dx L = v*f*dx solve(a == L, f)

... and generates local assembly kernels

Helmholtz OP2 kernel

void kernel(double A[1][1], double *x[2], int j, int k) { // Kij - Jacobian determinant // FE0 - Shape functions // Dij - Shape function derivatives // W3 - Quadrature weights for (unsigned int ip = 0; ip < 3; ip++) { A[0][0] += (FE0[ip][j] * FE0[ip][k] * (-1.0) + (((K00 * D10[ip][j] + K10 * D01[ip][j])) *((K00 * D10[ip][k] + K10 * D01[ip][k])) + ((K01 * D10[ip][j] + K11 * D01[ip][j])) *((K01 * D10[ip][k] + K11 * D01[ip][k])) )) * W3[ip] * det; } }

Lower level abstraction

From the finite element implementation to its efficient parallel execution

OP2 – an active library for unstructured mesh computations

Abstractions for unstructured grids

Sets of entities (e.g. nodes, edges, faces)
Mappings between sets (e.g. from edges to nodes)
Datasets holding data on a set (i.e. fields in finite-element terms)

Mesh computations as parallel loops

execute a kernel for all members of one set in arbitrary order
datasets accessed through at most one level of indirection
access descriptors specify which data is passed to the kernel and how it is addressed

Multiple hardware backends via source-to-source translation

partioning/colouring for efficient scheduling and execution on different hardware
currently supports CUDA/OpenMP + MPI - OpenCL, AVX support planned

OP2 for finite element computations

Finite element local assembly

... means computing the same kernel for every mesh entity (cell, facet)

OP2 abstracts away data marshaling and parallel execution

controls whether/how/when a matrix is assembled
OP2 has the choice: assemble a sparse (CSR) matrix, or keep the local assembly matrices (local matrix approach, LMA)
local assembly kernel is translated for and efficiently executed on the target architecture

Global asssembly and linear algebra operations

... implemented as a thin wrapper on top of backend-specific linear algebra packages:
PETSc on the CPU, Cusp on the GPU

Finite element assembly and solve in PyOP2

def solve(A, x, b): # Generate kernels for matrix and rhs assembly mat_code = ffc_interface.compile_form(A, "mat") rhs_code = ffc_interface.compile_form(b, "rhs") mat_kernel = op2.Kernel(mat_code, "mat_cell_integral_0_0") rhs_kernel = op2.Kernel(rhs_code, "rhs_cell_integral_0_0") # misc setup (skipped) # Construct OP2 matrix to assemble into sparsity = op2.Sparsity((elem_node, elem_node), sparsity_dim) mat = op2.Mat(sparsity, numpy.float64) f = op2.Dat(nodes, 1, f_vals, numpy.float64) # Assemble and solve op2.par_loop(mass, elements(3,3), mat((elem_node[op2.i[0]], elem_node[op2.i[1]]), op2.INC), coords(elem_node, op2.READ)) op2.par_loop(rhs, elements(3), b(elem_node[op2.i[0]], op2.INC), coords(elem_node, op2.READ), f(elem_node, op2.READ)) op2.solve(mat, b, x)