name: inverse layout: true class: center, middle, inverse --- # Domain-specific languages: the key to computational efficiency and programmer productivity ## Imperial College London ACM Student Chapter seminar series ### **Florian Rathgeber**, Department of Computing, Imperial College London ### February 7 2014 Slides: http://kynan.github.io/ACM2014 --- layout: false .left-column[ ## Domain-Specific Languages (DSLs) Languages specialised to a particular application domain In contrast to general-purpose (programming) languages ] .right-column[ ### Markup languages * XML dialects: HTML, SVG * Graphs, graphics: GraphViz, GrGen, PGF * Text: Markdown, rST, Org, MediaWiki * [remark](http://github.com/gnab/remark) used for this slide deck ### Modelling languages * Linear algebra: MATLAB, Mathematica, Maple * Symbolic maths: Mathematica, Maxima * Hardware description: Verilog, VHDL * Database queries: SQL, SparQL * Spreadsheets: Excel / LibreOffice formulas ### Embedded DSLs * Domain-specific language embedded in a host language (C++, Python, Ruby) * DSL user can mix and match and extend DSL code with the host language * DSL writer can use lexing, parsing, syntax tools of the host language Boundaries are not sharp: LaTeX and PostScript are turing-complete! ] --- class: center, middle # Motivating example: markup languages --- .left-column[ ## This slide deck is written in [remark](http://github.com/gnab/remark) A simple, in-browser, markdown-driven slideshow tool. ### or A domain-specific language for writing browser-based slide decks ] .right-column[ remark source for this slide deck: ```markdown name: inverse layout: true class: center, middle, inverse --- # Domain-specific languages: the key to computational efficiency and programmer productivity ## Imperial College London ACM Student Chapter seminar series ### **Florian Rathgeber**, Department of Computing, Imperial College London ### February 7 2014 Slides: http://kynan.github.io/ACM2014 --- layout: false .left-column[ ## Domain-Specific Languages (DSLs) Languages specialised to a particular application domain In contrast to general-purpose (programming) languages ] .right-column[ ... ] --- ``` ] --- ## remark generated html for the title slide ```html
Domain-specific languages: the key to computational efficiency and programmer productivity
Imperial College London ACM Student Chapter seminar series
Florian Rathgeber
, Department of Computing, Imperial College London
February 7 2014
Slides:
http://kynan.github.io/ACM2014
1 / 25
``` --- class: inverse, center, middle # Domain-specific languages in computational science --- class: center, middle # Image processing --- ## Halide (1/2) A domain-specific language for image processing and computational photography embedded in C++: http://halide-lang.org * Write high-performance image processing code on modern machines * Compiler targets: x86/SSE, ARM v7/NEON, CUDA, Native Client, and OpenCL Schedule for a 3x3 box filter defined as a series of two 3x1 passes: ```cpp Func blur_3x3(Func input) { Func blur_x, blur_y; Var x, y, xi, yi; // The algorithm - no storage or order blur_x(x, y) = (input(x-1, y) + input(x, y) + input(x+1, y))/3; blur_y(x, y) = (blur_x(x, y-1) + blur_x(x, y) + blur_x(x, y+1))/3; // The schedule - defines order, locality; implies storage blur_y.tile(x, y, xi, yi, 256, 32) .vectorize(xi, 8).parallel(y); blur_x.compute_at(blur_y, x).vectorize(x, 8); return blur_y; } ``` A `Func` object represents a pipeline stage, a "computed image": a pure function defining the value of each pixel. --- ## Halide (2/2) Single-stage imaging pipeline that outputs a grayscale diagonal gradient: ```cpp #include
#include
int main(int argc, char **argv) { Halide::Func gradient; Halide::Var x, y; Halide::Expr e = x + y; gradient(x, y) = e; Halide::Image
output = gradient.realize(800, 600); for (int j = 0; j < output.height(); j++) { for (int i = 0; i < output.width(); i++) { if (output(i, j) != i + j) { printf("Something went wrong!\n" "Pixel %d, %d was supposed to be %d, but instead it's %d\n", i, j, i+j, output(i, j)); return -1; } } } return 0; } ``` --- class: center, middle # Stencil languages --- .left-column[ ## Pochoir Compiler and runtime system for implementing stencil computations on multicore processors Domain-specific language embedded in C++ ] .right-column[ ### Stencils * Define the value of a grid point in a d-dimensional spatial grid at time t as a function of neighboring grid points at recent times before t. * A stencil computation computes the stencil for each grid point over many time steps .center[  ] 19-point stencil for the Lattice-Boltzman Method (LBM) ] --- ## 2D heat equation coded in Pochoir ```c #define mod(r,m) ((r)%(m) + ((r)<0)? (m):0) Pochoir_Boundary_2D(heat_bv, a, t, x, y) return a.get(t,mod(x,a.size(1)),mod(y,a.size(0))); Pochoir_Boundary_End int main(void) { Pochoir_Shape_2D 2D_five_pt[] = {{1,0,0},{0,1,0},{0,-1,0},{0,-1,-1},{0,0,-1},{0,0,1}}; Pochoir_2D heat(2D_five_pt); Pochoir_Array_2D(double) u(X, Y); u.Register_Boundary(heat_bv); heat.Register_Array(u); Pochoir_Kernel_2D(heat_fn, t, x, y) u(t+1, x, y) = CX * (u(t, x+1, y) - 2 * u(t, x, y) + u(t, x-1, y)) + CY * (u(t, x, y+1) - 2 * u(t, x, y) + u(t, x, y-1)) + u(t, x, y); Pochoir_Kernel_End for (int x = 0; x < X; ++x) for (int y = 0; y < Y; ++y) u(0, x, y) = rand(); heat.Run(T, heat_fn); for (int x = 0; x < X; ++x) for (int y = 0; y < Y; ++y) cout << u(T, x, y); return 0; } ``` --- class: center, middle # Unstructured meshes / graphs --- ## Structure of scientific computations on unstructured meshes * Independent *local operations* for each element of the mesh described as a *kernel*. * *Reductions* aggregating results from local operations to produce the final result. ## PyOP2 A domain-specific language embedded in Python for parallel computations on unstructured meshes or graphs. ### Unstructured meshes Meshes are described by *sets* and *maps* between them and can have data associated with them:  --- ## PyOP2 Data Model ### Mesh topology * Sets – cells, vertices, etc * Maps – connectivity between entities in different sets ### Data * Dats – Defined on sets (hold pressure, temperature, etc) ### Kernels * Executed in parallel on a set through a parallel loop * Read / write / increment data accessed via maps ### Linear algebra * Matrix sparsities defined by mappings * Matrix data on sparsities * Kernels compute a local matrix – PyOP2 handles global assembly --- .left-column[ ## PyOP2 Kernels and Parallel Loops Performance portability for *any* unstructured mesh computations ] .right-column[ ### Parallel loop syntax ```python op2.par_loop(kernel, iteration_set, kernel_arg1(access_mode, mapping[index]), ..., kernel_argN(access_mode, mapping[index])) ``` ### PyOP2 programme for computing the midpoint of a triangle ```python from pyop2 import op2 op2.init() vertices = op2.Set(num_vertices) cells = op2.Set(num_cells) cell2vertex = op2.Map(cells, vertices, 3, [...]) coordinates = op2.Dat(vertices ** 2, [...], dtype=float) midpoints = op2.Dat(cells ** 2, dtype=float) midpoint = op2.Kernel(""" void midpoint(double p[2], double *coords[2]) { p[0] = (coords[0][0] + coords[1][0] + coords[2][0]) / 3.0; p[1] = (coords[0][1] + coords[1][1] + coords[2][1]) / 3.0; }""", "midpoint") op2.par_loop(midpoint, cells, midpoints(op2.WRITE), coordinates(op2.READ, cell2vertex)) ``` Future work: define kernels as Python functions ] --- ## PyOP2 Architecture * Parallel scheduling: partitioning, staging and coloring * Runtime code generation and JIT compilation  --- ## Generated sequential code calling the midpoint kernel ```c void wrap_midpoint__(PyObject *_start, PyObject *_end, PyObject *_arg0_0, PyObject *_arg1_0, PyObject *_arg1_0_map0_0) { int start = (int)PyInt_AsLong(_start); int end = (int)PyInt_AsLong(_end); double *arg0_0 = (double *)(((PyArrayObject *)_arg0_0)->data); double *arg1_0 = (double *)(((PyArrayObject *)_arg1_0)->data); int *arg1_0_map0_0 = (int *)(((PyArrayObject *)_arg1_0_map0_0)->data); double *arg1_0_vec[3]; for ( int n = start; n < end; n++ ) { arg1_0_vec[0] = arg1_0 + arg1_0_map0_0[n * 3 + 0] * 2; arg1_0_vec[1] = arg1_0 + arg1_0_map0_0[n * 3 + 1] * 2; arg1_0_vec[2] = arg1_0 + arg1_0_map0_0[n * 3 + 2] * 2; midpoint(arg0_0 + n * 2, arg1_0_vec); } } ``` ## Generated OpenMP code calling the midpoint kernel next slide --- ```c void wrap_midpoint__(PyObject* _boffset, PyObject* _nblocks, PyObject* _blkmap, PyObject* _offset, PyObject* _nelems, PyObject *_arg0_0, PyObject *_arg1_0, PyObject *_arg1_0_map0_0) { int boffset = (int)PyInt_AsLong(_boffset); int nblocks = (int)PyInt_AsLong(_nblocks); int* blkmap = (int *)(((PyArrayObject *)_blkmap)->data); int* offset = (int *)(((PyArrayObject *)_offset)->data); int* nelems = (int *)(((PyArrayObject *)_nelems)->data); double *arg0_0 = (double *)(((PyArrayObject *)_arg0_0)->data); double *arg1_0 = (double *)(((PyArrayObject *)_arg1_0)->data); int *arg1_0_map0_0 = (int *)(((PyArrayObject *)_arg1_0_map0_0)->data); double *arg1_0_vec[32][3]; #ifdef _OPENMP int nthread = omp_get_max_threads(); #else int nthread = 1; #endif #pragma omp parallel shared(boffset, nblocks, nelems, blkmap) { int tid = omp_get_thread_num(); #pragma omp for schedule(static) for (int __b = boffset; __b < boffset + nblocks; __b++) { int bid = blkmap[__b]; int nelem = nelems[bid]; int efirst = offset[bid]; for (int n = efirst; n < efirst+ nelem; n++ ) { arg1_0_vec[tid][0] = arg1_0 + arg1_0_map0_0[n * 3 + 0] * 2; arg1_0_vec[tid][1] = arg1_0 + arg1_0_map0_0[n * 3 + 1] * 2; arg1_0_vec[tid][2] = arg1_0 + arg1_0_map0_0[n * 3 + 2] * 2; midpoint(arg0_0 + n * 2, arg1_0_vec[tid]); } } } } ``` --- ## PyOP2 Partitioning, Staging & Coloring ### Key optimizations performed by PyOP2 runtime core * *Partitioning* for on-chip memory (shared memory / cache) * *Coloring* to avoid data races on updates to the same memory location ### Example Parallel computation executing a kernel over the edges of the mesh: ```python # Sets of nodes and edges nodes = op2.Set(N) # N = number of nodes edges = op2.Set(M) # M = number of edges # Mapping from edges to nodes edge_to_node_map = op2.Map(edges, nodes, 2, ...) # Data defined on nodes u = op2.Dat(nodes, ...) # Kernel executing over set of edges, computing on nodal data op2.par_loop(kernel, edges, u(egde_to_node_map[:], op2.INC)) ``` --- background-image:url(images/partitioning.svg) --- background-image:url(images/staging.svg) --- background-image:url(images/coloring.svg) --- class: center, middle # Finite-element computations --- background-image:url(images/fem.svg) --- .left-column[ ## UFL: High-level definition of finite-element forms UFL is the [Unified Form Language](https://bitbucket.org/fenics-project/ufl) from the [FEniCS project](http://fenicsproject.org). ] .right-column[ ### The weak form of the Helmholtz equation `$$\int_\Omega \nabla v \cdot \nabla u - \lambda v u ~dV = \int_\Omega v f ~dV$$` ### And its (almost) literal translation to Python with UFL UFL: embedded domain-specific language (eDSL) for weak forms of partial differential equations (PDEs) ```python e = FiniteElement('CG', 'triangle', 1) v = TestFunction(e) u = TrialFunction(e) f = Coefficient(e) lmbda = 1 a = (dot(grad(v), grad(u)) - lmbda * v * u) * dx L = v * f * dx ``` ] --- .left-column[ ## Helmholtz local assembly kernel generated by FFC The [FEniCS Form Compiler](https://bitbucket.org/fenics-project/ffc) FFC compiles UFL forms to low-level code. ] .right-column[ ### Helmholtz equation `$$\int_\Omega \nabla v \cdot \nabla u - \lambda v u ~dV = \int_\Omega v f ~dV$$` ### UFL expression ```python a = (dot(grad(v), grad(u)) - lmbda * v * u) * dx ``` ### Generated C code ```c // A - local tensor to assemble // x - local coordinates // j, k - 2D indices into the local assembly matrix void kernel(double A[1][1], double *x[2], int j, int k) { // FE0 - Shape functions // Dij - Shape function derivatives // Kij - Jacobian inverse / determinant // W3 - Quadrature weights // det - Jacobian determinant for (unsigned int ip = 0; ip < 3; ip++) { A[0][0] += (FE0[ip][j] * FE0[ip][k] * (-1.0) + (((K00 * D10[ip][j] + K10 * D01[ip][j])) *((K00 * D10[ip][k] + K10 * D01[ip][k])) + ((K01 * D10[ip][j] + K11 * D01[ip][j])) *((K01 * D10[ip][k] + K11 * D01[ip][k])))) * W3[ip] * det; } } ``` ] --- ## The Firedrake/PyOP2 tool chain  --- ## Two-layered abstraction: Separation of concerns  --- ## Driving Finite-element Computations in Firedrake Solving the Helmholtz equation in Python using Firedrake: ```python from firedrake import * # Read a mesh and define a function space mesh = Mesh('filename') V = FunctionSpace(mesh, "Lagrange", 1) # Define forcing function for right-hand side def force(X): lmbda = 1 n = 8 return - (lmbda + 2*(n**2)*pi**2) * sin(X[0]*pi*n) * sin(X[1]*pi*n) f = Expression(force) # Set up the Finite-element weak forms u = TrialFunction(V) v = TestFunction(V) lmbda = 1 a = (dot(grad(v), grad(u)) - lmbda * v * u) * dx L = v * f * dx # Solve the resulting finite-element equation p = Function(V) solve(a == L, p) ``` --- ## Finite element assembly and solve in Firedrake What goes on behind the scenes of the `solve` call (simplified example!): ```python from pyop2 import op2, ffc_interface def solve(equation, x): # Invoke FFC to generate kernels for matrix and rhs assembly lhs = ffc_interface.compile_form(equation.lhs, "lhs")[0] rhs = ffc_interface.compile_form(equation.rhs, "rhs")[0] # Omitted: extract coordinates (coords), connectivity (elem_node) # and coefficients (field f) # Construct OP2 matrix to assemble into sparsity = op2.Sparsity((elem_node, elem_node), sparsity_dim) mat = op2.Mat(sparsity, numpy.float64) b = op2.Dat(nodes, np.zeros(nodes.size)) # Assemble lhs, rhs and solve linear system op2.par_loop(lhs, elements, mat(op2.INC, (elem_node[op2.i[0]], elem_node[op2.i[1]])), coords(op2.READ, elem_node)) op2.par_loop(rhs, elements, b(op2.INC, elem_node[op2.i[0]]), coords(op2.READ, elem_node), f(op2.READ, elem_node)) # Solve the assembled sparse linear system of equations op2.solve(mat, x, b) ``` --- class: center, middle # Recap --- .left-column[ ## Computational Science is hard Unless you break it down with the right abstractions Many-core hardware has brought a paradigm shift to CSE, scientific software needs to keep up ] .right-column[ ## The Solution ### High-level structure * Goal: producing high level interfaces to numerical computing * PyOP2: a high-level interface to unstructured mesh based methods *Efficiently execute kernels over an unstructured grid in parallel* * Firedrake: a performance-portable finite-element computation framework *Drive FE computations from a high-level problem specification* ### Low-level operations * Separating the low-level implementation from the high-level problem specification * Generate platform-specific implementations from a common source instead of hand-coding them * Runtime code generation and JIT compilation open space for compiler-driven optimizations and performance portability ] --- ## Thank you! Contact: Florian Rathgeber, [@frathgeber](https://twitter.com/frathgeber), f.rathgeber@imperial.ac.uk ### Resources * **Halide** http://halide-lang.org * *[Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines](http://people.csail.mit.edu/jrk/halide-pldi13.pdf)* Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, Saman Amarasinghe. PLDI 2013 * **Pochoir** http://groups.csail.mit.edu/sct/wiki/index.php?title=The_Pochoir_Project * *[The Pochoir Stencil Compiler](http://people.csail.mit.edu/yuantang/pochoir_spaa11.pdf)* Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. SPAA ’11, 2011 * **PyOP2** https://github.com/OP2/PyOP2 * *[PyOP2: A High-Level Framework for Performance-Portable Simulations on Unstructured Meshes](http://dx.doi.org/10.1109/SC.Companion.2012.134)* Florian Rathgeber, Graham R. Markall, Lawrence Mitchell, Nicholas Loriant, David A. Ham, Carlo Bertolli, Paul H.J. Kelly WOLFHPC 2012 * **Firedrake** https://github.com/firedrakeproject/firedrake * *[Performance-Portable Finite Element Assembly Using PyOP2 and FEniCS](http://link.springer.com/chapter/10.1007/978-3-642-38750-0_21)* Graham R. Markall, Florian Rathgeber, Lawrence Mitchell, Nicolas Loriant, Carlo Bertolli, David A. Ham, Paul H. J. Kelly ISC 2013 * **UFL** https://bitbucket.org/mapdes/ufl * **FFC** https://bitbucket.org/mapdes/ffc **This talk** is available at http://kynan.github.io/ACM2014 Slides created with [remark](http://remarkjs.com)