Remark

---

# Domain-specific languages: the key to computational efficiency and programmer productivity

## Imperial College London ACM Student Chapter seminar series

### **Florian Rathgeber**, Department of Computing, Imperial College London
### February 7 2014

Slides: http://kynan.github.io/ACM2014

---

Languages specialised to a particular application domain

In contrast to general-purpose (programming) languages
]
.right-column[
### Markup languages

* XML dialects: HTML, SVG
  * Graphs, graphics: GraphViz, GrGen, PGF
  * Text: Markdown, rST, Org, MediaWiki
  * [remark](http://github.com/gnab/remark) used for this slide deck

### Modelling languages

* Linear algebra: MATLAB, Mathematica, Maple
  * Symbolic maths: Mathematica, Maxima
  * Hardware description: Verilog, VHDL
  * Database queries: SQL, SparQL
  * Spreadsheets: Excel / LibreOffice formulas

### Embedded DSLs

* Domain-specific language embedded in a host language (C++, Python, Ruby)
  * DSL user can mix and match and extend DSL code with the host language
  * DSL writer can use lexing, parsing, syntax tools of the host language

Boundaries are not sharp: LaTeX and PostScript are turing-complete!
]

---

# Motivating example: markup languages

---

A simple, in-browser, markdown-driven slideshow tool.

### or

A domain-specific language for writing browser-based slide decks
]
.right-column[
remark source for this slide deck:
```markdown
name: inverse
layout: true
class: center, middle, inverse

---

# Domain-specific languages: the key to computational efficiency and programmer productivity

## Imperial College London ACM Student Chapter seminar series

### **Florian Rathgeber**, Department of Computing, Imperial College London
### February 7 2014

Slides: http://kynan.github.io/ACM2014

---

Languages specialised to a particular application domain

In contrast to general-purpose (programming) languages
]
.right-column[
...
]

---
```
]

---

## remark generated html for the title slide
```html
<div class="remark-slide-container remark-visible">
  <div class="remark-slide-scaler"
       style="width: 908px; height: 681px;
              -webkit-transform: scale(0.7973568281938326); left: 117px; top: 0px;">
    <div class="remark-slide">
      <div class="remark-slide-content center middle inverse hljs-default">
        <h1>Domain-specific languages: the key to computational efficiency and programmer
          productivity</h1>
        <h2>Imperial College London ACM Student Chapter seminar series</h2>
        <h3><strong>Florian Rathgeber</strong>, Department of Computing, Imperial College
          London</h3>
        <h3>February 7 2014</h3>
        <p>Slides: <a href="http://kynan.github.io/ACM2014">
          http://kynan.github.io/ACM2014</a></p>
        <div class="remark-slide-number">1 / 25</div>
      </div>
    </div>
  </div>
</div>
```

---

# Domain-specific languages in computational science

---

# Image processing

---

## Halide (1/2)

A domain-specific language for image processing and computational photography embedded in C++: http://halide-lang.org

* Write high-performance image processing code on modern machines
  * Compiler targets: x86/SSE, ARM v7/NEON, CUDA, Native Client, and OpenCL

Schedule for a 3x3 box filter defined as a series of two 3x1 passes:
```cpp
Func blur_3x3(Func input) {
  Func blur_x, blur_y;
  Var x, y, xi, yi;

// The algorithm - no storage or order
  blur_x(x, y) = (input(x-1, y) + input(x, y) + input(x+1, y))/3;
  blur_y(x, y) = (blur_x(x, y-1) + blur_x(x, y) + blur_x(x, y+1))/3;

// The schedule - defines order, locality; implies storage
  blur_y.tile(x, y, xi, yi, 256, 32)
        .vectorize(xi, 8).parallel(y);
  blur_x.compute_at(blur_y, x).vectorize(x, 8);

return blur_y;
}
```
A `Func` object represents a pipeline stage, a "computed image": a pure function defining the value of each pixel.

---

## Halide (2/2)

Single-stage imaging pipeline that outputs a grayscale diagonal gradient:
```cpp
#include <Halide.h>
#include <stdio.h>

int main(int argc, char **argv) {

Halide::Func gradient;
  Halide::Var x, y;
  Halide::Expr e = x + y;

gradient(x, y) = e;

Halide::Image<int32_t> output = gradient.realize(800, 600);

for (int j = 0; j < output.height(); j++) {
    for (int i = 0; i < output.width(); i++) {
      if (output(i, j) != i + j) {
        printf("Something went wrong!\n"
               "Pixel %d, %d was supposed to be %d, but instead it's %d\n",
               i, j, i+j, output(i, j));
        return -1;
      }
    }
  }

return 0;
}
```

---

# Stencil languages

---

Compiler and runtime system for implementing stencil computations on multicore processors

Domain-specific language embedded in C++
]
.right-column[
### Stencils

* Define the value of a grid point in a d-dimensional spatial grid at time t as a function of neighboring grid points at recent times before t.
  * A stencil computation computes the stencil for each grid point over many time steps

.center[
![LBM stencil](http://groups.csail.mit.edu/sct/wiki/images/f/f0/Lbm.png)
]
19-point stencil for the Lattice-Boltzman Method (LBM)
]

---

## 2D heat equation coded in Pochoir
```c
#define mod(r,m) ((r)%(m) + ((r)<0)? (m):0)
Pochoir_Boundary_2D(heat_bv, a, t, x, y) 
  return a.get(t,mod(x,a.size(1)),mod(y,a.size(0)));
Pochoir_Boundary_End

int main(void) {
  Pochoir_Shape_2D 2D_five_pt[] = {{1,0,0},{0,1,0},{0,-1,0},{0,-1,-1},{0,0,-1},{0,0,1}}; 
  Pochoir_2D heat(2D_five_pt);

Pochoir_Array_2D(double) u(X, Y); 
  u.Register_Boundary(heat_bv); 
  heat.Register_Array(u);

Pochoir_Kernel_2D(heat_fn, t, x, y) 
    u(t+1, x, y) = CX * (u(t, x+1, y) - 2 * u(t, x, y) + u(t, x-1, y)) +
                   CY * (u(t, x, y+1) - 2 * u(t, x, y) + u(t, x, y-1)) + u(t, x, y);
  Pochoir_Kernel_End
 
  for (int x = 0; x < X; ++x) 
    for (int y = 0; y < Y; ++y)    
      u(0, x, y) = rand();

heat.Run(T, heat_fn);

for (int x = 0; x < X; ++x) 
    for (int y = 0; y < Y; ++y)    
      cout << u(T, x, y);

return 0;
}
```

---

# Unstructured meshes / graphs

---

## Structure of scientific computations on unstructured meshes

* Independent *local operations* for each element of the mesh described as a *kernel*.
* *Reductions* aggregating results from local operations to produce the final result.

## PyOP2

A domain-specific language embedded in Python for parallel computations on unstructured meshes or graphs.

### Unstructured meshes
Meshes are described by *sets* and *maps* between them and can have data associated with them:

![PyOP2 mesh](images/op2_mesh.svg)

---

## PyOP2 Data Model

### Mesh topology
* Sets – cells, vertices, etc
* Maps – connectivity between entities in different sets

### Data
* Dats – Defined on sets (hold pressure, temperature, etc)

### Kernels
* Executed in parallel on a set through a parallel loop
* Read / write / increment data accessed via maps

### Linear algebra
* Matrix sparsities defined by mappings
* Matrix data on sparsities
* Kernels compute a local matrix – PyOP2 handles global assembly

---

Performance portability for *any* unstructured mesh computations
]
.right-column[
### Parallel loop syntax
```python
op2.par_loop(kernel, iteration_set,
             kernel_arg1(access_mode, mapping[index]),
             ...,
             kernel_argN(access_mode, mapping[index]))
```

### PyOP2 programme for computing the midpoint of a triangle
```python
from pyop2 import op2
op2.init()

vertices = op2.Set(num_vertices)
cells = op2.Set(num_cells)

cell2vertex = op2.Map(cells, vertices, 3, [...])

coordinates = op2.Dat(vertices ** 2, [...], dtype=float)
midpoints = op2.Dat(cells ** 2, dtype=float)

midpoint = op2.Kernel("""
void midpoint(double p[2], double *coords[2]) {
  p[0] = (coords[0][0] + coords[1][0] + coords[2][0]) / 3.0;
  p[1] = (coords[0][1] + coords[1][1] + coords[2][1]) / 3.0;
}""", "midpoint")

op2.par_loop(midpoint, cells,
             midpoints(op2.WRITE),
             coordinates(op2.READ, cell2vertex))
```
Future work: define kernels as Python functions
]

---

## PyOP2 Architecture
* Parallel scheduling: partitioning, staging and coloring
* Runtime code generation and JIT compilation

![PyOP2 implementation](images/pyop2.svg)

---

## Generated sequential code calling the midpoint kernel

```c
void wrap_midpoint__(PyObject *_start, PyObject *_end,
                     PyObject *_arg0_0,
                     PyObject *_arg1_0, PyObject *_arg1_0_map0_0) {
  int start = (int)PyInt_AsLong(_start);
  int end = (int)PyInt_AsLong(_end);
  double *arg0_0 = (double *)(((PyArrayObject *)_arg0_0)->data);
  double *arg1_0 = (double *)(((PyArrayObject *)_arg1_0)->data);
  int *arg1_0_map0_0 = (int *)(((PyArrayObject *)_arg1_0_map0_0)->data);
  double *arg1_0_vec[3];
  for ( int n = start; n < end; n++ ) {
    arg1_0_vec[0] = arg1_0 + arg1_0_map0_0[n * 3 + 0] * 2;
    arg1_0_vec[1] = arg1_0 + arg1_0_map0_0[n * 3 + 1] * 2;
    arg1_0_vec[2] = arg1_0 + arg1_0_map0_0[n * 3 + 2] * 2;
    midpoint(arg0_0 + n * 2, arg1_0_vec);
  }
}
```

## Generated OpenMP code calling the midpoint kernel

next slide

---

```c
void wrap_midpoint__(PyObject* _boffset, PyObject* _nblocks,
                     PyObject* _blkmap, PyObject* _offset,
                     PyObject* _nelems, PyObject *_arg0_0,
                     PyObject *_arg1_0, PyObject *_arg1_0_map0_0) {
  int boffset = (int)PyInt_AsLong(_boffset);
  int nblocks = (int)PyInt_AsLong(_nblocks);
  int* blkmap = (int *)(((PyArrayObject *)_blkmap)->data);
  int* offset = (int *)(((PyArrayObject *)_offset)->data);
  int* nelems = (int *)(((PyArrayObject *)_nelems)->data);
  double *arg0_0 = (double *)(((PyArrayObject *)_arg0_0)->data);
  double *arg1_0 = (double *)(((PyArrayObject *)_arg1_0)->data);
  int *arg1_0_map0_0 = (int *)(((PyArrayObject *)_arg1_0_map0_0)->data);
  double *arg1_0_vec[32][3];
  #ifdef _OPENMP
  int nthread = omp_get_max_threads();
  #else
  int nthread = 1;
  #endif
  #pragma omp parallel shared(boffset, nblocks, nelems, blkmap) {
    int tid = omp_get_thread_num();
    #pragma omp for schedule(static)
    for (int __b = boffset; __b < boffset + nblocks; __b++) {
      int bid = blkmap[__b];
      int nelem = nelems[bid];
      int efirst = offset[bid];
      for (int n = efirst; n < efirst+ nelem; n++ ) {
        arg1_0_vec[tid][0] = arg1_0 + arg1_0_map0_0[n * 3 + 0] * 2;
        arg1_0_vec[tid][1] = arg1_0 + arg1_0_map0_0[n * 3 + 1] * 2;
        arg1_0_vec[tid][2] = arg1_0 + arg1_0_map0_0[n * 3 + 2] * 2;
        midpoint(arg0_0 + n * 2, arg1_0_vec[tid]);
      }
    }
  }
}

```

---

## PyOP2 Partitioning, Staging & Coloring

### Key optimizations performed by PyOP2 runtime core
* *Partitioning* for on-chip memory (shared memory / cache)
* *Coloring* to avoid data races on updates to the same memory location

### Example

Parallel computation executing a kernel over the edges of the mesh:
```python
# Sets of nodes and edges
nodes = op2.Set(N) # N = number of nodes
edges = op2.Set(M) # M = number of edges

# Mapping from edges to nodes
edge_to_node_map = op2.Map(edges, nodes, 2, ...)

# Data defined on nodes
u = op2.Dat(nodes, ...)

# Kernel executing over set of edges, computing on nodal data
op2.par_loop(kernel, edges,
             u(egde_to_node_map[:], op2.INC))
```

---

background-image:url(images/partitioning.svg)

---

background-image:url(images/staging.svg)

---

background-image:url(images/coloring.svg)

---

# Finite-element computations

---

background-image:url(images/fem.svg)

---

UFL is the [Unified Form Language](https://bitbucket.org/fenics-project/ufl) from the [FEniCS project](http://fenicsproject.org).
]
.right-column[
### The weak form of the Helmholtz equation

`$$\int_\Omega \nabla v \cdot \nabla u - \lambda v u ~dV = \int_\Omega v f ~dV$$`

### And its (almost) literal translation to Python with UFL

UFL: embedded domain-specific language (eDSL) for weak forms of partial differential equations (PDEs)
```python
e = FiniteElement('CG', 'triangle', 1)

v = TestFunction(e)
u = TrialFunction(e)
f = Coefficient(e)

lmbda = 1
a = (dot(grad(v), grad(u)) - lmbda * v * u) * dx

L = v * f * dx
```
]

---

The [FEniCS Form Compiler](https://bitbucket.org/fenics-project/ffc) FFC compiles UFL forms to low-level code.
]
.right-column[
### Helmholtz equation
`$$\int_\Omega \nabla v \cdot \nabla u - \lambda v u ~dV = \int_\Omega v f ~dV$$`

### UFL expression
```python
a = (dot(grad(v), grad(u)) - lmbda * v * u) * dx
```

### Generated C code
```c
// A - local tensor to assemble
// x - local coordinates
// j, k - 2D indices into the local assembly matrix
void kernel(double A[1][1], double *x[2],
            int j, int k) {
  // FE0 - Shape functions
  // Dij - Shape function derivatives
  // Kij - Jacobian inverse / determinant
  // W3  - Quadrature weights
  // det - Jacobian determinant
  for (unsigned int ip = 0; ip < 3; ip++) {
    A[0][0] += (FE0[ip][j] * FE0[ip][k] * (-1.0)
      + (((K00 * D10[ip][j] + K10 * D01[ip][j]))
        *((K00 * D10[ip][k] + K10 * D01[ip][k]))
      +  ((K01 * D10[ip][j] + K11 * D01[ip][j]))
        *((K01 * D10[ip][k] + K11 * D01[ip][k])))) * W3[ip] * det;
  }
}
```
]

---

## The Firedrake/PyOP2 tool chain

![Firedrake](images/firedrake_toolchain.svg)

---

## Two-layered abstraction: Separation of concerns

![Separation of concerns](images/firedrake_toolchain_users.svg)

---

## Driving Finite-element Computations in Firedrake

Solving the Helmholtz equation in Python using Firedrake:
```python
from firedrake import *

# Read a mesh and define a function space
mesh = Mesh('filename')
V = FunctionSpace(mesh, "Lagrange", 1)

# Define forcing function for right-hand side
def force(X):
    lmbda = 1
    n = 8
    return - (lmbda + 2*(n**2)*pi**2) * sin(X[0]*pi*n) * sin(X[1]*pi*n)
f = Expression(force)

# Set up the Finite-element weak forms
u = TrialFunction(V)
v = TestFunction(V)

lmbda = 1
a = (dot(grad(v), grad(u)) - lmbda * v * u) * dx
L = v * f * dx

# Solve the resulting finite-element equation
p = Function(V)
solve(a == L, p)
```

---

## Finite element assembly and solve in Firedrake

What goes on behind the scenes of the `solve` call (simplified example!):
```python
from pyop2 import op2, ffc_interface

def solve(equation, x):
    # Invoke FFC to generate kernels for matrix and rhs assembly
    lhs = ffc_interface.compile_form(equation.lhs, "lhs")[0]
    rhs = ffc_interface.compile_form(equation.rhs, "rhs")[0]

# Omitted: extract coordinates (coords), connectivity (elem_node)
    # and coefficients (field f)

# Construct OP2 matrix to assemble into
    sparsity = op2.Sparsity((elem_node, elem_node), sparsity_dim)
    mat = op2.Mat(sparsity, numpy.float64)
    b = op2.Dat(nodes, np.zeros(nodes.size))

# Assemble lhs, rhs and solve linear system
    op2.par_loop(lhs, elements,
                 mat(op2.INC, (elem_node[op2.i[0]], elem_node[op2.i[1]])),
                 coords(op2.READ, elem_node))
    op2.par_loop(rhs, elements,
                 b(op2.INC, elem_node[op2.i[0]]),
                 coords(op2.READ, elem_node),
                 f(op2.READ, elem_node))

# Solve the assembled sparse linear system of equations
    op2.solve(mat, x, b)
```

---

# Recap

---

Unless you break it down with the right abstractions

Many-core hardware has brought a paradigm shift to CSE, scientific software needs to keep up
]
.right-column[
## The Solution

### High-level structure

* Goal: producing high level interfaces to numerical computing
  * PyOP2: a high-level interface to unstructured mesh based methods  
    *Efficiently execute kernels over an unstructured grid in parallel*
  * Firedrake: a performance-portable finite-element computation framework  
    *Drive FE computations from a high-level problem specification*

### Low-level operations

* Separating the low-level implementation from the high-level problem specification
  * Generate platform-specific implementations from a common source instead of hand-coding them
  * Runtime code generation and JIT compilation open space for compiler-driven optimizations and performance portability
]

---

## Thank you!

Contact: Florian Rathgeber, [@frathgeber](https://twitter.com/frathgeber), f.rathgeber@imperial.ac.uk

### Resources

* **Halide** http://halide-lang.org
    * *[Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines](http://people.csail.mit.edu/jrk/halide-pldi13.pdf)*
      Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, Saman Amarasinghe.
      PLDI 2013 
  * **Pochoir** http://groups.csail.mit.edu/sct/wiki/index.php?title=The_Pochoir_Project
    * *[The Pochoir Stencil Compiler](http://people.csail.mit.edu/yuantang/pochoir_spaa11.pdf)*
      Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson.
      SPAA ’11, 2011
  * **PyOP2** https://github.com/OP2/PyOP2
    * *[PyOP2: A High-Level Framework for Performance-Portable Simulations on Unstructured Meshes](http://dx.doi.org/10.1109/SC.Companion.2012.134)*
      Florian Rathgeber, Graham R. Markall, Lawrence Mitchell, Nicholas Loriant, David A. Ham, Carlo Bertolli, Paul H.J. Kelly
      WOLFHPC 2012
  * **Firedrake** https://github.com/firedrakeproject/firedrake
    * *[Performance-Portable Finite Element Assembly Using PyOP2 and FEniCS](http://link.springer.com/chapter/10.1007/978-3-642-38750-0_21)*
       Graham R. Markall, Florian Rathgeber, Lawrence Mitchell, Nicolas Loriant, Carlo Bertolli, David A. Ham, Paul H. J. Kelly 
       ISC 2013
  * **UFL** https://bitbucket.org/mapdes/ufl
  * **FFC** https://bitbucket.org/mapdes/ffc

**This talk** is available at http://kynan.github.io/ACM2014

Slides created with [remark](http://remarkjs.com)