What I learned writing a NumPy gufunc in C

2024-06-13

One of these days I was looking into how to write NumPy gufuncs (generalized universal functions) in C because I wanted to have a fast gufunc to do something. Here is what I learned.

The user guide and API documentation are pretty clear, so I won't repeat everything you need to make a (g)ufunc from C. Instead, I will tell you what isn't covered there, and that I learned by reading some NumPy source code.

To implement a ufunc in C, be it a generalized one or not, you need write a C function with the following signature:

static void double_matmul(char **args, const npy_intp *dimensions, const npy_intp *steps, void *data)

Here I named it double_matmul because I will do matrix multiplication with dtype double as an example. The signature is explained in the documentation. Essentially, args gives you pointers to the data held by both the incoming ndarrays and the outgoing ones; dimensions gives you the sizes of all your loops; steps tells you how far away in memory are the array cells from each other; and the argument data is unused for our purposes.

Now, the full code to the function double_matmul is this:

static void double_matmul(char **args, const npy_intp *dimensions,
                        const npy_intp *steps, void *data)
{
    /* outer loop counter and dimension */
    npy_intp i_outer;
    npy_intp d_outer = dimensions[0];

    /* matrix dimensions */
    npy_intp dm = dimensions[1], dn = dimensions[2], dp = dimensions[3];

    /* input and output data pointers */
    char *i1p = args[0], *i2p = args[1], *op = args[2];

    /* outer strides */
    npy_intp i1s = steps[0], i2s = steps[1], os = steps[2];

    /* inner strides */
    npy_intp i1s_m = steps[3], i1s_n = steps[4],
             i2s_n = steps[5], i2s_p = steps[6],
             os_m = steps[7], os_p = steps[8];

    for (i_outer=0; i_outer<d_outer; i_outer++)
    {
        double_matmul_inner(i1p, i2p, op, dm, dn, dp,
                            i1s_m, i1s_n, i2s_n, i2s_p, os_m, os_p);

        i1p += i1s;
        i2p += i2s;
        op += os;
    }
}

Let's go over it step by step.

First of all, the signature of our gufunc is (m?,n),(n,p?)->(m?,p?), see here what this means. So we can label our core dimensions as m, n and p, and use those as suffixes in variable names to clarify what we mean.

Every ufunc implementation in C has an outer loop, which abstracts away all broadcasting over dimensions you don't care about, and gufuncs in particular have inner loops over their core dimensions. The first lines of code declare a counter and define a size for the outer loop.

The second thing we do is obtain the sizes of our core dimensions m, n and p, conveniently named dm, dn and dp.

Next we define the pointers to the: in this case there are two input arrays, pointed to by i1p and i2p, and one output array, pointed to by op.

Then we obtain the outer strides, which tell us how far apart in memory are the different input and output matrices, to be used in the outer loop. We name them "input 1 stride" i1s, "input 2 stride" i2s and "output stride" os.

The inner strides, which tell us how far apart in memory are the different elements inside the matrices, are named according to their respective input/output and the dimension label. So, for example, since the second input matrix has shape (n,p), we can label its inner strides as i2s_n and i2s_p. The same is done for the first input matrix and for the output matrix.

Finally, we implement a loop. But only one! This is the outer loop, that takes care of all broadcasting. Inside it we do two things: we call the double_matmul_inner() function, which performs the actual computation for each combination of input matrices, and we increment the data pointers by the outer strides.

What have we learned until now? A very convenient naming scheme for variables that can be used in any ufunc implementation in C.

Next I show you double_matmul_inner():


static void double_matmul_inner(char *i1p, char *i2p, char *op,
    npy_intp dm, npy_intp dn, npy_intp dp,
    npy_intp i1s_m, npy_intp i1s_n, npy_intp i2s_n, npy_intp i2s_p,
    npy_intp os_m, npy_intp os_p)
{
    npy_intp m, n, p; // loop counters in the respective core dimensions
    npy_intp i1b_n, i2b_n, i2b_p, ob_p; // amount to backtrack pointers
    i1b_n = i1s_n * dn;
    i2b_n = i2s_n * dn;
    i2b_p = i2s_p * dp;
    ob_p = os_p * dp;

    double tmp;

    for (m = 0; m < dm; m++)
    {
        for (p = 0; p < dp; p++)
        {
            /* BEGIN actual computation */
            tmp = 0.0;
            for (n = 0; n < dn; n++)
            {
                tmp += *(double *)i1p * *(double *)i2p;

                i1p += i1s_n;
                i2p += i2s_n;
            }
            *((double *)op) = tmp;
            /* END actual computation */

            i1p -= i1b_n;
            i2p -= i2b_n;
            i2p += i2s_p;
            op += os_p;
        }

        i2p -= i2b_p;
        op -= ob_p;
        i1p += i1s_m;
        op += os_m;
    }
}

Here again there is some naming wisdom to be learned: we labeled our core dimensions m, n and p, so why not just use the same names for the counter variables?

Next come the backtracking amounts. These are best understood by going directly to the loop code.

Look at the innermost loop, the n one. According to the matrix multiplication formula, \[ C_{m p} = \sum_{n=1}^{d_n} A_{m n} B_{n p} \Leftrightarrow C = AB, \] we should sum the products \( A_{m n} B_{n p} \) over \( n \) to obtain one element of the output matrix. This is exactly what is being done in the first line inside this loop, where we add to the temporary double tmp the product of the doubles pointed to by i1p and i2p. We next increment the pointers by the necessary strides, labeled by n.

What happens after the end of this innermost loop? Now i1p and i2p each point to the last matrix element in a certain sequence (horizontal in the first case, vertical in the second). We want to get i2p to the next column, so that we can perform the n loop again, this time to obtain a different matrix element for the output. But the stride i2s_p tells us how much to jump from the start of a column to the start of the next column, as does the stride os_p. So in order to do this, we have first to backtrack our pointer i2p, undoing all the increments that we did inside the n loop. Since we want to reuse the line from the first input in the next iteration, i1p has to be backtracked, too.

Now you understand the definitions (and names) of i1b_n and i2b_n: they tell us how much to backtrack the i1p and i2p pointers to undo the increments performed inside the n loop.

The same has to be done after one complete p loop. We have incremented, in the p loop, the pointers i2p and op, so now we need to backtrack them by i2b_p and ob_p, before performing further pointer increments.

This technique of backtracking pointers is much better than what I initially had in mind to attack this problem when I started writing my gufunc. What I had thought was to assign multiple pointers to each matrix, and only update one kind of pointer in each loop, later using the other pointers to figure out the address to jump to after one completed loop. This would also work, but was more complex. Thanks to whatever NumPy developer wrote the matmul function for the wisdom.

Return to the homepage