Lost in code

A blog on programming and what it could be

Blog home About me Privacy

Thinking in code

The “two-language problem” is often discussed in data science and machine learning – especially by advocates of the Julia programming language. Essentially, it refers to the current situation that algorithms are often prototyped in higher-level dynamic languages such as R or Python. Yet, for reasons of efficiency the final implementation is then ported to Fortran or more commonly nowadays C++. Thus, several versions of the same algorithm have to be implemented in different programming languages which are either convenient for prototyping or sufficiently constrained to support an optimizing compiler.

Julia could actually solve another version of the “two-language problem” here, namely that modern ML toolboxes have become yet another programming language in their own right: First, by supporting a restricted subset of their host language and secondly, by introducing special control structures enabling better static optimizations.

Here, I want to argue that machine learning suffers from the “two-language problem” already at a more fundamental level. Namely, the impedance mismatch between formal mathematical notation – used when thinking about and deriving model equations – and the actual programming language used for implementation. While modern toolboxes such as PyTorch and TensorFlow readily support many matrix and tensor operations, they are still verbose when compared to mathematical notation. Furthermore, they commonly require expressing control flow in addition to the desired computation thereby hiding the actual intend of the code. Accordingly, they are rarely used as the formal language for developing and thinking about models.

What if the formal notation and programming language were just the same?

APL/J

In addition to the executability and universality emphasized in the introduction, a good notation should embody characteristics familiar to any user of mathematical notation:

• Ease of expressing constructs arising in problems.

• Suggestivity.

• Ability to subordinate detail.

• Economy.

• Amenability to formal proofs.

Kenneth E. Iverson

With this goal in mind, Kenneth Iverson (Iverson 1980) has created an unambiguous and terse alternative to mathematical notation itself and used it as the basis for different programming languages. In particular, APL has also become known as Iverson notation and was taught in introductory mathematics courses as well as used by researchers as a means of communicating formal ideas. I had a serious look at APL, after seeing this video showing how Conway’s Game of Life could be implemented in a single line of APL!

APL is infamous for its special single-character glyphs used to denote most of its operators. Furthermore, its semantics appears somewhat inconsistent and confusing at times. Thus, here I will use J, a more recent and open source reincarnation of APL. In contrast to APL, J relies on ASCII characters exclusively – making it easier to type, but not necessarily to read1 Arguably, ASCII characters lose much of the beauty of the original notation.. Most importantly, the semantics has been unified making it substantially simpler and more consistent.

As an array programming language, the basic data structures of J are multi-dimensional arrays – also known as tensors. In turn, most operators automatically extend across array arguments in a systematic fashion, e.g.

   1 + 2
3
   1 2 3 + 4 5 6  NB. + can operate on vectors
5 7 9

Here, and in the following, user input is shown indented whereas the output of the J interpreter is aligned left. Comments in J are marked with NB. and extend to the end of a line. Most operators have a different meaning when used with one (monadic) or two arguments (dyadic). Furthermore, expressions should be read from right to left and all operators have the same precedence2 APL/J intentionally breaks with mathematical traditions when these are slightly inconsistent. Being more regular, it is actually easier to learn and often requires fewer parenthesis.. More precisely, operators associate to the right and accordingly 2 * 3 + 4 is interpreted as \(2 \cdot (3 + 4)\).

Every array has a rank, i.e., its number of dimensions/axes, and a shape, i.e., the vector listing the number of elements in each dimension. The monadic operators $ (shape) and # (tally) can be used to query the shape and rank of an array as follows:

   i. 2 3 4  NB. creates array of shape 2 3 4 containing 0, 1, ...
 0  1  2  3
 4  5  6  7
 8  9 10 11

12 13 14 15
16 17 18 19
20 21 22 23
   $ i. 2 3 4  NB. the shape of i. 2 3 4
2 3 4
   # $ i. 2 3 4  NB. rank is number of elements in shape
3

When a dyadic operator is applied to arrays of different ranks, a simple rule is used for broadcasting: the smaller array is extended over the larger one, whenever its shape is a prefix of the shape of the larger array, e.g.,

   1 2 + i. 2 3  NB. shape of 2 vector matches first axes of 2 3 matrix
1 2 3
5 6 7
   1 2 3 + i. 2 3  NB. length error as shape is not a prefix
|length error
|   1 2 3    +i.2 3
   1 + i. 2 3  NB. scalar has empty shape and can always be broadcasted
1 2 3
4 5 6
   (i. 2 3) + i. 2 3 4
 0  1  2  3
 5  6  7  8
10 11 12 13

15 16 17 18
20 21 22 23
25 26 27 28

Verb rank

In particular, the rank of an operator3 Not just every array/value has a rank, but also every verb. These are separate concepts. – called verb in J – is an interesting and very useful concept. Consider a shape 2 3 4 array A of rank 3. Depending on need, this can be considered as an 2 3 4 array of cells (scalars of rank 0), as an 2 3 array of 4-vectors, as a 2 vector of 3 4 matrices or a single cell of 2 3 4 arrays. To illustrate this, let’s apply the verb < at different ranks.

The verb < encloses or boxes a value when used monadically4 Unsurprisingly, < is less than when used dyadically.. A boxed value is considered as a scalar and can be unboxed with >. In contrast to APL, J is more consistent and strict in distinguishing boxed and unboxed values. In J, boxed values are shown with a box around its value and arrays can only contain values of different types when these are boxed5 A box is similar to a reference type, but always explicit, e.g., < 1 2 is similar to Ref([1, 2]) in Julia..

   A =. i. 2 3 4  NB. assign 2 3 4 tensor of indices to A
   <"0 A  NB. box at rank 0 => 2 3 4 array of boxed cells
┌──┬──┬──┬──┐
│0 │1 │2 │3 │
├──┼──┼──┼──┤
│4 │5 │6 │7 │
├──┼──┼──┼──┤
│8 │9 │10│11│
└──┴──┴──┴──┘

┌──┬──┬──┬──┐
│12│13│14│15│
├──┼──┼──┼──┤
│16│17│18│19│
├──┼──┼──┼──┤
│20│21│22│23│
└──┴──┴──┴──┘
   <"1 A  NB. box at rank 1 => 2 3 matrix of boxed 4 vectors
┌───────────┬───────────┬───────────┐
│0 1 2 3    │4 5 6 7    │8 9 10 11  │
├───────────┼───────────┼───────────┤
│12 13 14 15│16 17 18 19│20 21 22 23│
└───────────┴───────────┴───────────┘
   <"2 A  NB. 2 vector of boxed 3 4 matrices
┌─────────┬───────────┐
│0 1  2  3│12 13 14 15│
│4 5  6  7│16 17 18 19│
│8 9 10 11│20 21 22 23│
└─────────┴───────────┘
   <"3 A  NB. single box containing 2 3 4 array
┌───────────┐
│ 0  1  2  3│
│ 4  5  6  7│
│ 8  9 10 11│
│           │
│12 13 14 15│
│16 17 18 19│
│20 21 22 23│
└───────────┘

Thus, the rank denotes at which level the verb is applied, i.e., rank 0 means that it acts on cells, rank 1 on vectors and so on. The final result is then obtained, by moving into the tensor until the desired rank is reached, applying the verb there and collecting all results into a tensor. In J, the prefix of the shape that needed to be traversed until the desired rank of the verb was reached is referred to as its frame, i.e., when applying <"2 to an 2 3 4 array, the frame has shape 2 and < is applied to each 3 4 matrix of rank 2. Thus, the result is a 2 vector of boxed 3 4 arrays.

The notion of verb rank unifies many operations on arrays. As another example consider the verb +/ – actually an adverb modified by the + operator/verb – which sums an array along its first axis. Then,

   $ +/"2 i. 2 3 4 5
2 3 5

i.e., by applying +/ at rank 2 we have summed out the second to last axis, leading to a result array of shape 2 3 5 where 2 3 was the shape of the frame and 5 corresponds to the shape of the result!

The J vocabulary provides an overview of all verbs/adverbs etc in J, including the ones introduced by now.

Machine learning algorithms

K-means

As a warm-up exercise with this notation, let’s implement the basic K-means algorithm (Bishop 2006) given by the two update equations

\[ r_{nk} = \left\{ \begin{array}{cl} 1 & \mbox{ if } k = \mathrm{argmin}_j ||\mathbf{x}_n - \mathbf{\mu}_j||^2 \\ 0 & \mbox{ otherwise} \end{array} \right. \]

matching the input vectors \(\mathbf{x}_n\) to the current cluster centers \(\mathbf{\mu}_k\) and then computing a new set of cluster centers

\[ \mathbf{\mu}_k = \frac{\sum_n r_{nk} \mathbf{x}_n}{\sum_n r_{nk}} \; . \]

First, define a function to compute the squared \(L_2\) distance between two vectors

   dist2 =: dyad : '+/ *: x - y'  NB. left and right arguments are x and y

dyadic verbs =. and =: are local and global assignment monadic verb *: is square dyadic adverb / is table, e.g. outer product monadic adverb / is insert, i.e., reduce or fold right.

and apply it at rank 1 to an n d data matrix and k d means in an tabular fashion, i.e., as an outer product, to obtain the n k pairwise distances:

   $ X =. i. 7 4  NB. show only shape of example arrays to save space
7 4
   $ mu =. i. 3 4
3 4
   $ d =. X dist2"1/ mu
7 3

The matrix r is then obtained by comparing the row-wise minima with the actual values

dyadic verb = is equality, returning boolean/bit, i.e., 0 or 1 dyadic verb <. is minimum dyadic verb % is division

   $ r =. d = <./"1 d
7 3
   r
1 0 0
0 1 0
0 0 1
0 0 1
0 0 1
0 0 1
0 0 1

and the update equation for the means can be stated as

Compare (+/ r */“1 X) % (+/ r) to the mathematical expression \[ \frac{\sum_n r_{nk} \mathbf{x}_n}{\sum_n r_{nk}} \] Which one is more readable? Remember how long you had to study math to use it fluently!

   (+/ r */"1 X) % (+/ r)
 0  1  2  3
 4  5  6  7
16 17 18 19

Putting everything together, the K-means algorithm is obtained by iterating its update step to convergence:

   KMeansStep =: dyad define
r =. d = <./"1 d =. x dist2"1/ y  NB. pairwise distances assigned to d inline
(+/ r */"1 x) % (+/ r)
)
   X KMeansStep ^: _ mu  NB. iterate ^: to infinity _
 2  3  4  5
10 11 12 13
20 21 22 23

Exercises: Why does broadcasting work in d = <./“1 d? What is the shape of r */“1 X? Explain difference between a +/“1 b and a +”1/ b.

Neural networks

As an other example, Consider a dense feed-forward layer, i.e.,

\[ f(\mathbf{x}) = \mathrm{relu}\left( \mathbf{W} \mathbf{x} + \mathbf{b} \right) \; . \]

In J, this can be easily implemented via

   relu =: monad : '0 >. y'
   dense =: adverb define
'W b' =. x  NB. unpack parameters which conveniently unboxes if needed
u b + W +/ .* y
)

and can be used as follows:

Parameters are boxed and collected in a vector. Here, a named tuple or object as in Python or Julia would be slightly more convenient and readable.

   ] W =. ? 2 3 $ 0  NB. use ] to show assigned value
0.0688955  0.27686 0.271178
 0.797844 0.357451 0.817211
   b =. _2 2
   ] params =. (<W), <b
┌───────────────────────────┬────┐
│0.0688955  0.27686 0.271178│_2 2│
│ 0.797844 0.357451 0.817211│    │
└───────────────────────────┴────┘
   params relu dense 1 2 3
0 5.96438

dyadic phrase +/ . * is dot product monadic verb ? is roll, i.e., random draw monadic verb ] is same, i.e., identity dyadic verb $ is reshape dyadic verb , is ravel

Now, applying a dense layer at a (right) rank of 1 we can easily compute it over a whole batch of data6 In general, a dyadic verb can be applied at different ranks for the left and right argument. Here, rank 1 works for both arguments.:

   $ batch =. ? 8 3 $ 0
8 3
   $ params relu dense"1 batch
8 2

This is not just cool, but also very convenient when defining more complicated models, e.g., transformers (Vaswani et al. 2017) with additional dimensions for token position and and multi-head attention.

Transformers

Assuming that tokenization and embedding has already been taken care of, a sentence can be represented as a T D matrix where T indexes token position and D embedding dimensions.

The attention head first computes several vectors \(q, k\) and \(v\) from each word in the input sentence:

   rand =: monad : '(2 * ? y $ 0) - 1'
   T =. 5
   D =. 4
   $ sentence =. rand (T, D)
5 4
   Q =. 3  NB. dimensionality of latent vectors
   theta =. <"2 rand (3, D, Q)  NB. pack parameter matrices
   'Wq Wk Wv' =. theta  NB. unpack parameters
   q =. sentence +/ . * Wq
   k =. sentence +/ . * Wk
   v =. sentence +/ . * Wv

Now, each \(q_t\) is paired with each \(k_t\) and their scalar product is computed

   att =. q (+/ . *)"1/ k
   att =. (%: Q) %~ att  NB. divide by sqrt of Q

and then normalized along the last axis, i.e., \(k\)’s position in the sentence, via the softmax function:

   softmax =: monad : '(+/ z) %~ z =. ^ y'
   $ att =. softmax"1 (%: Q) %~ q (+/ . *)"1/ k
5 5

Finally, the values are weighted by the attention:

   $ z =. att +/ . * v  NB. alternatively +/"2 att *"1 _ v
5 3

Putting everything together, a single attention head works as follows:

monadic verb %: is sqrt dyadic adverb ~ flips left and right argument of a verb dyadic verb { is from, selecting elements by index

   head =: dyad define
'Wq Wk Wv' =. x  NB. unpack parameters
q =. y +/ . * Wq
k =. y +/ . * Wk
v =. y +/ . * Wv
att =. softmax"1 (%: _1 { $ q) %~ q (+/ . *)"1/ k
att +/ . * v
)

Now, a multi-head can simple be obtained by applying the verb for a single head at the correct rank across multiple parameters vectors:

   $ mh =. (theta ,: - each theta) head"1 _ sentence  NB. dual head example
2 5 3
   $ (,"2) 0 |: mh  NB. move head axis to end and ravel last two axes
5 6

dyadic verb ,: is laminate (similar to vstack) dyadic verb |: is rearrange axes

Thus, using this trick we can define a multi-head function which applies all heads and then projects the concatenated results down to a lower dimension by multiplying with a (H*Q) D matrix \(W_o\):

   multihead =: dyad define
'Wo thetas' =. x
res =. (,"2) 0 |: thetas head"1 _ y
res +/ . * Wo
)

Finally, defining a layer normalization verb

   layernorm =: dyad define
'a s' =. x
mu =. (+/ y) % # y
var =. +/ *: y - mu
a + s * (y - mu) % %: var
)

and a simple two-layer network

   mlp =: dyad define  NB. simple 2-layer network
'layer1 layer2' =. x
layer2 ] dense (layer1 relu dense y)
)

we can put everything together into an attention layer:

   attlayer =: dyad define
'mh ln1 dn ln2' =. x  NB. unpack layer parameters
hidden =. ln1 layernorm"_ 1 y + mh multihead y  NB. rank 1 to apply layernorm
ln2 layernorm"_ 1 hidden + dn mlp"_ 1 hidden    NB. and MLP at each position
)

Exercises: Add positional encoding and causal masking to the transformer model.

Let’s put together some example parameters

mh_params =. (< rand (8*Q), D), < <"2 rand (8, 3, D, Q)
ln_params =. 0 1
mlp_params =. (< (< rand (7, D)), < rand 7), < (< rand (D, 7)), < rand D
theta =. (< mh_params), (< ln_params), (< mlp_params), < ln_params

and just because it is so easy apply it to a batch of several sentences:

   $ theta attlayer"_ 2 rand (16, T, D)
16 5 4

Compare this code to other expositions, e.g., illustrating transformers with large colorful images, or implementations, e.g., tf.keras.layers.MultiHeadAttention programmatically computing Einsum equations. Also papers are sometimes less precise, i.e., applying softmax and layer normalization functions without stating the exact tensor dimension at which these act, and additional code reviews are needed to ensure that the stated equations are correctly implemented.

Final notes

Finally, overemphasis of efficiency leads to an unfortunate circularity in design: for reasons of efficiency early programming languages reflected the characteristics of the early computers, and each generation of computers reflects the needs of the programming languages of the preceding generation.

Kenneth E. Iverson

Besides providing for a concise implementation of machine learning models, the APL/J notation can also be used think about an implementation, e.g., in order to derive a simpler or more efficient version. Reconsider for example the expression +/ r */"1 x used in the definition of K-means. Thinking about its action on the axes of r (an N K matrix) and x (an N D matrix), i.e., switching to an explicit index notation, we see that r */"1 x is an N K D tensor with elements \((\verb|r */"1 x|)_{nkd} = r_{nk} x_{nd}\) and thus,

\[ (\verb|+/ r */"1 x|)_{kd} = \sum_n r_{nk} x_{nd} \; . \]

Accordingly, using matrix multiplication, i.e., obtained from the generalized inner product in J as +/ . *, and transposition |:, it can also be represented as (|: r) +/ . * x.

I had not actually seen that identity before deriving it in this fashion! Compared to its corresponding mathematical expression7 A well-known identity between matrix multiplication and the sum of outer products between the columns of \(\mathbf{A}\) and rows of \(\mathbf{B}\)., i.e.,

\[ \mathbf{A} \mathbf{B} = \sum_k \mathbf{a}^{\mathrm{col}}_k \otimes \mathbf{b}^{\mathrm{row}}_k \; , \]

it is (arguably) even simpler in J, nicely illustrating its power not just for implementing algorithms, but also as a technical notation amenable to formal manipulations8 Other examples can be found in the works of Ken Iverson or on the Dyalog APL website, e.g., golfing the famous APL code for the Game of Life..

In the end, shorter code not just enables faster exploration and prototyping of algorithms. Reasoning about code on a higher-level of abstraction also opens up possibilities for optimization via equational reasoning. The only other language, that I have seen being used in this fashion is Haskell. Here, stream fusion as exemplified in the functor law fmap f . fmap g == fmap (f . g) or the works of Richard Bird (Bird 2006), e.g., deriving an efficient Sudoku solver from a brute-force specification, come to mind.

In the end, APL/J has some interesting tricks under its sleeves – also for modern machine learning – and is certainly is fresh blast from the past. For inspiration and contemplation the full transformer code is repeated here …

   relu =: monad : '0 >. y'
   dense =: adverb define
'W b' =. x
u b + W +/ .* y
)
   softmax =: monad : '(+/ z) %~ z =. ^ y'
   head =: dyad define
'Wq Wk Wv' =. x  NB. unpack parameters
q =. y +/ . * Wq
k =. y +/ . * Wk
v =. y +/ . * Wv
att =. softmax"1 (%: _1 { $ q) %~ q (+/ . *)"1/ k
att +/ . * v
)
   multihead =: dyad define
'Wo thetas' =. x  
res =. (,"2) 0 |: thetas head"1 _ y
res +/ . * Wo
)
   layernorm =: dyad define
'a s' =. x
mu =. (+/ y) % # y
var =. +/ *: y - mu
a + s * (y - mu) % %: var
)
   mlp =: dyad define
'layer1 layer2' =. x
layer2 ] dense (layer1 relu dense y)
)
   attlayer =: dyad define
'mh ln1 dn ln2' =. x  NB. unpack layer parameters
hidden =. ln1 layernorm"_ 1 y + mh multihead y  NB. apply layernorm and MLP at each position
ln2 layernorm"_ 1 hidden + dn mlp"_ 1 hidden
)
Bird, Richard S. 2006. “FUNCTIONAL PEARL: A Program to Solve Sudoku.” Journal of Functional Programming 16: 671–79. https://doi.org/10.1017/S0956796806006058.
Bishop, Christopher. 2006. Pattern Recognition and Machine Learning. Springer. https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/.
Iverson, Kenneth E. 1980. “Notation as a Tool of Thought.” Commun. ACM 23 (8): 444–65. https://doi.org/10.1145/358896.358899.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” In Proceedings of the 31st International Conference on Neural Information Processing Systems, 6000–6010. NIPS’17. Red Hook, NY, USA: Curran Associates Inc.