The “two-language problem” is often discussed in data science and machine learning – especially by advocates of the Julia programming language. Essentially, it refers to the current situation that algorithms are often prototyped in higher-level dynamic languages such as R or Python. Yet, for reasons of efficiency the final implementation is then ported to Fortran or more commonly nowadays C++. Thus, several versions of the same algorithm have to be implemented in different programming languages which are either convenient for prototyping or sufficiently constrained to support an optimizing compiler.
Julia could actually solve another version of the “two-language problem” here, namely that modern ML toolboxes have become yet another programming language in their own right: First, by supporting a restricted subset of their host language and secondly, by introducing special control structures enabling better static optimizations.
Here, I want to argue that machine learning suffers from the “two-language problem” already at a more fundamental level. Namely, the impedance mismatch between formal mathematical notation – used when thinking about and deriving model equations – and the actual programming language used for implementation. While modern toolboxes such as PyTorch and TensorFlow readily support many matrix and tensor operations, they are still verbose when compared to mathematical notation. Furthermore, they commonly require expressing control flow in addition to the desired computation thereby hiding the actual intend of the code. Accordingly, they are rarely used as the formal language for developing and thinking about models.
What if the formal notation and programming language were just the same?
In addition to the executability and universality emphasized in the introduction, a good notation should embody characteristics familiar to any user of mathematical notation:
• Ease of expressing constructs arising in problems.
• Suggestivity.
• Ability to subordinate detail.
• Economy.
• Amenability to formal proofs.
With this goal in mind, Kenneth Iverson (Iverson 1980) has created an unambiguous and terse alternative to mathematical notation itself and used it as the basis for different programming languages. In particular, APL has also become known as Iverson notation and was taught in introductory mathematics courses as well as used by researchers as a means of communicating formal ideas. I had a serious look at APL, after seeing this video showing how Conway’s Game of Life could be implemented in a single line of APL!
APL is infamous for its special single-character glyphs used to denote most of its operators. Furthermore, its semantics appears somewhat inconsistent and confusing at times. Thus, here I will use J, a more recent and open source reincarnation of APL. In contrast to APL, J relies on ASCII characters exclusively – making it easier to type, but not necessarily to read1 Arguably, ASCII characters lose much of the beauty of the original notation.. Most importantly, the semantics has been unified making it substantially simpler and more consistent.
As an array programming language, the basic data structures of J are multi-dimensional arrays – also known as tensors. In turn, most operators automatically extend across array arguments in a systematic fashion, e.g.
1 + 2
3
1 2 3 + 4 5 6 NB. + can operate on vectors
5 7 9
Here, and in the following, user input is shown indented whereas the
output of the J interpreter is aligned left. Comments in J are marked
with NB.
and extend to the end of a line. Most operators
have a different meaning when used with one (monadic) or two arguments
(dyadic). Furthermore, expressions should be read from right to left and
all operators have the same precedence2 APL/J intentionally breaks with mathematical traditions
when these are slightly inconsistent. Being more regular, it is actually
easier to learn and often requires fewer parenthesis.. More precisely,
operators associate to the right and accordingly 2 * 3 + 4
is interpreted as \(2 \cdot (3 +
4)\).
Every array has a rank, i.e., its number of dimensions/axes,
and a shape, i.e., the vector listing the number of elements in
each dimension. The monadic operators $
(shape)
and #
(tally) can be used to query the shape and
rank of an array as follows:
i. 2 3 4 NB. creates array of shape 2 3 4 containing 0, 1, ...
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
16 17 18 19
20 21 22 23
$ i. 2 3 4 NB. the shape of i. 2 3 4
2 3 4
# $ i. 2 3 4 NB. rank is number of elements in shape
3
When a dyadic operator is applied to arrays of different ranks, a simple rule is used for broadcasting: the smaller array is extended over the larger one, whenever its shape is a prefix of the shape of the larger array, e.g.,
1 2 + i. 2 3 NB. shape of 2 vector matches first axes of 2 3 matrix
1 2 3
5 6 7
1 2 3 + i. 2 3 NB. length error as shape is not a prefix
|length error
| 1 2 3 +i.2 3
1 + i. 2 3 NB. scalar has empty shape and can always be broadcasted
1 2 3
4 5 6
(i. 2 3) + i. 2 3 4
0 1 2 3
5 6 7 8
10 11 12 13
15 16 17 18
20 21 22 23
25 26 27 28
In particular, the rank of an operator3 Not just every array/value has a rank, but also every
verb. These are separate concepts. – called verb
in J – is an interesting and very useful concept. Consider a shape 2 3 4
array A
of rank 3. Depending on need, this can be
considered as an 2 3 4 array of cells (scalars of rank 0), as
an 2 3 array of 4-vectors, as a 2 vector of 3 4 matrices or a single
cell of 2 3 4 arrays. To illustrate this, let’s apply the verb
<
at different ranks.
The verb <
encloses or boxes a
value when used monadically4 Unsurprisingly, <
is less than
when used dyadically.. A boxed value is considered as a scalar
and can be unboxed with >
. In contrast to APL, J is more
consistent and strict in distinguishing boxed and unboxed values. In J,
boxed values are shown with a box around its value and arrays can only
contain values of different types when these are boxed5 A box is similar to a reference type, but always
explicit, e.g., < 1 2
is similar to
Ref([1, 2])
in Julia..
A =. i. 2 3 4 NB. assign 2 3 4 tensor of indices to A
<"0 A NB. box at rank 0 => 2 3 4 array of boxed cells
┌──┬──┬──┬──┐
│0 │1 │2 │3 │
├──┼──┼──┼──┤
│4 │5 │6 │7 │
├──┼──┼──┼──┤
│8 │9 │10│11│
└──┴──┴──┴──┘
┌──┬──┬──┬──┐
│12│13│14│15│
├──┼──┼──┼──┤
│16│17│18│19│
├──┼──┼──┼──┤
│20│21│22│23│
└──┴──┴──┴──┘
<"1 A NB. box at rank 1 => 2 3 matrix of boxed 4 vectors
┌───────────┬───────────┬───────────┐
│0 1 2 3 │4 5 6 7 │8 9 10 11 │
├───────────┼───────────┼───────────┤
│12 13 14 15│16 17 18 19│20 21 22 23│
└───────────┴───────────┴───────────┘
<"2 A NB. 2 vector of boxed 3 4 matrices
┌─────────┬───────────┐
│0 1 2 3│12 13 14 15│
│4 5 6 7│16 17 18 19│
│8 9 10 11│20 21 22 23│
└─────────┴───────────┘
<"3 A NB. single box containing 2 3 4 array
┌───────────┐
│ 0 1 2 3│
│ 4 5 6 7│
│ 8 9 10 11│
│ │
│12 13 14 15│
│16 17 18 19│
│20 21 22 23│
└───────────┘
Thus, the rank denotes at which level the verb is applied, i.e., rank
0 means that it acts on cells, rank 1 on vectors and so on. The final
result is then obtained, by moving into the tensor until the desired
rank is reached, applying the verb there and collecting all results into
a tensor. In J, the prefix of the shape that needed to be traversed
until the desired rank of the verb was reached is referred to as its
frame, i.e., when applying <"2
to an 2 3 4
array, the frame has shape 2 and <
is applied to each 3
4 matrix of rank 2. Thus, the result is a 2 vector of boxed 3 4
arrays.
The notion of verb rank unifies many operations on arrays. As another
example consider the verb +/
– actually an adverb modified
by the +
operator/verb – which sums an array along its
first axis. Then,
$ +/"2 i. 2 3 4 5
2 3 5
i.e., by applying +/
at rank 2 we have summed out the
second to last axis, leading to a result array of shape 2 3 5 where 2 3
was the shape of the frame and 5 corresponds to the shape of the
result!
The J vocabulary provides an overview of all verbs/adverbs etc in J, including the ones introduced by now.
As a warm-up exercise with this notation, let’s implement the basic K-means algorithm (Bishop 2006) given by the two update equations
\[ r_{nk} = \left\{ \begin{array}{cl} 1 & \mbox{ if } k = \mathrm{argmin}_j ||\mathbf{x}_n - \mathbf{\mu}_j||^2 \\ 0 & \mbox{ otherwise} \end{array} \right. \]
matching the input vectors \(\mathbf{x}_n\) to the current cluster centers \(\mathbf{\mu}_k\) and then computing a new set of cluster centers
\[ \mathbf{\mu}_k = \frac{\sum_n r_{nk} \mathbf{x}_n}{\sum_n r_{nk}} \; . \]
First, define a function to compute the squared \(L_2\) distance between two vectors
dist2 =: dyad : '+/ *: x - y' NB. left and right arguments are x and y
dyadic verbs =.
and =:
are local and global assignment
monadic verb *:
is
square dyadic adverb
/
is table, e.g. outer product monadic adverb /
is
insert, i.e., reduce or fold right.
and apply it at rank 1 to an n d
data matrix and
k d
means in an tabular fashion, i.e., as an outer product,
to obtain the n k
pairwise distances:
$ X =. i. 7 4 NB. show only shape of example arrays to save space
7 4
$ mu =. i. 3 4
3 4
$ d =. X dist2"1/ mu
7 3
The matrix r
is then obtained by comparing the row-wise
minima with the actual values
dyadic verb
=
is equality, returning boolean/bit, i.e., 0 or
1 dyadic verb <.
is
minimum dyadic verb
%
is division
$ r =. d = <./"1 d
7 3
r
1 0 0
0 1 0
0 0 1
0 0 1
0 0 1
0 0 1
0 0 1
and the update equation for the means can be stated as
Compare
(+/ r */“1 X) % (+/ r)
to the mathematical
expression \[ \frac{\sum_n r_{nk}
\mathbf{x}_n}{\sum_n r_{nk}} \] Which one is more readable? Remember how long
you had to study math to use it fluently!
(+/ r */"1 X) % (+/ r)
0 1 2 3
4 5 6 7
16 17 18 19
Putting everything together, the K-means algorithm is obtained by iterating its update step to convergence:
KMeansStep =: dyad define
r =. d = <./"1 d =. x dist2"1/ y NB. pairwise distances assigned to d inline
(+/ r */"1 x) % (+/ r)
)
X KMeansStep ^: _ mu NB. iterate ^: to infinity _
2 3 4 5
10 11 12 13
20 21 22 23
Exercises: Why does broadcasting work in d = <./“1
d
? What is the shape of
r */“1 X
? Explain
difference between a +/“1 b
and a +”1/
b
.
As an other example, Consider a dense feed-forward layer, i.e.,
\[ f(\mathbf{x}) = \mathrm{relu}\left( \mathbf{W} \mathbf{x} + \mathbf{b} \right) \; . \]
In J, this can be easily implemented via
relu =: monad : '0 >. y'
dense =: adverb define
'W b' =. x NB. unpack parameters which conveniently unboxes if needed
u b + W +/ .* y
)
and can be used as follows:
Parameters are boxed and collected in a vector. Here, a named tuple or object as in Python or Julia would be slightly more convenient and readable.
] W =. ? 2 3 $ 0 NB. use ] to show assigned value
0.0688955 0.27686 0.271178
0.797844 0.357451 0.817211
b =. _2 2
] params =. (<W), <b
┌───────────────────────────┬────┐
│0.0688955 0.27686 0.271178│_2 2│
│ 0.797844 0.357451 0.817211│ │
└───────────────────────────┴────┘
params relu dense 1 2 3
0 5.96438
dyadic
phrase +/ . *
is dot product monadic verb ?
is roll,
i.e., random draw monadic verb
]
is same, i.e., identity dyadic verb $
is
reshape dyadic verb
,
is ravel
Now, applying a dense layer at a (right) rank of 1 we can easily compute it over a whole batch of data6 In general, a dyadic verb can be applied at different ranks for the left and right argument. Here, rank 1 works for both arguments.:
$ batch =. ? 8 3 $ 0
8 3
$ params relu dense"1 batch
8 2
This is not just cool, but also very convenient when defining more complicated models, e.g., transformers (Vaswani et al. 2017) with additional dimensions for token position and and multi-head attention.
Assuming that tokenization and embedding has already been taken care
of, a sentence can be represented as a T D
matrix where
T
indexes token position and D
embedding
dimensions.
The attention head first computes several vectors \(q, k\) and \(v\) from each word in the input sentence:
rand =: monad : '(2 * ? y $ 0) - 1'
T =. 5
D =. 4
$ sentence =. rand (T, D)
5 4
Q =. 3 NB. dimensionality of latent vectors
theta =. <"2 rand (3, D, Q) NB. pack parameter matrices
'Wq Wk Wv' =. theta NB. unpack parameters
q =. sentence +/ . * Wq
k =. sentence +/ . * Wk
v =. sentence +/ . * Wv
Now, each \(q_t\) is paired with each \(k_t\) and their scalar product is computed
att =. q (+/ . *)"1/ k
att =. (%: Q) %~ att NB. divide by sqrt of Q
and then normalized along the last axis, i.e., \(k\)’s position in the sentence, via the softmax function:
softmax =: monad : '(+/ z) %~ z =. ^ y'
$ att =. softmax"1 (%: Q) %~ q (+/ . *)"1/ k
5 5
Finally, the values are weighted by the attention:
$ z =. att +/ . * v NB. alternatively +/"2 att *"1 _ v
5 3
Putting everything together, a single attention head works as follows:
monadic verb
%:
is sqrt dyadic adverb ~
flips left and
right argument of a verb dyadic
verb {
is from, selecting elements by
index
head =: dyad define
'Wq Wk Wv' =. x NB. unpack parameters
q =. y +/ . * Wq
k =. y +/ . * Wk
v =. y +/ . * Wv
att =. softmax"1 (%: _1 { $ q) %~ q (+/ . *)"1/ k
att +/ . * v
)
Now, a multi-head can simple be obtained by applying the verb for a single head at the correct rank across multiple parameters vectors:
$ mh =. (theta ,: - each theta) head"1 _ sentence NB. dual head example
2 5 3
$ (,"2) 0 |: mh NB. move head axis to end and ravel last two axes
5 6
dyadic verb
,:
is laminate (similar to
vstack
) dyadic verb
|:
is rearrange axes
Thus, using this trick we can define a multi-head function which
applies all heads and then projects the concatenated results down to a
lower dimension by multiplying with a (H*Q) D
matrix \(W_o\):
multihead =: dyad define
'Wo thetas' =. x
res =. (,"2) 0 |: thetas head"1 _ y
res +/ . * Wo
)
Finally, defining a layer normalization verb
layernorm =: dyad define
'a s' =. x
mu =. (+/ y) % # y
var =. +/ *: y - mu
a + s * (y - mu) % %: var
)
and a simple two-layer network
mlp =: dyad define NB. simple 2-layer network
'layer1 layer2' =. x
layer2 ] dense (layer1 relu dense y)
)
we can put everything together into an attention layer:
attlayer =: dyad define
'mh ln1 dn ln2' =. x NB. unpack layer parameters
hidden =. ln1 layernorm"_ 1 y + mh multihead y NB. rank 1 to apply layernorm
ln2 layernorm"_ 1 hidden + dn mlp"_ 1 hidden NB. and MLP at each position
)
Exercises: Add positional encoding and causal masking to the transformer model.
Let’s put together some example parameters
mh_params =. (< rand (8*Q), D), < <"2 rand (8, 3, D, Q)
ln_params =. 0 1
mlp_params =. (< (< rand (7, D)), < rand 7), < (< rand (D, 7)), < rand D
theta =. (< mh_params), (< ln_params), (< mlp_params), < ln_params
and just because it is so easy apply it to a batch of several sentences:
$ theta attlayer"_ 2 rand (16, T, D)
16 5 4
Compare this code to other expositions, e.g., illustrating transformers with large colorful images, or implementations, e.g., tf.keras.layers.MultiHeadAttention programmatically computing Einsum equations. Also papers are sometimes less precise, i.e., applying softmax and layer normalization functions without stating the exact tensor dimension at which these act, and additional code reviews are needed to ensure that the stated equations are correctly implemented.
Finally, overemphasis of efficiency leads to an unfortunate circularity in design: for reasons of efficiency early programming languages reflected the characteristics of the early computers, and each generation of computers reflects the needs of the programming languages of the preceding generation.
Besides providing for a concise implementation of machine learning
models, the APL/J notation can also be used think about an
implementation, e.g., in order to derive a simpler or more efficient
version. Reconsider for example the expression +/ r */"1 x
used in the definition of K-means. Thinking about its action on the axes
of r
(an N K
matrix) and x
(an
N D
matrix), i.e., switching to an explicit index notation,
we see that r */"1 x
is an N K D
tensor with
elements \((\verb|r */"1 x|)_{nkd} =
r_{nk} x_{nd}\) and thus,
\[ (\verb|+/ r */"1 x|)_{kd} = \sum_n r_{nk} x_{nd} \; . \]
Accordingly, using matrix multiplication, i.e., obtained from the
generalized inner product in J as +/ . *
, and transposition
|:
, it can also be represented as
(|: r) +/ . * x
.
I had not actually seen that identity before deriving it in this fashion! Compared to its corresponding mathematical expression7 A well-known identity between matrix multiplication and the sum of outer products between the columns of \(\mathbf{A}\) and rows of \(\mathbf{B}\)., i.e.,
\[ \mathbf{A} \mathbf{B} = \sum_k \mathbf{a}^{\mathrm{col}}_k \otimes \mathbf{b}^{\mathrm{row}}_k \; , \]
it is (arguably) even simpler in J, nicely illustrating its power not just for implementing algorithms, but also as a technical notation amenable to formal manipulations8 Other examples can be found in the works of Ken Iverson or on the Dyalog APL website, e.g., golfing the famous APL code for the Game of Life..
In the end, shorter code not just enables faster exploration and
prototyping of algorithms. Reasoning about code on a higher-level of
abstraction also opens up possibilities for optimization via
equational reasoning. The only other language, that I have seen
being used in this fashion is Haskell. Here, stream fusion as
exemplified in the functor law
fmap f . fmap g == fmap (f . g)
or the works of Richard
Bird (Bird 2006), e.g., deriving an
efficient Sudoku solver from a brute-force specification, come to
mind.
In the end, APL/J has some interesting tricks under its sleeves – also for modern machine learning – and is certainly is fresh blast from the past. For inspiration and contemplation the full transformer code is repeated here …
relu =: monad : '0 >. y'
dense =: adverb define
'W b' =. x
u b + W +/ .* y
)
softmax =: monad : '(+/ z) %~ z =. ^ y'
head =: dyad define
'Wq Wk Wv' =. x NB. unpack parameters
q =. y +/ . * Wq
k =. y +/ . * Wk
v =. y +/ . * Wv
att =. softmax"1 (%: _1 { $ q) %~ q (+/ . *)"1/ k
att +/ . * v
)
multihead =: dyad define
'Wo thetas' =. x
res =. (,"2) 0 |: thetas head"1 _ y
res +/ . * Wo
)
layernorm =: dyad define
'a s' =. x
mu =. (+/ y) % # y
var =. +/ *: y - mu
a + s * (y - mu) % %: var
)
mlp =: dyad define
'layer1 layer2' =. x
layer2 ] dense (layer1 relu dense y)
)
attlayer =: dyad define
'mh ln1 dn ln2' =. x NB. unpack layer parameters
hidden =. ln1 layernorm"_ 1 y + mh multihead y NB. apply layernorm and MLP at each position
ln2 layernorm"_ 1 hidden + dn mlp"_ 1 hidden
)