Optimizing code and knowing about the memory

To write efficient code, you need to know what makes code slow
We already saw how knowing type inference and specializations in Julia gives us tips to improve our coding issues
In this part we will see how knowing about computer's memory will show some more ways to avoid slow coding and performance pitfalls

Hierarchical structure of Computer Memory

There is a disconnect between the theoretical world and the actual world of high performance computing. The reason is hardware actually matters so the way that the hardware is put together means that we have to understand how to code differently to the fact of how hardware works. So we need a mental model of memory.

Let's Google CPU cache!!! Also we look something from the University of Washington.

At the highest level you have a CPU's core memory
which directly accesses a L1 cache. The L1 cache has the fastest access, so things which will be needed soon are kept there.
However, it is filled from the L2 cache,
which itself is filled from the L3 cache,
which is filled from the main memory (Stack and heap both are here but stack is closer)

This hierarchical format really matters because it means that there's a natural ordering that we should be using and following whenever we are doing your calculations. For optimimizing code, if we can use things that are already in a closer cache can help the code run faster and in that case it does not have to bring it closer from a far end location to the CPU. When something needs to be pulled directly from main memory this is known as a cache miss, and this is most costly. For example linear algebra libraries are very efficient is because they really block their computation so that, one block fits into L1 cache the second block fits in L2 cache and so on, and these block sizes have been all different for everyone's chip, they're written CPU dependent in order to get the maximum performance possible. So main thing to take away this hierarchy of memory makes difference. Things that you can have on your computer might be close in memory, also could be far in memory and this memory latency is one of the most important things for making sure when you want to work with your computational speeds.

In the following we will see several times, this is really important where information gets stored.

Row/Column-Major

Many algorithms in numerical linear algebra are designed to minimize cache misses. Because of this chain, many modern CPUs try to guess what you will want next in your cache. When dealing with arrays, it will speculate ahead and grab what is known as a cache line or the next chunk in the array. Thus, your algorithms will be faster if you iterate along the values that it is grabbing. The values that it grabs are the next values in the continuous order of the stored array. When an array has only one dimension, its elements can be stored one after the other in a continuous block of memory. For two-dimensional (or greater) arrays can, there are two common conventions: row major and column major.

In the Row Major, the linear array of memory is formed by stacking the rows one after another. In other words, we can store from the beginning of the array the elements of the first row, followed by the elements of the second row, and so on.
In the Column Major, the column vectors are put one after another.

Arrays in C, Python-numpy are stored as row-ordered. Julia, on the other hand, chooses the latter strategy, storing arrays as column-ordered, similar to MATLAB and FORTRAN. (This rule generalizes to the higher-dimensional array as well; it is always the last dimension that is stored first. )

In Julia we have two kinds of indices for this, cartesian and linear. So for example we have

AA = rand(2,3)
AA[1]
AA[2]
AA[3]

So the formula is $A[i,j] = A[i+ (j-1) * numrows]$ . Since number of rows for AA is 2, we can check

#code chunk 2
AA[2 + (2-1)*2  ] == AA[2,2]
AA[2 + (3-1)*2  ] == AA[2,3]
AA[1 + (2-1)*2  ] == AA[1,2]

Here is a direct idea of going from Cartesian to Linear or Linear to Carstesian indices in Julia

## going from Caretsian index to Linear index

A = rand(3, 5)

lin_I = LinearIndices(A)
cart_I = CartesianIndices(A) 

lin_I[CartesianIndex(2, 3)] ## given a cartesian index I can find a linear index
cart_I[8] ## but ac

So now let's see whether this column major operations makes any change or not..

We consider the problem of taking two 100x100 random matrices and getting a 3rd matrix of formed by the element-wise multiplication. We can do it simply by looping through the elements of the matrices as we will but we can loop either row-wise or column-wise.

The two functions below show both, for the first one, the outer loop is row wise and the inner loop is column wise (so going in each row first and then to all columns, and then going to the next row and so on). And in the second function it's the opossite, we go to each column first and then go to all rows, and then we go to next column, and so on...Note in both functions we pre-allocate $C$ ,

#code chunk 3
A = rand(100,100);
B = rand(100,100);
C = rand(100,100);

function inner_rows!(C,A,B)
  for i in 1:100, j in 1:100  
    C[i,j] = A[i,j] + B[i,j] #
  end
end

function inner_cols!(C,A,B)
  for j in 1:100, i in 1:100
    C[i,j] = A[i,j] + B[i,j]
  end
end

using BenchmarkTools
@btime  inner_rows!(C,A,B)
@btime inner_cols!(C,A,B)

So we run both and benchmark and notice that the column-wise version performs much faster. So using the memory layout is crucial.

Stack and Heap

The stack is a fixed piece of the memory that is always living right next to your core. So when you know so it's a fixed size where the location of everything in the stack is known beforehand and you know that you can instantly grab anything in there without having to search around, this is pretty fast. For the heap, it is a dynamic memory allocation layout unit which is associated with a program. The bad thing is it's of changing size and this means the things that are in there can have changing size. So you can have things with undefined lengths of memory.

Again what is wrong with heap allocations? The idea is you have to store everything there with pointers (some kind of reference) and whenever you need it, you go to the pointer, if you need type information, it will check every time you ask for it. On the other hand stack is like one straight line of memory, things are just side by side, and it is close to the core.

Many things are automatically on the stack when you run your computer, so there is not much free space, but you can also use some the free space sometime. And maybe we will see some.

One way where you actually see this in action is when you allocate an array. An array is the most basic thing that shows up all the time that is living on the heap. Again, why does it live on the heap? Well because an array can be in any object of any size and any size at compile time. So recall julia uses all of its type information at compile time, so here when I create this array 100 by 100, we see in REPL that this array and its actual type is an Array of Float64s. But at compile time it has no idea how what the length of the array will be. So this is what we call a dynamically sized array. Most languages use dynamically sized arrays because they're really nice, you can change their size you can resize them and so on. But the key thing this has some cost. All arrays live on the heap and that means whenever we create new arrays we're going to create new pointers to things on the heap and so on...

Okay so let's see this in action,

eltype(A) the type of the elements contained in A
length(A) the number of elements in A
ndims(A) the number of dimensions of A
size(A) a tuple containing the dimensions of A
size(A,n) the size of A along dimension n
axes(A) a tuple containing the valid indices of A
axes(A,n) a range expressing the valid indices along dimension n
eachindex(A) an efficient iterator for visiting each position in A

A = rand(2, 3)

for i in eachindex(A)
  println(i^2) ## A[i]

end

For construction and initialization we can use

Array{T}(undef, dims...) an uninitialized dense Array
zeros(T, dims...) an Array of all zeros
ones(T, dims...) an Array of all ones
trues(dims...) a BitArray with all values true
falses(dims...) a BitArray with all values false
reshape(A, dims...) an array containing the same data as A, but with different dimensions
copy(A) copy A
deepcopy(A) copy A, recursively copying its elements
similar(A, T, dims...) an uninitialized array of the same type as A (dense, sparse, etc.), but with the specified element type and dimensions. The second and third arguments are both optional, defaulting to the element type and dimensions of A if omitted.
reinterpret(T, A) an array with the same binary data as A, but with element type T
rand(T, dims...) an Array with random, iid [1] and uniformly distributed values in the half-open interval [0,1)[0, 1)[0,1)
randn(T, dims...) an Array with random, iid and standard normally distributed values
Matrix{T}(I, m, n) m-by-n identity matrix. Requires using LinearAlgebra for I.
range(start, stop=stop, length=n) range of n linearly spaced elements from start to stop
fill!(A, x) fill the array A with the value x
fill(x, dims...) an Array filled with the value x

Now back to memory allocations. So let's see some code

#code chunk 3
function inner_alloc!(C,A,B)
  for j in 1:100, i in 1:100
    val = [A[i,j] + B[i,j]] ## note we put it inside the array
    C[i,j] = val[1]
  end
end
@btime inner_alloc!(C,A,B)

In the code above we are doing the same column wise loop but note, what we also do is we put this sum of elements inside of an array. Then we're going to put that value into $C$ . Let's see how fast this runs...Ohhh!! 100 X 100 = 10000 allocations!!!!

So the key here is really that when we take this value A plus B we get a scalar and then we could put that scalar inside of an array and this array is going to be that dynamically sized object. Although we know that it's a one by one object or a length one vector but it's not part of the type information. So at compile time it cannot specialize directly on that type.

So the question is how much does it matter when we change this to have a scaler rather than creating an array? Let's change the code a bit by replacing val array with the val scaler to store the temporary values...

#code chunk 4
function inner_noalloc!(C,A,B)
  for j in 1:100, i in 1:100
    val = A[i,j] + B[i,j] ## this is in stack
    C[i,j] = val[1]
  end
end
@btime inner_noalloc!(C,A,B)

Huge difference! Scaler is something that you know the size at compile time because if I take something like 1.0 I know it's a Float64 which tells me it's 64 bit object.

Note, the only difference is, in the previous we created an array and in this one we do not create an array when we do this then we see that even though this one is still adding it's not going to take even close to as much time. Now, very important, one of the things that can affect your performance the most is small heap allocations.

We discussed about mutable objects in general right? Now note, they are always heap allocated. Sometimes even in some programming languages you can change the number of fields of your object and so you can have any code anytime further down the line and then all of a sudden add more fields in there. Sounds great, but the problem is if you can always add more fields in there, then that object is not something that has a single size, its size can be dependent on which codes you're going to run after it and so the only way to be able to store such an object is heap because you don't know how to pre-size it. This might be really problematic in terms of performance.

Even with heap allocations, sometimes other languages do well sometimes because there are some additional code optimizations to keep minimum the number of things that get heap allocated and having coding styles that are not emphasizing heap allocation.

Now occassionally we can also try to put things on the stack that are not just scalars. There is a nice package called StaticArrays.jl, it will make statically-sized arrays and these arrays are stack-allocated:

#code chunk 5
using StaticArrays
sv = SVector{3,Float64}(1.0,2.0,3.0)
typeof(sv)

See we can exactly know the size of this object and type so we know it will need 64*3 bits in order to store this object and so when a function is compiled holding this kind of SVector it can put this SVector to the to the stack

#code chunk 6
function static_inner_alloc!(C,A,B)
  for j in 1:100, i in 1:100
    val = @SVector [A[i,j] + B[i,j]] 
    C[i,j] = val[1]
  end
end
@btime static_inner_alloc!(C,A,B)

where

#code chunk 7
@macroexpand @SVector [A[i,j] + B[i,j]]

The upside of static arrays or statically allocated variables is they're fast and they essentially don't count as allocations at all. But the downside of them though is that you only have little space in your stack. So if you try to put way too much stuff in your stack then you can actually make things a lot slower by doing that (why??) Statically allocated variables are great for small size variables but you don't want to over abuse stacks, so be careful.

Mutation to Avoid Heap Allocations

Many times you do need to write into an array, maybe because of very large object we have to access but heaps are slow, so how can you write into an array without performing a heap allocation?

The answer is mutation. Mutation is changing the values of an already existing array (sometimes I think I mentioned this as pre-allocation!, we already saw this!). In that case, no free memory has to be found to put the array (and no memory has to be freed by the garbage collector).

#code chunk 8
function inner_noalloc!(C,A,B) ## it takes C as an argument!
  for j in 1:100, i in 1:100
    val = A[i,j] + B[i,j]
    C[i,j] = val[1]
  end
end
@btime inner_noalloc!(C,A,B)

This the same function. What it will essentially do is C = A+ B, but instead of putting this into a new matrix what this is doing is just overwriting C that was passed as the first argument to the function. In Julia, functions which mutate the first value are conventionally noted by a !. Note that the previous functions also had the !. Now the non-mutating equivalent functions would be (which takes a bit more!)

#code chunk 9
function inner_alloc(A,B)
  C = similar(A)
  for j in 1:100, i in 1:100
    val = A[i,j] + B[i,j]
    C[i,j] = val[1]
  end
end
@btime inner_alloc(A,B)

This is what you will usually do when you write C = A+ B in numpy (for my Python expert colleagues and freinds!).

In the second one we have a heap allocated object. You're not creating a lot of them so remember that before when we're when we're allocating inside of the loop we allocated 10000 things and it was extremely slow. This is better now, but even here we have a bit of a slowdown because we are still allocating something right and allocating an array takes some time right???

So if you want to do a lot of operations fast you want to reuse your arrays as much as possible you do not want to create arrays

Even when you're going to very large matrices what you want to do is you want to still decrease the amount of memory allocations as much as possible at least try to make them show up as few times as possible and try to put them before loops whenever you can and that also means that if you're going to have loops of loops of loops of loops you know you want to have, then have all of the allocations at the very very beginning before all of your loops.

Julia's Broadcasting Mechanism

Now let's say we want to calculate 10 * (A + B) and we do the similar way how we did the previous calculations,

#code chunk 10
function f(A,B)
  sum([A .+ B for k in 1:10]); # 10 100 by 100 matrices 
end
@btime f(A,B)

function g(A,B)
  C = similar(A)
  for k in 1:10
    C.+= A .+ B; ## elementwise versions of the plus!
  end
end
@btime g(A,B)

In Julia . operator is basically means turn the following into a element-wise function which also called Broadcasting.

So `A.*B' means do elementwise multiplication between A and B. Another way to write it is

map((a,b) -> a*b, A, B)

or a more lower equivalent representation would be

for i in 1:length(A)
  C[i] = A[i] * B[i] ## note we did this in single indexing
end

So here we are indexing arrays with the single dimension and with this we do not have to worry about inner column or inner row because it is already the order how it is stored which is column-wise for Julia. So if we want to write generic code like this that works for arrays or matrices irrespective of dimension, dot operation which will generate the single dimension code under the hood.

So how you will compute A .+ B .+ C? How numpy will do it? it will interpret it with (A .+ B) .+ C that is

#code chunk 11
function dotstar(A, B, C)
  tmp = similar(A)
  for i in 1:length(A)
    tmp[i] = A[i] * B[i]
  end
  tmp2 = similar(C)
  for i in 1:length(C)
    tmp2[i] = tmp[i] * C[i]
  end
end
@btime dotstar(A,B,C)

Now is this going to be the most optimal way? Well No! we have heap allocations. What if you do it as a single operation like below:

#code chunk 12
function dotstar2(A, B, C)
  tmp = similar(A)
  for i in 1:length(A)
    tmp[i] = A[i] * B[i] * C[i]
  end
end
@btime dotstar2(A,B,C)

So we can see it got faster and we cut out some allocations as we got rid of one of the arrays. This is called loop fusion . See we have two loops but since we're doing element-wise operations in both of these loops it could be one loop.

This is why we don't want a function for dot star rather you should have a compiler mechanism that creates new functions based off of all the things that you're trying to do dot star with together.

How does the first one look in map form?

map((tmp, c) ->, map((a,b)-> a*b, A, B),C)

is what would happen if we had .* as a function. But Julia's JIT compiler allows to make a fast code for A .+ B .+ C where the dot operations are fused together which have the map representation

map((a,b,c) -> a*b*c, A, B, C);

So this fusion requires generating new function at compile time. So for example if we want to compute A .+ B .+ C .* sin.(A), Julia's compiler is going to collect all of these dots and it's going to compile a function that is going to fuse together all of these loops.

#code chunk 13
function unfused(A,B,C)
  tmp = A .+ B
  tmp .+ C
end
@btime unfused(A,B,C);

The timing is pretty close to the dotstar, the two loop version we wrote by hand. And the fused version should be close to the one loop version dotstar2

fused(A,B,C) = A .+ B .+ C
@btime fused(A,B,C);

You can see the allocation number of the corresponding loop version match but these .* implementations are still little faster that the loop version, some additional things are done to make the loops faster but we will get to that.

But we still have some allocations, so lets also see how can do a mutating version. Let's do the looped version first.

#code chunk 15
function dotstar3!(tmp, A, B, C)
  #tmp = similar(A) # Don\t need allcation anymore
  for i in 1:length(A)
    tmp[i] = A[i] * B[i] * C[i]
  end
  tmp;
end

D = similar(A)
@btime dotstar3!(D, A,B,C)

of course you cannot have free memory so the user your user defines a temporary variable, but now if they just happen to have this variable just lying around in their back pocket then they can call your function and it should have absolutely no allocations.

So dotstar2 needed 19 microseconds and now we get down to 15. This is the fastest we got so far by writing loop by hand.

So how we use the dot format. We have here use .=.

#code chunk 16
function realdotstar3!(tmp, A, B, C)
  tmp .= A .+ B .+ C
  tmp 
end

@btime dotstar3!(D, A,B,C)

About mutation you have to be careful that you are using the exact same variable that you are inputting here. How do you know? Let's do this

#code chunk 18
D = zeros(100,100)

And make realdotstar3 return nothing

So this time it takes 9 microseconds and also no allocations.

#code chunk 19
function realdotstar3!(tmp, A, B, C)
  tmp .= A .+ B .+ C
  nothing 
end
realdotstar3!(D, A,B,C)

is D still going to be zeros?

#code chunk 20
D

So it is changing the matrix that you out in. So that is why using this bang ! is important to warn the user that this is going to change the values you have when using a mutating function.

This is what we call a non-allocating function with no return

So you can put dots on everything. So try to figure out what is happening under the hood for this function in Julia compiler?

A .+ B .*C.* sin.(cos.(A))

this is one of those things that really causes the speed difference in optimized code from something like C++ and fortran to python numpy and r right because when you're doing a lot of python and r the way that those programming languages work is that someone is actually kind of written down like function of dot star that lowers into C code underneath the hood equivalent to

function .*(A,B)
  tmp = similar(A)
  for i in 1:length(A)
    tmp[i] = A[i] * B[i]
  end
end

And then always the unfused looping happens which is like .*(.*(A,B), C) so it is not possible to fuse

Now those who familiar with vectorized matrix operation in numpy or R let's see now how we differentiate them

#code chunk 20
function vectorized!(tmp, A, B, C)
  tmp .= A .+ B .+ C
  nothing 
end

function nonvectorized!(tmp, A, B, C)
  for i in 1:length(A)
    tmp[i] = A[i] * B[i] * C[i]
  end
  nothing 
end