Optimizing code and knowing about the memory

Hierarchical structure of Computer Memory

There is a disconnect between the theoretical world and the actual world of high performance computing. The reason is hardware actually matters so the way that the hardware is put together means that we have to understand how to code differently to the fact of how hardware works. So we need a mental model of memory.

Let's Google CPU cache!!! Also we look something from the University of Washington.

So

This hierarchical format really matters because it means that there's a natural ordering that we should be using and following whenever we are doing your calculations. For optimimizing code, if we can use things that are already in a closer cache can help the code run faster and in that case it does not have to bring it closer from a far end location to the CPU. When something needs to be pulled directly from main memory this is known as a cache miss, and this is most costly. For example linear algebra libraries are very efficient is because they really block their computation so that, one block fits into L1 cache the second block fits in L2 cache and so on, and these block sizes have been all different for everyone's chip, they're written CPU dependent in order to get the maximum performance possible. So main thing to take away this hierarchy of memory makes difference. Things that you can have on your computer might be close in memory, also could be far in memory and this memory latency is one of the most important things for making sure when you want to work with your computational speeds.

In the following we will see several times, this is really important where information gets stored.

Row/Column-Major

Many algorithms in numerical linear algebra are designed to minimize cache misses. Because of this chain, many modern CPUs try to guess what you will want next in your cache. When dealing with arrays, it will speculate ahead and grab what is known as a cache line or the next chunk in the array. Thus, your algorithms will be faster if you iterate along the values that it is grabbing. The values that it grabs are the next values in the continuous order of the stored array. When an array has only one dimension, its elements can be stored one after the other in a continuous block of memory. For two-dimensional (or greater) arrays can, there are two common conventions: row major and column major.

Arrays in C, Python-numpy are stored as row-ordered. Julia, on the other hand, chooses the latter strategy, storing arrays as column-ordered, similar to MATLAB and FORTRAN. (This rule generalizes to the higher-dimensional array as well; it is always the last dimension that is stored first. )

In Julia we have two kinds of indices for this, cartesian and linear. So for example we have

AA = rand(2,3)
AA[1]
AA[2]
AA[3]

So the formula is A[i,j]=A[i+(j1)numrows]A[i,j] = A[i+ (j-1) * numrows]. Since number of rows for AA is 2, we can check

#code chunk 2
AA[2 + (2-1)*2  ] == AA[2,2]
AA[2 + (3-1)*2  ] == AA[2,3]
AA[1 + (2-1)*2  ] == AA[1,2]

Here is a direct idea of going from Cartesian to Linear or Linear to Carstesian indices in Julia

## going from Caretsian index to Linear index

A = rand(3, 5)

lin_I = LinearIndices(A)
cart_I = CartesianIndices(A) 

lin_I[CartesianIndex(2, 3)] ## given a cartesian index I can find a linear index
cart_I[8] ## but ac


So now let's see whether this column major operations makes any change or not..

We consider the problem of taking two 100x100 random matrices and getting a 3rd matrix of formed by the element-wise multiplication. We can do it simply by looping through the elements of the matrices as we will but we can loop either row-wise or column-wise.

The two functions below show both, for the first one, the outer loop is row wise and the inner loop is column wise (so going in each row first and then to all columns, and then going to the next row and so on). And in the second function it's the opossite, we go to each column first and then go to all rows, and then we go to next column, and so on...Note in both functions we pre-allocate CC,

#code chunk 3
A = rand(100,100);
B = rand(100,100);
C = rand(100,100);

function inner_rows!(C,A,B)
  for i in 1:100, j in 1:100  
    C[i,j] = A[i,j] + B[i,j] #
  end
end

function inner_cols!(C,A,B)
  for j in 1:100, i in 1:100
    C[i,j] = A[i,j] + B[i,j]
  end
end

using BenchmarkTools
@btime  inner_rows!(C,A,B)
@btime inner_cols!(C,A,B)

So we run both and benchmark and notice that the column-wise version performs much faster. So using the memory layout is crucial.

Stack and Heap

The stack is a fixed piece of the memory that is always living right next to your core. So when you know so it's a fixed size where the location of everything in the stack is known beforehand and you know that you can instantly grab anything in there without having to search around, this is pretty fast. For the heap, it is a dynamic memory allocation layout unit which is associated with a program. The bad thing is it's of changing size and this means the things that are in there can have changing size. So you can have things with undefined lengths of memory.

Again what is wrong with heap allocations? The idea is you have to store everything there with pointers (some kind of reference) and whenever you need it, you go to the pointer, if you need type information, it will check every time you ask for it. On the other hand stack is like one straight line of memory, things are just side by side, and it is close to the core.

Many things are automatically on the stack when you run your computer, so there is not much free space, but you can also use some the free space sometime. And maybe we will see some.

One way where you actually see this in action is when you allocate an array. An array is the most basic thing that shows up all the time that is living on the heap. Again, why does it live on the heap? Well because an array can be in any object of any size and any size at compile time. So recall julia uses all of its type information at compile time, so here when I create this array 100 by 100, we see in REPL that this array and its actual type is an Array of Float64s. But at compile time it has no idea how what the length of the array will be. So this is what we call a dynamically sized array. Most languages use dynamically sized arrays because they're really nice, you can change their size you can resize them and so on. But the key thing this has some cost. All arrays live on the heap and that means whenever we create new arrays we're going to create new pointers to things on the heap and so on...

Okay so let's see this in action,

A = rand(2, 3)

for i in eachindex(A)
  println(i^2) ## A[i]

end


For construction and initialization we can use

Now back to memory allocations. So let's see some code

#code chunk 3
function inner_alloc!(C,A,B)
  for j in 1:100, i in 1:100
    val = [A[i,j] + B[i,j]] ## note we put it inside the array
    C[i,j] = val[1]
  end
end
@btime inner_alloc!(C,A,B)

In the code above we are doing the same column wise loop but note, what we also do is we put this sum of elements inside of an array. Then we're going to put that value into CC. Let's see how fast this runs...Ohhh!! 100 X 100 = 10000 allocations!!!!

So the key here is really that when we take this value A plus B we get a scalar and then we could put that scalar inside of an array and this array is going to be that dynamically sized object. Although we know that it's a one by one object or a length one vector but it's not part of the type information. So at compile time it cannot specialize directly on that type.

So the question is how much does it matter when we change this to have a scaler rather than creating an array? Let's change the code a bit by replacing val array with the val scaler to store the temporary values...

#code chunk 4
function inner_noalloc!(C,A,B)
  for j in 1:100, i in 1:100
    val = A[i,j] + B[i,j] ## this is in stack
    C[i,j] = val[1]
  end
end
@btime inner_noalloc!(C,A,B)

Huge difference! Scaler is something that you know the size at compile time because if I take something like 1.0 I know it's a Float64 which tells me it's 64 bit object.

Note, the only difference is, in the previous we created an array and in this one we do not create an array when we do this then we see that even though this one is still adding it's not going to take even close to as much time. Now, very important, one of the things that can affect your performance the most is small heap allocations.

We discussed about mutable objects in general right? Now note, they are always heap allocated. Sometimes even in some programming languages you can change the number of fields of your object and so you can have any code anytime further down the line and then all of a sudden add more fields in there. Sounds great, but the problem is if you can always add more fields in there, then that object is not something that has a single size, its size can be dependent on which codes you're going to run after it and so the only way to be able to store such an object is heap because you don't know how to pre-size it. This might be really problematic in terms of performance.

Even with heap allocations, sometimes other languages do well sometimes because there are some additional code optimizations to keep minimum the number of things that get heap allocated and having coding styles that are not emphasizing heap allocation.

Now occassionally we can also try to put things on the stack that are not just scalars. There is a nice package called StaticArrays.jl, it will make statically-sized arrays and these arrays are stack-allocated:

#code chunk 5
using StaticArrays
sv = SVector{3,Float64}(1.0,2.0,3.0)
typeof(sv)

See we can exactly know the size of this object and type so we know it will need 64*3 bits in order to store this object and so when a function is compiled holding this kind of SVector it can put this SVector to the to the stack

#code chunk 6
function static_inner_alloc!(C,A,B)
  for j in 1:100, i in 1:100
    val = @SVector [A[i,j] + B[i,j]] 
    C[i,j] = val[1]
  end
end
@btime static_inner_alloc!(C,A,B)

where

#code chunk 7
@macroexpand @SVector [A[i,j] + B[i,j]]

The upside of static arrays or statically allocated variables is they're fast and they essentially don't count as allocations at all. But the downside of them though is that you only have little space in your stack. So if you try to put way too much stuff in your stack then you can actually make things a lot slower by doing that (why??) Statically allocated variables are great for small size variables but you don't want to over abuse stacks, so be careful.

Mutation to Avoid Heap Allocations

Many times you do need to write into an array, maybe because of very large object we have to access but heaps are slow, so how can you write into an array without performing a heap allocation?

The answer is mutation. Mutation is changing the values of an already existing array (sometimes I think I mentioned this as pre-allocation!, we already saw this!). In that case, no free memory has to be found to put the array (and no memory has to be freed by the garbage collector).

#code chunk 8
function inner_noalloc!(C,A,B) ## it takes C as an argument!
  for j in 1:100, i in 1:100
    val = A[i,j] + B[i,j]
    C[i,j] = val[1]
  end
end
@btime inner_noalloc!(C,A,B)

This the same function. What it will essentially do is C = A+ B, but instead of putting this into a new matrix what this is doing is just overwriting C that was passed as the first argument to the function. In Julia, functions which mutate the first value are conventionally noted by a !. Note that the previous functions also had the !. Now the non-mutating equivalent functions would be (which takes a bit more!)

#code chunk 9
function inner_alloc(A,B)
  C = similar(A)
  for j in 1:100, i in 1:100
    val = A[i,j] + B[i,j]
    C[i,j] = val[1]
  end
end
@btime inner_alloc(A,B)

This is what you will usually do when you write C = A+ B in numpy (for my Python expert colleagues and freinds!).

In the second one we have a heap allocated object. You're not creating a lot of them so remember that before when we're when we're allocating inside of the loop we allocated 10000 things and it was extremely slow. This is better now, but even here we have a bit of a slowdown because we are still allocating something right and allocating an array takes some time right???

So if you want to do a lot of operations fast you want to reuse your arrays as much as possible you do not want to create arrays

Even when you're going to very large matrices what you want to do is you want to still decrease the amount of memory allocations as much as possible at least try to make them show up as few times as possible and try to put them before loops whenever you can and that also means that if you're going to have loops of loops of loops of loops you know you want to have, then have all of the allocations at the very very beginning before all of your loops.

Julia's Broadcasting Mechanism

Now let's say we want to calculate 10 * (A + B) and we do the similar way how we did the previous calculations,

#code chunk 10
function f(A,B)
  sum([A .+ B for k in 1:10]); # 10 100 by 100 matrices 
end
@btime f(A,B)

function g(A,B)
  C = similar(A)
  for k in 1:10
    C.+= A .+ B; ## elementwise versions of the plus!
  end
end
@btime g(A,B)

In Julia . operator is basically means turn the following into a element-wise function which also called Broadcasting.

So `A.*B' means do elementwise multiplication between A and B. Another way to write it is

map((a,b) -> a*b, A, B)

or a more lower equivalent representation would be

for i in 1:length(A)
  C[i] = A[i] * B[i] ## note we did this in single indexing
end

So here we are indexing arrays with the single dimension and with this we do not have to worry about inner column or inner row because it is already the order how it is stored which is column-wise for Julia. So if we want to write generic code like this that works for arrays or matrices irrespective of dimension, dot operation which will generate the single dimension code under the hood.

So how you will compute A .+ B .+ C? How numpy will do it? it will interpret it with (A .+ B) .+ C that is

#code chunk 11
function dotstar(A, B, C)
  tmp = similar(A)
  for i in 1:length(A)
    tmp[i] = A[i] * B[i]
  end
  tmp2 = similar(C)
  for i in 1:length(C)
    tmp2[i] = tmp[i] * C[i]
  end
end
@btime dotstar(A,B,C)

Now is this going to be the most optimal way? Well No! we have heap allocations. What if you do it as a single operation like below:

#code chunk 12
function dotstar2(A, B, C)
  tmp = similar(A)
  for i in 1:length(A)
    tmp[i] = A[i] * B[i] * C[i]
  end
end
@btime dotstar2(A,B,C)

So we can see it got faster and we cut out some allocations as we got rid of one of the arrays. This is called loop fusion . See we have two loops but since we're doing element-wise operations in both of these loops it could be one loop.

This is why we don't want a function for dot star rather you should have a compiler mechanism that creates new functions based off of all the things that you're trying to do dot star with together.

How does the first one look in map form?

map((tmp, c) ->, map((a,b)-> a*b, A, B),C)

is what would happen if we had .* as a function. But Julia's JIT compiler allows to make a fast code for A .+ B .+ C where the dot operations are fused together which have the map representation

map((a,b,c) -> a*b*c, A, B, C);

So this fusion requires generating new function at compile time. So for example if we want to compute A .+ B .+ C .* sin.(A), Julia's compiler is going to collect all of these dots and it's going to compile a function that is going to fuse together all of these loops.

#code chunk 13
function unfused(A,B,C)
  tmp = A .+ B
  tmp .+ C
end
@btime unfused(A,B,C);

The timing is pretty close to the dotstar, the two loop version we wrote by hand. And the fused version should be close to the one loop version dotstar2

fused(A,B,C) = A .+ B .+ C
@btime fused(A,B,C);

You can see the allocation number of the corresponding loop version match but these .* implementations are still little faster that the loop version, some additional things are done to make the loops faster but we will get to that.

But we still have some allocations, so lets also see how can do a mutating version. Let's do the looped version first.

#code chunk 15
function dotstar3!(tmp, A, B, C)
  #tmp = similar(A) # Don\t need allcation anymore
  for i in 1:length(A)
    tmp[i] = A[i] * B[i] * C[i]
  end
  tmp;
end

D = similar(A)
@btime dotstar3!(D, A,B,C)

of course you cannot have free memory so the user your user defines a temporary variable, but now if they just happen to have this variable just lying around in their back pocket then they can call your function and it should have absolutely no allocations.

So dotstar2 needed 19 microseconds and now we get down to 15. This is the fastest we got so far by writing loop by hand.

So how we use the dot format. We have here use .=.

#code chunk 16
function realdotstar3!(tmp, A, B, C)
  tmp .= A .+ B .+ C
  tmp 
end

@btime dotstar3!(D, A,B,C)

About mutation you have to be careful that you are using the exact same variable that you are inputting here. How do you know? Let's do this

#code chunk 18
D = zeros(100,100)

And make realdotstar3 return nothing

So this time it takes 9 microseconds and also no allocations.

#code chunk 19
function realdotstar3!(tmp, A, B, C)
  tmp .= A .+ B .+ C
  nothing 
end
realdotstar3!(D, A,B,C)

is D still going to be zeros?

#code chunk 20
D

So it is changing the matrix that you out in. So that is why using this bang ! is important to warn the user that this is going to change the values you have when using a mutating function.

This is what we call a non-allocating function with no return

So you can put dots on everything. So try to figure out what is happening under the hood for this function in Julia compiler?

A .+ B .*C.* sin.(cos.(A))

this is one of those things that really causes the speed difference in optimized code from something like C++ and fortran to python numpy and r right because when you're doing a lot of python and r the way that those programming languages work is that someone is actually kind of written down like function of dot star that lowers into C code underneath the hood equivalent to

function .*(A,B)
  tmp = similar(A)
  for i in 1:length(A)
    tmp[i] = A[i] * B[i]
  end
end

And then always the unfused looping happens which is like .*(.*(A,B), C) so it is not possible to fuse

Now those who familiar with vectorized matrix operation in numpy or R let's see now how we differentiate them

#code chunk 20
function vectorized!(tmp, A, B, C)
  tmp .= A .+ B .+ C
  nothing 
end

function nonvectorized!(tmp, A, B, C)
  for i in 1:length(A)
    tmp[i] = A[i] * B[i] * C[i]
  end
  nothing 
end

It looks like there is some difference in time difference we will where that comes from.


Note on Vectorization and Speed

In articles on MATLAB, Python, R, etc., this is where you will be told to vectorize your code. Notice from above that this isn't a performance difference between writing loops and using vectorized broadcasts. This is not abnormal! The reason why you are told to vectorize code in these other languages is because they have a high per-operation overhead (which will be discussed further down). This means that every call, like +, is costly in these languages. To get around this issue and make the language usable, someone wrote and compiled the loop for the C/Fortran function that does the broadcasted form (see numpy's Github repo). Thus A .+ B's MATLAB/Python/R equivalents are calling a single C function to generally avoid the cost of function calls and thus are faster.

But this is not an intrinsic property of vectorization. Vectorization isn't "fast" in these languages, it's just close to the correct speed because these languages spend time in fast languages during vectorization, so it looks fast. The reason vectorization is recommended is because looping is slow in these languages.

Because looping isn't slow in Julia (or C, C++, Fortran, etc.), loops and vectorization generally have the same speed. So use the one that works best for your code without a care about performance.

(As a small side effect, these high level languages tend to allocate a lot of temporary variables since the individual C kernels are written for specific numbers of inputs and thus don't naturally fuse. Julia's broadcast mechanism is just generating and JIT compiling Julia functions on the fly, and thus it can accommodate the combinatorial explosion in the amount of choices just by only compiling the combinations that are necessary for a specific code)


So Julia does not have problem with looping operations for arrays or matrices which these other matrices have it is internally looping when broadcast is called. So what is causing the difference in speed between handwritten loop and using . /Broadcast operators in Julia??

Bounds checking is costly

#code chunk 22
A[10001]
#code chunk 24
function nonvectorized!(tmp, A, B, C)
  @boundscheck A,B,C
  @inbounds   for i in 1:length(A)
    tmp[i] = A[i] * B[i] * C[i]
  end
  nothing 
end
@btime vectorized!(D, A, B, C)
@btime nonvectorized!(D, A, B, C)

See now it's even faster than the one using .

There is not slowing down when looping in Julia if you turn-off bounds-check, that is how all vectorized code are internally written

Heap Allocation from slicing

When we grab

#code chunk 25
A[50,50]

this thing go on the heap? no the output does not go on the heap when you're inside of a function because you can statically know that this thing that's coming out of here is 64 bits. How does the compiler actually know that?

we'll discuss this in a little bit but you know if you can figure out what kind of number is coming out of your array then you would know that this thing is 64 bits and you can know to build your function such as that you have a spot in there for to hold this output and therefore you have no heap allocations coming out of here.

Now what about

#code chunk 26
A[1:5, 1:5]

? this has to be allocated

#code chunk 27
@btime A[1:5,1:5]

as this creates an array. The way to be able to get around this kind of allocation is to use what's known as a view

#code chunk 28
@btime @view A[1:5,1:5] 
# equivalent to view(A,[1:5,1:5])

Let's put these inside functions to see the differences.

It should be always put inside a function to time them

Because Julia only JIT compiles function calls if i put everything into a function here then it'll run its optimizations on this full function whereas if i just call this in directly in the REPL right directly in the interactive form it's going to compile each of these intermediate little function. so what that tells you is that if you want to write fast julie code you do have to just put things into functions and you should always benchmark things inside of functions because that's the only time when the compiler has the full complete information of what will happen throughout your entire algorithm

#code chunk 29
function ff7(A) 
  A[1:5, 1:5]
end

function ff8(A)
  @view A[1:5,1:5] 
end
@btime ff7(A)#89.024 ns (1 allocation: 288 bytes)
@btime ff8(A)#28.932 ns (1 allocation: 64 bytes)

they're both allocating something but what this is allocating is much much smaller than what this one is now why is it allocating a lot less well because this this is actually just allocating the pointer to the existing piece of memory

Let's do for 50 by 50

#code chunk 30
function ff7(A) 
  A[1:50, 1:50]
end

function ff8(A)
  @view A[1:50,1:50] 
end
@btime ff7(A)# 3.568 μs (2 allocations: 19.64 KiB)
@btime ff8(A)# 29.267 ns (1 allocation: 64 bytes)

i do this slice of one five one five it allocates 288 bytes right so it does one allocation it creates one array which is takes which takes up 288 bytes when i change this to be 50 by 50 it is going to change the amount of allocations here to be 19.64 kilobytes right so this is 19 000. what happens when i change the view survey says well let me actually do the first one before right so the smaller version of this view is allocating one thing which is going to be 64. uh so it's just allocating a 64 bytes thing now when i do the bigger version of the view how much does it allocate 64 bytes why is that the case well because no actual memory was created right this 64 bytes is a 64-bit object right just like photo64's it's a 50-64-bit object which is a pointer that points to the piece of memory that defines this object that defines where this thing is so this this really is all that all that this at view thing is this is really just the information it needs to be able to know what the size is and it just points to the thing of that size so that means that your the out you know you can create as large of views as you want and you're not actually creating things in memory

this kind of idea of you know this these objects that that seem like they're creating rays but are not creating a race shows up in many different cases another case where this shows up is in ranges

#code chunk 31
function ff9()
  1:50000
end
@btime 1:50000

Optimizing Memory Use Summary