Spending Your Cache Wisely

30/08/2011

Back in the day, computers were much simpler things. They had a CPU and they had some memory. Getting data to/from memory took a fairly reliable amount of time, and when optimising you’d just try to make sure that memory speed wasn’t the block in the pipeline that caused your application to slow down. Today however, things have changed, primarily due to the fact that CPUs have gained speed at a much faster rate than memory has. Where a CPU might go up in speed by 10x over a few years, the speed of the memory it’s paired with might go up by just 3x, making it become a much larger culprit for slowing games down…

The hardware solution to this is the use of caches – a small amount of super fast (and pricey) memory that sits in between the main CPU and RAM. A cache is made up of a set of small cache lines (usually 128 bytes) that at any point in time represent some 128 byte block of main memory. Cache lines can be flushed back to RAM (writing back any changes to the memory they represent), and read from RAM (refreshing or loading a new cache line from memory). It’s the hardware’s responsibility to keep the memory and the cache in sync, and make sure the bits of memory of you want to fiddle with are in the cache.

Nowadays when you read/write memory you’re really reading/writing what is in the cache. This is exceedingly quick (generally on the same scale as the cost of a few instructions). However, if you attempt to read/write data that isn’t in the cache, the CPU must stall, flush a bit of the cache that’s not been used for a while back to memory, and replace it with the memory you want to read/write. It’s not uncommon for the cost of accessing memory that isn’t in the cache to be 30x the cost of your average instruction!

So lets jump straight into some code…


struct Person
{
       char Name[58];
       int FavouritePrimeNumber;
       Person() : FavouritePrimeNumber(0) { Name[0] = 0; }
};
int NumPeople = 0;
Person GPeople[1000];
Person* AddPerson(const char* name, int favourite_prime_number)
{
       strcpy(GPeople[NumPeople].Name, name);
       GPeople[NumPeople].FavouritePrimeNumber = favourite_prime_number;
       NumPeople++;
}

This type of code should be fairly familiar to any games programmer. We’ve got a nice big buffer to store people in (in this case their name and favourite prime number). The AddPerson function simply inserts a new entry and increments the NumPeople counter. In this case I’ve conveniently made the Person structure be 64 bytes, so we can fit 2 people in a 128 byte cache line.
Now suppose our game wants to inspect a property of the people – what if we want to know the biggest favourite prime number…

int GetBiggestFavouritePrimeNumber()
{
       int biggest = 0;
       for(int i = 0; i < NumPeople; i++)
              if(GPeople.FavouritePrimeNumber > biggest)
                     biggest = GPeople.FavouritePrimeNumber;
       return biggest;
}

That all works fine. We iterate over all the added people and track the largest chosen number. However, this article is about memory access, so let’s look at how much data has to be read from memory. FavouritePrimeNumber is a 4 byte integer, so you might be forgiven for assuming that the total memory read is:
Total read (in bytes) = NumPeople * sizeof(int) = NumPeople * 4

Wrong! Think back to how a cache works. On a modern platform there is no such thing as reading a few bytes – if the data you want isn’t in the cache, you need to load the entire cache line. In hardware, the loop will actually do something like:
1. Read GPeople[0].FavouritePrimeNumber – load the whole 128 byte cache line containing GPeople[0] (costs about 30 instructions worth of time)
2. Read GPeople[1].FavouritePrimeNumber – it’ll already be in the cache as it’s in the same line as GPeople[0] (pretty fast)
3. Read GPeople[2].FavouritePrimeNumber – load the whole 128 byte cache line containing GPeople[2] (costs about 30 instructions worth of time)
4. Etc etc ....
When we want to read that 4 byte property of a person, we’re having to actually get the entire person from memory into the cache before doing so. As a result, from the hardware’s point of view GetBiggestFavouritePrimeNumber has to read 64 bytes just to get at that 4 byte integer!
So how to fix this? Well, it all depends on what you want to do regularly with your data. If in my game I very regularly needed to get the biggest favourite prime number from a large set of people then I might re-organise data as follows:

struct Person
{
       char Name[58];
       Person() { Name[0] = 0; }
};
int NumPeople = 0;
int GFavouritePrimeNumbers[1000];
Person GPeople[1000];
Person* AddPerson(const char* name, int favourite_prime_number)
{
       strcpy(GPeople[NumPeople].Name, name);
       GFavouritePrimeNumbers[NumPeople] = favourite_prime_number;
       NumPeople++;
}
int GetBiggestFavouritePrimeNumber()
{
       int biggest = 0;
       for(int i = 0; i < NumPeople; i++)
              if(GFavouritePrimeNumbers > biggest)
                     biggest = GFavouritePrimeNumbers;
       return biggest;
}

I’ve updated the code so the favourite prime numbers are stored in a separate array that I can iterate over nice and quickly (the technical name for this is SOA – structure-of-arrays, as opposed to AOS – array-of-structures). An int is only 4 bytes, which means the GetBiggestFavouritePrimeNumber now only has to load a new cache line for every 32nd person! Given memory was almost certainly the blocker in the pipeline here, my code now runs at roughly 16x its original speed!

The moral of the story - avoid loading data into the cache that you don’t need. Or to put it another way, spend your cache wisely :)

Tags:

the lab