Wednesday, November 11, 2009

Per CPU data structures and statistics - Improving performance

To improve performance, developers typically define the statistics on per core basis. Also, developers tend to define internal structures of module on per core basis wherever possible. Based on the core in which processing is happening, appropriate "per core" memory block is used to read and update the data.

In case of statistics, typically during event (example packet) processing statistics counters are updated. Since 'per core' statistics counter is updated, there is no need for any atomic add/inc operations. Reading of statistics is a rare occurrence which is typically initiated by some management function. Reading across multiple blocks, adding them up and giving comprehensive data is required in this case. During add operations of counters of multiple blocks, there is a chance of counter is being updated. Due to this, there could be some error. But it is normally acceptable.

One example of maintaining internal structures of module is 'free lists'. Other example could be 'timers'. In both of these examples, set of blocks are arranged on per core basis, such as linked lists. Typically array of linked lists or any data structure is defined. Whenever the module requires a free block, based on the core it is running, it gets the free node from that core specific linked list. When the module frees the block, the core specific linked list is updated with freed node. Thus, no lock operation is required while allocation and free of nodes.

Let me take some scenarios to explain this better. Let us assume a module contains a core specific free lists and one global active list. Module gets the 'Control block' node from the free lists and puts it in active list upon 'add' operation. As part of 'delete' operation, this control blocks gets freed. Let us also assume that upon event (example packet), which happens very frequently, control block is found from the active list first and then it updates several statistics during event processing. To improve performance, these counters are put on 'per core' basis in the control block. Based on the core processing the event, appropriate core specific counters are incremented.

struct modStatistics {
int Counter1;
int Counter2;
};
typedef struct modCb_s {
struct modCb_s *pNext;
.... some module specific state variables ...
} modCb_t;

struct modStatistics statsArray[NR_CPUS];

modCb_t *pActiveListHead; This is not important for our discussion.

Free lists:

struct freeQ_s {
int Count;
modCb_t *pHead;
modCb_t *pTail;
};

struct freeQ_s freeLists[NR_CPUS];

Though this 'per core' mechanism gives good performance by avoiding atomic operations and locks, it can have cache related performance implications.

In above example, let us take statistics counters. Typically, they get incremented using statsArray[current_cpu].Counter1++. This operation involves
  • reading the counter value.
  • increment
  • store back the new value.
Since all counters are together, some of them for sure fall into the same cache line. As you all know, the caching system in CPUs work by cache lines. Cores typically get the cache line worth of data from memory upon read operation. When one core updates the counter, it invalidates the cache line in which the counter is present. So, when another core tries to increment the counter belonging to its statistics block, it may end up getting the block from memory into cache due to cache line invalidation by first core, even though the first core did not update any memory location corresponding to second core.

Similarly, whenever pHead and pTail pointers in above example get updated, it is possible that other cores might end up reading from DDR even though core specific memory is being updates as per module logic. As you all know, getting memory from DDR is more expensive than reading data from the caches.

Note that L1 Cache and in some processors L2 Cache is provided on per core basis. This problem may not be there in the shared caches, but shared cache inherently has more latency compared to dedicated cache. Moreover L1 Cache in all processors families I know are dedicated. Some processors (which I think is very good) which has three level hierarchy caches typically have even L2 also dedicated on per core basis. Whenever you have dedicated caches, above cache invalidation problem can occur which results to performance degradation.

To avoid this degradation, programmers typically align each core specific block cache aligned. Though this looks good, it would have memory implications. Based on cache line size, at most you can waste 'cache line size - 1' bytes for each core. If there are more cores say 32 (which is very common in today's world', it can result to 32 * 127 (assuming that the cache line is 128) = 4K. Ofcourse this is the worst case scenario. But still this is waste of memory for every 'per core' instance.

In addition, each module need to declare the size of array to hold all possible values of cores. That is, the size of array need to be big enough to accomodate hot pluggable CPUs at run time. There could be waste of memory if that deployment does not any hot pluggable CPUs.

I also heard that in some processor families, the core IDs are not unique and hence the size of CPU specific arrays need to be as big as the largest core ID value.

Linux 'per cpu' infrastructure takes care of above issues and yet provide benefits of per core blocks. Module can define its per cpu blocks using following facilities provided in Linux.

  • DEFINE_PER_CPU (struct modStatistics *, statsArray): This is in place of struct modStatistics statsArray[NR_CPUS].
  • DEFINE_PER_CPU(struct freeQ_s *, freeLists): This is in place of 'struct freeQ_s freeLists[NR_CPUS].
Access functions:
  • _raw_get_cpu_var(statsArray) or __get_cpu_var(statsArray) to get hold of address of executing CPU specific statsArray block.
  • per_cpu(statsArray, cpuid) is to get hold of address of CPU specific statsArray block of CPU 'cpuid'.
Using above, statistics can be incremented as _raw_get_cpu_var(statsArray).Count1++. To figure out the comprehensive statistics, one could use following code block.

totalcount1 = 0;
for (core = 0; core <>
{
totalcount1 += per_cpu(statsArray, core).Count1;
}

Let us analyze what is happening under the hoods.

Linux keeps all the per CPU blocks defined using DEFINE_PER_CPU across all modules in one memory region. This memory region is duplicated for every CPU in the system. When there is any hot plugged CPUs at later time, it duplicates this region again.

At linker time, there is only one copy of these per CPU blocks. During init time, Linux duplicates this linked block as many number of times as number CPUs.

__get_cpu_var, per_cpu(), __raw_get_cpu_var macros based on the CPU ID, get hold of the base of memory region corresponding to each CPU and adds the offset of instance of any per cpu defined memory to get hold CPU specific memory block.

Since Linux combines all module instances together on per CPU basis, any updates to the variables of any memory block don't result to cache invalidation by other cores. Since all modules core specific blocks are together, there is no need for cache aligment, thus saving memory.

Some links on this topic that I referred are given here, here and ofcourse Linux source code.

I hope it is useful. Happy multicore programming.


No comments: