Saturday, July 30, 2011

Key-Value Store using Memcached - Microservers & Embedded Multicore processors

What is Microserver?

It is explained multiple ways in the Industry.  Intel defines it as ""We define it as any server with a large number of nodes, usually with a single socket or multiple low-power processors and shared infrastructure,".   Serverwatch article on Microservers - small footprint, powerful punch article characterizes it as "The difference between microservers and server blades is very similar to the difference between a netbook and a full-fledged laptop: A netbook is compact, consumes little power and is suitable for a limited range of light duties, while a laptop is bigger and more power hungry but can generally handle anything that is thrown at it".  

What I understand finally is that,  these are like normal server chassis with multiple blades except that these blades (some call it as sleds in the context of Microservers) have low power Multicore processors such as ATOM processor from Intel.   Another difference is that there is no control card in the Microserver chassis to control the sleds.  They run independent of each other connected by 10G switch fabric. 

Why Microservers?

According to Serverwatch article,  Microservers are intended for companies to become energy efficient.  All kinds of workloads don't require high processing power offered by high end server blades.  Simple webservers that serve pages,  file servers,  memcached applications,  Hadoop kind of applications don't require single thread processing power.  They can be serviced using multiple Microserver sleds or multiple low powered cores on each sled.

One might argue that,  high end blades can be virtualzied into multiple virtual machines to handle simple workloads.  Even though that is one option,  some believe that costs associated with virtualization software is high.  With similar energy savings between virtualization and Microservers,  some deployments seems to be favoring the Microserver approach due to cost savings and maintenance headaches associated with virtualization.  Since each Microserver sled can be used to host one particular application (similar to virtual machines),  some see less interruption in services when there is a hardware failure.  In case of Microserver approach,  only the faulty sled needs to be replaced and during that time, only the application running on that sled will be affected.  In case of blade replacement,  services related to all applications running on different virtual machines may have interruption.

Embedded Multicore Processors:  Non-x86 based Multicore processors are increasingly being considered in Microservers mainly for following reasons.
  • Low power.
  • Performance is achieved using many cores.
  • Integration of Peripherals -  Security Co-processors,   Regex pattern matching acceleration,  Compression acceleration and mainly inbuilt  multiple 10G/1G Ethernet MACs.
  • 64-bit processors.
  • JVM is increasingly available on Embedded Multicore processors too.
  • SIMD acceleration as in x86 processors.
It appears that non-x86 based Multicore processors are being considered for some workloads more than others.  Some of the workloads that are more suitable for Embedded Multicore processors are:
  • Key-Value stores such as memcached.
  • Hadoop based distributed processing
  • Front End Web Servers.
This article from facebook "Many-core-key-value store"  mainly talks about memcached and how non-x86 Multicore processors are playing a role.  Though this article mainly concentrates on performance comparison between x86 processors and Tilera Multicore processors,  I believe this argument of Embedded Multicore processor advantage over x86 is equally valid with other Embedded Multicore processors.

What is Memcached:

Excerpt from

"Free & open source, high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.
Memcached is an in-memory key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering.
Memcached is simple yet powerful. Its simple design promotes quick deployment, ease of development, and solves many problems facing large data caches. Its API is available for most popular languages."

Some internal details of memcached:
  • Memcached not only listens for the requests coming from applications running on the same system, but also services the requests coming from remote systems via UDP and TCP.
  • Memcached maintain the data in the hash table.
  • It provides operations to insert, delete and query the entries in/from the hash table.
  • It appears that STORE operations are normally done over TCP connections and GET operation is normally done on the UDP connections.
  • It does not use hard disk to store the hash table and associated data.  It uses main memory.  Hence no IO bottleneck.
There are large number of memcached users.   Please visit for more information. It appears that Facebook, Amazon have thousands of servers running memcached application alone.

 Some Key performance Excerpts from the Facebook article:
  • Two socket based Intel Xeon L5520 running at 2.27Ghz (total of 8 cores):  200,000 GET transactions/sec.   Power consumption:  140Watts
  • Tilera 64 core processor running at 800Mhz:  335000 GET transactions/sec.  Power consumption: 90 watts.
 It appears that some changes were done to Memcached to achieve above performance on Tilera due to 32 bit architecture.

My view is that Embedded Multicore processors from other vendors will do better even. I believe that one should look for following features from Multicore processors to enable Memcached appliances.

  • Large number of cores/threads.
  • Fast hash generation (SIMD Engine would help).
  • Traffic distribution across multiple threads.
  • User space based bare-metal environment for UDP based GET transactions while TCP based transaction continue to happen using traditional TCP/IP stack in Operating systems.
  • Stashing and Prefetching facilities in hardware to utilize the cache effectively.
  • TCP Termination offload functionality in the HW:  TCP/UDP GRO/TSO,  Transport layer Checksum verification/generation, IP fragmentation/reassembly etc..
  • Software:  RCU library to avoid contention.  Huge-tlb-fs  to reduce the TLB misses etc..

If Hadoop and Hadoop based applications to work on these types of appliances, one should look for JVM support on the processors.