Sunday, December 19, 2010

User space Packet processing applications - Execution Engine

If you plan to port your data plane network processing application from Linux kernel space to user space,  first thing you would think is how you can port your software to user space with minimal changes to your software.  Execution Engine is the first thing one would think of.

Many kernel based networking applications don't create their own threads.  They work with the threads which are already present in the kernel.   For example,  packet processing applications such as Firewall, IDS/IPS, Ipsec VPN work in the context of Kernel TCP/IP stack.  This is mainly done for performance reasons.  Additional threads for these applications result in multiple context switches.  Also, it results into pipeline processing as packets handover from one execution context to another execution context.  Pipelining requires queues involving enqueue, dequeue operations which take some core cycles.  Also, it results into flow control issues when one thread processing is more than other threads. 

Essentially,  Linux kernel itself provides execution contexts and networking packet processing applications work within these contexts.  Linux TCP/IP stack itself works in softirq context.  SoftIRQ processing in normal kernel runs from both IRQ context as well as softirqd context.  I would say 90% of the time SoftIRQ processing happens in the IRQ context.  In PREEMPT_RT patched kernel, network IRQs are mapped to the IRQ threads.  In any case,  the context at which network packet processing applications run is unknown to the applications.  Since Linux kernel execution contexts are per core basis,  there are  less shared data structure and hence less locking requirement.  Kernel and underlying hardware also provides mechanism to balance the traffic across different execution contexts with flow granularity. In cases where hardware does not provide any load balancing functionality,  IRQs are dedicated to different execution contexts.  If there are 4 Ethernet devices and 2 cores (hence two execution contexts),  four receive interrupts of Ethernet controllers are assigned equally to two execution contexts.. If the traffic from all four Ethernet devices is same or similar, then both the cores are used effectively.  
Execution Engine in user space packet processing applications, if made similar to the Kernel execution contexts, then application porting becomes simpler.  Execution Engine (EE) can be considered part of infrastructure to enable DP processing in user space.  EE design should consider following.
  • There could be multiple data plane processing applications in user space.  Each DP daemon may be assigned to run in fixed set of cores - core mask may be provided at startup time.
  • If DP daemon is not associated with any core mask, then it should assume that the daemon may be run by all cores. That is, it should considered as if core mask contains all core bits set.
  • Set of cores are dedicated to the daemon. That is, those cores don't do anything else other than doing DP processing of the DP daemon.  This facility is typically used to ensure in Multicore processors providing hardware poll.  Recent generation of Multicore processors have facility to provide incoming events and acceleration results through single portal (or station or work group).  Since the core is dedicated, there is no software polling is required.  That is hardware polling can be used if underlying hardware supports it and if the core(s) are dedicated to the process.
It appears that number threads in the process equaling the number of cores assigned to the process provides the best performance. Also, this also provides great similarity with kernel execution contexts. With the above background,  I believe EE needs to have following capabilities:
  • Provide capability to assign the core mask. 
  • Provide capability to indicate whether the cores are dedicated or assigned.
  • If no core mask is provided, it should have capability to read the number of cores in the system and should assume that all the cores are given in core-mask.
  • Provide capability to use software poll or hardware poll.  Hardware poll should be validated and accepted only if underlying hardware supports it and only if the cores are dedicated to it. Hardware polling has performance advantages as it does not require interrupt generation and interrupt processing. But the disadvantage is that the core is not used for anything else.  One should weigh the options based on the application performance requirements. 
  • API it exposes for its applications should be same whether the execution engine uses software poll (such as epoll()) or hardware poll.
Typically capabilities are provided through command line parameters or via some configuration file.  EE is expected to create as many threads as number of cores in the core mask.  Each thread should provide following functionality:
  • Software timers functionality - EE should provide following functionality.
    • Creation and deletion of timer block
    • Starting, stopping, restarting timers in each timer block.
    • Each application can create one ore more timer blocks and use large number of timers in each timer block.
    • As in Kernel,  it is required that EE provides cascaded timer wheels for each timer block.
  • Facility for applications to register/De register for events and passing the events.
    • API function (EEGetPollType()) to return the type of poll - Software or Hardware : This function would be used by EE applications to use file descriptors such as UIO and other application oriented kernel drivers for software poll or use hardware facilities for hardware poll 
    • Register Event Receiver :  EE applications can use this function to register the FD, READ/Write/Error, associated callback function pointer and callback argument.
    • Deregister Event receiver:  EE applications can call this to de register the event receiver which was registered using 'Register' function.  
    • Variations above API will need to be provided by EE if it is configured with hardware poll. Since each hardware device has its own way of representing this,  there may as many number of API sets. Some part of each EE application have hardware specific initialization code and calls the right set of API functions.
    • Note that one thread handles multiple devices (multiple file descriptors in case of software poll). Every time epoll() comes out,  callback functions of ready FDs would need be called. These functions which are provided by the EE applications are expected to get hold of packets in case of Ethernet controllers, acceleration results in case of  acceleration devices or other kinds of events from different kinds of devices. From the UIO discussion,  if the applications use UIO based interrupts to wakeup the thread, then it is expected that all events are read from the device to reduce the number of wakeups (UIO coalescing capability).  Some EE application might be reading lot of events. For each event it reads, it is going to call its own application function.  These applications can be very heavy too. Due to this,  if there are multiple FDs are ready,  one EE application may take very long time before it returns back to the EE.  This results into unfair assignment of EE thread to FDs which are also ready.  This unfairness might even result into packet drops or increase the jitter if high priority traffic is pending to be read in other devices.  To ensure the fairness,   it is expected that EE applications process only 'quota' number of events at any time before returning back to the EE.  'Quota' is tunable parameter and can be different for different types of devices.  EE is expected to callback the same application after it runs through all other ready file descriptors.  Until all ready EE applications indicate that they don't have anything to process, EE should not be calling ePoll().  To allow EE to know whether to call the application callbacks again,  there should be some protocol.  Each EE application can indicate to the EE while returning from the callback function on whether it processed all events.  EE based on this indication will decide to call the EE application again or not before it goes to the epoll() again.  Note that epoll() is expensive call and hence it is better if all events are processed in fairness fashion before epoll() is called again.   In case of hardware poll based configuration,  this kind of facility is not required as polling is not expensive. Also Multicore SoCs implementing the single portal for all events have fairness capabilities built in.   Since the callback function definition is same for both software and hardware poll based systems,  these parameters exist, but they are not used by hardware poll based systems.
EE before creating threads should initialize itself and then create threads. Once created it should load the shared libraries of its applications one by one.  For each EE application library, it is expected to call 'init' function by getting hold of address of 'init()' symbol.  Init() function is expected to initialize its own module.  Each EE packet processing threads is expected to call another function of EE application. Let us call this symbol name is 'EEAppContextInit()'.   EEAppContextInit function expected to real initialization such as opening UIO and other character device drivers and registering with the software poll() system.

EE also would need to call 'EEAppFinish()' function when the EE is killed.  EEAppFinish does whatever graceful shutdown required for its module.

Each thread, if it is software based poll, does epoll on all the FDs registered so far.  Polling happens in the while() loop.  epoll() can take the timeout argument. Timeout argument must be next lowest timer expiry timeout of all software timer blocks.  When epoll() returns, it should call the software timer library for any timer expiry processing. In case of hardware poll,  specific hardware specific poll function would need to be used.

In addition to above functions,  EE typically needs to emulate other capabilities provided by Linux Kernel for its applications such as - Memory Pool library,  Packet descriptor buffer library,  Mutual exclusion facilities using Futexes and user space RCU  etc.. 

With these above capabilities, EE can jump start the application development. This kind of functionality only requires changes at very few places in the applications.

Hope it helps.

No comments: