Monday, September 29, 2008

What does Virtualization of Hardware IO devices and HW Accelerators mean?

Virtualization is becoming popular in server markets.  Multiple under-utilized physical servers are consolidated into one without losing the binary compatibility of operating systems and applications.

Hypervisors from VMware,  Linux Xen and KVM are enabling this trend.  These hypervisors give an impression as if there are multiple CPUs and multiple IO devices which allows different operating systems and associated applications to run without any changes on a single physical server.  Virtualization uses terms such as host and guest.  Host is main operating system where  hypervisor runs .  Guests are virtual machines. Multiple guests can be installed on a host. As in physical world where each physical machine can run different operating system and applications,  guests also can be installed with different operating systems and associated applications.  And no change is required in operating systems or applications. That is real beauty.  That is, you buy an operating system CD/DVD from Microsoft or Ubuntu and follow similar steps of installing them as in physical machine on a hypervisor  to create a guest machine.

Hypervisor virtualizes almost everything on the hardware - CPU,  IO devices such as Ethernet Controller, Keyboard, Mouse,   Accelerators such as cryptography accelerators, Pattern Matching Accelerators, Compression Accelerators etc..   Hypervisor makes multiple instances of them from one physical device with one ore more instances assigned to guests.  Hypervisor exposes these software instances as pseudo physical devices to guests and hence existing drivers in the guest operating systems work with no additional software changes.

Current generation of hypervisors, using software drivers,  deal with the hardware.  Guests don't interact with hardware directly. Host hypervisor software internally virtualizes by creating multiple instances.  Guest drivers deal with these instances.  Guests think that they are talking to hardware directly, but actually they deal with hardware via host operating system by connecting to the virtual instances created by host driver.

Direct IO/Accelerator connectivity
Traditional way of virtualizing hardware devices requires guests going through the hypervisor. Due to this, there is an additional copy of data and also additional context switching overhead.  To reduce the performance impacts associated with the indirection though hypervisors,  direct IO connectivity is being thought by both CPU and hardware device vendors.  Intel/AMD seems to be enhancing their CPUs to allow direct connectivity to the hardware devices from guest operating systems.

Intel/AMD x86 processors seem to be providing a feature called IOMMU in their CPUs.  Hardware IO devices traditionally only work with physical memory.   The IOMMU  feature allows IO devices to take the virtual memory address for buffers and commands.  Guests or even the user space processes in host operating systems such as Linux deal with  virtual address space.  CPUs translate virtual addresses to physical addresses dynamically using MMU translation tables.  IOMMU is expected to do similar translation for IO devices.  IO devices can be given buffers in virtual address space of guests or user space processes.  IO devices before reading/writing data from the virtual addresses work with IOMMU to translate into physical address and then perform read/write operation on the physical address.

To avoid hypervisor intervention,  another feature is also required, that is interrupt delivery to the guests directly from IO devices.  Interrupts are typically used by IO and accelerator devices to inform the new input or completion of command given by CPUs earlier.  CPU vendors are also providing this feature where PICs  (Programmable Interrupt Controller) are virtualized. 

These two CPU features allow direct connectivity of IO devices.  Hypervisors are also doing one more job before, that is, creating multiple instances of IO and accelerator devices. To avoid  hypervisor intervention, then the instantiation of devices need to be taken care by the devices itself.  Unfortunately, this can't be done at central place, CPU.  This needs to be taken care by each individual IO/Accelerator device.

Instantiation of IO & accelerator devices within the hardware need to ensure that it satisfies the virtualization requirements as hypervisors are doing.  Some of them given below.

Isolation :   Each guest is independent of each other.  Isolation should exist similar to physical servers.
  • Failure isolation:  In physical world, failure of a physical server or IO devices within it does not affect the operation of another physical server. Similarly, a guest failure should not affect the operation of other guests.  Since IO/Accelerator devices are common resource among the guests, it is required that it provides isolation such that if one instance of IO device fails, it should not affect the operation of other instances. Any fault deliberately or unintentionally introduced by guest should only affect its owned instance, but not others.  Fault should be corrected by resetting its instance and should not involve reset of all instances or entire device. 
  • Performance Isolation:  When applications are run in different physical servers,  all devices in a physical server is available exclusively for the operating systems and applications.  In a shared environment where multiple guests or user space processes working with the IO/accelerator devices need to ensure that one guest does not hog the entire accelerator and IO device bandwidth.  IO/accelerator devices are expected to have some sort of scheduling to share the device bandwidth across.  One method is scheduling of commands to accelerators using round-robin and weighted round robin schedulers.  But this may not be sufficient in accelerator devices. For example, 2048 bit RSA sign operation takes 4 times the crypto accelerator bandwidth compared 1024 RSA sign operation.  Consider a scenario where a guest is sending 2048 bit RSA sign operations to its instance of acceleration device and another guest is using accelerator device for 1024 bit RSA sign operations.  If round robin  method  is used  for scheduling requests by device across instances,  then the guest sending 2048 bit operations takes more crypto accelerator bandwidth than other guest.  This may be considered unfair.  It is also possible that a guest deliberately sends high computing operations to deny the bandwidth of crypto accelerator to other guests, creating denial of service condition.  Hardware schedulers of devices are expected to have scheduler that takes into consideration of processing power used by each instance. 
Access Control:

As discussed devices are expected to provide multiple instances.  Hypervisor would be expected to assign instances to guests.  Guests (typically kernel space) can inturn assign its device instances to its user space processes.  User space processes in turn assign the device resources to different tenants.  In public data center server market,  tenant services are created using guests - Each or set of guests correspond to one tenant - where tenants don't share a guest.  In Network services (Load balancers,  security devices),  a host or guest can process traffic from multiple tenants.  In some systems, each tenant is isolated to a user space process within a host or a guest and in some systems, multiple tenants might even share a user space process.  And there are hybrid systems,  based on the service level agreement with its customers,  Data Center operators either create a VM (guest) for tenant traffic, create a user space process within a guest for processing tenant traffic or share a user space process along with other similar kinds of tenants.   Ofcourse, isolation granularity differs from these three different types of tenant installations.  VM for tenant provides best isolation,  User space process provides similar, but better isolation than the shared user space process.  If it is shared user process,  all tenant traffic gets affected if the process dies.  But one thing that should be ensured is that the hardware IO/accelerator devices resources are not hogged by one tenant traffic.

IO/Accelerator hardware devices need to provide multiple instances for supporting multiple tenants. Typically, each tenant is associated with one virtual instance of peripheral devices.  Associating Partition,  User space process and tenant combination to virtual instance assignment is the job of trusted party.   Hypervisor is typically under the control of  network operator.  Hypervisor is assigned with the task of assigning virtual instances to guests.   Kernel space component of hosts or guests internally assign its owned device instances to its user space daemons and tenants within the user space daemons.

Ownership assignment as described above is one part of access control. Another part of access control is to ensure that guests and user space processes will only look at and access its assigned resources, but not the others resources.  If guests or user space processes allowed to even look at the other instances not owned by them, then it is considered as a security hole.

Another important aspect is that multiple guests or user space processes can be instantiated using same image multiple times.   This is done for reasons such as load sharing (active-active).  Even though same binary image is used to bring up multiple guest/user-space instances,  each  userspace/guest instance should not be using same peripheral virtual instances.  Since it is same binary image, this is relatively becomes easy if device virtual instances are exposed at the same virtual address space by the hypervisor.

Most of the peripheral devices are memory mappable in the CPUs.  Once the configuration space is mapped, accessing configuration space of peripheral can be done in the same way as accessing the DDR.  Peripherals,supporting virtualization,  typically divide the configuration space into global configuration space and configuration space for  multiple virtual instances.  If a peripheral supports X number of virtual instances and each virtual instance configuration space is M bytes,  then there would be (X * M) + G  bytes of configuration space where is G is size of global configuration space of that peripheral device.  Global configuration space of peripheral is normally meant to initialize the entire device. Instance specific configuration space is meant for run time operations such as sending commands and getting responses in case of accelerator devices and sending & receiving packets in case of Ethernet controllers.  Global configuration space is normally only allowed to controlled by the hypervisor and where as virtual instances configuration space is allowed by appropriate assigned guests.

To satisfy the access control requirements, it is required thatvirtual instances assigned to each guest starts at the same address space.  Typically, TLBs are used for this purpose as TLBs are used to translate addresses and also provide access restrictions.  TLBs take source address which is input to the translation, destination address which is the start of the translated address, and the size. Since hypervisors are normally used to assign set of virtual instance resources to guests,  hypervisor can take job of not only keeping track of free virtual instances and assigning virtual instances to guests, but also creating page table entry (or multiple PTEs in case of all assigned virtual instances are not contiguous) to map the virtual space of guests to physical space where the virtual instances configuration is mapped in CPU address space.   Virtual address of hypervisor of a given guest is treated as physical space by the guests.  When guest kernel assigns the virtual instances to its user space processes, it does same thing where it creates page table entries which allows mapping of virtual instance space to the user space virtual memory.  Once the PTEs are created,  TLBs are used dynamically by the system at run time in hierarchical fashion - PTE lookup in guest followed by PTE lookup in host.

The description in above section mainly discusses the way cores access the peripheral devices.  There is another aspect where peripheral devices need to access the cores' memory.  As discussed earlier, CPUs having the IOMMU capability (nested IOMMU capability) can give virtual addresses for buffers to the peripheral devices.  Virtual address spaces overlap with other across guests and user space processes.  Due to this, virtual instances of devices would see same addresses for buffers.   To get the physical address for virtual addresses,   devices need to consult IOMMU.  IOMMU would require guest ID, user space ID and virtual address to give associated physical address.  Hence hypervisors while assigning the virtual instances of devices to guests also  need to associate the guest ID with the virtual instance and inform the device through global configuration space..  Similarly, host or guest kernel while assigning the virtual instance to user space would associte the additional user space process ID to the virtual instance.  Kernel is also expected to assign the user space process ID to the virtual instance using Virtual instance specific configuration space or it could ask hypervisor to assign the user space process ID to the virtual instance using global configuration space.

Since IOMMU are accessed by devices,  it is necessary that the buffers given to the peripheral devices would never get swapped out. Hence it is very important the applications must ensure to lock the buffers in the physical memory (operating systems provide facilities to lock memory in DDR) before providing them to peripheral devices.


Multicore processors and Multicore SoCs (processors + Peripherals) are increasingly being used in network equipment devices.  Network equipment vendors while selecting the Multicore SoCs for virtualization would look for following features in their next designs.
  •  Features expected in cores:
    • IOMMU Capability which looks at the PTE tables created by operating systems, similar to 'Hardware Page Table Walk' support in cores.
    • Nested IOMMU capability as being discussed by Intel and AMD on cores side to allow hierarchical PTE tables to facilitate peripheral devices access to  buffers with virtual memory addresses in user space processes of guest operating systems.
    • Virtualization of PIC.
  • Features expected in peripheral devices
    • Ability to provide multiple virtual instances.
    • Ability to provide good isolation among the virtual instances - Reset of virtual instance should not affect the operation of other instances.  Performance isolation by scheduling mechanisms which considers the peripheral device bandwidth.  Peripherals are also expected to provide facility for software to assign the device bandwidth for virtual instances.
    • Ability for peripheral devices to take virtual memory addresses for buffers. Ability to communicate with IOMMU capabilities to work with virtual addresses.

Sunday, September 28, 2008

Look Aside Accelerators Versus In-Core Acceleration

Majority of Multicore processor vendors implemented many acceleration functions as look-aside accelerators. Some Multicore vendors such as Cavium and Intel implemented some functions as in-core and some as look-aside accelerators.  Cryptography is one acceleration function which Cavium and Intel implemented in-core and others have provided that as look-aside accelerator.  Other acceleration functions such as compression/decompression,  regular expression search are provided as look-aside accelerators by many of them.  Hence I will not be discussing them here.  I will concentrate mostly on Crypto acceleration.

How do software use them?

Software use the accelerators in two fashions - Synchronously and Asynchronously.  In synchronous usage,  software thread issues the request to the accelerator and waits for the result. That is, it uses the accelerator as software C function. By that time, C function returns, the result is with it.  In asynchronous usage, software thread issues the request and goes and does some thing else.  Once the result is ready,  thread picks up the result and does the rest of processing needed on the result.  Result is indicated to thread many ways.
  • If the thread is polling for events,  then the result is read via polling.  Many Multicore processors provide facility for software to listen for external events using one HW interface.  Incoming packets from Ethernet controllers, results from look-aside accelerators and other external events are given through common interface.  Run-to-completion programs waits for the event from this common interface and takes action based on the type of event received. 
  • If thread is not doing polling on HW interface, then the events are notified to the thread via interrupts.  
The point is that asynchronous usage of look-aside crypto acceleration allows the thread to do some thing while acceleration function is doing its job.

Look-aside accelerators can be used synchronously and asynchronously.  In-core accelerators are always used synchronously.

What application use the accelerators synchronously and asynchronously?

If the performance is same, software would like to use any accelerator synchronously.  But we all know that asynchronous usage would give best performance for per-packet processing applications.   Before going further, let us see some of the applications that use cryptography.

IPsec,  MACSec, SRTP, kinds of applications would use crypto accelerators asynchronously as these applications work on per-packet basis and simple to make changes to take advantage of look-aside accelerators in asynchronous fashion.

SSL and DTLS based applications, in my view, would always use accelerators in synchronous fashion.  SSL and DTLS are kind of libraries, not applications by themselves.  Applications such as HTTP, SMTP, POP3 servers and proxies would use SSL internally.  To use accelerators in asynchronous fashion, not only changes are required in  SSL/DTLS library, but also major modifications to the applications such as HTTP, SMTP, POP3 proxies etc.. are required.   When the applications are developed using high level languages, it becomes nearly impossible to make changes to the applications.

What are the advantages of look-aside accelerators?

Look-aside accelerators provide an option for software applications to use the accelerators in asynchronous fashion.  Any algorithm which takes significant number of cycles would be better used in look-aside fashion. Many of the crypto algorithm are falling in this category.  When used in asynchronous fashion,  core is not idling for the result.  Core can do some thing else, there by improving overall system throughput.  As described above, any per-packet processing applications such as IPsec, MACSec, PDCP, SRTP would work fantastic in this mode.  Many software applications, which were using software crypto algorithms, are being changed or already changed by software developers  to take advantage of asynchronous way of using the look-aside accelerators.

Another big advantage is with the processing of high priority traffic in poll mode based model - where the core spins for incoming events such as packets.  If the application uses the crypto accelerator in synchronous mode,  core is not doing any thing for a long time, in the tune of tens of microseconds.  If you take 1500 byte packet, it  might take around 10Micro seconds of time to encrypt/decrypt the data.  Core is not doing anything for 10Microseconds.  If it is jumbo frame of 9K bytes,  core may not be doing anything for 60Microseconds.  If there is any high priority traffic such as PTP (Precision Time Protocol) during this time,  this does not get processed for upto 60 Microseconds.   If Crypto is used in asynchronous fashion,  then these high priority traffic will be processed as the core does not babysit the crypto operation.  It also improves the jitter issues. Irrespective of the size of the packet which is getting encrypted/decrypted, high priority traffic is processed within same time.

Having said that,  SSL/TLS based applications use SSL library (example: openSSL).  Since SSL library works in synchronous fashion,  look-aside accelerators will be used in synchronously.  High priority traffic is not an issue as SSL based application typically work in user space in Linux and use software threads. Even if one thread is waiting for result from the crypto accelerator,  other software threads would be scheduled by the operating system  and thereby core would be utilized well as well as high priority traffic can be processed by other software threads.   But many times, the software thread waiting for the result may have to wait for the result in tight loop waiting for change in value of state variable which indicates the readiness of result. In those cases, other software threads may not be scheduled very well.  In those cases,  Multicore processors having hardware SMT (Simultaneous Multi Threading) would work good.

What are the advantages of in-core Crypto?

In-Core Crypto normally is faster than look-aside crypto when used in synchronous fashion.  So, it is good for SSL kind of applications.  But, these are not good for per-packet processing applications.  In-Core crypto has other advantages -  Since the data is sent to the in-core accelerators via core registers (these registers are normally big registers, 128 bit or 256 bit registers),  they work with virtual memory and hence very suitable for user space applications.  Another advantage of in-core acceleration is that they can be used in virtual machines without worrying about drivers being exposed efficiently by host operating system.  Since these are instructions like any other core instructions, they would work just fine without any additional effort. Since in-core crypto is just like software crypto,  it is very flexible for porting software applications without requiring to make major changes to SW architecture.

In-Core Crypto has some disadvantages too. If the cores divided across multiple applications and if some applications don't require crypto acceleration, those in-core crypto accelerators are not useful and resulting system throughput would be less.  As indicated above, in-core crypto acceleration does not work on high priority work as they can be used synchronous fashion.  For per-packet processing applications,  performance of in-core crypto would be less than the look-aside crypto when used in asynchronous fashion.

My take:

Since Multicore processors are expected to be used in many scenarios, it is necessary that the acceleration functions are designed such a way that they can be used for both per-packet processing applications and stream based applications (such as SSL).   If look-aside crypto acceleration performance is made similar to in-crypto acceleration in synchronous mode,  then it look-aside acceleration is preferable choice. As I described in my earlier post,  if look-aside accelerators are enhanced with  v-to-p kind of broker hardware module which understand the virtual space,  then I believe it is possible to make performance of  look-aside synchronous acceleration as close as possible with in-core acceleration.

Saturday, September 27, 2008

Are Multicore processors Linux friendly - Part 2

As described in earlier post,  Software architects of network equipments are moving or developing their applications in user space.  I believe Multicore vendors should keep this in mind while designing the accelerators and Ethernet controllers.   Unfortunately, current creep of Multicore devices are not well designed for applications based on user space.

Let me first give some items which are different in User space programs in comparison to Kernel level programming and Bare-metal programming.

  • Virtual Memory :  Kernel level programs and Bare-metal programs work with physical memory.  User space programs in Linux work with virtual memory.  Virtual memory to physical memory mappings are created by the operating system on per process basis.  
    • Some information on Virtual space and how it works:  Each process virtual space starts from 0 to 4Gbytes in 32 bit operating systems.  When the process executes, core with the help of MMU gets the physical address from virtual address for executing instructions as well as to get data or write data in the memory.  Core maintain the cache of virtual space to physical space mapping in Translation Lookaside Buffer (TLB).   Many processor types have 64 TLB entries on per core basis.  Operating system when it finds that there is no matching TLB entry for virtual address (TLB miss), goes through the mapping table created for each process for the match.  If there is a match and physical address is available, then it adds the entry in  one of  the TLB entries.  If there is no physical address page fault occurs and it requires reading the data from the secondary storage.  Since many processes can be scheduled on the same core by operating system,  TLB gets flushed out upon context switch and gets filled up with virtual space addresses of new process as part of its execution.
  • Operating system scheduling :  In Bare-metal programming,  application developer has full control over the cores and the logic that gets executed on per core basis.  User processes may get scheduled on different cores  and same core might be used to schedule multiple processes.  Since operating systems provide time slice for each process,  the user program can be preempted at any time by the operating system.
  • Restart capability:  Bare-metal and kernel level programs gets initialized when system is UP. If there is any issue in the program,  whole system gets restarted.  User space programs provide flexibility of graceful restart and restart if they crash due to some issues.  That is, user space programs should be capable of reinitializing itself when they get started even if the complete OS and system is not restarted.

What are the challenges user programs have while programming with Multicore processors?

Virtual Memory:

Many Multicore processors are blind to the virtual memory.  They only understand the physical memory. Due to that, application software needs to do different things to get best out of Accelerators.  Some of the techniques that are followed by software are:
  • Use physical address for buffers that would be sent to the HW accelerators & Ethernet controllers:  This is done by software allocating some memory in kernel space, which reserves the physical memory, and memory mapping that in user space using mmap().  Though it looks simple, it is not practical for all applications. 
    • Applications need to be changed to use this memory mapped memory for allocating buffers.  Memory allocation in some applications might go through multiple layers of software.  For example, some software might be in high level languages such as Java,  Perl, Python etc..  Mappong allocation routines of these programs to memory mapped area could be nearly impossible and requires major software development.  
    • Applications might be allocating memory for several reasons.  Applications might be calling same allocation function for all types of reasons.  To take advantage of memory mapped space, either the application need to provide new memory allocation routines or all allocations are satisfied from mapped area.   First case requires software changes which could be significant if applications have developed multiple layers on top of basic allocation library routines.  Second case may not work may have problems in satisfying allocations. Note that kernel space is limited and amount of memory that can be mapped is not infinite. 
  • Implement hardware drivers in Kernel space and copy the data from virtual memory to physical memory and vice versa using copy_from_user and copy_to_user routines.  This method obviously has performance problems - Memory copy overhead.  It also requires driver in the kernel which is not preferred by many software developers.  Preference would be to memory map the hardware and use the hardware directly from the user space software.
  • Use virtual space for all buffers.  Convert virtual memory to physical memory and provide the physical memory to the HW accelerators.  Though this is better,  this also has performance issues -  Locking the memory and getting the physical pages is not inexpensive.  get_user_pages() equivalent user space function needs to go through the process specific page table to get the physical pages for virtual pages. Second is that all physical pages need to be locked using mlock()  function, which is not so expensive, but takes good number of CPU cycles. On top of that,  the result of get_user_pages is set of physical pages which may not be contiguous.  If accelerators dont' support scatter gather buffers, then this required flattening the data which is again very expensive.
I am expecting that at least future versions of Multicore processors would have capability to understand the virtual memory and avoid software to do any thing special.  I expect that Multicore processor takes the virtual address for both input and output buffers, in addition to acceleration specific input, and convert virtual addresses  to physical memory and gets the accelerator function executed using accelerator engines.  There is a possibility that virtual to physical space mapping might get changed while the V-to-P conversion or acceleration algorithm is running.  To avoid this,  V-to-P conversion module should do two things.

A. Input side:

1.  Copy the input data to internal RAM of the accelerator.
2.  While copying, if it find that there is no TLB entry, it returns error to the software.
3.  Software accesses the virtual space which makes the TLB entry filled up and then software issues the command again to continue from where it left off.

B. Returning the result:

1.  Copy the output from internal RAM to the virtual memory using TLB.
2.  If it finds there is no mapping,  let it return to the software, if software is using the accelerator synchronously. If software is not waiting, let it generate the interrupt which wakes the processor and issues the command to read the result.
3.  Software accesses the virtual space which makes the TLB entry and issues the command again.  The v-to-p conversion module starts writing from the place it left off.

Many times, TLB entry would be there always for input buffer.  It is possible that TLB entry might have been lost by that time accelerator does its job.  But it avoids quite a bit of processing software has to do as indicated above.

Since the v-to-p hardware module needs to have access to TLB, it needs to be part of the core. So the command to be issued to accelerator and to read the result should be more like a instruction.

Note that TLB gets overwritten every time there is a process context switch. While doing memory copy operation, specifically for output buffer, it is always expected that v-to-p module checks the current process ID for which TLB is valid with the process ID it has as part of the command.

Since v-to-p module is also expected to do the memory copy from the DDR to internal SRAM or high speed RAM for input data and from SRAM to DDR for output data.  It is expected that this is very fast and does not add to the latency.  Hence v-to-p module is expected to work with core caches for coherency and performance reasons.

Operating System Scheduling:

Same core may be used by the Operating system to run multiple independent processes.  All these processes may be required to use accelerators by memory mapping them onto the user space. Since these are independent applications, there should not be any expectation that these processes would use accelerators in cooperative fashion.

Current Multicore processors would not allow multiple user processes running on a core to access the hardware accelerators independently. Due to this, software creates the drivers in Kernel space to access the hardware.  Each user process talks to the driver which in turn services the requests and returns results to appropriate user process.  Again, this would have performance issues resulting from copy of buffers from user space to kernel space and vice versa.

Limitation of Multicore processors today stems from two things:

A.  Multiple virtual instances can't be created in the acceleration engines.

B.  Interface points to HW accelerator is limited to 1 for each core.

If those two limitations are mitigated, then multiple user processes can use hardware accelerators by directly mapping them into user space. 


User process can be restarted any time either due to graceful shutdown or due to crash.  When there is a crash, there is a possibility of some buffers pending in the accelerator device.  Linux, upon any user process crash or whenever the process is gracefully shutdown frees up its physical pages associated with the process.  Physical pages can be used up by any body else.  If accelerator is working on the physical page thinking that it is owned by the user process that had given the command, then this could be an issue as it may write some data which might corrupt some other process.

I believe if solution as specified in 'Virtual memory' section above is followed, there is no issue as accelerators work on internal SRAM. Since v-to-p module always checks the TLB while writing into the memory, it should not corrupt any memory.

I hope my ramblings are making sense.

Hardware Traffic Management Functionality - What is it system designers need to look for)

There are many chip vendors coming out with inbuilt traffic management solutions, mainly on traffic shaping and scheduling.   I happened to review some of them as part of my job at Intoto.

Traffic Management in hardware is typically last step in the egress packet processing.  Scheduled packets of traffic management goes on the wire. That is, once the packet is submitted by software to the Hardware Traffic Management,  packets are not seen by the software. 

In theory,  anything that is done in the hardware is good as it saves precious CPU cycles to do some thing else. And that is good thing.  In practice,  hardware traffic management feature set  is limited, it may not be useful in Enterprise markets. As I understand these HW traffic management solutions are designed for some particular market segments such a Metro Ethernet.

If you are designing a network equipment,  you may like to look for following functionality in hardware traffic management (HTM).

Traffic Classification:  Many HTMs don't support classification in the hardware.  They expect the classification to be done by the software running in the cores and enqueue the packet to the right hardware queue.  HTMs typically do only shaping and scheduling portion of  Traffic Management function on the queues.   I can understand that there are multiple ways the packets can be classified and hence leaving it to software provides good flexibility for system designers.  As I understand,  number of cycles software takes to do the classification is either same or more than the scheduling and shaping put together.  At least, I would expect HTMs to do some simple classification based on L2 and L3 header fields, there by leaving the classification task from the cores.

Traffic Shaping and Scheduling :  Traffic shaping being the basic functionality of HTMs,  this is supported well.  Toke bucket algorithm common algorithm used by Traffic Managers to do the traffic shaping.  Some systems require Dual rate traffic shaping (Committed Information Rate and Excess Information Rate).  System designers may need to look for 'Dual Rate' feature.  In addition, it is required to know how the EIR is used by the HTMs.   At least the systems I am familiar should treat EIR similar to CIR, but EIR shaping is expected to be done only if all CIR requirement of all queues is met.  If there is more bandwidth available after meeting the CIR requirement of queues,  then EIRs of the queues need to be considered.   If EIR of all the queues are met and if there is still more bandwidth available to send more packets, then round-robin mechanism or some other mecahnism of queue selection can be adopted for scheduling the traffic    One should look for these features to ensure link is not under-utilized.

Another feature one should look from HTMs on whether it has flexibility to enable only CIR, EIR or both and a flag indicating whether it should participate in scheduling beyond EIR. 

From scheduling perspective, different systems require different scheduling algorithms.  Systems require scheduling from strict priority based queues and non-strict priority queues.  For non-strict priority queues, scheduling algorithm applies. Common scheduling algorithms expected are:  DRR, CRR, WFQ,  RR, WRR.

Traffic Marking:   Marking is one important feature of Traffic Management functionality.  Marking of the packet is meant for upstream router to make allow/deny decisions if the upstream observes any congestion. Different markings need to be applied based on classification criteria and based on rate band it used (within CIR, between CIR and EIR and beyond EIR).  Marking the packet based on classification criteria is normally expected to be done by software if classification is done in software.  But marking the packet based on the shaping rate needs to be done by HTM as software does not get hold of the packets after traffic management.  Typically the marking is limited to DSCP value of IP header or CoS field of 802.1Q header.  I see some HTM systems expecting the software to point to the DSCP location and CoS location along with the packet so that they can place the right value in those locations.  

So, the features to look for on marking side is  - Ability for HTM to market packets and Ability for software to configure marking values (DSCP, COS or both) on per queue basis based on the shaping band used to schedule the packet (CIR, EIR or beyond EIR).

Congestion Management:   Shaping and Scheduling always leads to queue management.  Queue Management is required to limit the queue size and also to ensure that latency of packets don't go up as in some cases it is good to drop the packets rather than send the packets late.   Different traffic types require different congestion management.  Typical congestion management algorithms expected are -  Tail Drop,  RED (Random Early Detection) , WRED (Weighted Random Early Detection),  head of queue drop.  In addition, queue size in terms of number of packets it can hold are expected to be configurable.   When there is congestion,  there would be packet drops.  How the packets are dropped and how they are informed to software can have performance issues.  Software needs to know the packets that are dropped from the queues so that software can free them.   To reduce the number of interrupts going to the software for dropped packets, it is expected that interrupt coalescing functionality is implemented by HTM. Also it is expected that it maintains list of packets that were dropped so that the software can read that bunch in one go when interrupt occurs.

Hierarchical Shaping and Scheduling :  This feature is critical for many deployments.  Shaping parameters are normally configured at the physical port or logical port level based on the effective bandwidth.  On the port, there could be multiple subscribers (Example:  Server farms of different customers in DC,  different divisions in the Enterprises,  Subscribers in Metro Ethernet Provider etc..) with each subscriber having their own shaping (CIR, EIR).  Different traffic flows in each subscriber also might have shaping. For example,  MEF does not rule out shaping on traffic based on set of DSCP values beyond shaping on Port and VLAN level.   In Enterprise,  shaping might need to be done based on IP addresses or transport protocol services.  Scheduling is always associated with each shaping.  That is whenever there is some bandwidth available, scheduler is initiated to schedule the packet. That is, when the physical/logical port level shaper finds some bandwidth to send the traffic, it invokes the scheduler.  In above example, port/logical port level scheduler tries to get hold of packets from one of the subscribers. If the subscriber itself is another QoS instance (having its own shaping and scheduling),  selected subscriber scheduler is called to get hold of the packet.  If the subscriber scheduler might call another internal scheduler to select the queues having the traffic. Since one scheduler calls another scheduler,  it is called hierarchical.  Typically,  8 hierarchical levels are expected to be supported.  As a system designer, one needs to ensure this feature and ensure that the number of levels supported by HTM suit your requirement.

It is also required to ensure that hierarchical shaping and scheduling does not involve queues at each level. If that was the case, performance of HTM would be bad.  It is okay for HTM to expect software to put the packet in the inner most levels.  Note that all queues may not be always in the inner most level.  An intermediate or first levels QoS might have either further QoS instance or the queue itself.  If the scheduler selects the QoS instance, then next inner level scheduler is called, otherwise, it selects the packet form the selected queue.  Classification in software is expected to put the packets in appropriate queues as per scheduler levels.

Support for Multiple Ports :  Some Enterprise Edge devices have multiple interfaces (Multiple WAN interfaces).   Each interface might be requiring its own QoS traffic treatment.  As a system designer, this is one thing to look for in HTM on how many ports/logical ports can be configured with QoS.   Logical ports are also required as some systems use inbuilt switch to expose multiple physical interfaces from one 10G interface connected to the CPU board. VLANs are used internally to communicate between 10G interface and switch.   For all practical purposes, this scenario should be treated as if there are multiple interfaces on the CPU card itself.

Support for LAG : LAG feature adds multiple links together with each link having its own shaping parameters.   As a system designer, you may like to ensure that traffic marked for a link (port or logical port) by the software LAG or hardware LAG is scheduled on the same link.  Also, one may like to ensure that schedule operation is invoked by the appropriate shaper of link. That is,  HTM should not be having shaper on LAG instance, but it should implement the shaper on each link. 

At no time, HTM should drop the packet. It is okay for some mis-ordering happening in some LAG operations, but no packet should be dropped.  Two LAG operations - Rebalancing and Add/Delete new link should not rise to drop of packets from the HTM queues.  One may like to ensure this.