Monday, September 29, 2008

What does Virtualization of Hardware IO devices and HW Accelerators mean?

Virtualization is becoming popular in server markets.  Multiple under-utilized physical servers are consolidated into one without losing the binary compatibility of operating systems and applications.

Hypervisors from VMware,  Linux Xen and KVM are enabling this trend.  These hypervisors give an impression as if there are multiple CPUs and multiple IO devices which allows different operating systems and associated applications to run without any changes on a single physical server.  Virtualization uses terms such as host and guest.  Host is main operating system where  hypervisor runs .  Guests are virtual machines. Multiple guests can be installed on a host. As in physical world where each physical machine can run different operating system and applications,  guests also can be installed with different operating systems and associated applications.  And no change is required in operating systems or applications. That is real beauty.  That is, you buy an operating system CD/DVD from Microsoft or Ubuntu and follow similar steps of installing them as in physical machine on a hypervisor  to create a guest machine.

Hypervisor virtualizes almost everything on the hardware - CPU,  IO devices such as Ethernet Controller, Keyboard, Mouse,   Accelerators such as cryptography accelerators, Pattern Matching Accelerators, Compression Accelerators etc..   Hypervisor makes multiple instances of them from one physical device with one ore more instances assigned to guests.  Hypervisor exposes these software instances as pseudo physical devices to guests and hence existing drivers in the guest operating systems work with no additional software changes.

Current generation of hypervisors, using software drivers,  deal with the hardware.  Guests don't interact with hardware directly. Host hypervisor software internally virtualizes by creating multiple instances.  Guest drivers deal with these instances.  Guests think that they are talking to hardware directly, but actually they deal with hardware via host operating system by connecting to the virtual instances created by host driver.

Direct IO/Accelerator connectivity
Traditional way of virtualizing hardware devices requires guests going through the hypervisor. Due to this, there is an additional copy of data and also additional context switching overhead.  To reduce the performance impacts associated with the indirection though hypervisors,  direct IO connectivity is being thought by both CPU and hardware device vendors.  Intel/AMD seems to be enhancing their CPUs to allow direct connectivity to the hardware devices from guest operating systems.

Intel/AMD x86 processors seem to be providing a feature called IOMMU in their CPUs.  Hardware IO devices traditionally only work with physical memory.   The IOMMU  feature allows IO devices to take the virtual memory address for buffers and commands.  Guests or even the user space processes in host operating systems such as Linux deal with  virtual address space.  CPUs translate virtual addresses to physical addresses dynamically using MMU translation tables.  IOMMU is expected to do similar translation for IO devices.  IO devices can be given buffers in virtual address space of guests or user space processes.  IO devices before reading/writing data from the virtual addresses work with IOMMU to translate into physical address and then perform read/write operation on the physical address.

To avoid hypervisor intervention,  another feature is also required, that is interrupt delivery to the guests directly from IO devices.  Interrupts are typically used by IO and accelerator devices to inform the new input or completion of command given by CPUs earlier.  CPU vendors are also providing this feature where PICs  (Programmable Interrupt Controller) are virtualized. 

These two CPU features allow direct connectivity of IO devices.  Hypervisors are also doing one more job before, that is, creating multiple instances of IO and accelerator devices. To avoid  hypervisor intervention, then the instantiation of devices need to be taken care by the devices itself.  Unfortunately, this can't be done at central place, CPU.  This needs to be taken care by each individual IO/Accelerator device.

Instantiation of IO & accelerator devices within the hardware need to ensure that it satisfies the virtualization requirements as hypervisors are doing.  Some of them given below.

Isolation :   Each guest is independent of each other.  Isolation should exist similar to physical servers.
  • Failure isolation:  In physical world, failure of a physical server or IO devices within it does not affect the operation of another physical server. Similarly, a guest failure should not affect the operation of other guests.  Since IO/Accelerator devices are common resource among the guests, it is required that it provides isolation such that if one instance of IO device fails, it should not affect the operation of other instances. Any fault deliberately or unintentionally introduced by guest should only affect its owned instance, but not others.  Fault should be corrected by resetting its instance and should not involve reset of all instances or entire device. 
  • Performance Isolation:  When applications are run in different physical servers,  all devices in a physical server is available exclusively for the operating systems and applications.  In a shared environment where multiple guests or user space processes working with the IO/accelerator devices need to ensure that one guest does not hog the entire accelerator and IO device bandwidth.  IO/accelerator devices are expected to have some sort of scheduling to share the device bandwidth across.  One method is scheduling of commands to accelerators using round-robin and weighted round robin schedulers.  But this may not be sufficient in accelerator devices. For example, 2048 bit RSA sign operation takes 4 times the crypto accelerator bandwidth compared 1024 RSA sign operation.  Consider a scenario where a guest is sending 2048 bit RSA sign operations to its instance of acceleration device and another guest is using accelerator device for 1024 bit RSA sign operations.  If round robin  method  is used  for scheduling requests by device across instances,  then the guest sending 2048 bit operations takes more crypto accelerator bandwidth than other guest.  This may be considered unfair.  It is also possible that a guest deliberately sends high computing operations to deny the bandwidth of crypto accelerator to other guests, creating denial of service condition.  Hardware schedulers of devices are expected to have scheduler that takes into consideration of processing power used by each instance. 
Access Control:

As discussed devices are expected to provide multiple instances.  Hypervisor would be expected to assign instances to guests.  Guests (typically kernel space) can inturn assign its device instances to its user space processes.  User space processes in turn assign the device resources to different tenants.  In public data center server market,  tenant services are created using guests - Each or set of guests correspond to one tenant - where tenants don't share a guest.  In Network services (Load balancers,  security devices),  a host or guest can process traffic from multiple tenants.  In some systems, each tenant is isolated to a user space process within a host or a guest and in some systems, multiple tenants might even share a user space process.  And there are hybrid systems,  based on the service level agreement with its customers,  Data Center operators either create a VM (guest) for tenant traffic, create a user space process within a guest for processing tenant traffic or share a user space process along with other similar kinds of tenants.   Ofcourse, isolation granularity differs from these three different types of tenant installations.  VM for tenant provides best isolation,  User space process provides similar, but better isolation than the shared user space process.  If it is shared user process,  all tenant traffic gets affected if the process dies.  But one thing that should be ensured is that the hardware IO/accelerator devices resources are not hogged by one tenant traffic.

IO/Accelerator hardware devices need to provide multiple instances for supporting multiple tenants. Typically, each tenant is associated with one virtual instance of peripheral devices.  Associating Partition,  User space process and tenant combination to virtual instance assignment is the job of trusted party.   Hypervisor is typically under the control of  network operator.  Hypervisor is assigned with the task of assigning virtual instances to guests.   Kernel space component of hosts or guests internally assign its owned device instances to its user space daemons and tenants within the user space daemons.

Ownership assignment as described above is one part of access control. Another part of access control is to ensure that guests and user space processes will only look at and access its assigned resources, but not the others resources.  If guests or user space processes allowed to even look at the other instances not owned by them, then it is considered as a security hole.

Another important aspect is that multiple guests or user space processes can be instantiated using same image multiple times.   This is done for reasons such as load sharing (active-active).  Even though same binary image is used to bring up multiple guest/user-space instances,  each  userspace/guest instance should not be using same peripheral virtual instances.  Since it is same binary image, this is relatively becomes easy if device virtual instances are exposed at the same virtual address space by the hypervisor.

Most of the peripheral devices are memory mappable in the CPUs.  Once the configuration space is mapped, accessing configuration space of peripheral can be done in the same way as accessing the DDR.  Peripherals,supporting virtualization,  typically divide the configuration space into global configuration space and configuration space for  multiple virtual instances.  If a peripheral supports X number of virtual instances and each virtual instance configuration space is M bytes,  then there would be (X * M) + G  bytes of configuration space where is G is size of global configuration space of that peripheral device.  Global configuration space of peripheral is normally meant to initialize the entire device. Instance specific configuration space is meant for run time operations such as sending commands and getting responses in case of accelerator devices and sending & receiving packets in case of Ethernet controllers.  Global configuration space is normally only allowed to controlled by the hypervisor and where as virtual instances configuration space is allowed by appropriate assigned guests.

To satisfy the access control requirements, it is required thatvirtual instances assigned to each guest starts at the same address space.  Typically, TLBs are used for this purpose as TLBs are used to translate addresses and also provide access restrictions.  TLBs take source address which is input to the translation, destination address which is the start of the translated address, and the size. Since hypervisors are normally used to assign set of virtual instance resources to guests,  hypervisor can take job of not only keeping track of free virtual instances and assigning virtual instances to guests, but also creating page table entry (or multiple PTEs in case of all assigned virtual instances are not contiguous) to map the virtual space of guests to physical space where the virtual instances configuration is mapped in CPU address space.   Virtual address of hypervisor of a given guest is treated as physical space by the guests.  When guest kernel assigns the virtual instances to its user space processes, it does same thing where it creates page table entries which allows mapping of virtual instance space to the user space virtual memory.  Once the PTEs are created,  TLBs are used dynamically by the system at run time in hierarchical fashion - PTE lookup in guest followed by PTE lookup in host.

The description in above section mainly discusses the way cores access the peripheral devices.  There is another aspect where peripheral devices need to access the cores' memory.  As discussed earlier, CPUs having the IOMMU capability (nested IOMMU capability) can give virtual addresses for buffers to the peripheral devices.  Virtual address spaces overlap with other across guests and user space processes.  Due to this, virtual instances of devices would see same addresses for buffers.   To get the physical address for virtual addresses,   devices need to consult IOMMU.  IOMMU would require guest ID, user space ID and virtual address to give associated physical address.  Hence hypervisors while assigning the virtual instances of devices to guests also  need to associate the guest ID with the virtual instance and inform the device through global configuration space..  Similarly, host or guest kernel while assigning the virtual instance to user space would associte the additional user space process ID to the virtual instance.  Kernel is also expected to assign the user space process ID to the virtual instance using Virtual instance specific configuration space or it could ask hypervisor to assign the user space process ID to the virtual instance using global configuration space.

Since IOMMU are accessed by devices,  it is necessary that the buffers given to the peripheral devices would never get swapped out. Hence it is very important the applications must ensure to lock the buffers in the physical memory (operating systems provide facilities to lock memory in DDR) before providing them to peripheral devices.

Summary

Multicore processors and Multicore SoCs (processors + Peripherals) are increasingly being used in network equipment devices.  Network equipment vendors while selecting the Multicore SoCs for virtualization would look for following features in their next designs.
  •  Features expected in cores:
    • IOMMU Capability which looks at the PTE tables created by operating systems, similar to 'Hardware Page Table Walk' support in cores.
    • Nested IOMMU capability as being discussed by Intel and AMD on cores side to allow hierarchical PTE tables to facilitate peripheral devices access to  buffers with virtual memory addresses in user space processes of guest operating systems.
    • Virtualization of PIC.
  • Features expected in peripheral devices
    • Ability to provide multiple virtual instances.
    • Ability to provide good isolation among the virtual instances - Reset of virtual instance should not affect the operation of other instances.  Performance isolation by scheduling mechanisms which considers the peripheral device bandwidth.  Peripherals are also expected to provide facility for software to assign the device bandwidth for virtual instances.
    • Ability for peripheral devices to take virtual memory addresses for buffers. Ability to communicate with IOMMU capabilities to work with virtual addresses.

No comments: