Random technical bits and thoughts: PCIe Endpoint developer considerations

See my earlier post on "PCIe End Point Developer Techniques" on some of the considerations developer should keep in mind. This post will give some more expectations from Endpoint.

Interrupt Coalescing:

Endpoints will interrupt the host whenever it puts data in the receive descriptor ring(s) . Also, Endpoint interrupts the host whenever it transmits the packet out from the transmit descriptor ring(s). Interrupt coalescing is normally used to reduce the number of interruptions to the host. Interrupt coalescing configuration by host on Endpoint will result in Endpoint interrupting the host only after filling up certain configured number of descriptors or after some configurable amount of time is passed from previous invocation of interrupt. Time parameter is required to ensure that host gets the interrupt even if configured number of descriptors are not filled up. This parameter should be chosen carefully to reduce the latency of packets. Interrupt coalescing also can be configured by host to reduce the number of interrupts when descriptors are removed from the descriptor rings.

As an Endpoint developer, always make sure to provide interrupt coalescing parameter configuration to host for every interrupt that gets generated by the Endpoint.

Now the processors are used in Endpoints, similar functionality is required in reverse direction. End Point should be able to indicate the interrupt numbers it would assign to the host. In addition, end point can also have interrupt coalescing parameters in configuration registers for each interrupt. Host will read them and use this information to reduce the number of interrupts it invokes on the Endpoint. Since these are registers, host also can put different configuration values.

As an Endpoint developer, make sure to provide this configuration to the host. If you are developing the host driver, please ensure to use interrupt coalescing functionality.

Multiple Receive Descriptor Rings and Multiple Transmit Descriptor rings:

In case of intelligent NIC cards, multiple receive descriptor rings are used to pass different kinds of traffic. For example, packets related to configuration & management is given highest priority compared to other data traffic. Within data traffic, voice & video traffic would be given higher priority normally. Different descriptor rings are used to pass different priority traffic. Endpoint, upon receiving the packets from wire, classifies the packets and puts them in appropriate rings. Host driver is expected to select the descriptor ring to dequeue the packet. Basically, scheduling of rings is used to select the ring. To enable scheduling, rings are given weightage. Some important rings may also be placed in strict priority. Scheduler in the host is expected to select the strict priority rings if they have packets. If no packets in strict priority rings, then a ring is selected from weighted ring set for dequeuing the packet.

Similarly, multiple transmit descriptor rings are used by the host for multiple reasons. One reason is similar to the reason described above for receive descriptor rings. That is, multiple descriptor rings might be used to send different priority traffic. Endpoint software is expected to do scheduling across multiple rings to select the ring and then dequeue the packet for transmission on the wire. This scheduling also could be similar to the one described above.

Set of descriptor rings meant for different kinds of traffic can be termed as 'Descriptor Ring Group".

As Endpoint developer, one should provide this kind of flexibility for host to process high priority traffic first over low priority traffic.

Multiple Descriptor Ring Groups:

In Multicore processor environments, avoiding locks is very important for achieving the performance. Endpoint is expected to provide facilities such a way that host never need to do any locking while dequeuing the packets from receive descriptor rings or while en-queuing the packets onto the transmit descriptor rings. That is where, multiple descriptor groups are required. If the host has 4 cores, then at least 4 receive and transmit groups are needed to avoid lock by assigning each group to different core. To also ensure that right core is woken up upon interrupt, it is also necessary that each group has its own interrupt which is affined to the appropriate core. That is, when a core is interrupted, it exactly knowns which group of descriptor rings to look at. Since a group is accessed by only one core, there is no lock required.

Now that Endpoints are also being implemented on processors, it also becomes important to reduce the locks even on the endpoints. This makes the problem little more complicated. Let us assume that host has 4 core processor and endpoint has 8 core processor. As discussed above, since host has 4 cores, 4 groups are good enough. That is, 8 cores in the endpoint would be updating only 4 groups. Since more than one core would be manipulating the group, lock would be required in the endpoint. To avoid lock in both places, it is required that the number of groups need to be at least maximum of number of cores in host and endpoint. Descriptor groups would be equally divided across multiple cores on each side. Even though this formula avoids the lock in both places, but the traffic distribution can be a problem if the number of groups can't be divided across cores equally. This will lead to some cores getting utilized more than other cores. If you take an example where the host has 4 cores and endpoint has 7 cores, then to make the distribution of groups across cores equal on each side, one should have GCF of 4 and 7. That is, 28 groups are required. On host side, each core would need to be assigned with 7 groups and on endpoint side, each core is assigned with four groups.

If one core has more than one group, then there could be two challenges. One is interrupt assignment and other is distribution across groups. Even though PCIe support MSI and MSI-X, there could always be some practical limitations on the number of interrupts that can be used. Hence each group can't have its own interrupt. Since interrupts are used to interrupt the core, one interrupt per core is good enough.

Producer core is expected to distribute the traffic equally among the groups - Round Robin on packet basis would be one acceptable method. Once the group is chosen, based on the classification criteria as described in the previous section a descriptor ring in the group gets selected to place the packet. Consumer core is expected to do similar RR mechanism to select the group to dequeue from. Once the group is selected, another scheduler as described under the previous section would be used to select the descriptor ring to dequeue the packet from.

As an Endpoint developer, one should ensure to support group concept to avoid locks.

Command Descriptor rings:

So far we have talked about packet transmit and receive descriptor rings and groups. Command rings are also required to pass commands to the Endpoint and for end point to respond back. There are many cases where command rings are required. For example, Ethernet endpoint might be required to know information from host such as

MAC addresses
Multicast Addresses
Local IP addresses
PHY configuration.
Any other offload specific information.

Unlike packet descriptor rings, here response is expected. Hence the descriptor should have facility for host to provide both command and response buffers. Endpoint is expected to act on the command and put the response in the response buffer.

As an endpoint developer, one should provide facility to send commands and receive responses.

Random technical bits and thoughts

Saturday, March 6, 2010

PCIe Endpoint developer considerations

No comments:

About Me

Interesting Links