Sunday, October 31, 2010

Multicore Networking applications - Mitigating the Performance bottlenecks

I had given this talk in 2010 Multicore Expo in San Jose.  It was in presentation document in concise form. I voiced the most of the details during my talk.  Many people requested me to provide details in written form.  I tried to give details here in this post.  I hope this post would give enough details on 'New techniques to improve software performance with increasing number of cores'.

Before going further, I would like to differentiate two kinds of applications - Packet processing applications and Stream processing applications.

Packet processing applications in my definition are the ones which  take  packet by packet, work on the packet and send out the same packet or packet with some minor modifications.  In packet processing applications,  there is one-to-one correspondence between input and output packets except for very small number of exceptions. One example where there is no one-to-one correspondence is when there is IP reassembly or fragmentation. Other example is when the packet is dropped by the application. Example applications in this category are:  IP forwarding, L2 Bridging,  Firewall/NAT,  Ipsec and even some portions of IDS/IPS.

Stream processing applications are the ones which may take packets or stream of data,  work on data and send out the data or  send out different packets. Most of  the TCP socket based proxy applications come under this category. Examples:  HTTP Proxy,  SMTP Proxy,  FTP Proxy etc..

This post tries to aid the programmers debugging the software to find out the performance bottlenecks in Multicore networking applications.

Always Ensure to do  flow/Session Parallelization 

Ensure that only one core is processing the session at any given time.  If multiple packets from the same session are being processed by more than one core at the same time,  then there would be requirement to ensure that Mutual exclusion on the session variables.  That would be very expensive.  Multicore SoCs actually aid you to do flow parallelization in packet processing applications.  Many Multicore SoCs support parsing the fields from the packets,  calculate hash on the software defined fields and distribute the packets across the multiple queues based on the hash value.  And then they provide provision for software threads to dequeue the packets from the queues.  These SoCs also provide provision to stop dequeue of packets from threads until the control of the queue is given up explicitly.  This ensures that a given flow is processed by only software thread at any time.

Many Multicore SoCs also have facility to bind the queues to the software threads and each software thread to the core.  If the number of flows are small, there is a possibility of cache being warmed with contexts due to previous packets. This reduces the data movement from DDR.   Also, many Multicore SoCs provide facility to stash the context as part of dequeue operation which reduces the cache thrashing issue even if  binding of the queues to the cores are not done. 

Flow parallelization not only eliminates the need for Mutexes, it also ensures that there is no packet mis-ordering in the flows.

Many stateful packet processing applications require not only flow parallelization, but also session parallelization.  Session typically consists of two flows - Client to Server traffic and Server to Client traffic.  It is possible that two packets from both the flows may be coming to the device and two separate software threads might be processing these packets. Stateful applications share many state variables across these two flows. Due to this, you may require mutual exclusion operation if both the packets are allowed to be processed at the same time.  Session Parallelization as described here would eliminate the need for mutual exclusion.  Unlike flow parallelization, session parallelization is not available in many Multicore SoCs for cases where the tuple values are different in both the flows and hence needs to be done in software.  Packet tuples are different when NAT is applied. Note that many Multicore SoCs enqueue the packet to the same queue if there is no NAT.  They are intelligent enough to generate the same hash value even though the tuples position get changed, that is, source IP in one flow would be destination IP in reverse flow and same is true for destination IP, Source Port and Destination Port.

Stream processing modules such as proxies would need to ensure that both client side and server side sockets are processed by the same software thread to ensure that there is no  Mutual exclusion operations requirement to  protect the sanctity of state variables.  Stream processing modules typically create many software threads - worker threads.  Master thread terminates the client side connections and handover the connection descriptor to one of the less loaded worker threads.  Worker thread is expected to create new connection to the server and do rest of the application processing.  Worker threads are typically implement FSM for processing multiple sessions. More often,  the number of worker threads would be same as number of cores dedicated for that application.  In cases where the threads need to block for some operations such as waiting for accelerator results, then more threads, in multiples of number of cores, would be created to take advantage of full power of accelerators.

Eliminate the Mutual Exclusion Operation while Searching for Session/Flow Context

This technique is also expected to ensure that there are no mutual exclusion operations in the packet path.  Any networking application do some search operations on the data structures to figure out the operations and other action to be done on the packet/data.  Upon the incoming packet/data,  search is done to get hold of session/flow context and then further packet processing happens based on the state variables in the session.  For example,  IP routing does search on the routing table to figure out the destination port, PMTU and other information for operations such as fragmentation, TTL decrement and packet transmit.  Similarly firewall/IPsec packet processing applications maintain the sessions in a easy to search data structures such as RB trees, hash lists etc..   Since the sessions are created or removed dynamically from these structures, it is necessary to protect the data structure while doing operations such as add/delete/search.  Mutual exclusion operations using spinlock,  futex, up/down are one way to do this.   RCU (Read-Copy-Update) is another method that can be used which eliminates the Mutex operation during search.   RCU operation is described in earlier post.  Please check that here and here.  RCU lock/unlock operations in many operating systems is very simple operation. Note that Mutex operations are still required for add/delete even in RCU based usage.

Eliminate Reference Counting 

One of the other bottlenecks in the Multicore programming is the need to keep the session safe from deletion while it is being used by other software threads. Traditionally this is achieved by doing 'reference counting'. Reference counting is used in two cases - During packet processing operation or  When neighbor module store the reference.

In the first case, reference count of the session context is incremented as part of the session lookup operation.  During packet processing, the session is referred many times to get hold of state variable values and to set the new values in the state variables of the session.  It is expected that if the session is deleted, it should not be freed until the current thread is done with its operation. Otherwise, it would corrupt some other memory if the session memory is freed and allocated to somebody else during packet processing.  To ensure that the session ownership is not given away, the reference count is checked as part of 'delete operation'.  If is is not zero, then the session is marked for deletion, but not freed until the reference count becomes zero.  If the value is zero, it indicates there is no reference to this session and the session gets freed. 
Since RCU operation postpones the delete operation until current processing cycles of all other threads,  reference counting becomes redundant.  Elimination of reference count not only helps in improving the performance, but also reduces the maintenance complexity. Note that reference counting operation requires atomic usage of count variable. Atomic operations are not inexpensive.

Second use case of reference count is when the neighbor modules store the reference (pointer) to the sessions in their session contexts. By eliminating the storage of pointer,  reference count usage can be eliminated.  This post helps you understand how this can be done.

Linux user space programs also can take advantage of RCUs. See this post for more details.

Use the Cache Effectively

Once the matching session is found upon incoming data/packet,  processing functionality uses many variables in the session. If these variables are together in a cache line,  any cache fill due to access of one variable result all other variable in the cache line available in the cache.  That is, Access to other variables will not result in going to DDR.  But all variables may not fill in one cache line. In those cases, it is necessary to group the related variables together to reduce going to DDR.

To effectively use instruction cache, always arrange your code with likely/unlikely compiler directives. Compilers will try to arrange the likely() code together. 

Reduce Cache Thrashing due to Statistics variables

Almost all networking applications update statistics variables.  Some variables are global and some of them are session context specific variables.  There are two types of statistics variables -  increment variables and add variables. Increment variables are typically used to maintain the count of packets.  Add variables are used to maintain the byte count.  Updating these variables require getting hold of current values and then add or increment operation.   If these variables are updated by multiple threads (with each thread running on a specific core), then every time an variable is updated,  cache information of this variable is no longer valid in other cores.  When one of other cores needs to do same operations,  it needs to get the current value first from the DDR and apply the operation.  In worst case scenario, where packets are going to round robin fashion to different software threads (hence cores), then the cache thrashing due to statistics variables would be very high and this would reduce the performance dramatically.

Always use 'per core/thread statistics counters'  whenever possible.  Please see this post for more details. 

Some Multicore SoCs provide special feature which also eliminates the need for 'per core' statistics maintenance.  These SoCs provide facility to allocate memory block for statistics.  These SoCs provide facility to fire the operation and forget about it.  Firing the operation involves the operation type (increment, decrement, add X or sub Y etc..) and memory address (32 bit or 64 bit).  SoCs internally do this operation without cache thrashing.  I suggest strongly to use this feature, if it is available in your SoC.

Use LRO/GRO facilities

Many networking applications' performance depends on the number of packets being processed than the number of bytes processed.  Examples: IP Forwarding,  Firewall/NAT and Ipsec with hardware acceleration.  So, reducing the number of packets processed becomes key in improving the performance.

LRO/GRO facilities provided by operating system in Ethernet drivers or by Multicore SoCs reduce the number of TCP packets, if multiple packets from the same TCP flow are pending to be processed.  Since TCP is byte oriented stream protocol, it does not matter whether or not the processing happens on packets.  Please see this post for more information on LRO feature in Linux operating system.  If it is supported by your operating system or Multcore SoC, always make use of it.

Process Multiple Packets together

Each packet processing module does set of operations on the packets/data - such as Search,  Process and  Pkt out.  If the packet is going through multiple modules, there are many C functions get called.  Each invocation of C function has its own overhead such as pushing the variables in the stack, initializing some local variables etc..    By bunching multiple packets of same flow together can reduce search/pkt out overhead and overhead associated with the C functions.

Some Multicore SoCs provide facility to coalesce packets together on per queue basis with coalescing parameters -  Packet threshold and time threshold.  Queue does not let the target thread to dequeue until one of the conditions reached - either number of packets in the queue exceed the packet threshold parameter or if no packet was dequeued for time 'time threshold'.   If this facility is available, ensure that your software dequeues multiple packets together and processes them together. 

Yet times, there is no one-to-one correspondence between queues and sessions.  In that case, one might ask that search overhead can't be reduced as there is no guarantee that the packets in the same queue belong to the same session.  Though it is correct, it might still have some improvements due to cache warming if there are more than one packet belonging to same session in the bunch.

As a software developer,  it would be required to strive for one-to-one correspondence between queues and sessions. This can be done easily among the modules running in software.  Some Multicore SoCs provide queues for not only to access hardware blocks, but also for inter-module communication.  Software can take advantage of this to create one-to-one mapping between queues and destination module's sessions.

It is true that when the packets are being read from the Ethernet controllers, there is no way to ensure that a queue only holds packets of one session as the queue selection happens based on the hash value of packet fields.  Two different sessions may fall into same queue.  In those cases,  as mentioned above you might not see improvement from 'serach' functionality, but you would still see improvements due to less number of invocations of C functions.

Many Multicore SoCs also have functionality to take multiple packets together for acceleration and for sending the packets out. This also will reduce the number of invocation to acceleration functions and for sending packets out.  If this facility is available in your Multicore SoCs,  take advantage of it. 

Eliminate usage of software queues

Some Multicore applications need to send the packets/data/control-data to other modules.  If multiple threads send the data to the queue, then there is a need for mutual exclusion to protect these data structure queues.

Many Multicore SoCs provide queues for software usage.  These queues would eliminate the need for software queues and hence eliminate the mutual exclusion problem, there by improving performance.   Some Multicore SoCs also provide facility to group multiple queues together into a queue group which allows sending and receiving applications to enqueue priority items and dequeue based on priority.  These queues can be used even among different processes or virtual machines as long as shared memory is used for items that get enqueued and dequeued.  Some Multicore SoCs even went a step further to provide 'copy' feature which avoids shared memory and there by providing good isolation. This feature makes a copy of these items from source process to internal managed memory by Multicore SoCs and copy to the destination process memory as part of dequeue operation.

Always use this feature if it is available in your Multicore SoC.

Eliminate the usage of Software Free pools 

Networking applications use free pools of memory blocks for memory management.  These free pools are used to allocate/free session contexts, buffers etc..   Many software threads would require these facilities at different times. Software typically maintains the memory pools on per core basis to avoid mutual exclusion operations on per allocation basis.  Since there is a possibility of asymmetric usage of pools by different threads, yet times there is a possibility of memory allocation failures even though there are free memory blocks in other threads' pools.   To avoid this, software does complex operations during these scenarios of moving memory blocks from one pool to another through global queues.   Many Multicore SoCs provide 'free pool' functionality in hardware.  Allocation and free can be done by any thread at any time without mutual exclusion operations. Use this facility whenever it is available.  It saves some core cycles.  More than that is provides efficient usage of memory blocks.

Use Multicore SoC acceleration features to improve performance

There are many acceleration features that are available in Multicore SoCs.  Try to take advantage of them.  I classify acceleration functions in Multicore SoCs into three buckets -  Ingress In-flow acceleration,  In-flight acceleration and Egress in-flow acceleration.

Ingress In-flow acceleration:  Acceleration functions that are done by Multicore SoCs in the hardware on the packets before they are handed over to software are called Ingress In-flow accelerations.  Some of the features, I am aware, in Multicore SoCs are:
  • Parsing of Packet fields :  Some Multicore SoCs parse the headers and make those fields available to the software along with the packet.  Software needing the fields can eliminate the parsing of fields.   These SoCs also provide facility for software to choose the fields to be made available along with the packet.  They also provide facilities for software to create parser to extract fields from proprietary headers or from non pre-defined headers.  Try to take advantage of this feature.
  • Distribution of packets across threads:  This is basic feature required in Multicore environments.  Packets needs to be distributed to different software threads.  Many Multicore SoCs also ensure that packets belonging to one flow go to one software thread at any time to ensure that packets will not get mis-ordered within a flow.  As described above,  multiple queues would be used by hardware to place the packets.  Selection of queue is based on hash value calculated on the set of software programmable fields.  As a software developer, take advantage of this feature rather than implementing the distribution in software.
  • Packet Integrity checks & Processing offloads:  Many Multicore SoCs do quite a bit of integrity checks on the packet as  listed below.  Ensure that your software don't do them again to save some core cycles.
    • IP Checksum verification.
    • TCP, UDP checksum verification.
    • Ensuring that the headers are there in full.
    • Ensure that size of packet is not less than the size indicated in the headers.
    • Invalid field values.
    • IPsec inbound processing.
    • Reassembly of fragments
    • LRO/GRO as described above.
    • Packet coalescing as described above.
    • Many more.
  • Policing :  This feature can police the traffic and reduce the amount of traffic that is seen by the software.  If your software requires policing of some particular traffic to stop cores from getting overwhelmed, this feature can be used rather than doing it in the lowest layers of software.
  • Congestion Management :  This feature ensures that the number of buffers used up by the hardware won't go up exponentially. Without this feature, cores may not find buffers to send out the packets if all buffers are used up by the receiving hardware. This situation typically happens when the core is doing lot of processing and hence slow in dequeuing while lot more packets are coming in.  Many Multicore SoCs also have facility to generate pause frames in case of congestion. 
Egress In-flow acceleration:   Acceleration functions that are done in the hardware once the packets are handed over to it by software to send the packets out are called Egress in-flow acceleration functions.  Some of the Egress in-flow acceleration functions are given below.  If these are available, take advantage of them in your software as these can reduce significant number of cycles in the core.
  • Shaping and Scheduling :  High priority packets are sent out within the shaped bandwidth.  Many Multicore SoCs provide facilities to program the effective bandwidth. These SoCs shape the traffic with this bandwidth. Packets which are queued to it by software would be scheduled based on the priority of the packets.  Some SoCs even provide multiple scheduling algorithms and provide facility for software to choose the algorithm on per physical or logical port.  Some SoCs even provide hierarchical scheduling and shaping.  Take advantage of this in your software if you require shaping and scheduling of the traffic.
  • Checksum Generation for IP and TCP/UDP transport packets :  Checksum generation, especially for locally generated TCP and UDP packets is very expensive.   Use the facilities provided by hardware.  
  • Ipsec Outbound processing :  Some Multicore SoCs provide this functionality in hardware.  If you require Ipsec processing,  use this facility to save large number of cycles on per packet basis.
  • TCP Segmentation and IP Fragmentation :  Some Multicore SoCs provide this functionality.  TCP segmentation performs well for local generated packets. Use this functionality to get best out of your Multicore.
In-flight Acceleration :   Acceleration functions provided by hardware that can be used during packet processing are called In-flight acceleration functions.  Crypto,  Crypto with protocol offload,  Pattern Matching,  XML acceleration are some of the acceleration functions that come in this category.  Here the packet/data for acceleration is handed over to the hardware acceleration functions by software. Software reads the results at later time when the results are ready.  Take advantage of these feature in your software wherever they are available . Some Multicore SoCs differentiate themselves by doing lot more in the acceleration functions.  For example,  some Multicore SoCs do protocol offload along with crypto such as Ipsec ESP,  SSL record layer protocol , SRTP and MACSec offloads which do beyond crypto offload.

I see many times people asking me a question on how to use the acceleration functions.  I had detailed this long time back here. Please see the details there and there.

Software Directed Ingress In-flow accelerations:

As described before, Ingress in-flow acceleration is applied before the packets are given to the software. Packets that are received on integrated Etherent controllers go through this acceleration.  But many times this acceleration is required from software too.  Take the example of Ipsec, SSL or any tunneling protocol.  Once the software processes these packets, that is once it gets hold of inner packets,  software would like ingress in-flow acceleration to be applied on the inner packets for distribution across cores and other acceleration functions.  To facilitate these kinds of scenarios, some Multicore SoCs provide concept of 'offline port' which allows software to reserve the offline ports and send the traffic for ingress in-flow acceleration.  Some software features that can take advantage of this feature are:
  • Tunneled traffic as described above to let the inner packets to go through the ingress in-flow acceleration,
  • IP reassembled traffic - Once the fragments are reassembled, it would have all 5-tuples which can be used to distribute the traffic through offline port.
  • L2 encapsulated packets - Such as IP packet from PPP, FR etc..
  • Ethernet controllers on PCI and Traffic from Wireless interfaces :  Here the traffic might need to be read by the software and Ingress in-flow acceleration might not have been implemented for these features. Software after getting hold of packets can be directed to in-flow acceleration functions through offline ports.
Use Multicore core features wherever they are available

Multicore SoCs from different vendors have different core architecture. Some Multicore SoCs are based on power pc, some based on MIPS core and Intel Multicore is based on x86 processors. Multicore SoC vendors provide different features to improve performance of Multicore applications.  Whenever they are available, software should make use of them to get the best performance out of cores.  Some of the features that I am aware of are listed below.

Single Instruction & Multiple Data instructions (SIMD)

Multicore SoCs from Freescale and Intel have this block in their cores.   This feature in the cores allows software do a given operation on the multiple data elements.  This kind of parallelism is called 'Data level parallelism'.  'Add' operation in typical cores is performance either on 32 bit or at the most 64 bit operands.  Current generation of SIMD do this operation on 128 bit operands. They also provide flexibility to do multiple 16 bit, 32 bit add operations on different parts of data simultaneously.  SIMD greatly helps in operations which involve arithmetic, bit, copy, compare operations on large amount of data.  Any operation that is done in a loop can be accelerated using SIMD.   In networking world,  SIMD is helpful in following cases:
  • Memory compare, copy,  clear operations.
  • String compare, copy, tokenization and other string operations.
  • WFQ scheduling of QoS, where multiple queues need to be checked to figure out which queues need to be scheduled based on sequence number property of queues.  If the sequence numbers are arranged in array form, then SIMD can be used very effectively.
  • Crypto operations.
  • Big Number arithmetic which is useful in RSA, DSA and DH operations.
  • XML Parsing and schema validations.
  • Search algorithms -  Accelerating compare operation to find matching entry from collision elements in a hash list.
  • Check-sum verification and generation:  In some cases Ingress and Egress In-flow accelerations can't be used to verify and generate the checksums.  One example is,  TCP and UDP packets that come in IPsec tunnel.   Since the packets are in encrypted form,  ingress and egress accelerators will not be able to verify and generate checksums in inner packets.  Even packets that get encapsulated in tunnels will not be able to take advantage of Ingress & Egress in-flow accelerations.  Checksum verifications and generations need to be done in software by cores.  SIMD would help in those cases tremendously.
  • CRC verification and generation:  These algorithms are not very expensive to have In-flight acceleration and not inexpensive for core to do.  SIMD in these cases help as it does not involve any architecture changes to the software and still get lot better performance over the cores which don't have SIMD.
Normally SIMD based cores give at least 50% more performance improvement for typical workloads.  So, as a software developer, figure out the ones that can be improved using SIMD and modify the code to improve performance of your application.

Speculative Hardware Data Prefetching & Software Directed Prefetching

This feature fetches the next cache line worth of data from the current memory access in the hopes that software would use next memory line.  Many core technologies provide control on enabling and disabling this at run time.  Software can take advantage of this while doing memory copy, set and compare operations.  Any data is arranged in linear fashion in the memory (such as arrays) can get good boost of performance with this feature. Note that, if this feature is not used selectively and carefully, it might even give degradation in performance. Be careful in using this feature.

Many cores also provide special instruction to warm the cache given a memory address. Software developers know the kind of processing (next module) and many times next module session context is also known. In those cases, software can be developed such a way that next module session is prefetched while packet processing happens in current module.  When the next module gets the control of the packet, it already has session context in the cache which avoids getting it DDR in serial fashion.  My experience is that using software directed prefetching gives very good results.  This also ensures that the performance does not go down even with large number of sessions.

Some Multicore SoCs provide support for Cache warming on the incoming packets.  As part of making packets ready for the software, these SoCs warm the cache with some part of packet content,  annotation data containing parsed fields and software issued context data.   When the software dequeues the packet, most of the information required to process the packet of the module that is getting hold of packet is in place in the cache, thereby, avoiding on-demand DDR access.  Software can program its context on per queue basis.  Note that, this feature is useful for the first module that receives the packet.  Actually that is good enough as this module can prefetch the next module context while the packet is being processed in the current module.  As long as each modules does this, there is no performance degradation even with high capacity. 

As described before,  hardware queues may not have one-to-one correspondence with the receiving module session contexts.  A queue might be having packets for multiple session contexts. Many times, software maintains the sessions in the hash table with large number of hash buckets.  All collision sessions are arranged in linked list or RB tree.  Software can ensure that there are as many queues as number of hash buckets and program the first collision element in the queue.  If the matching context is not same as the one that was programmed, then one might not get the full benefit of cache warming by the hardware. But if there are 4 collision elements and the traffic across these four are same, cache warming would come in handy 25% of the time. Some software developers might even store the collision elements in an array and program the array to the queue.

Software directed prefetch works very well as long as there is one-to-one correspondence between current module session context and next module session context.  In this case,  current module session context can cache the reference to the next module session context and use this to do prefetch operation.  This scheme also work fine if next module context is super set of multiple current module contexts.  But it does not work well if the next module context is finer granular.  Example:  Ipsec SA transfer packets from  multiple firewall/NAT sessions.  In this case, 'Software Directed Ingress In-flow acceleration' method can be used to direct the hardware to send the packet to next module.  This method not only provides cache warming, but also distributes the processing to multiple cores.

Hardware Page Table walk:

Some cores provide nested hardware page table walk to find out the physical address given the virtual address.  This is really useful for user space applications in Linux kind of operating systems.  Hardware page table walk feature is expected to be taken care by operating system vendors.  But unfortunately many OS vendors are not taking advantage of this feature.  As a software developer, if your Multicore SoC provide this feature, don't forget to ask your OS vendors to take advantage of this.  This will ensure that your performance does not go down when you move your application from Bare-metal environment (where the TLB are fixed and there is no page walk required) to Linux user space.

I hope it helps.

Sunday, October 10, 2010

Fastpath Ipsec implementations - Developer integration tips on Inbound policy check

Basic purpose of Ipsec fast path implementations is to reduce the IPsec processing load on the main processing cores.  Since most of the Ipsec processing is same across different kinds of packets, offloading of this processing to hardware makes sense.

There are companies today who provide fast path implementations - either as software component or as an add-on card such as PCIe card that goes onto PCI slot of main processing unit such as x86 based mother board.

Software based fast path implementations are becoming quite popular in Multicore processing environment.  Fast path is run on some cores and rest of the cores are used for other applications.

Ipsec fast path implementations typically work as follows:

  • Fast path typically owns the Ethernet and other L2 ports. That is, all packets come to the fast path plane first.
  • If there is enough state information to process the packet,  fast path implementations act on the packets without involving normal path running in main cores. Packet might even get transmitted out after working on the packet.  If the packet requires some other application processing that is not present in the fast path,  then the packet is handed over to normal path processing unit.  In case of Ipsec fast path,  decrypted packets are  given to the normal path in inbound direction. In outbound direction,  it does Ipsec processing before packet is sent out.  
Basic purpose of fast path is to save CPU cycles so that it can do some other processing.
All fast path implementation from different vendors are not created equal.

In this post, I specifically would like to concentrate on 'Inbound Policy Check'.  Some fast path implementation skip this check. Reasons given by vendors of the fast path implementation typically is that this is done for performance reasons.  Some people believe that it can be done without any implications on security too.   Unfortunately, that is not true.

What is inbound policy check?

Inbound policy check ensures that the decapsulated IPsec packets used the SA that was formed for this traffic.   And also it ensures that the inbound policy rules allow this traffic to come through.

What are the issues if the inbound policy check is not done?

I can think of two issues -  DoS attack &  Allowing the traffic that is supposed to be denied (Access control violation).

DoS attack:

Let us assume that  a corporation gateway  has two tunnels to two different partners - Partner1 and Partner2. Without inbound policy check, it is possible for partner1 to interfere with the sessions/traffic between corporate gateway and partner2. That is, partner1 can create denial of service attack on partner2 traffic.  Even though I have taken the example of partners, this kind of attack is possible among IPsec remote users.

Let us assume this scenario:

SGW:  Security Gateway of a corporation - It is protecting network
PSGW1:  Partner 1 Security Gateway - Its LAN is
PSGW2:  Parnter 2 Security Gateway. Its LAN is

There are two security tunnels from SGW - One to PSGW1 and another to PSGW2. Let us call them Tunnel1 and Tunnel2 respectively.

Tunnel1 is negotiated to secure traffic between 10.1.10/24 to  Tunnel2 is negotiated to secure traffic between 10.1.10/24 and   Let us also assume that the Tunnel1 SPI at the SGW is  SPI1 and the Tunnel2 SPI at the SGW is SPI2.  

It is expected that any packets coming from PSGW1 are expected to have SPI1 in its ESP header and inner IP packet SIP address  be one of addresses in and DIP address  be one of  Similarly it is expected that any packet coming from PSGW2 is expected to have SPI2 in its ESP header and inner IP packet SIP address be one of addresses in and DIP address  be one of 

Now to the attack scenario:

If PSGW1 network sends the inner packets whose IP addresses is other than and , then the SGW is expected to drop those packets.  SGW can only drop the traffic only if SGW does the inbound policy check.  If PSGW1 is allowed to send any inner packets, then it is possible that PSGW1 and its network can misuse this by sending packets with inner packets with IPs of PSGW2 LAN and SGW LAN.  Since it sends the traffic on the right SA using its own SPI,  SGW IPsec packet processing will happen smoothly.  If no other check is done, this traffic can go to SGW LAN.   Based on type of traffic, different attacks are possible.  Some of the attacks that are possible are:
  • If attacker at PSGW1 guesses the TCP ports of some long lived sessions between PSGW2 network and SGW network, it can send the RST packets or ICMP Error messages to terminate the connections.
  • Attacker at PSGW1 can send ICMP Echo message to SGW1 LAN network multicast IP address with SIP as PSGW2 LAN machine.  Replies from all machines in the SGW1 LAN go to PSGW2 victim machine and overwhelm the machine.
If SGW checks the inbound policy after the IPsec decapsulation is done using inner IP packet,  then it would have found that the SA used for the matching inbound policy is not same as the SA used to decapsulate the packet.  Whenver there is any mismatch, it is expected to drop the packet.  Due to this, no malicious traffic would have gone to the SGW LAN in above scenario.  Also, by logging these events,  administartor can find out the misbehaving peer security gateway and take appropriate out-of-band action.

Access Control Violation:

This is one more problem that can be faced if inbound policy check is not done.
Many Ipsec normal path implementations provide facility for administartors to add multiple rules with different actions to the secuirty policy database (SPD).   Rules normally have 5-tuple selectors in ranges/subnets/exact IP addresses for source and destination and ranges/exact values for UDP/TCP ports.  Actions can be one of 'Bypass', 'Discard' and 'Apply'.   Rules are arranged in a ordered list.  During packet processing,  rule search is done.  Rule search is stopped upon match.  Action specified on the matching rule is taken.  If the action is 'Bypass', then the packet is forwarded without any Ipsec processing.  'Discard' action indicates the packet is to be dropped.  Apply action indication that Ipsec processing is to be applied.  Normally administartors configure the rules with respect to outbound traffic.  Inbound policy rules are created automatically by the system from outbound policy rules by reversing the selectors - That is SIP becomes DIP and vice versa. Similary SP becomes DP and vice versa.

Now let us look at the possible access policy violation with following example:

Let us take these two policy rules in outbound list:

Rule 1:  SIP:  DIP  Protocol UDP   Action :  Discard
Rule 2:  SIP:  DIP  All Protocols   Action : Apply.

Inbound policy rule list would look like this:

Rule 1:  SIP:  DIP:  Protocol:  UDP   Action : Discard
Rule 2:  SIP:  DIP:  Protoco:  All   Action:  Apply.

Administartor creates the rules in above fashion to indicate that to discard any UDP traffic between the networks, but allow everything else by securing the traffic.

Assume that above policy rules are created in SGW1.

When a TCP packet is sent from the SGW1 LAN and SGW2 LAN, then second rule gets matched and SA is created to allow traffic 10.1.10/24 to/from for all protocols.  If SGW2 either misconfigured or intentionally sends UDP traffic in the SGW1-SGW2 tunnel,  then SGW1 is expected to drop the packet even if it successfully decrypts and decapsualtes the packet.  This can only happen if the SGW1 does the inbound policy check on inner IP packet.

If SGW1 does not do any inbound policy check, UDP traffic would have been passed to the network thereby violating the access rules configured by the administartor.

I hope I could give good reasoning on why inbound policy check is required.  Some fast path implementation don't do this.   So, as a development integration engineer, please ensure that not only your implementation, but also fast path implementation does all the checks that are required.


Fragmentation before Ipsec Encapsulation - Redside fragmentation and more use cases

I am finding more and more benefits of doing 'red side' fragmentation in Ipsec worl.

One use case is given here:  With red side fragmentation,  any  switches/routers in between security gateways of tunnels don't see  fragmented packets.  Due to this the cases, where some service providers' routers give less priority to the fragmented packets, don't arise. 

Second use case is given here :  When majority of  the traffic goes on IPsec tunnels, LAG can't distribute the traffic across ports since the result traffic  has same 5-tuple information. As described in the post,  multiple IPsec tunnels normally get created with forceful NAT-T.   All packets that are coming out of Ipsec Engine are expected to have 5 tuple information. If fragmentation is done after Encap, then the LAG would see some packets without 5-tuples.  This results to uneven distribution.  Hence redside fragmentation is done to ensure that LAG sees 5-tuples for all the packets.

Third use case:  Avoid mis-ordering of the packets:

There could be packets which are big and small in the traffic.  Big packets may get fragmented after Ipsec encapsulation if the result size exceeds the MTU of outgoing interface.  Small packets may not get fragmented even after encapsulation.

Gateway receiving the Ipsec packets is expected to process them in order. Due to fragmented packets, this may not happen.  Let us say that,  gateway received 1st fragment of 1st packet,  2nd full packet and 2nd and also final fragment of 1st packet in that order.  It is expected that the gateway processes them  in the same order.  But since 1st packet waits for 2nd fragment,  2nd full packet would be processed. Gateways don't stop the full packets getting processed as it may not know whether or not second  fragment of the 1st packet is going to come in and also when it is going to come in. 

So, this leads to packet mis-order.

This can be avoided if there are no fragments. Solution : Red side fragmentation.