Sunday, January 24, 2010

PCIe End Point Developer techniques

Some UTM functions such as Intrusion Prevention and Anti-Malware take significant CPU time. Equipment vendors are looking for ways to offload some processing from the main CPU to NIC cards.  That is how Intelligent NIC cards have born.  Some equipment vendors created their own ASICs or FPGAs to do some offload functions at ingress time and egress time.  Increasingly vendors are showing interest in using Multicore processors in offload cards. Multicore processors approach  have many advantages over ASIC/FPGA approach as typical programming languages can be used to create even offload engines - Faster Time-to-Market,  Image can be upgraded in the field and simple to maintain.  Offload card are attached to main CPU over PCIe and they in essence are replacing  traditional PCIe NIC cards in security equipment devices.

With PCIe cards  increasing becoming processor based,  network programmers are now can program the offload cards using traditional programming languages such as 'C'.  Intelligent NIC cards have Ethernet port attached to it for external connectivity.  Towards CPU, it communicates via PCIe interface. These cards, upon receiving the packets from the Ethernet port, do some operations and send the result packets to the CPU via PCIe interface.  When the packets are given to the card via PCIe, it can do some operations before sending them out on Ethernet port. Basically,  PCIe card as a whole appears as one NIC interface to the host even though it does some offload functions.  In this post, I will not cover typical offload functions that can be done in the offload cards. I will try to do that in my next posts.  In this post, I like to concentrate on the efficient packet transfer over PCIe.

Some brief introduction on PCIe:

Intelligent NIC cards use PCIe End Point functionality to talk to the host CPU.  Host CPU uses the PCI in RC (Root complex) mode.   As you all know,  PCIe devices are arranged in a tree manner.  Root of the tree is RC (host).  Leaves of the trees are  PCIe devices (End points).  In between the leaves,   there can be PCIe switches.   Host uses services provided by the end points.  PCIe switches are used to extend the number of end point devices to the host.  Example of end point devices are:  Ethernet controllers,  SCSI devices, Video devices etc..

Every end point and root complex processors  have PCIe controller hardware block to talk to the PCIe bus.  End points can only talk to the RC, that is, End points can't talk to each other. If they need to talk to each other, RC will act as a relay.  RC can talk to many end points at any time.  In a PCIe tree,  only one RC is possible.  In typical PC system,  x86 is RC and all other PCIe devices are End Points.

RC and end points share required memory mapped IO or memory itself with the other party.  End points indicate the regions of memory or memory mapped IO to the RC via configuration space BAR registers.  Typically,  end point have around 5 BARs.  Each BAR represents in some form both memory address and the size.  Host would map these memory locations in its system address space. In case of x86 hosts, BIOS does the PCIe devices enumeration. As part of enumeration, it maps the device addresses in its space.   Hosts typically keep aside certain system address space to map memory exposed by PCIe devices.  Many host processors today are at the minimum have 36 bit address lines - they can address up to 64 G bytes of address space.  Some address space goes for the DDR and some space in the rest of address space is reserved to map PCIe devices.   Hosts also normally would like to expose its memory to PCIe devices.  There is no standard defined on how the hosts would communicate to the end point  the memory spaces it wants to share.  End Points typically assign some part of DDR memory as the command and response memory.  This memory would be part of the memory space it exposes to the RC host via one of the BAR registers.  One of the commands that EndPoint exposes could be for host drivers to share its memory with the EndPoint.  End point then maps the RC exposed memory in its reserved system address space (End point processor's PCI address space).

PCIe controllers support multiple transactions types  - Most important ones are Memory read,  Memory write,  Configuration Read and Configuration Write.  There are others such as 'Messages',  Memory read and write with Lock. There are some more transaction types related to PCIe internal layers (transaction layer) such as Completion etc..

When the processors (either host or End Point) read or write data from/into the PCIe address space,  PCIe controller gets hold of this operation and creates the PCIe read/write transaction  on to the other side (destination) with the other side's memory address. PCIe transaction message contains the other side address, transaction type and data to be written, if it is write operation.  In case of read operation, there would be one more message that comes from other side with the data.   All this happens automatically.  As far as the processor is concerned, it thinks that it is reading and writing into some address (in this case PCIe address) same way as it read and writes from normal memory.  Internally PCIe transaction would be created.  Normally PCIe controllers have some thing called 'Address Translation Registers'.  Corresponding to each PCIe address space range,  original memory address of other side is stored.  This is normally done whenever other side gives the memory ranges to be mapped.  That is how PCIe controller knows the memory address to be put in the read/write transactions. 

PCIe defines two transaction models called 'Posted' and 'Non Posted'.  Posted transactions are the ones where the transactions are fired and forgotten.  That is, processor does not care whether the transaction was completed.  'Write' transactions are typically posted transactions.   'Read transactions' are by default non-posted as it needs to get the data from the other side's memory address.  Since it is synchronous operation, process simply waits for the result.  Hence the non-posted transactions are expensive.  As I understand read transaction typically even in PCIe Gen2 hardware takes around 1 to 2 Microseconds.  Hence PCIe read transaction for less than 16 bytes should be avoided.  For big size size, it is okay to use PCIe read transactions though.

So, any programming interface exposed by the PCIe card must ensure that the host processor does not do any  PCIe read transactions for data movement (such as packet transfer).   

Descriptor rings:

This post gives one technique whereby both host and End point avoid PCIe read transactions.  Descriptor ring approach is used to send/receive packets and commands between hosts and end points.  Normally PCI devices design the descriptor rings such that hosts never need to do PCIe read transaction either  to get hold of packets or send packets.  As PCIe devices also being implemented on processors for doing flexible offload mechanisms, it is necessary that PCIe read transactions should be avoided even in End Point implementations.

Descriptor rings consists of set of descriptors. Each descriptor contains some control & status bits and pointers to buffers holding the packets or commands/results.

For every descriptor ring, there is one consumer and producer.  In case of Ethernet based PCIe device,  on transmit side,  host is producer and Ethernet controller end point is consumer.  For received packets,  Ethernet controller is producer and host is consumer.  That is,  at least two descriptor rings are used  in Ethernet Controller end point -  Transmit descriptor ring and Receive descriptor ring.  In addition to these two rings,  there may be command rings too to send command and receive responses from the End point.

Extending the Ethernet example,  Ethernet driver in the host typically initializes both packet transfer (receive and transmit)  rings.  Host Ethernet driver normally initializes the descriptors in the descriptor ring with buffers to receive packets.  Ethernet Controller when it receives the packet from the wire puts it in the buffers in the receiver descriptor ring and invokes the interrupt on the host.  On transmit side too,  host puts the buffers of the packets in the descriptors and informs the end point.  In case of processor based end points,  even interrupt on that processor would be invoked by host.

Descriptor ring consists of descriptors in a ring with Producer Index and Consumer Index.  Producer Index (PI) points to the descriptors that can be written into and Consumer Index (CI) points to the ring from where consumer is expected to read descriptors.  Since there should be a way to find out whether the ring is full or empty, normally one descriptor is left empty.  That is, when writing to descriptor pointed by PI makes the PI equal to CI, then that descriptor is not used by producer until CI moves up.

To avoid PCIe read transactions, descriptor ring buffer has to be part of consumer processors' memory space.  PI and CI indexes would be there in both spaces, but the consumer space would be exposed to the producer.  Producer PI and CI need not be exposed to the consumer.  Here is the flow :

Initialization time:

  • Consumer Descriptor ring is part of the memory it exposes to the producer.
  • Consumer Descriptor rings PI and CI are also part of the memory it exposes to the producer.
  • Initially, both PI  and CI are set to 0. 
Producer :

Producer has got the buffer to write into.

If  (PI + 1 )%ring Size != CI ) 
  • Write the buffer in the descriptor pointed by the PI.
  • Write the status and control bits accordingly.
  • Increment the local PI:  PI = (PI + 1) %ring size;
  • Write the new PI into other side PI..
  • Generate Interrupt (MSI interrupt is also write transaction) - ofcourse, it needs to honor coalescing criteria.

Consumer flow would be:

if ( PI !=  CI )
  • Read from the descriptor that is indexed by local CI and process
  • Update local CI by (CI + 1 ) % ring size;
  • Write  the new CI into other side CI, that is update Producer's CI value   
  • Generate interrupt (if needed) to indicate producer that the descriptor is consumed.

If you see above, producer or consumer never do any read operation on other side memory. Any read you see above is on the local memory.  

If you take Ethernet Controller is an example, then the receive descriptor ring is defined in the host memory and transmit descriptor ring is defined in the endpoint memory.

Note that the packet buffers are always expected to be part of the host memory to ensure that no changes are required in host TCP/IP stack or host applications.  Due to this, endpoint would issue PCI read transaction on the packet buffer using address found in the descriptor. To reduce the read latency across multiple packets, it is good to get hold of multiple buffer points from multiple descriptors and initialize the multiple DMA controllers (if the device has them).  Since the packet content is expected to be in terms of hundreds of bytes and multiple packets can be read at the same time (if multiple DMA controllers are available),   read latency would be amortized to small value on per packet basis.  Read transaction on transmit packets in case of  Ethernet controllers is unavoidable,   but it is necessary that the descriptors reads should never result into PCIe read transaction.

Another point to note is by the time  interrupt  reaches the consumer, it is necessary that the consumers PI is updated.  Otherwise, there could be race condition where interrupt goes first and nothing was found to be read.  If PI was updated after the interrupt is seen by the peer, some filled descriptors will never be read until another interrupt is generated due to new descriptor being updated by producer.  That race condition is very dangerous as next descriptor update might happen after long time.  Since PCIe MSI interrupt is write transaction, there will not be any race condition as long as PI update is issued before invoking the interrupt.

Saturday, January 16, 2010

IPv6 Firewall - ICMP Error Message Handling (Tips for developers)

IPv4 ICMP handling by firewall is not critical. That is, if even ICMP messages are discarded by firewall,  many of IPv4 functions continue to work.  In IPv6, ICMP messages if not handled by firewall properly can break the IPv6 functions or reduce the performance of IPv6. For example:

  • IPv6 ICMP Echo request and response messages can't be filtered blindly in IPv6. ICMP Echo request and response messages are used for proper functionality of Toredo tunneling. (Refer to rfc4380).
  • Some ICMP types such as 'Neighbor solicition',  'Neighbor Advertisement',  'Router Solicitation' and 'Router Advertisement' are used to detect duplicate addresses, Stateless IP address assignment and router discovery.  These must be allowed by L2 firewall or by the router firewall for its local traffic.
 ICMP Error messages are very important for smooth IPv6 communications.  Firewalls should not be blindly dropping these packets.
  • Unlike IPv4, IPv6 communication expects fragmentation and reassembly done by IPv6 end nodes.  The intermediate routers are expected to generate ICMP Error message Type 2 (Packet too big)  if the MTU of outbound interface is less than the packet size.  Intermediate router is expected to put the right PMTU value in the ICMP error message.  If these message don't go to transmitting end node, it does not have information on the PATH MTU and it continues to send packets with original MTU which gets dropped by the intermediate router.
  • Destination Unreachable messages (Type 1) are also very important for applications running in end nodes to use alternative mechanisms.  If these messages are dropped by firewall,  applications in end nodes may take longer to use alternative mechanisms.
  • Time Exceeded error message (Type 3) with Code 0 (Hopcount becomes 0) is needed for tracing the routers on the path.  Code 1 message is meant to indicate the end node did not receive all fragments with reassembly timeout.  It is really for debugging, but very much needed.
  • Parameter Problem Error messages (Type 4) with Code 1(Unrecognized Next Header) and Code 2 (Unrecognized IPv6 Option) are useful for transmitted end node to know the extension headers and options to avoid in further communication.  
Unlike some ICMP messages, all ICMP Error messages work with unicast addresses.  Another important thing to observe is that ICMP Error messages can be generated by any router in between.  Many times these messages are not authenticated.  It means that  any attacker can send the ICMP Error messages.  Firewalls play very important role in filtering the bad and reducing the flood of ICMP Error messages reaching the trusted boundary.

If firewall allows attacker ICMP error messages,  receiving host can act on these messages and it might result to poor performance or some times even communication ceases.  For example by sending very small MTU in path MTU messages, attacker can make the end host to do excessive fragmentation there by reducing the performance of the end system and also the network.  By sending bad destination unreachable or parameter problem error messages, it can let end hosts choose alternative paths which might be expensive or they might make end hosts to back off for some time or they might even make the current active connection fail.

What is expected by firewalls to mitigate attacks resulting from malicious ICMP error messages yet still allowing genuine ICMP Error messages. Though there is a no solutions that works all the time,  but by following some guidelines the efficacy of attacker can be reduced significantly.  Some of the techniques are similar to IPv4 firewalls. 
  • Check the destination IP of the outer IP packet of ICMP Error message. It should be same as the source IP of the original IP packet (inner IP packet).
  • Make the firewall stateful.  Stateful firewalls typically create a session for each UDP/TCP and other connections.  When the ICMP Error message is generated by intermediate routers or destination node, it is expected to contain IP header of the original message and 8 bytes of transport header.  With this information in the ICMP Error message,   firewall can match the original session.  If there is session, there is a good confidence that the message is not spoofed.  For an attacker, it would be difficult to guess all 5 tuples of the original connections, specifically source port which changes for each connection.  
  • In case of TCP connection, additional check can be made to ensure that ICMP error message is not generated by the attacker.  ICMP Error messages with  8 bytes of transport header contains the sequence number field.   If it is generated by the genuine router, then it would have done based on the TCP packet of original connection.  Firewall can check this sequence number with TCP state variables it is maintaining in the session. If the sequence number of the partial TCP header of ICMP error message is within some range of the sequence number maintained by the session, then firewall can believe it is a genuine ICMP error message as it would be difficult for attackers to guess the sequence number. 
  • Any ICMP Error message that comes in when the idle flow of session is also can be considered as bad packet and can be dropped by firewall.  Firewall can maintain the time stamp in each flow whenever there is packet sent out.  If the ICMP Error message is received within the stored timestamp + X seconds, it can be considered as genuine.  If it is received after X value of last time stamp, then the ICMP error message can be dropped.
Though the stateful firewall can detect many spoofed ICMP error messages,  still it is expected that the administrator have control over ICMP error messages types (and codes) that are to be allowed to trusted networks.  Though it is unlikely that this control is expected on IP address/network basis,  but this control is expected on set of zones.

Data Model can be following:

    • allowAll : String(8), RW : Takes value "yes" or "no":  Indicates whether to allow all ICMP Error message if there is a session match.  Following table indicates the allowed ICMP Error messages when this value is 'no'.
    • minPMTUValueToAccept:  RW, Unsigned Int:  This indicates minimum PMTU value.  Firewall drops the ICMP Error message Type 2 if the PMTU value is less than this configured value.
    •{i}.ipv6firewall.icmpErrorMessageFilter.allowList.{i}: Table whose rows can be added, modified and deleted by administartors
      • rule ID :  RW, Unsigned Int,  Can't be modified once created -  Rule ID to identify the rule.l
      • Description:  RW,  String(32), Optional -  Description of the record.
      • FromZone: String(32), RW - One of the Zone IDs. It takes value of ZoneName from internetGatewayDevice.securityDomains.VirtualInstance.{i}.Zone.{i} table.
      • ToZone: String(32), RW - One of the Zone IDs. It takes value of ZoneName from internetGatewayDevice.securityDomains.VirtualInstance.{i}.Zone.{i} table.
      • icmpMessageType:  RW, Unsigned Int:  ICMP Error message type.
      • icmpMessageCodeAll :  RW,  String(8), Yes or no:  "yes' indicates that all codes are allowed within ICMP message type given in 'icmpMessageType'.
      • icmpMessageCode: RW, Unsigned Int,  ICMP Error message code.  Code is unique within a icmpMessageType.  Valid only if 'icmpMessageCodeAll' is "no",
It is strongly recommended to create a hash table based on ICMP message type for faster lookup.

"Recommendation for filtering ICMPv6 messages in Firewall - RFC 4890" recommends that the firewall allow some ICMP Error message types always.  To conform to this standard, it is suggested that the firewall creates rules in above table as factory defaults.  Those are:
  •  Type1 (Destination Unreachable) and all codes
  • Type2 (Packet too big)
  • Type 3 (Time Exceeded)  : Code 0 ( Hop Count limit reached) and Code 1 (Reassembly timeout reached)
  • Type 4 ( Parameter Problem) :  Code 0 (Erroneous Header),  Code 1 (Unrecognized Next Header),  Code 2 (Unrecognized IPv6 Option)
This article talked about handling of ICMP Error messages by firewalls.  In my next article, I would briefly describe the normal ICMP message handling by firewalls.

Tuesday, January 12, 2010

Hash Tables versus Balanced Binary trees - Hybrid model

Hash tables are most commonly used in network infrastructure components.  Firewall/NAT sessions, IPsec SAs and routing cache entries are some of the examples where hash tables are used to store them.

In many of the packet processing applications,  there are three operations -
'Session Creation',  'Session Deletion' and 'Session Search'.  Session Creation and Deletion performance affects the session rate where as session search operation impacts the throughput. That is 'Session find' operation typically happens on per packet basis and 'Session creation/deletion' happens for each connection.  Both the metrics - connection rate and throughput - are important.  Any data structure selected should take care of these two important metrics.  Also, any data structure selected should also perform well in Multicore environment.  Fields used to search for the session depends on the application.  5-tuples of the packet is used in case of firewall/NAT/TCP/UDP,  SIP, DIP, DSCP fields are used in case of forwarding application and DIP and SPI are used in case of IPsec.

Many implementations used to use hash tables for arranging the sessions.  Hash algorithms vary from simple rotation & XOR operations, Bob-jenkins to CRC-32/64.  Few bits of the output of algorithm is used to index into the hash table.  Collision entries are normally arranged in double linked list form.  As part of search operation, collision entries are searched for exact match by comparing all fields that comprise the key.

Problem with the hash table implementations are known for some time. Some of the issues are:
  • Number of collision entries has impact on search performance. In some cases, if hash algorithm is bad' or if the type of sessions created at that instance are similar,  it is possible that there are many entries in one hash bucket.  That results to poor performance.
  • Hash table attacks are not uncommon.  There are cases where attackers broke the hash algorithm.  Attacker send packets such a way that the device under attack creates large number of collision elements in one hash bucket.  This leads to CPU exhaustion attack whenever packets are sent (by attackers)  whose hash falls to a bucket having large number of entries.
  • Some hash implementations try to take care of above problem by resizing the hash table dynamically. Even that can lead to 'CPU exhaustion' attack. 
Hash tables are also has advantages:
  • Add/Delete operations are as simple as adding and removing a node from linked list. Session rate performance is typically better - O(1).
  • Search operation tends to be good for most normal cases.  Number of buckets per hash table is typically chosen as 1/4 of the number of maximum of sessions. That is, if 1M sessions to be supported by a device, then the hash table size is typically chosen as 256K.  In ideal case, there would not be more than 4 collision entries.  In many cases, it varies from 0 to 16 or so for good hash algorithms.
  • Since hash tables use linked lists,  RCU mechanisms can be used.  As we know, read or search operations do not take any CPU cycles.  Due to this, there is no lock overhead in Multicore processors. 
Balanced binary trees (RB trees or AVL trees) are also increasingly used in place of hash tables. They provide consistent performance in both best and worst cases scenarios.  Every time the session is added or deleted, the tree gets balanced.  Search operation is also completed with O(LogN) performance where N being the number of sessions in the tree.  If there are 1M sessions,  at the most 3 nodes are checked.  Addition and deletion operations are also happen within O(LogN).  It would have some  performance degradation with addition and deletion operation, but there is no attack possible.

Balanced binary trees also have one more disadvantage -  It is not Multicore friendly.  RCU can't be used for search operation.  Read spin locks need to be used. This would take some cycles on per search operation.  My observation is that it takes around 200 CPU cycles if there is no contention with writers.  For some packet processing applications such as IP forwarding, Firewall/NAT, IPsec VPN  200 cycles can be huge as the rest of packet processing might happen within 1000 cycles or so.  200 cycles adds to 20% overhead and hence might not be acceptable.

Hybrid model of hash tables and balanced binary tree may be right approach.  This approach is simple.
  •  Have hash tables for storing the sessions.
  •  By default, each hash bucket is a linked list.
  •  if number of entries in the linked list go beyond X elements, then rearrange them in a RB tree for that bucket.  When the number of entries in the linked list go below Y elements and after some XYZ time it can move back to linked list.
This method would get best of both worlds.  In normal work loads, performance would be as good as hash table.  If there is any attack or any fringe cases,  performance degradation is managed with RB tree.


Monday, January 4, 2010

Data Redundancy Elimination across files

HTTP, CIFS and other application level proxies typically apply DRE within the scope of file being requested.  That is, if the content is changed with the same file name,  DRE is done effectively by these proxies.  Client proxy as part of request sends the signature information of the file it has and Server proxy identifies the delta information from signatures and new file it gets from the origin server.  Then delta information is sent to the client proxy which in turn gest the new content from the old file and delta information it gets.  That is, the scope of delta information is limited to the scope of a given file.

It is my observation that Enterprise users, when they modify documents or  presentation files etc.., they tend to make a new copy and give new file name to it, that is they version the changes by keeping different files. One might argue that this is bad way of keeping the versions. But this happens more frequency than one can imagine.  In those cases, file based delta information does not work.  Though one may argue that  these instances of may be smaller, but they are not insignificant.

Good WAN optimization device should be able to do DRE on the blocks across files. I am not sure whether WAN optimization vendors are doing this already, but as a end user, you should look for this feature.

Sunday, January 3, 2010

Network infrastructure application validation Engineers - Things to ask yourself

One of the challenges many Intelligent network infrastructure device vendors have is to ensure that their product withstands attacks on protocol processing logic of the device and also ensure that system withstands (robustness) even if the device is used beyond its marketed capabilities. Validation phase (Quality Assurance phase) is one of the phases in software engineering which must ensure  device robustness.  QA Engineers play one important role in product development.  Test Engineers, in my view, contribute to success of product more than development Engineers. Test Engineers need to be more agile, adaptive, understand product features, deployment scenarios and aware of all product features.  Only then  Test Engineers would be able to contribute more to validate the product. It is very easy for Test Engineers to get lost of above qualities and become dump black box testers running predefined functional tests again and again without adding value by creating new test cases with respect to attacks, scale and robustness.

Don't get me wrong.  Running the predefined tests on all releases is good to ensure that there is no regression.  But test engineers need to go beyond that be creative and think like attackers and think of multiple kinds of deployments. Creating test cases is continuous process.

I have written one blog entry on "advice to network equipment testers".  I hope that was useful.

What are the test Engineers need to know from their counter parts in development and product management. What questions you need to ask yourself as test Engineer.

  1. Do I know the purpose of the feature which I am testing. 
  2. What are deployment scenarios where this feature is applicable.
  3. What are configuration parameters exposed by this feature to the end user.
  4. How do I map these configuration parameters to the deployment scenarios.
  5. Is this feature associated with any protocol processing?
  6. If there is any protocol processing, do I know the protocol.
    1. Go through the standard document if it standard protocol.
    2. Go through your internal protocol documentation in case if it is proprietary protocol.
    3. Go through the messages and fields AND possible values of protocols.
    4. Find out if there are any tools in the public domain related to this feature.
    5. Think like an attacker and note down the messages and fields for possible negative test cases.
  7. What are the hardcoded parameters for this feature
    1. How many configuration records can it take.
    2. How many run time blocks can it create.
    3. Think of deployments where the traffic can't be predicted.  These will help in creating test cases to test the robustness of the feature.
  8. If there is any standard based protocol processing,  do I need to ensure that it is inter-operating with other products in the market.  
    1. What are different ways this feature is used and interoperability testing should cover those.
  9. What is the expected performance of this feature in low load and in high load conditions?
  10. What is other traffic that is typically run when this feature is enabled. 
    1. Find out all kinds of traffic that is possible in deployment scenarios. Ensure that this feature works along with other features.
  11. Can the test cases automated?  
    1. If so, automate them. So that next time you don't have to spend time on doing manual testing.
Above questions are not some thing new for you all. But if you don't know the answers for above questions, you are not doing good job for yourself and your organization.

Saturday, January 2, 2010

epoll - Small tutorial

For a long time, programmers are used to using 'select' and 'poll' calls to wait for events from multiple sockets and file descriptors etc..   Though they served their purpose well,  they are no longer sufficient from scalability and performance perspective.  'Select' mechanism expects all 'fds' to be represented in a three different bit masks (one for read events, one for write and one for exceptions).  'select' call waits until some event happens. Once the 'select' call returns, user program needs to go through all descriptors (basically go through the corresponding bit masks) to identify the actual descriptors which have events ready.    Normally only few descriptors are ready with events.  Going through bit by bit is really wasting CPU cycles. Also, user programs typically require to store some data on per 'FD' basis and expect this data to be passed to it when there is some event on the FD.  'select' mechanism does not provide that facility. 

'poll' mechanism is certainly a big improvement over the 'select' mechanism.  It does not use bit masks, hence there is no limit on number of descriptors to wait on.  Also, along with the FD, user programs can provide 'callback' argument.   But it has similar performance problems as 'select'. That is, every time 'poll' comes out, it needs to figure out which descriptors have events ready by going through all descriptors which were given to 'poll' call.  Though scalability problem is taken care, performance problem still continues to be there.

'epoll' mechanism solves both scalability and performance problems.  It has mechanisms to add descriptors and remove descriptors with descriptor value, not the bit position. It also has facility for user programs to associate its data to the descriptor, which is returned along with descriptor that is ready when 'epoll_wait' returns. More importantly, epoll_wait returns the array of descriptors which has some event ready. Due to this, user programs need not go through complete 'fd' list to figure out ready descriptors.

Usage is also simple. Create a epoll instance using epoll_create.  Add, Delete and Modify descriptors and events they are interested in using epoll_ctl and call epoll_wait to wait for events to occur.  epoll_wait outputs the descriptors which are ready with some events. User program can walk though this ready descriptors to do whatever they need to do.

It provides following API functions.
  • int epoll_create(int size):   This function is expected to be called by user thread only once.  'size' parameter was hint for old Linux distributions to allocate memory internally in the kernel.  But this value is no longer used. You can pass any value.
  • int epoll_ctl(int epfd,  int op, int fd, struct epoll_event *event) :   This function can be used  to add new descriptor and associated interested events,  delete existing descriptor from the epoll instance and modify the interested events information for the existing descriptor.  User programs can do this on dynamic basis. 
  • int epoll_wait(int epfd,  struct epoll_event *events,  int maxevents,  int timeout) :  This function is used by user threads to wait for events on the descriptors which were added to the epoll instance identified by epfd.   This call waits until there is some ready event on any descriptor or until timeout.  'events' would have all ready descriptors when this call returns. 
  • Use typical 'close' call on epfd to close the epoll instance. This is typically done when the thread is exiting.

Friday, January 1, 2010

nginx - Hashing scheme

ngx_http_upstream module is using ngx_hash module for searching the HTTP headers specific function pointers to take appropriate action.   Intially when I had gone through it ngx_hash.c, I was confused due to the differences between this and the way I have used hash lists before.  I have used hash lists for dynamic insertion, deletion in addition to search.  For supporting all these operations, I am used to develop hash lists that contain set of buckets, with each bucket pointing to one of different types of linked lists or binary trees. Any operation starts with finding out the 'hash' value (typical hash algorithm used in networking world is Bob-Jenkins algorithm), get hold of head of list/tree and then do operation such as addition, deletion or search.  In case of addition, a node is created and add to the list/tree. In case of delete operation, node is deleted from the list/tree of the bucket. In case of search operation, it return the pointer to the node.

nginx hash, though in principle is same as typical  hash lists, but it has significant differences.  It appears that 'ngx hash' is not meant for applications that add and remove elements dynamically.  They are specifically designed to hold set of 'init time' elements arranged in hash list.  Main purpose seems to  be speeding up the search of one time added elements. 'ngx_hash' utility module provides mainly two functions :  ngx_hash_init() and ngx_hash_find().  I did not find any hash destroy function. It may be that it is never need to be destroyed at run time. As part of process exit,  Linux process anyway reclaims all the memory.

Little bit deep into the hash list implementation,  it is allocating memory for hash table of type 'ngx_hash_t' which has elements 'buckets' and 'size of hash table'.   Bucket is double pointer.  It first allocates memory for bucket array, that is, memory to hold pointers, that is first dimension of two dimensional array.  Size of this array is equal to the hash size. Second dimension to this array needs to point to all collision elements. As I said before,  all elements that are put in the hash list are known while creating the hash list itself.  No dynamic addtion or deletion is possible here.  So, it allocates memory to hold all the elements across hash lists at one time.  Then it puts right offsets to appropriate places in the buckets array. Each element is of type 'ngx_hash_elt_t', which has 'name' (key), lengh of the key and pointer to application specific value information.  ngx_hash_init() is relatively complicated function where it determines the hash list size needed to allocate buckets array and then it figures out the size of memory required to hold the hash list elements. ngx_hash_find is relatively simple where it finds the right bucket from the hash value and searches for the collision elements in array form.

ngx_hash module also has some logic to search for wild card based strings where wild card is present only at the head or tail of strings.  I will do some writeup on that at later time.

IPS - Need for C based rules & Distribution

Rule based attack detection (signature based detection)  is one of popular methods used by Network IPS devices.   Rule are typically written in a proprietary format typically looking for patterns in different parts of the packet or processed protocol data.  When new vulnerability disclosure is made,  IPS device vendors typically provide new rule and update their central distribution servers.  IPS devices typically configured to download the rules on periodic basis from central servers. New rules help IPS devices detect new attacks.  IPS devices using their interpreters convert the rules into native format and start analyzing the traffic with the converted rules.  Note that these rules are not executables.  That is, no image change is made when the rules are downloaded.

Many new attacks are complex and sophisticated and typical rule based detection is not possible in these cases.   In last few years,  the number of attacks that can't be detected using signatures is increasing.  It appears protection signature development could not be done for more than 10% of new attacks that were disclosed last year. This is a big number.  Some IPS devices are detecting these attacks by changing the firmware image.  But pushing the firmware image can't be done without testing it out thoroughly which may take few weeks to months.  Till then the customers will not have protection.  This will certainly be not acceptable.

IPS vendors need to adopt mechanisms similar to AV vendors . That is,  protections developed in C language would need to get deployed in the IPS devices through signature distribution mechanism.  These protections are done in C language,  any problem with the logic can bring down the whole system.  So, it is important that the IPS device have capability to run C based protections in a manner where the system recognize bad C based protections and disable them automatically when the IPS system comes back up.  Bring up time of  IPS system in these scenarios should be as small as possible.  This is one of the reasons, I believe IPS should be run in Linux user space rather than running it in either in Kernel space or in some real time operating system.  Process can be brought up easily and faster.   It is easy to recognize the bad protections and disable them if IPS is implemented as user process when it comes back up. 

It is also important that when the C based protection rules are downloaded,  the current state should not be affected.  Many IPS vendors are implementing these C based protections as Linux shared objects (.so). When these rules get downloaded,  signal can be sent to IPS process to load them.

In summary,  it is given that some attacks can only be detected by C based logic. As an IT person, ensure that your IPS can download the C based rules using normal signature distribution mechanisms.