Wednesday, December 30, 2009

Java Script Analysis - IPS Device Requirement?

Number of unique attacks on client applications are increasing and in recent past the number of attacks on client applications exceeded the attacks on server applications.  Javascript based attacks is one big chunk of attacks in client side category(it is also called target initiated attacks).  Almost all exploitations of client applications are possible only when the user is attracted to visit some malicious site hosting malicious content  or when user opens bad attachment in received email.    Typically,  whenever user downloads content or opens an attachment in email, appropriate application will be opened to render the data.  Attackers construct the data such a way that it exploits vulnerability in the application.  Vulnerability exploitation can lead to corrupting the system,  getting control of the system and to install troajns, botnets etc.. 

ActiveX application exploitation is one popular method.  ActiveX applications are controlled via javascripts in the pages of the site user is visiting.  Some of these applications don't sanitize the input to their functions.  Attackers take advantage of this and serve the pages with javascript which sends the bad input the functions. I would categorize the type of attacks into following:
  •  Buffer overflow attacks:  Sending long argument to the Activex functions via java script. This can result into taking control of machine by attacker.
  •  Remote Code Execution :  Executing commands via command line tools. Some activex programs provide functions which can execute programs given to them via arguments.
  • Overwriting some files:  Using some vulnerable activex programs, attackers can overwrite any file content.  Typically it is used by attackers to overwrite some system programs with their own executable.
  • Setting and Getting Registry values.
  • Information disclosure :  Reading any arbitrary file contents via activex functions. 
There are other javascript attacks that are possible on the browser itself.  As you all know, browser is complicated applications. It has several modules/functions built into it. Some of these modules can be controlled/accessible via javascript.   Any vulnerability in these modules can be exploited via javascript.   Some of the attacks of this nature are:
  •  DOM Manipulation of HTML page by embedded java script:  Some DOM Manipulations result into browser crash while they are rendered or during cleanup or garbage collection process.
    • Removal of child elements.
    • Modification of child elements
    • Assigning bad values to the elements.
    • Creating circular references across parent and children.
IPS Challenges:

Simple Pattern based detection does not work well to detect these kinds of attacks.
  • Java script is a language by itself.  Patterns can be hidden very easily either through encoders or by different ways of programming logic to generate string (or argument to the ActiveX function). 
  • Simple pattern based protection systems stop (alert) the traffic whenever CLSID and vulnerable method are present.  That is, they don't check for the exploiting argument to the method. This could result into false positives. 
  • Simple pattern based protection systems assume that CLSID and vulnerable methods are present within same packet or together separated by few bytes such as 2K/4K etc..  This can be used by attackers to put these two strings far apart by introducing lot of dummy java script logic or DOM elements. This is one example and there are many ways to evade the detection by IPS devices. Another example is different scripts in the same HTML page exploiting the vulnerability. That is, one java script in beginning of HTML page storing some data in DOM tree and second Java script using this data to pass the bad argument to the vulnerable method.
What IPS Devices need to do?

One way I can think of is adoption of well known sandbox model to detect bad javascripts with less false positives and negatives. Sandbox mechanism typically involves executing javascript in a safe environment.  Internal flow would be some thing like this:
  • Using transparent proxies or WCCP or other mechanisms, collect the HTML data.
  • Get hold of all Java scripts in the page.  If the page is referring to JS file via include primitive, download those java script files too. If the script is referred in the HTTP Referer method, get that too.
  • Concatenate all java scripts into one single entity (file/buffer) and pass it onto sandbox. It is also required to concatenate javascript in IPS rules (which catches the exploit) before passing onto the sandbox.
  • Ensure the sandbox is created in virtual environment (Linux containers is fastest)
  • Before running javascript, create the browser environment.
  • If the javascript that was concatenated from IPS rule is executed and catches the condition that was put in the javascript,  then it can be considered as hit and IPS can take action based on the severity of the rule. Actions include alerting the administrator or stopping the HTML page going to the Client.
Yes. This could be CPU intensive, but it is required beast in my view. Offloading this kind of scanning from the end client would benefit battery powered devices such as Smart phones,  Netbooks etc.. Even in desktop/laptop world, it provides benefit of detecting these at one central place and provided better control for administartors.  With advent of Multicore processors, this kind of processing can be done at central place.

What are security considerations the IPS devices need to worry about?

Malware developers are smart and they continue to evade the detection by intermediate security devices and even try to attack these protection systems.
  • Always use Sandbox mechanism:  If javascript that is being analysed is doing some mischief on system resources, only sandbox system would be affected, but not the rest of protection system.
  • Time the amount of time the analysis is taking and if it exceeds some predefined time kill the analyzing process.   Attackers can develop javascript that goes in continuous loop and can take awayall CPU cycles of protection system.
  • Override some Javascript functions to get better control and get out in case of any issue found.
Detection and protection of internal systems is CPU intensive, but necessary evil in the IPS systems. Many IPS devices in the market don't do good job of detecting these attacks. I believe that in coming years, many IPS devices would have intelligent way of detecting javascript attacks.  Sandbox mechanism is one method I believe would be popular.

nginx - Arrays

Ngnix web server provides 'array' utility.  This utility provides set of API functions to create array and push elements.  There is no API function to remove elements in the array. Since it uses pool functions internally to allocate memory, I am guessing that these elements will also be destroyed as part of the pool cleanup. 

Following structure represents the array:

struct ngx_array_s {
    void        *elts;
    ngx_uint_t   nelts;
    size_t       size;
    ngx_uint_t   nalloc;
    ngx_pool_t  *pool;

'elts':  Read this as 'elements'. It contains user data elements. Each  user data element size is stored in 'size'. It expects that all user data elements are of same size.
'netls':  Read this as 'number of elements'. It represents the valid number of user data elements pointed by 'elts'.
'size':   Size of user data element.
'nalloc':  Read this as 'number of allocated elements'.  Represents the number of elements that can be stored in 'elts' in total. 
'pool':  Pool that is being used to allocate both ngx_array_t as well as memory to store elements.

API functions :
  • ngx_array_t *ngx_array_create(ngx_pool_t *p, ngx_uint_t n, size_t size) :  This function is used to create the array and allocate memory required to store 'n' elements of size 'size' from the pool 'p'. It returns pointer to ngx_array_t.  User application is expected to store this pointer in its control structures and use this in further functions. 
  • ngx_int ngx_array_init(ngx_array_t *array, ngx_pool_t *pool, ngx_uint_t n, size_t size) : This function is used when application allocated ngx_array_t by itself (by declaring a member of this type in its control structures). This function does every thing as ngx_array_create other than the allocating ngx_array_t.  It returns NGX_OK if every thing is okay.
  • void *ngx_array_push(ngx_array_t *a):  This function gives the pointer to the free 'element'.  Application is expected to typecast its structure to the returned pointer and use it. If there is no space in the 'elts', then it allocates contiguous memory from the pool, copies old data and returns pointer to the free element.  Applications should not be storing the pointer to the 'elts' it its control structures as 'elts' can change with new elements. It is suggested to get 'elts' every time array elements are required for its processing.
  • void *ngx_array_push_n(ngx_array_t *a, ngx_uint_t n):  This function is same as ngx_array_push, except that it maks room for 'n' elements in contiguous memory. It returns pointer to the first free element.

WAN Optimization Case Studies (Public information)

Updated on 01/08/2010

1. Sony Erricsson case study of WAN Optimization:
  • Vendor/product:  Cisco WAAS on Windows Server 2008
  • Savings:  24000 per branch office.
  • Use cases described :  
    • Speed of project life cycle application management application increased by 20 times.
    • Video file transfer performance improved by 400%
    • Time taken to transfer  data worth 12GBytes reduced to 1.5 hours from 12 hours.
2. Studley Inc., a real estate services firm based in New York City case study of WAN optimization:,289142,sid200_gci1378242,00.html
  • Vendor : Riverbed WAN Accelerator
  • Use cases described:
    • Share Point application acceleration:  Loading performance improved from 1 to 2 minutes to 4 to 5 seconds.

Pool based allocations - nginx

Nginx has one core module called "Pool based allocation".   This "pool based allocation" method is in addition to direct heap functions via wrappers such as ngx_alloc(), ngx_calloc and ngx_free. API functions defined by this module has "ngx_p" in the beginning of each function.

Pool based allocation routines are implemented in ngx_palloc.c and ngx_palloc.h.

It appears that these API functions are meant to reduce memory fragmentation that is otherwise a possibility with direct heap based allocations.  Any module or function requiring temporary buffers, then these  'pool based allocation' API functions can be used. Since there is no protection (mutex) primitives in the functions defined in ngx_palloc.c and there is no freeing of buffers until the pool is destroyed, I am guessing that these API functions are mainly would be used by modules for local allocations within one connection processing.  If there is any data required across connections which are to be handled by multiple threads, then these API functions are not suitable to allocate buffer to store that data.
Pool module also provides facility to register callback functions when the pool is being destroyed. Pool module calls the registered functions for application to do any cleanup.

Pool utility module avoids going to heap for every memory allocation required by application modules. It allocates bigger size memory chunks from heap and provides required smaller memory blocks from the big chunks to the application modules through its API. 

Usage is simple:
  • Create pool using "ngx_pool_t *ngx_create_pool(size_t size, ngx_log_t *log)".   'size' parameter given to this API function  really a good hint for pool module to allocate memory at the pool creation time. Allocations from this pool using ngx_palloc() function is given from this memory.   As you can see from the code,  if the initially allocated memory chunk is not sufficient to satisfy the application allocations,  pool module has facility to allocate more memory chunks on demand basis.  Pool utility module restricts the amount of memory it can provide to the applications from the preallocated memory chunks. Typically, it is restricted to PAGE SIZE.    But it provided one escape route.  If  ngx_palloc is called with size more than the PAGE SIZE of processor architecture, then it bypasses the pool chunks and goes to the heap directly. These allocations are maintained in separate linked list within pool  in 'large' list.   When ngx_pfree is called, if the block is in the 'large' list it frees the memory to heap. If it does not belong to 'large' list, then it keeps quiet.  Since large list is maintained for each large block allocation and is used by ngx_pfree, I feel there could be impact on performance if there are more large block allocations by the application.  Application module better off calling heap wrapper functions directly.  
    • This function also allocates a meta information of type ngx_pool_t and returns pointer to this as handle for other API functions.
  • Pool can be destroyed using " void ngx_destroy_pool(ngx_pool_t *pool)".  It will free all memory chunks it allocated from heap and then frees the pool meta information block. This function, before destroying the buffers, it calls application registered cleanup functions.
  • Pool can be reset to its initial state using "void ngx_reset_pool(ngx_pool_t *pool)".  If you want to reuse the pool, this function can be used.  It is similar to destroying the pool and creating a new pool. But using this function is faster than recreating the pool.  It preserves the memory chunks the pool module gets from the heap.  Note that  it frees the memory that is maintained in 'large' list.
  • Memory from the pool can be allocated using "void *ngx_palloc(ngx_pool_t *pool, size_t size)" and "void * ngx_pnalloc(ngx_pool_t *pool, size_t size)" :   Function takes pool handle and 'size' of memory required.  These functions checks whether the 'size' requested is less than the PAGE SIZE. If so, it allocates memory from preallocated chunks. If not, it goes to heap and gets the memory and then puts it in 'large' list.  ngx_palloc function is same as ngx_pnalloc, except that the memory it gives is aligned to "NGX_ALIGNMENT" which is defined in nginx with value "sizeof(unsigned long)". Some internal details:  This API function if memory is not found in the preallocated chunks, then it allocates one more big chunk from the heap and provides requested memory from this big chunk.  These chunks are arranged in linked list fashion in the pool handle.  This linked list is used in "Destroy Pool" function to free these  chunks to the heap.
  • Buffers allocated using this module can be freed using "ngx_int_t ngx_pfree(ngx_pool_t *pool, void *p)".   This function does not do any thing  if the buffer is allocated from the pool chunks. If memory is allocated from heap i.e if memory pointer is in 'large' list,  then it frees the memory to the heap.
  • If buffer allocated needs to be initialized to zero, then "void *ngx_pcalloc(ngx_pool_t *pool, size_t size)" can be called. This function is same as ngx_palloc except that it does one additional operation of initializing the content to all zeros.
  • Yet times application may require its functions to be called when pool is being destroyed for some buffer allocations.  Pool module provides this function " ngx_pool_cleanup_t * ngx_pool_cleanup_add(ngx_pool_t *p, size_t size)" for applications to get buffer and register cleanup functions for that buffer. This function internally calls ngx_palloc().   In addition, this function allocates some memory to store the cleanup handler and attaches the cleanup  handler to the POOL cleanup list.

Huge-tlb-fs to improve Linux user space program performance

As I indicated  before, many networking vendors are moving their heavy packet processing applications to user space from kernel space for maintainability of the code, richness of libraries available in user space and GPL concerns of running applications in Kernel space. Typical concern one has in moving their applications to user space is performance.  This write up is one of the techniques that can be followed in reducing performance degradation in moving programs from Kernel space to user space.

Linux kernel space text, data segments are generally fixed and contiguous. Hence TLBs required are less and TLBs are assigned at initialization time.  Unlike kernel programs,  user space applications are divided across multiple processes. Each process runs in virtual memory.  TLBs are used to translate virtual address to physical address of memory.

Number of TLBs in processors are limited.  Hence TLBs are used as cache by Linux.  When there is no TLB entry corresponding to virtual address that is being accessed,  processor frees up one of the TLB entries, searches through the page tables for the mapping between virtual address space and physical address space  and adds this mapping to the freed TLB entry.   Programs with sparse memory address access will have more TLB misses and due to this performance can suffer.  In addition, if page tables are accessed more often,  then the number of times L1 and L2 cache filled up with page tables is also becomes high, there by programs usage of cache can go down, decreasing the performance of applications. This kind of problem is not there in Kernel space programs because of one time mapping of TLB entries for kernel text and data segments. Reduction of TLB misses is the key to improve performance. "hugetlbfs" facility allows TLB entries to map to bigger chunks of memory i.e huge pages.

This facility can be used to map text segment and also to do application memory management.   Linux documentation on this topic and some examples can be found here at:

Above documentation and usage is mainly limited to applications allocating huge page from the kernel space. In addition to applications allocating big  page, it is also necessary that the text segment of the program also makes use of this facility to decrease TLB misses.  Some work was done on this, but it is not yet available in the kernel main line. But patch is available for developers to try this.  Explanation on this technique is provided here  

Patch to the kernel,binutils and Glibc is provided here

Red black tree library - nginx

See wikipedia link on rb tree description.  Red black tree provides improvement over binary trees that makes it faster to search and also faster to insert/delete nodes.  It applies four principles as described in wiki link to balance the tree after every addition and deletion.  These principles make it search performance predictable.  These rebalancing principles are not expensive and hence insert and delete operations are also faster.

Nginx web server uses rb trees to cache SSL sessions across threads, to store the open file cache entries and for many other purposes.

Nginx rbtree implementation does not include all  binary tree operations. It leaves the binary tree insertion to the applications.  It does rebalancing of the tree and also it provides delete operation API function. It lets the application to do the 'search' on binary tree.  Note that search operation on rbtree is same as search operation on binary tree.

Though implementation of 'rbtree' implementation looks complicated (mainly if you have not gone through the algorithm before),   its usage by application is simple.  Applications need to define the root and put the node in the structure elements which it wants to put in the rbtree.

Following type definition of the rbtree root is defined by 'rbtree' implementation:

struct ngx_rbtree_s {
    ngx_rbtree_node_t     *root;
    ngx_rbtree_node_t     *sentinel;
    ngx_rbtree_insert_pt   insert;

root :  This is root of the binary tree.
setinel:  This node is created to empty node for leaves. See wikilink to understand the usage of this. This is really dummy node memory.
insert:  This is function pointer to application function which does insertion into the binary tree.

struct ngx_rbtree_node_s {
    ngx_rbtree_key_t       key;
    ngx_rbtree_node_t     *left;
    ngx_rbtree_node_t     *right;
    ngx_rbtree_node_t     *parent;
    u_char                 color;
    u_char                 data;

ngx_rbtree_node_s is expected to be defined in each element that application needs to make it part of rbtree.

Key:  This is the key.  Applications typically put the hash value of the actual application key in here.  This is used to traverse to left and right based on key value. If two nodes have same key value, then the entire application key is used to make decision.   I guess this is done to improve performance.  Application key can be long.  Integer based comparisons are faster.

left, right, parent:  Are used to maintain the node in the tree.  left and right point to the children and parent point to the parent node.

color:  Color of the node - Red or black.

data:  Start of application data. It is not used by rbtree.

API functions:
  •  ngx_rbtree_init(tree, s, i) : This macro initialize the rbtree instance.  'tree' is pointer to the 'rbtree' instance of type ngx_rbtree_t.   's' is pointer to sentinel node and 'i' is pointer to the insert function. This is the first macro that needs to be called by application to initialize the rbtree instance.
  • void ngx_rbtree_insert(ngx_thread_volatile ngx_rbtree_t *tree,     ngx_rbtree_node_t *node) : This function is called by application to put a new node in the rbtree. This function internally calls the insert function provided to ngx_rbtree_init before to insert the node as per application key in the binary tree. Then ngx_rbtree_insert function continues to do the rebalancing of the tree.
  • void ngx_rbtree_delete(ngx_thread_volatile ngx_rbtree_t *tree,     ngx_rbtree_node_t *node):  This function is called to remove 'node' from the rbtree 'tree'.  It removes the node from the binary tree and reblances it as per rbtree principles.

OSPF v3 IPsec requirements - Notes

RFC 4552 standard provides enough details about IPsec requirements.  Some additional details are given here.

OSPFv3 is mainly meant for IPv6 link state propagation.  Unlike OSPF for Ipv4,  OSPFv3 for IPv6 assumes that the authentication, integrity and optionally confidentiality is provided by inbuilt IPsec layer in IPv6 stack. What capabilities are expected to be provided by IPsec layer to provide authentication and confidentiality functionality to the OSPF?  Before we go further,  please refer to to OSPFv3 RFC specification to understand OSPF concepts. I also found this reading very useful to get faster understanding of OSPF. Here I try to give some quick introduction to OSPF concepts that are relevant for this discussion.

Basic purpose of OSPF is to create routing table in a given network node (typically routers) based on link state information it receives from the adjacent routers.  A router can have multiple physical or logical (VLAN) interfaces connected to different network segments.   OSPF routing protocol runs on the configured interfaces.  As many OSPF Hello protocol instances are run as number of configured interfaces.    A router can be in multiple areas.  A given interface is always in one area.  Router collects link state information across all interfaces of the area and runs Dijkstra algorithm to figure out the shortest paths to all known destinations. If the router is configured to be in multiple areas, then it is called Area Border Router (ABR).  In OSPF world, areas are created administratively to reduce the number of link state updates in the network.  If all routers in the organization is in one area, then amount of link state information exchanged can be very high and amount of time and memory required to run Dijkstra algorithms also can be high.  By creating multiple areas, number of link state updates when there is change in link information is reduced to its area.

OSPF has a concept of Area 0. It is called backbone network.  Every area is expected to have one ABR with the Area 0 (backbone network).  If an area does not have ABR with area0, then it can't communicate with other areas for which it does not have ABR.  By having ABR with area0,  continuity across multiple areas can be maintained.

In some cases, it may not be possible to have direct connectivity with area0.  When the area does not have physical connectivity with area0, then  it is expected that administrators create a virtual link to the area 0 via transit area.  Transit area is the area which is connected to the area in question and the area0.  Virtual link can be thought of as a point to point link.  It creates an p-to-p interface.  More often multiple virtual links are created for redundancy purposes.  There is one standard  which proposed some changes to the OSPF protocol to discover the discontinuity with the area 0 and creates virtual links dynamically by figuring out the transit areas.  But this draft was not widely implemented and hence the virtual links are continue to be configured manually.  Virtual links are normally created with the router IDs which are not IP addresses of the each end of virtual link.  Some vendors provide virtual link configuration with peer IP address.  Every time there is a change in the configuration on the peer, it is required to change the IP address.  To avoid administration mistakes, it is typically left to the OSPF protocol to determine the IP address from router LSAs in transit arewa.   The IP addresses are discovered dynamically  via router LSA in transit area.  OSPF then activates the virtual link. From then on OSPF uses this virtual link as point to point link with the area0.  What it means is that the IP addresses of IP header used by OSPF packets on virtual link are not known to the administrator at configuration time.

Let us come back to OSPFv3 and IPsec based authentication.

It appears that CISCO is supporting both interface level authentication and area level authentication. Juniper seems to be supporting interface level authentication for sure.   In interface level authentication, all neighbor routers would need to have same SA information. Different interfaces  (network segments) of same area can have different SAs.   In case of area level authentication, all routers and interfaces in that area would need to have same SA information.  It is administratively simple to have one set of SA parameters (SPI and keys) for area.  But there may be other security reasons (such as different administrators for different network segments) to have separate authentication/encryption for each network segment.  In case of ABRs,  routers belong to multiple areas.  If area level authentication is used,  there would be as many SA authentication parameters as number of areas.   See this link to understand configuration on CISCO and this link to understand Juniper configuration. 

Note that OSPF uses both multicast and unicast addresses to send its messages to the adjacent routers. Whenever any new OSPF router is added to the network or removed from the network,  it is not expected by administrator to do any configuration changes on current OSPF routers ( Except for virtual links I guess).  One must ensure this even  when OSPFv3 being secured using IPsec. That is, IPsec configuration should not be expected to be changed on the current OSPF routers when new OSPF router is being added. I would think that administrators will not configure neighbor IP addresses in the selectors of SPD policy records.  SPD policy IPv6 selectors would have OSPF IP protocol,  source IP as ANY and Destination IP as ANY.

As we discussed there can be multiple SA requirement either due to having authentication requirement for every network segment or due to multiple areas.   So, one should have many SAs (SPD policy records) with same selector set.  If there are multiple SPD records having the same selectors set,  first record is always will be matched which is not desirable.  That is where,  interface level Security policy database come in handy.  For each interface, there is one SPD.  When  OSPFv3 IPsec is configured on interface, a policy record with  selector having ANY to ANY with OSPF protocol is created in interface specific Security Policy Database.  We also discussed that yet times the SPD records are not interface specific, but area specific where area level authentication is desired.   To take care of this,  one option is to duplicate SPD records in all interfaces corresponding to that area.  This will have administrative overhead and memory overhead.  It is better if IPsec supports area based SPD.  As a matter of fact,  some IPsec stack are capable of doing this, though the capability is not specific to OSPFv3.

IPsec stacks do support multiple SPDs for providing IPsec interfaces for route based VPN.  Since interface IDs are dynamic in nature,  the SPD policy database granularity is provided on fixed domain basis.  Interface ID to domain mapping is configured or programmed at run time.  Multiple interface IDs can be mapped to one domain. But one interface ID can't existi in multiple domains.   If interface level based SPD is required, then the interface to domain mapping is 1:1.  This feature can be mapped to OSPFv3 based authentication easily.  If interface level authentication is chosen,  then there is 1:1 mapping between interface and domain.  If area level authentication is chosen, then area is considered as domain.   OSPFv3 module at run time can generate the mapping of interface ID to the domain or administrator can configure interface names to the domain.

We discussed about virtual links. Virtual links are also interfaces and hence SPD can be configured on per interface basis.   But there is one problem though as described in section 11 of RFC43552.  See the descriptoon about rule 4 and 5.  To protect against clear OSPF packets going to the OSPF daemon through via normal interfaces or any other interface, it is necessary to program the virtual link IP addresses in all interface specific SPD records. That is the reason you require IP addresses for virtual links.  This is not required in case of other SPD records as IPsec packet processing module knows which inbound SPD to look at once the IPsec decapsulation is done.

OSPF requirements for IPsec are :
  • Common Requirements
    •  Transport code.
    • AH Support is MUST as ESP is not implemented by some popular routing vendors.
    • Manual Key Management support  :  Please see RFC4552 on why manual key management is required. Basic need is due to Multicast support.  Since multicast key distribution protocols are not popular,  Manual key management is expected to be supported. 
    • Disable Anti-Replay check.
    • Configuration to replace keying material where administrators will not be able to change the configuration of all routers at the same time. See RFC 4552 Section 10.1 'Rekeying procedures'.
      •  IPsec module should be able to take KeyRolloverInterval. Typically it is in hours/days.
      • Once the new keys are configured at different times but within KeyRolloverInterval,  without any manual intervention, IPsec should be able to bring new SAs and delete Old SAs.
  • OSPF - Internal Router :  In this case, all interfaces belong to only one area. 
    • If area level authentication is good enough:  Then global SPD is good enough.
    • If interface level authentication is required, then it requires to have interface level SPD.
  • OSPF - Area Boundary Router requirements.
    • If interface level authentication is configured for all interfaces,  interface level SPD capability is required.
    • If area level authentication is configured and if there are multiple interfaces to each area, then domain level SPD capability is required.
    • If Virtual links are required, then IPsec stack should have ability to create policy records associated with the virtual link addresses on both ends in all interface/domain specific SPDs.
What changes are required in OSPF implementation?  If virtual link capability is not required, then there is no change required in OSPF implementations as long as administrator ensure to configure  interface level/area level authentication and configure the interfaces domain mapping.   When virtual links are configured, then OSPFv3 module may require to inform IPsec module with IP addresses of virtual link so that IPsec module can create the policies in all interface/area (domain) specific SPDs. By the way, this is also optional as OSPFv3 anyway is going to drop the packets if they don't come on virtual link.  So, in essence there is no real requirement to change the OSPF implementation to provide IPsec authentication.

Tuesday, December 29, 2009

nginx - Module Documentation - Some notes

I was going through the  nginx module development guide.  While I was going through the example module code, I did not understand how the module registers itself to the nginx infrastructure.  For example,  the upstream module has following structure initializations.

static ngx_command_t  ngx_http_upstream_commands[] = {

    { ngx_string("upstream"),
      NULL },

    { ngx_string("server"),
      NULL },


static ngx_http_module_t  ngx_http_upstream_module_ctx = {
    ngx_http_upstream_add_variables,       /* preconfiguration */
    NULL,                                  /* postconfiguration */

    ngx_http_upstream_create_main_conf,    /* create main configuration */
    ngx_http_upstream_init_main_conf,      /* init main configuration */

    NULL,                                  /* create server configuration */
    NULL,                                  /* merge server configuration */

    NULL,                                  /* create location configuration */
    NULL                                   /* merge location configuration */

ngx_module_t  ngx_http_upstream_module = {
    &ngx_http_upstream_module_ctx,         /* module context */
    ngx_http_upstream_commands,            /* module directives */
    NGX_HTTP_MODULE,                       /* module type */
    NULL,                                  /* init master */
    NULL,                                  /* init module */
    NULL,                                  /* init process */
    NULL,                                  /* init thread */
    NULL,                                  /* exit thread */
    NULL,                                  /* exit process */
    NULL,                                  /* exit master */

ngx_http_upstream_commands and ngx_http_upstream_module_ctx are referred in ngx_http_upstream_module.   So far so good.  I expected ngx_http_upstream_module to be referred in some other structure or I expected this to be registered with infrastructure via some API function.  I did not find it being referred anywhere.

Finally I figured out that ./configure shell script with the help of  /auto/modules shell script  is creating a C file called ngx_module.c.    See this shell script bit in /auto/modules file.

cat << END                                    > $NGX_MODULES_C




for mod in $modules
    echo "extern ngx_module_t  $mod;"         >> $NGX_MODULES_C

echo                                          >> $NGX_MODULES_C
echo 'ngx_module_t *ngx_modules[] = {'        >> $NGX_MODULES_C

for mod in $modules
    echo "    &$mod,"                         >> $NGX_MODULES_C

cat << END                                    >> $NGX_MODULES_C


Above shell script is creating C file with array of all module structure variables.  When I executed ./configure, content of the file looked like this. Since this file is big, I have taken few lines out of this file to illustrate the output.

extern ngx_module_t  ngx_core_module;
extern ngx_module_t  ngx_errlog_module;
extern ngx_module_t  ngx_conf_module;
extern ngx_module_t  ngx_events_module;
extern ngx_module_t  ngx_event_core_module;
extern ngx_module_t  ngx_select_module;
extern ngx_module_t  ngx_poll_module;
extern ngx_module_t  ngx_http_module;
extern ngx_module_t  ngx_http_core_module;
extern ngx_module_t  ngx_http_upstream_module;
extern ngx_module_t  ngx_http_static_module;
extern ngx_module_t  ngx_http_autoindex_module;

ngx_module_t *ngx_modules[] = {


Core module seems to be using ngx_modules array to get hold of different function pointers of modules to set the configuration and other operations.

It appears from the source code that there is no dynamic addition of modules.  ./configure script is only method available of module registration.    I feel this is one drawback nginx has with respect to Apache in my view.  Ofcourse, nginx has several advantages such as:
  •  Even driven model:  That is one thread can handle multiple connections simultaneously.
  •  Multiple threads to take advantage of multiple processors.
  •  Usage of 'epoll' mechanism. 
Anyway, I thought I would write this notes for people who also may wonder while going through the documentation  on how module gets registered with nginx infrastructure.  Hope it helps.

Monday, December 28, 2009

librsync usage

Please see the RSYNC algorithm here at  blog on rsync Librsync is library that can be linked with any application.  It has all the features that required for de-duplication.   Any de-duplication functionality requires that new file content is generated from the old content and delta information.  It requires following features.

  • Generation of signatures by the entity that is expected to receive new file. Signatures are nothing but weak (Adler32 )and strong checksums (MD5) of all non-overlapping blocks of the file it has.
  • Entity that is supposed to send the new file should have facility to take the signature file, new file it has and generate delta file.  Delta file consisting of set of instructions.  Instructions include COPY a block from old file (delta file only  has reference to block),   Use the content from the delta file for certain blocks if old file does not have this content 
  • Ability to  merge the delta file with old content file to generate new file.  This is typically done by the entity which receives the delta file.
Librsync library provides all these capabilities.  If you look at the whole.c file of librsync library, you find following capabilities:
  •  rs_sig_file:  This function can be used to generate signature file.  This function is called by the entity that has old content file.
  • rs_loadsig_file:  This function can be used to load the signature file contents in memory based hash list.  This function is called by the entity that has new content file.
  • rs_delta_file :  This function can be used to create delta file from signatures it loaded before and new content file it has.  This function is called by the entity that has new content file.
  • rs_patch_file:  This function can be used to generate new content file from the delta file and the existing content file from which signature file was created before.   This function is called by the receiver of the content.
 'librsync' library maintains all the state information local to the a control block called 'job'.   Due to this, librsync can be used simultaneously to do multiple operations at the same time.  Each operation has its own 'job' and hence it does not have any impact on other operations.  Hence it is suitable for proxy applications where one 'user process' processes multiple connections at the same time.

Sunday, December 27, 2009

Memory Barriers - Where are they required?

Memory barriers are also called memory fences. There are two types of memory fences - Read and write.
Memory barriers are required in Multicore processing environments.

When a programmer develops the code,  he expects the code to run sequentially.  Modern CPUs have multiple execution engines inside them  and they try to parallelize the serial code by itself, thereby executing the code out-of-order.  Also, while CPU is waiting for the data to be ready from the DDR,  it also try to execute the next instruction.  If there is any dependency in the code that is executed already before the old code, then the new code gets executed again.  That is, cores know the dependency of the code in a single thread and hence can revert or cancel the out-of-order code execution, thereby, maintaining the in-sequence order of the code.

When there are multiple cores,  this out-of-sequence code execution might give result to errors.  Wikipedia has got very good example and I am pasting that piece of code and explanation here:

Initially, memory locations x and f both hold the value 0. The program running on processor #1 loops while the value of f is zero, then it prints the value of x. The program running on processor #2 stores the value 42 into x and then stores the value 1 into f. Pseudo-code for the two program fragments is shown below. The steps of the program correspond to individual processor instructions.
Processor #1:
while f == 0
 // Memory fence required here
 print x;

Processor #2:
 x = 42;
 // Memory fence required here
 f = 1;
One might expect the print statement to always print the number "42"; however, if processor #2's store operations are executed out-of-order, it is possible for f to be updated before x, and the print statement might therefore print "0". Similarly, processor #1's load operations may be executed out-of-order and it is possible for x to be read before f is checked, and again the print statement might therefore print an unexpected value. For most programs neither of these situation are acceptable. A memory barrier can be inserted before processor #2's assignment to f to ensure that the new value of x is visible to other processors at or prior to the change in the value of f. Another can be inserted before processor #1's access to x to ensure the value of x is not read prior to seeing the change in the value of f.

As you can see,  processor #1 requires read memory fence and processor #2 requires write fence.

In Linux, memory barriers are implemented in following macros:

smp_rmb()  :  Symmetric Multi Processing & Read Memory Barrier
smp_wmb():  Symmetric Multi Processing & Write Memory Barrier.
smb_mb():    Symmetric Multi Processing * Memory Barrier - Both read & write memory barrier.

Saturday, December 26, 2009

WAN Optimization Devices - Features that should be expected by buyers

Many WAN optimization solutions are available today. What are the features one should look for?  This list is my view of features the box should provide.  Some features might or might not be used in some deployments though.

WAN Optimization is all about utilizing the WAN resources effectively. At very high level main features one should look for are:
  • Usability and  Experience
  • Application Detection and providing differential QoS based on applications.
  • Load balancing and failover of the connections across multiple WAN links 
  • De-Duplication of the data among offices 
  • Compression
  • Caching 
  • Security
  • Reliability

WAN optimization has been and will continue for at least a year or two to be a special device in Enterprise networks.  As the value of WAN optimization is realized,  more and more routing, switching, ADC and network security vendors are adding this  as an additional feature in their offerings as a blade or as a software component running on some cores of Multicore processors.  Buyers should not be just looking at the tick mark on WAN optimization, but should check the details.

Usability features: 
  • No configuration changes to the client and server machines and their applications should be expected as part of WAN optimization device installation in the network:  Client machines (Desktops, Laptops,  Mobiles etc..) and Server machines (Running HTTP Server, Email Server, CIFS Server) should not even know that WAN optimization device is installed in the network.  No changes to be expected to be made to the machines or applications running on those machines when these devices are added or removed.
  • No changes to the existing network infrastructure devices should be expected when WAN optimization devices are installed in the network, except for the cases where there is asymmetric routing.  It is understandable that,  In case of asymmetric routing,  routers need to be configured to redirect the packets to WAN optimization devices using WCCPv2 protocol.
  • WAN optimization device should not be appearing as a L3 hop:  WAN optimization devices are expected to provide Layer 2 transparency.  WAN Optimization Device is expected to intercept the packets at the Layer 2 level by acting as a L2 bridge.
  • IPv4 and IPv6 Support:  During this migration times to IPv6 from IPv4,  both types of networks, clients and servers are possible.  WAN Optimization device is expected to support IPv4, IPv6 networks, clients and servers.
  • When WAN Optimization device is inline of traffic, it is expected that the device has high availability feature using Active-Backup method at the minimum :  If one device fails, other device should be able to take over the WAN optimization functionality. UDP connections must work fine even after backup device takes over the WAN optimization functionality. Existing TCP connections might break, but it must ensure that new connections go successfully through the backup device.
  • Device is expected to provide GUI for configuration. 
    •  WAN optimization devices should have facility to learn other devices and their reachability information dynamically. WAN devices are also expected to provide configuration facility to add/remove reachability information of other WAN devices statically.
    • Any configuration made on a device should be propagated to other devices if some incarnation of this configuration is required on other peer devices.
    • It is normally expected that any configuration change done gets reflected immediately. that is, no restart should be necessary for the configuration to be effective.
    • Configuration through secure mechanism is expected:  SSH for CLI access,  HTTPS for GUI.
    • Configuration consistency across device restarts is expected.
    • For Common Critiria and other certifications,  Configuration facility are expected to have role based management with multiple roles with multiple users belonging to the roles.  It is also expected that audit trail is created upon configuration changes.  Audit logs are expected to have all the configuration information changed for the changed records.
    • Any configuration update on the Active device should reflect in the backup device without any additional effort by administrator.
    • Centralized Management System to configure multiple WAN optimization devices from a single console is normally expected when number of WAN optimization devices are more than few (example: more than 4).
  •  Device is expected to provide multiple kinds of reports and statistics to the admin.
    • Reports related to amount of WAN bandwidth savings that occurred over specific time period.
      • Due to de-duplication,  Due to compression, Due to Caching etc..
      • On different protocols (HTTP, NFS, CIFS etc..)
    • Reports related to Integrity of dedup repositories.
    • Reports related to possible savings if more memory/hard drive capability is added.
    • Reports related to traffic belonging to different applications over specific period of time.
    • Reports related to amount of WAN utilization and under-utilization.
    • Multiple different types of statistics collected over significant amount of time and represented in specific time periods such as hours, weeks, months etc..
    • Debug statistics which aid in field debugging.
    • Tracing facilities in field with different levels of traces.

Application Detection and providing differential QoS based on applications:

One of the features to utilize the WAN links effectively is to identify the applications and apply traffic management facilities such scheduling, marking and bandwidth control.  Lower priority application such as P2P and non-interactive/non-realtime applications can be ensured to use lesser bandwidth when higher priority application data is pending to be sent on the WAN links.

Many applications can be detected by based on the Destination Port of TCP or UDP protocols.  Application detection is expected to be provided by the WAN Optimization device to detect applications that do port hopping. Examples:  P2P and IM applications.  Application detection is also required to detect HTTP connections being used for social networking,  DDL (Direct Download Links) etc..  Application detection identifies the application ID for each connection. QoS policies would need to have application ID as one of the criteria elements to choose the policy so that the policy rule specific actions such as bandwidth control and prioritization of traffic on to the WAN link can be applied.

Load balancing and fail-over of the connections across multiple WAN links 

It appears that many deployments go with more number of WAN links to satisfy their bandwidth requirements than going for a bigger pipe.  I believe it is less expensive. Also it provides organization to scale the bandwidth as the organization grows.  Having multiple WAN links rather than single big pipe also avoid network discontinuity if one link fails.  These WAN links normally are also taken from different ISPs so as to avoid discontinuity in case of one ISP failure.

WAN optimization functionality is expected to provide capability to support multiple links going towards WAN.  These devices are expected to balance the traffic (based on hash result of IP packet header fields such as source IP, Destination IP etc.) across multiple WAN links. Also these devices are expected to transfer the existing connections on a failed WAN link to new links.  These devices are also expected to maintain order of packets in a flow and hence the balancing criteria configurations selection should be provided to administrators.

De-Duplication of the data among offices

This is one important feature to reduce the amount of traffic exchanged (on WAN links) among offices of an organization.  Basic purpose of de-duplication is to ensure duplicate data is not seen on the wire. Peer WAN optimization devices are expected to hide these details from clients and servers which are exchanging the data. Handling and processing of deduplication happens among WAN optimization devices. Different vendors may have implemented this in different ways. It is important to check the de-dup efficiency.   Some of the features to look for are:
  • Block-level Deduplication with configurable block size is to be expected by the administrators. 
  • De duplication must be across the protocols.  That is, if the data is downloaded by a client from a server using HTTP protocol first time and same data is downloaded by the client from the same server, but using CIFS, it is expected that actual data is not seen on the WAN link. That is deduplication must happen across protocols.  
  • Dedup feature efficiency when the data is not changed on the server, but being downloaded by the client again.  In this case, it is expected that no data, but only the blocks identifiers would be sent on the WAN link, that is, 100% dedup efficiency expected.
  • Minimal changes to the data on the server should also lead to near 100% dedup efficiency.  For example, if the additional data of few bytes is added in the beginning of the file in the server, only the changed data or atmost one or two additional blocks of data is expected to be seen on the WAN link when client downloads the changed file.  This change in the data should not lead to transfer of complete file content. In further attempts of same file download should have again 100% dedup efficiency assuming that the file is not changed on the server.
  • Blocks that are stored by the WAN optimization should be persistent. This data should be available after any device restarts.  Example scenario:  A file is downloaded by the client machine from server machine. WAN optimization devices cache the blocks of data. Restart the device and download the same file from the server. There should be 100% dedup efficiency.  It is understandable that devices take some time to recreate the internal serach lists from the disk when the device restarts.  During this time the any download of the file will not be able to achieve 100% dedup. But when the system is ready with internal lists, it should lead to 100% dedup efficiency.
  • Look for amount of disk space and memory the device has.  Dedup efficiency is directly proportional to this.  Some devices don't support the disk drives for storing the dedup data. It would have multiple problems: 
    • Dedup efficiency will not be good as DDR space is limited to store both search lists and data blocks.
    • When device restarts, there is no data in the DDR which leads to learning the data afresh.
  • Dedup efficiency, when tools like fragroute are used, is as good as the cases when it is not used.   It appears that some WAN optimization devices don't work well in these scenarios.  One might argue that fragroute is a lab tool, but it is necessary to remember that fragroute is simulating some real network conditions. For example, it is normal practice to break the TCP segments to smaller segments to reduce head-of-line blocking created by large TCP packets to allow VOIP RTP traffic.  'Frag Route' tool can be used to do multiple things such as:
    • Breaking the TCP segments to smaller segments.
    • Breaking IP datagrams to multiple IP fragments.
    • Reordering of TCP segments and IP datagrams/fragments.
  • Whatever disk size the WAN optimization devices have, it may not be sufficient when compared with amount of new data that is flowing on the WAN links. It is expected that WAN optimization devices throw the blocks which were not used for a long time to make space for new data. New data should be given higher priority all the time.  One way to test to ensure this is to fill the disk by sending unique blocks of data.  Then let the client download a big file from the server and ensure that there si 100% dedup efficiency when the client downloads the same file again.
  • Support for Protocol adapters :  Expect protocol adapters for different protocols for following reasons.
    • Some protocols such as CIFS are chatty. To reduce the chattiness, some intelligence of protocol is required to do operations such as 'read-ahead'. 
    • Knowing data boundary would make deduplication efficient.  WAN Optimization can wait for data with this boundary intelligence and then do dedup processing.  There is higher chance of finding the dedup blocks.
    • Doing ALG functionality to figure out data connections such as RTP to apply special processing to reduce any latency and jitter.
    • Decoding and Decompression of data before dedup processing occurs.
    • SSL Support :  For SSL termination and SSL connection establishment.
  • Support for dedup feature for real time and/or  streaming traffic:  When reliability channels are used between WOC devices, then real time traffic quality can suffer.  Hence it is expected from WAN optimization devices to use non-reliable channel for real time traffic.
Compression of data before sending it on the WAN links

WAN optimization devices are expected to provide compression feature to reduce the data on the WAN link. Dedup functionality reduces the data by not sending duplicate data.  Compression reduces the data that is being sent. Hence both functions are expected in the WAN optimization products. 
  • It is expected that compression is beyond packet level compression. It should be across the reliable channel (connection) normally established among WAN optimization devices. Compression by maintaining its history can do better job of reducing the data over time.

Caching feature completely eliminates any data including dedup block identifier data going on the wire if the file is not changed across downloads.  It is only possible to do this in HTTP protocol. 
  • WAN Optimization device should act as HTTP Proxy supporting HTTP/1.0 and HTTP/1.1
  • SSL Termination to ensure that it does both Caching and Deduplication across WAN Optimization devices.  Peer WAN Optimization device can make SSL connection to the Server.
Security on data at Rest:  WAN optimization device stores the dedup data in the hard drive.  It could be confidential data.  Expect WAN optimization to support secure storage of the data.
  • Crypto file system versus normal file system:  Devices must ensure that confidential data is stored in crypto file sysetm.
  • Encryption key used by crypto file system must not be stored in the same device.  Expect it to provide KMIP or equivalent functionality to get the keys from Key Management Server. It ensures that when device is stolen,  thieves don't get hands on the clear data.
Security on data on wire:  WAN Optimization devices are expected to provide secure transfer of data among them.  Note that these devices are terminating SSL connection at one device and make new SSL connection from remote device to the ultimate server.  Hence it is necessary that communication among these devices is secured.  IPsec is one popular technology to secure the data on the wire.  Expect IPsec kind of functionality in WAN Optimization devices.

Reliability & Data Consistency:

As part of dedup and caching, data is stored in the disks.  Expect some functionality to ensure that the data written to the disk is same as data being read.  RAID is one method by which it can ensure that kind of integrity upon any disk related errors.

As always, it is always good to get the devices and evaluate them in your network for significant of time before buying them. 

Network Packet Processing Applications - Example code using RCUs.

Data plane portion of embedded network devices typically does the packet processing of various data paths - Firewall, NAT,  IPsec, routing and bridging.  Close observation of these datapaths follow similar packet processing path.

Data path consists specific packet processing and interfaces with control plane. Structure of the datapath typically consists of:

Packet processing:
  • Packet-In (Get hold of packet).
  • Parse and Extract fields.
  • Check for matching context in 'Context' table.
  • Context would give information to act on the packet.
  • Modify the packet as per the datapath functionality and based on the information in the matching context.
  • Send the packet out.
At various stages, appropriate statistics counters are incremented. If there is no matching context for a given packet, packet is handed over to Control plane.  Control plane, based on its database, creates the context in the datapath.

Datapath-Control Plane interface:

  • Acting on Create/Delete/Modify/Query Context commands from CP
  • Responding to commands after acting on them.
  • Yet times, data paths create unsolicited events to CP
    • Exception packets - Packets that don't have matching context.
    • Local packets - Packets meant for local applications or packets matching context that are specifically marked to send the packets to CP.
    • Logging/Audit events.
In almost all cases, the context database is searched for every packet and context database is updated on behalf of control plane for exceptions.  That is,  search operations are lot higher than the update operations on these databases.  In Multicore environment, this context database is accessed from multiple cores.  To ensure that integrity of the context databases,  mutex locks are normally used.  For databases where the search operations are more frequent than update operations,  read-write locks are used.  As discussed in the previous post on RCUs,  read-write locks are expensive in SMP systems such as Linux.  RCUs are best suited for these databases as rcu_read_lock and rcu_read_unlock are very light and in some CPU architectures, these functions are empty functions. 

RCUs also provide good functionality and supports safe-reference to these contexts when external modules require the reference to the contexts.   As explained in the post,,  external modules need not store the context reference as pointer. Please note that I have used the term 'record' in that post instead of term 'context'.  You can treat them to be one another same.

With both of these advantages,  I thought it is better to showcase these two techniques using C code (sort of).

Hash table is one of the common data structure used in datapaths.  I will take hash table implementation and showcase how RCUs and safe reference techniques can be applied on hash table.   This hash table implementation is made as generic enough so that multiple datapaths can use these for their own context databases.   By making it generic library, each datapath developer don't have to worry about RCUs and Safe reference techniques and can achieve this using generic library.

My intention of C code is only to showcase the techniques. Code given below is done in a day or so and the code is not tested. There is no guarantee that this code works. 

MTHT Library:


  * DLL Node requires two pointers - Next and Prev. 
  * This structure defines these two pointers.
  * This hash table implementation supports the case
  * where there are two flows for a context and hence
  * there would be two DLL nodes.


typedef struct HashNode_s {
    struct HashNode_s *next;
    struct HashNode_s *prev;
    unsigned int sr_array_index;

/** HashCommonInfo_t memory is expected to be part of datapath
    application context
    table pointer is required to trace back to table as part of
    RCU callback function.

struct MT_Hash_Table_s;

typedef struct MTHashCommonInfo_s {
     struct rcu_head rcuh;
     HashNode_t  flow1;
     HashNode_t  flow2;
     struct MT_Hash_Table_s *table;

/** Hash Table contains multiple Hash Buckets.
    This structure defines the hash bucket.  In this case
    hash bucket contains the pointer to the first node in the bucket

typedef struct MT_Hash_Bucket_s {
     HashNode_t  head;

typedef struct MT_Safe_Reference_Node_s {
    unsigned int cookie_val;
    MTHashCommonInfo_t *common_node;
    int array_index;
    struct MT_Safe_Reference_Node_s *next;

/** Hash Table :  Main structure pointing to the hash table instance.
      It consists of buckets (Bucket list and number of buckets)
      lock variable, which is used when the hash table is updated.
      Cookie value and free safe reference list are mainly for safe reference

typedef struct MT_Hash_Table_s {
    int number_of_buckets;
    MT_Hash_Bucket_t *buckets;
    struct futex lock;
    unsigned int  cookie_val;
    unsigned int safe_reference_nodes;
    struct MT_Safe_Reference_Node_s *sr_array;
    struct MT_Safe_Reference_Node_s *free_index_list;
    unsigned int (*match_func)();
    unsigned int (*rcu_app_callback)(); 

/** Creates the hash table and returns the pointer to the hash table

MT_Hash_Table_t* MTHT_create_hash_Table( int number_of_buckets,
            int max_number_of_common_nodes, 
            unsigned int (*match)(),  
            unsigned int (*app_rcu_clbk)() )

        MT_Hash_Table_t *table;

        /** Allocate hash table **/
        table = calloc(1, sizeof(MT_Hash_Table_t));

        if (table == NULL )
            return (NULL);

        /** Allocate buckets **/

        table->buckets = calloc(number_of_buckets,  sizeof(MT_Hash_Bucket_t));

        if (table->buckets == NULL )

    /** Allocate space for safe reference nodes - common information nodes **/

        if ( MTHT_init_sr_array(table, max_number_of_common_nodes) == 0)
              return (NULL);

         /** Initialize the hash table **/
         /** Each bucket is arranged in circular linked list fashion
             so that the deletion of a hash node in future does not require
             to know the head pointer
            int ii;
            for ( ii = 0; ii < number_of_buckets; ii++ )
                table->buckets[ii] = table->buckets[ii].head.prev
                     = &table->buckets[ii].head;

        table->number_of_buckets = number_of_buckets;
        futex_init(&table->lock); /** Initialize the lock **/
        table->cookie_val =  1; /*Start the safe reference cookie value at 1*/
        table->match_func = match;
        table->rcu_app_callback = app_rcu_clbk;

  MTHT_remove_hash_table assumes that all nodes are freed already.
  This function frees up all the memory allocated during 'create' function.

void MTHT_remove_hash_Table( MT_Hash_Table_t *table)

unsigned int MTHT_get_free_sr_index(MT_Hash_Table_t *table);

/** This adds the node into hash table. 
    Since ther are two flows, there would bbe two keys passed to it **/

unsigned int MTHT_Add_Node( MT_Hash_Table_t *table,
  MTHashCommonInfo_t *flowsInfo, unsigned int hash_key1,
  unsigned int hash_key2)

   unsigned int free_index;


   /** Get free array index **/

   if ((free_index = MTHT_get_free_sr_index(table)) == 0 )
      return 0;

   table->sr_array[free_index].cookie_val = table->cookie_val++;
   table->sr_array[free_index].common_node = flowsInfo;
   table->sr_array[free_index].next = NULL;

   /** Adding the flow should be the last step **/
   /** Add flow1 **/
   /** Add flow1 to circular double linked list in the beginning,
       right after head

      HashNode_t *head;
      HashNode_t *flow;
      head =  &table->buckets[hash_key1 % table->number_of_buckets].head;
      flow = &flowsInfo->flow1;

      flow->sr_array_index = free_index;
     /** DLL Manipulations, but RCU purposes, we need to use
         rcu_assign_pointer to take care of CPUs with weak ordering
      flow->next = head->next;
      head->next = flow;
      flow->prev = head;
      head->next->prev = flow;

     /** Even though it may not be requierd in some CPUs, this is good
      ** practice. Also, it tells the pointers which are important
      rcu_assign_pointer(flow->next, head->next);
      rcu_assign_pointer(head->next, flow);
      rcu_assign_pointer(flow->prev,  head);


   /** Add flow2 **/
      HashNode_t *head;
      HashNode_t *flow;
      head =  &table->buckets[hash_key2 % table->number_of_buckets].head;
      flow = &flowsInfo->flow2;
      flow->sr_array_index = free_index;

      /** Following statements are converted to RCU
      flow->next = head->next;
      head->next = flow;
      flow->prev = head;
      head->next->prev = flow;

     /** Even though it may not be requierd in some CPUs, this is good
      ** practice. Also, it tells the pointers which are important
      rcu_assign_pointer(flow->next, head->next);
      rcu_assign_pointer(head->next, flow);
      rcu_assign_pointer(flow->prev,  head);




 * This function removes the node from the hash table
 * hash keys are not required as buckets maintain the double linked list.

void MTHT_RCU_free_fn(struct rcu_head *);

unsigned int MTHT_Remove_Node(MT_Hash_Table_t *table,
    MTHashCommonInfo_t *flowsInfo)
    unsigned int array_index;


    array_index = flowsInfo->flow1.sr_array_index;

    /** Invalidate the sr_array node, but don't put in the
        free list yet. Freeing to the linked list do be as part
        of RCU callback.  Otherwise, there is a possibility of
        same index being used by some other context
        Do this before removing the flows.

    table->sr_array[array_index].cookie_val = 0;
    table->sr_array[array_index].common_node = NULL;

    /** Remove flow1, but don't initialize the pointer to NULL as
     ** other thread will continue to travese the linked list

       HashNode_t *flow;

       flow = &flowsInfo->flow1;
       /** Make this DLL steps RCU friendly
       flow->next->prev = flow->prev;
       flow->prev->next = flow->next;

       rcu_assign_pointer(flow->next->prev, flow->prev);
       rcu_assign_pointer(flow->prev->next, flow->next);

    /** Remove flow2 **/
       HashNode_t *flow;

       flow = &flowsInfo->flow2;
       /** Make this DLL steps RCU friendly
       flow->next->prev = flow->prev;
       flow->prev->next = flow->next;

       rcu_assign_pointer(flow->next->prev, flow->prev);
       rcu_assign_pointer(flow->prev->next, flow->next);

    /** Set up the callback for RCU infrastructure to indicate us when
     ** all threads complete their current scheduled cycle

    call_rcu(&flowsInfo->rcuh,   MTHT_RCU_free_fn );

/** Seach function.
 ** Up to 4 search keys can be passed

HashNode_t *MTHT_Search_Node(MT_Hash_Table_t *table, unsigned int hash_key,
               void *search_key1, void *search_key2, void *search_key3,
               void *search_key4, unsigned int (*search)(),
               unsigned int *sf_index, unsigned int *cookie_val  )

   HashNode_t *temp, *head;
   unsigned int array_index;
   unsigned int cookie;
   MTHashCommonInfo_t *common;


       /** Start from head  **/
       head = &table->buckets[hash_key % table->number_of_buckets].head;
       temp = rcu_dereference_pointer(head->next);
       while(temp != head)
           if (search(temp, search_key1, search_key2, search_key3, search_key4)
                   == 1 ) /** Match found **/
              array_index = temp->sr_array_index;
              cookie = table->sr_array[array_index].cookie_val;
              common =  table->sr_array[array_index].common_node;
              if((cookie != 0) &&  (common != NULL))
                 /** while traversing the list, deletion process might have
                     started, if you find any of these variable being reset
                     assume that this node is not matched. Since these
                     variable are not reset, break

           temp = rcu_dereference_pointer(temp->next);

       if (temp == head )
          /** No Match found **/
          temp = NULL;
        *sf_index = array_index;
        *cookie_val = cookie;




void MTHT_free_sr_node(MT_Hash_Table_t *table, unsigned int array_index);

void MTHT_RCU_free_fn(struct rcu_head *ptr)
     MTHashCommonInfo_t *flowsInfo;

     flowsInfo = container_of(ptr,  MTHashCommonInfo_t, rcuh);

     /** Free the array index into free sr list **/
     MTHT_free_sr_node(flowsInfo->table, flowsInfo->flow1.sr_array_index); 


     /** Now call applicaton specific callback with common Info pointer


/** Initialize Safe Reference Array (also called Indirection Array)

int MTHT_init_sr_array(MT_Hash_Table_t *table,
              unsigned int max_number_of_common_nodes)
    unsigned int temp_index = 1;
    struct MT_Safe_Reference_Node_s *temp;

    table->sr_array =  calloc(max_number_of_common_nodes,
                            sizeof (MT_Safe_Reference_Node_t));

     if(table->sr_array == NULL )
         return 0;

     /** Create the free list of array indices **/

         temp = &table->sr_array[0];
         temp_index = 1;
         table->free_index_list = temp;

         while(temp_index <= max_number_of_common_nodes)
            temp= &table->sr_array[temp_index];
            temp->next =  table->free_index_list;
            table->free_index_list = temp;



unsigned int MTHT_get_free_sr_index(MT_Hash_Table_t *table )
  unsigned int free_index;

  if(table->free_index_list == NULL )
      return 0;

  free_index =  table->free_index_list->array_index;
  table->free_index_list = table->free_index_list->next;


void MTHT_free_sr_node(MT_Hash_Table_t *table, unsigned int array_index)
  MT_Safe_Reference_Node_t  *node;
  node = &table->sr_array[array_index];

  node->next = table->free_index_list->next;
  table->free_index_list = node;

I hope it helps in understanding how 'RCU's and safe reference techniques can be used using hash table.

Friday, December 25, 2009

DCE RPC Evasions - Developer tips

There are many vulnerabilities discovered and disclosed on applications based on DCE RPC.  Though many IDS/IPS devices have signatures to detect attacks exploiting these vulnerabilities,  these devices can be evaded easily.  This particular presentation has information on the methods used by attackers to avoid detections by network security devices.  Before I discuss on how to take care of these evasions in IDS/IPS, let me give some background on DCE RPC.

Some back ground on DCE RPC:

RPC (Remote Procedure Call) mechanism is used for distributed computing. Developers of applications on RPC mechanism need not worry about the underlying transport system. They only need to define the application methods, arguments and return values.  RPC mechanism takes care of every thing else such as transporting the method and arguments to the server in the  remote system and giving back the results to application after receiving results from the remote system.   RPC Server also need not worry about transporting the results after executing the method.

Predominantly there are two RPC systems that became popular.  Conceptually both are similar, but underlying implementation is different. Hence they are not interoperable. 

ONE/RPC :  Also called SUN RPC.  It is specified in RFC 5531
DCE RPC :  Distributed Computing Environment / Remote Procedure Call - This is adopted by Microsoft and hence is also called as MS RPC.  There is no RFC specifying this specifications. But open group published kind of specification. You can find it here.

Protocol information on DCE RPC:

DCE RPC transport can be  TCP based (Connection Oriented),  UDP (Connection Less) and on SMB (CIFS).  Any IDS/IPS device detecting attacks on DCE-RPC based applications need to look deep in the packets coming from any of the transport.  Though many protocol messages are same across TCP and UDP transports,  protocol data is different between them.

IDS/IPS devices need to worry about following PDUs:  BIND (Type 11),  BIND_ACK (Type 12), ALTER_CONTEXT (14),  ALTER_CONTEXT_RESP (15) , REQUEST (Type 0) and RESPONSE (Type 2) PDUs.

Each DCE RPC application (Client & Server) is identified by UUID (of 16 bytes).  Client  machines when it makes the connection to the Server machine, it indicates the applications it is interested in communicating by sending UUIDs of the applications in BIND and ALTER_CONTEXT PDUs.   Note that,  client machine may not know all UUIDs at one time.  It may come to know during run time. What it means is that the UUIDs need not be sent at one time. They can be sent any time during the RPC connection.   BIND-ACK,  ALTER_CONTEXT_RESPONSE return the context IDs for each UUID.  These context IDs are used by client application and server application to execute methods (and correspond arguments and return values) on the servers. Client application send method (identified by context ID,  opnum) and its arguments to the Server application via REQUEST PDU.  Server application, once it executes the methods, it sends the results via RESPONSE PDU.

Each PDU has RPC header as shown below.  (Copied from open group specification)

        u_int8  rpc_vers = 5;        /* 00:01 RPC version */
        u_int8  rpc_vers_minor;      /* 01:01 minor version */
        u_int8  PTYPE = bind;        /* 02:01 bind PDU */
        u_int8  pfc_flags;           /* 03:01 flags */
        byte    packed_drep[4];      /* 04:04 NDR data rep format label*/
        u_int16 frag_length;         /* 08:02 total length of fragment */
        u_int16 auth_length;         /* 10:02 length of auth_value */
        u_int32  call_id;            /* 12:04 call identifier */
 BIND specific data:

        u_int16 max_xmit_frag;     /* 16:02 max transmit frag size, bytes */
        u_int16 max_recv_frag;     /* 18:02 max receive  frag size, bytes */
        u_int32 assoc_group_id;    /* 20:04 incarnation of client-server
                                                            * assoc group */
      /* presentation context list */

        p_cont_list_t  p_context_elem; /* variable size */
      /* optional authentication verifier */
      /* following fields present iff auth_length != 0 */

        auth_verifier_co_t   auth_verifier; 
p_cont_list_t :
typedef   struct {
          u_int8          n_context_elem;      /* number of items */
          u_int8          reserved;            /* alignment pad, m.b.z. */
          u_short         reserved2;           /* alignment pad, m.b.z. */
          p_cont_elem_t [size_is(n_cont_elem)] p_cont_elem[];
          } p_cont_list_t;
typedef struct { 
    p_context_id_t p_cont_id; 
    u_int8 n_transfer_syn; /* number of items */ 
    u_int8 reserved; /* alignment pad, m.b.z. */ 
    p_syntax_id_t abstract_syntax; /* transfer syntax list */ 
    p_syntax_id_t [size_is(n_transfer_syn)] transfer_syntaxes[]; 
} p_cont_elem_t; 
typedef u_int16 p_context_id_t;
typedef struct { 
   uuid_t if_uuid; 
   u_int32 if_version; 
} p_syntax_id_t;

Some important points to note are:
  • Multiple UUIDs can be bound using one BIND PDU. n_context_elem gives the number of UUIDs that are being requested. 
  • Normally n_transfer_syn is 1, but it can be more than 1.  It provides the transfer encoding of the UUID in multiple forms.
  • BIND_ACK will provide context IDs for each UUID of the BIND PDU.
  • If there are more UUIDs need to be bound at later time, they can be bound using ALTER_CONTEXT PDUs.
  • Context IDs of new UUIDs are given by Server machine using ALTER_CONTEXT response PDU.
Other point that one must be aware are:
  •  DCE-RPC protocol itself has fragmentation supported. This is beyond the IP layer fragmentation.  pfc_flags bit 0 and bit 1 tells the receiving implementation on whether it is FIRST fragmentation or LAST fragmentation.  If both bits are 0, then it is MIDDLE fragment.  
  • Pipe lining:  DCE RPC allows multiple requests in transit. That is,  DCE RPC can send multiple REQUESTS before getting the RESPONSE messages. My understanding is that these multiple requests in transit belong to different contexts.  For a given context, only one REQUEST would be pending.  When pipelining is combined with fragmentation, it is possible that the fragments belonging to REQUEST messages (or RESPONSE messages) can go in any order. That is,  on the wire you can see frag1 of REQUEST 1,   frag 1 of REQUEST2, frag2 of REQUEST1  and frag2 of REQUEST2 or any order.
What IDS/IPS devices need to be aware of to avoid evasions by attackers?
  •  IDS/IPS devices typically look in the stub data for a specific pattern.  Stub data is specific to each application and method number.  So, any protection signature that searches for the content in the stub data should have other keywords such as UUID,  method number. Unfortunately UUID is not present along with method and stub data. That is UUID is not present in REQUEST PDU or RESPONSE PDU.  REQUEST and RESPONSE PDUs only have Context ID.  Hence IDS/IPS device need to maintain the UUID versus context ID mapping by interpreting BIND (Alter_Context) and BIND-ACK (Alter-Context-Resp) PDUs for each TCP or UDP connection.  This mapping information needs to be used when the signatures are analyzed as part of  attack detection on methods and its stub data processing. Note that, these mappings should be maintained on per connection basis.  Across the connections, for same UUID, there could be different context ID value.  IDS/IPS devices might be evaded by sending large number of UUIDs in the BIND/Alter-Context PDUs to overwhelm the mapping table.  There is no easy solution here, but IDS/IPS devices should have ability to generate log when this happens. It is also good to have some kind of configuration/anomaly-signature to control this behavior. 
  • In some IDS/IPS devices, interpretation of UUIDs goes wrong if there are more than one transfer encoding of UUID, that is, if 'n_transfer_syn' is more than one.  It is necessary to skip right amount of transfer encoded data to go to the next UUID in the BIND/Alter-Context PDUs.
  • Some IDS/IPS devices fail to detect attacks when PDUs are fragmented.  It is necessary that IDS/IPS devices interpret the data across multiple fragments. It is required to build UUID-to-context ID mapping table or to analyze the stub data of REQUEST and RESPONSE PDUs.
  • Some IDS/IPS devices also fail when there is pipelining of REQUEST ( orRESPONSE PDUs).   IDS/IPS devices while reassembling the fragmented PDU must ensure to reassemble right fragments by comparing not only checking the fragment flags, but also the type of message and context ID in case of REQUEST/RESPONSE messages.
  • Some IDS/IPS devices also fail when the stub data is sent in a different form. Stub data can be encoded in little endian form and big endian form.  IPS devices need to ensure to check the stub data based on the format.  'packed_drep' field in the header indicates the format used.
As a buyer of the IDS/IPS product,  one must ensure that the device withstands evasion techniques adopted by attacker to bypass detection and prevention of attack passing through it.