Sunday, February 22, 2009

Advice to Network Device testers - Simulate Capacity/Stress related faults

Capacity in network devices such as UTM is specified with respect to simultaneous connections in case of firewall, ALGs, intrusion Prevention functionality, tunnels in case of IPsec VPN, number of sessions in case of Anti Virus and Spam functions and many more related to different smaller functions. All the functions are not normally used at the same time. Even if all functions are used at the same time, all sessions may not be going through all functions. Due to this, network device vendors typically oversubscribe memory. That is, the memory needed for all functions for the specif ed capacity would be lot more than the memory available in the devices.

This could pose interesting problem in the field. If there is a deployment where multiple functions are used by large number of connections, there could be memory shortage and other resources shortage. This leads to error being returned when the resource is being allocated. If error detection, propagation and recovery is not taken care well by the software, this could lead to instability, leaks, crashes and lockups. It is tester job to ensure that these kinds of problems do not happen in the field. Typically testers simulate different conditions and ensure that system is stable. Yet times, it is not possible for testers to test all different combinations or simulate different conditions.

I believe testers should be able to simulate all possible combinations by simulating all kinds of error conditions. As part of it, testers should ask development team to provide facilities to inject the faults. In particular, testers should ask for facilities to inject faults for following.
  • Memory allocation failures: Almost all functions in software would allocate memory either at the time of connection establishment or on packet basis or to queue the packets and control data etc... Testers should have ammunition to inject the memory fault for specific functions.
  • Socket/File open failures
  • Semaphore creation failures
  • Thread/Tasklet creation failures
  • Fault simulation of any other OS resource that gets allocated after software is completely initialized.
Testers should go at testing in methodical way:

  • Keep list of all functions and OS resource allocations they do.
  • For each one of them, create a test case.
  • Before running the test case, configure to inject fault.
  • Run the test and ensure that system works as expected.
  • Run the test without fault and ensure that system is stable.
I believe that this kind of testing should happen for every release - feature or maintenance releases. If these tests are done manually, it takes very long time. My suggestion is to automate them.

No comments: