Tuesday, 29 April 2014

When Testing Goes Wrong

This is a reminder to self and warning to others to make sure your testing environment is as similar as possible to the real environment as possible. This doesn't just go for computing either...

I am developing a web service using a 'proper multi-stage development environment'. At least I thought I was. I'd set up git to do version control and manage deployment to an online backup, a test server and a live production server. We develop code on a laptop, test major releases on a staging server, then once I know it's running as expected, and only, then push the code to the live production server. This approach is meant to filter out any mistakes in code, any odd behaviours, before they reach the end-user.

This was working brilliantly, safe-guarding my good code from late night hacking and making me feel in control and professional.

Then I hit a problem.

My production server would not maintain a connection on a particular page. My staging server worked fine. It turns out that a fundamental difference in my staging and production servers hid an error from me until I was on a train going to show off the web service.

Some detail for those interested. The problem was to do with firewalls, ports and incompatibilities with old and new technologies:

  • First the staging server also hosts my webpage on www.staging.com. This meant that I accessed the site being developed using a different port number: www.staging.com:3000
  • The production server only hosts my production page that I would access without the port number: www.production.com meaning on standard http port 80
  • I added a REDIRECT to send port 80 to port 3000 which seemed to work fine. Exactly the same code as on staging, but with this small additional setting
  • My web page uses websockets - a fairly new technology that came in the HTML5 which is great for real-time communication
  • It seems that some firewall/port/redirection software is incompatible with websockets, as far as I know, stopping this key function from working

There are solutions that I won't go into here just yet. What I want to point out, to remind myself in the future, is that it is very important that your staging server is as similar to your production server as possible.  Same versions of software, same security settings, same time-zone... you never know what weirdness could break your site, so try to bring in as much of the weirdness to your test environment as possible.

I've personally seen similar situations testing missiles on fast-jets and controls for armoured vehicles, but there are many examples of things going wrong because of this issue. The cause of unexpected failure is quite often in the moistware - aka humans but could be due to many different considerations throughout the life of a system. If you need to be absolutely sure that something is going to work, test it to death but remember it will still fail. The devil is in the detail, it's also in things you haven't even thought about. Having a system that copes with breakdowns and errors is as essential as testing.