Debugging software is best done using the scientific method: gather evidence about the effects of the bug, conjure up hypotheses to explain the behaviour, experiment to test the hypotheses and modify the code to change the behaviour. Rinse and repeat. If you can’t consistently reproduce the bug though, it can get tricky.
Recently, while developing a site targeted at mobile devices, we came across an intermittent problem when using a BlackBerry device. Testing mobile sites with desktop browsers and emulators can only take you so far. Eventually you reach the point where real devices begin to exhibit their own peccadillos and so we use DeviceAnywhere to access a whole host of remote-controlled physical devices.
Using the BlackBerry Curve, occasionally, our login page wouldn’t proceed to the home page after successful authentication. But we could never reproduce the this in our development environments, only on live; sometimes.
One major difference between the two environments was that the live one had dozens of servers behind a load-balancer which used a URL parameter for session affinity (we couldn’t assume all mobile devices would support cookies), whereas the development environment was a single server. We also had a staging environment which closely reproduced the live environment, although there were only a couple of servers behind its load-balancer. Initial tests on the staging environment indicated that the problem didn’t appear there either.
To rule out the mobile network provider, we installed the excellent Opera Mini browser on the BlackBerry and it worked every time. This also ruled out any issues with pages being cached by Akamai, the content delivery network. So we were now looking for a problem with our code interacting with the BlackBerry browser, but only behind our live load-balancer; sometimes.
After painstakingly tracing through the live Apache logs we closed in on the unexpected cause: a bug in the BlackBerry browser. When a server tells a browser to redirect it sends the full URL, including in our case the all-important session parameter. This URL was being tampered with before the browser navigated to it. The parameter name was being converted to lower-case (if it wasn’t preceded by a slash). This meant that the load-balancer didn’t use it for server affinity so the home page server probably didn’t have a logged-in session, and so it would bounce back to the login page.
The reason this problem had been so hard to reproduce was that in development there was only one server so affinity wasn’t an issue and the server software didn’t care about the case of the session parameter. Also the site URL was different and so the session parameter always had a preceding slash which didn’t trigger the BlackBerry URL tampering, so it never appeared as lower-case in the development logs. And on the staging environment, because there were only two servers, the device would hit the same server, notwithstanding any affinity failure caused by the lower-casing, half of the time by chance alone. The live environment was more likely to fail, but even it gave a sizeable probability of hitting the same server successively by chance alone.
We built a test server and, using some black box reverse-engineering (because the BlackBerry browser is closed-source), we reckon the logic inside the browser’s redirect code goes something like this: “lower-case all the characters in the location URL up to the first slash” presumably with the intention of making the DNS name lower-case. But it should be: “… up to the first slash or ?” to preserve the case of any query parameters.
Googling for this issue returns a number of other sites having redirect and login issues with BlackBerrys. I wonder how many are caused by this subtle, case-sensitive bug?
We’ve since searched our logs and found the bug across this wide range of BlackBerry devices/versions:BlackBerry8100/4.2.0
We’ve logged it with BlackBerry. I’ll post an update if we receive any response.