Wednesday, March 5, 2014

Never, ever dismiss the most innocent failing test on your continuousintegration system!

Situation

As part of a project in C that requires some threading primitives, I wrote a piece of code to abstract a Windows timer into a small C class (yes, C class... please don't bash me here for mixing "classes" with plain C code, that will be material for another discussion).

You can find the header file of the basic classes provided by this library here:

https://gist.github.com/jsbattig/9379060

This timer primitive uses Windows threadpool timers. They are kind of nice because they have nothing to do with traditional windows handle linked timers which require a message loop marshaling messages sent to the handle neither they are "multimedia timers". Besides, this library already required thread pools so adding support for threadpool timers was a natural extension to the library itself.

By looking at Windows API documentation I find that to get one of these timers going, besides the fact you need a callback environment variable pre-initialized (or NULL) you need to use the following two APIs:

CreateThreadpoolTimer
SetThreadpoolTimer

SetThreadpoolTimer will actually "start" the timer.

The key parameters this function receives are the timer handle, a "due time" and a recurrence parameter or interval. There's a fourth parameter named msWindowLength which we don't particularly care at this point.

The due time, has the oddity to be the number of units of 100 nanoseconds since January 1, 1601 (UTC) expressed in FILETIME format (I personally found this kind of odd, for a timer... but anyway, MS folks did it like this for some reason I imagine). This parameter, if negative, can also mean a relative time since current time.
The msPeriod parameter is expressed in milliseconds. Why one parameter denoting timer timing in FILETIME and the other in milliseconds? Don't know...

Anyway, at first sight it seemed pretty straightforward, so I crafted the code bellow to get my timer created and going:


So, as you can quickly see, I missed the boat from the get going since I set the FILETIME parameter to be the actual interval in milliseconds converted to 100 nanosecond units since 0 (zero) time.

Here's where Mr Jenkins came to the rescue!
There's a particular test I run to exercise the functioning of this timer:

This test failed once today just before I wrote this blog entry.
This is the GoogleTest jenkins entry for the failure:

c_driver_SSvcBusThreadPoolTest_Win32_Release_singleMongo.SvcBusConsumer_testThreadPoolTimer

 Error Details

Value of: SvcBusThreadPoolTimer_getThreadId( timer)
  Actual: 24652
Expected: 0

 Stack Trace

threadpool_unittest.cc:132
Value of: SvcBusThreadPoolTimer_getThreadId( timer)
  Actual: 24652
Expected: 0

I could have easily dismiss it as a "fluke" or some kind of oddity, maybe the CPU was too busy and the program took to long to go from the creation of the timer to the actual check that's why the threadId recorded was != 0 (even tough the timer was created with 100ms due time).

I don't believe in "oddities" anymore at this point in my career and haven't done so for a while...

So, I dug a little bit to find the obvious which I pretty much imagined since I saw the Jenkins log.

The timer was kicking right away because I pass a positive number not adjusted to January 1st 1601. Now, I did write code to do the adjustment and showed the new working code to a colleague only for him to quickly point me down after checking the documentation why I didn't use the "relative to current" feature. Always good to have another pair of eyes checking at your code to uncover your own naiveness!

So, finally I did that, which by the way it was a really easy solution. All it took was casting the millis parameter to __int64 and negate it. That's it!

Solution

Here's the fixed code:

That's it. That fixes our "oddity", our "cosmic ray flipping that bit to zero", our bad bad Jenkins making our perfectly good code and perfectly written test to fail.

No, there's no oddities, cosmic rays, bad Jenkins or slow CPUs (maybe sometimes there's slow CPUs). A failing test is a failing test, it's signaling something. Either the test is flawed, or there's a bug hiding somewhere behind it.

Happy coding!