Wednesday, December 16, 2015

What's good logging on microservice-oriented components

Motivation

I've noticed very different approaches to logging on all of the micro-services we currently have at Sovos (which are a lot a this point).

Prior to our move to micro-services model based on our service-bus, we did have a big portion of Sovos's architecture designed around the concept of coarser services. Nevertheless, we did have pretty good and comprehensive logging model, and most services did log enough to be able to perform efficient and successful troubleshooting without requiring to reproduce problems in Development nor use a debugger.

I get the perception now that with our move to micro-services we are losing the grip on disciplined logging instrumentation on the services.

Let's start with a statement I want to be true for all micro-services we produce (this is a desire of future state essentially):

If it takes more than 5 minutes of troubleshooting to have a rough idea where a problem might be on a faulty service. The service lacks proper logging.

That's what we have to shoot for.

Since we moved to the micro-services model I've observed countless troubleshooting sessions that go for hours, until someone finally has to put the service in a debugger just to figure out it had a bad initialization parameter or something as simple as that...
It doesn't have to be like that.

Logging best practice #1


Every service should follow the best practice of logging at startup ALL initialization parameters (or critical commands) and shutdown. Use LogLevel=0 for this (meaning "always log").

This the example from our rxfile service initialization and initialization parameters logs:

11/12/2015 1:53:53 PM: Status:INFO -- Details=----------------------------------------------------------
11/12/2015 1:53:53 PM: Status:INFO -- Details=--        Remote Execution Module - File Manager        --
11/12/2015 1:53:53 PM: Status:INFO -- Details=CS: J3Y3MAJWPYBGJ250JDOEVQOEUP3LGXOSVSLADYCX55QRIILPDJPYLXVEVBDM10FHK24JNWNIR5TPKP
11/12/2015 1:53:53 PM: Status:INFO -- Details=Target Path: X:-Cerem-InBox
11/12/2015 1:53:53 PM: Status:INFO -- Details=Log Level: 5
11/12/2015 1:53:53 PM: Status:INFO -- Details=Restart Time: 06:00
11/12/2015 1:53:53 PM: Status:INFO -- Details=Timeout interval: 1000
11/12/2015 1:53:53 PM: Status:INFO -- Details=----------------------------------------------------------

As you can see, those are "INFO" log entries with the critical parameters for this particular application.

Shutdown log example:

11/12/2015 1:53:51 PM: Status:INFO -- Details=Stopping service
11/12/2015 1:53:51 PM: Status:INFO -- Details=Main timer disabled
11/12/2015 1:53:51 PM: Status:INFO -- Details=Waiting for execution queues
11/12/2015 1:53:51 PM: Status:INFO -- Details=Waiting for execution queues completed
11/12/2015 1:53:51 PM: Status:INFO -- Details=----------------------------------------------------------
11/12/2015 1:53:51 PM: Status:INFO -- Details=ServiceStop event complete
11/12/2015 1:53:53 PM: Status:INFO -- Details=----------------------------------------------------------

Notice the highlighted significant events logged upon shutdown. There's a reason for that. They allow to troubleshoot common issues every time a program has to deal with asynchronous queues or threads shutdown (who didn't have a deadlock when trying to shutdown a threaded process?).
With that informational logging on top of the shutdown logging you can trace how every piece of the application gets shutdown.
In the case of our micro-services, when running in console mode, we can capture Ctrl-C key combination with a simple Windows hook and initiate a clean shutdown.



The other thing I see often, is exceptions occurring on services that are simply absorbed by the code with no logging whatsoever. I've done this myself in a couple of places, but typically I will put at least a "// ToDo: needs logging" or something along that nature.
We need to be vigilant to code that could absorb exceptions to make sure there's proper logging in place.
A final though on that, hybrid micro-services will have exceptions serialized back to the caller. That's OK, as long as the caller logs the exception if it can't properly handled it.

Logging best practice #2


Ensure that ALL unhandled exceptions are logged as errors. Use LogLevel=2 for this unless the situation is Fatal, use LogLevel=1 for fatal errors.

Example from rxfile service:

11/12/2015 1:53:55 PM: Status:ERROR -- Details=Exception received from the Agent -ORA-00942: table or view does not exist
- processing environment -QAINEWPKD--
11/12/2015 1:53:55 PM: Status:ERROR -- Details=Exception received from the Agent -ORA-00942: table or view does not exist
- processing environment -TWED--
11/12/2015 1:53:55 PM: Status:ERROR -- Details=Exception received from the Agent -ORA-00942: table or view does not exist
- processing environment -DEVORAD--
11/12/2015 1:53:55 PM: Status:ERROR -- Details=Exception received from the Agent -ORA-00942: table or view does not exist
- processing environment -DEVNEWKEYD--
11/12/2015 1:53:55 PM: Status:ERROR -- Details=Exception received from the Agent -ORA-00942: table or view does not exist
- processing environment -NCOLLIND--

Those are logging entries caused by external missing tables on that target schema. The service can take the pounding of those exceptions and keep going fine, but it will log the problem nevertheless, hopefully some day maybe someone will read the log and figure out that either has to remove the entries so rxfile doesn't process those domains, fix the configuration or add the missing tables.



The other situation I see happening is exceptional situations that "somehow" are supposed to be handled within the code, therefore "hidden" from developer view when the software is running, but in the long run end up derailing the program with a crash or errors that become hard to track, unless you can log the trace of this unfortunate (handled) events.

Logging best practice #3


Log handled exceptions that could be "of interest" in the future to troubleshoot issues. Use LogLevel=3 or 4 for this depending on how bad the handled exception.

An example of this is a piece of code, again in rxfile service, where a file being accessed is locked for exclusive access by another process, therefore the service will retry a few times. Again, this is part of the process, so this log entry I think we do it only with loglevel=4:

11/12/2015 1:59:20 PM: Status:INFO -- Details=File found: X:-Cerem-InBox-JSBATTIGD-QFTAPDEPLOY-DMP@TPRDVYZ5BFJUT15HT5ZNYAZDY53IKDT43J2SFVBSIFSOCQBRWJGYNM55MP2FWT2Q2AHRVGQEMUPQFI0ESFDQQLLPDMFT45CXH4GPFZP-0000000000000000-post
11/12/2015 1:59:20 PM: Status:INFO -- Details=Failed to open file -retry no- 1-

That simple log entry can help you troubleshoot potentially issues with algorithms that rely on capturing certain exceptions to perform retries (or not).




Next one up my list is what I will call "major functions". Many times micro-service applications run big processes "in-line" as a direct response to a request, or on a separate thread asynchronously (some kind of garbage collection or resource cleanup of any sort).
When possible, it is good to log those as is not uncommon for this "large processes" to destabilize an application and make it crash. Many times you don't have a trace of the crash itself, but you know what happened immediately before that...

Logging best practice #4


Log "big or special operations" of any kind. Use LogLevel=4 for this.

Example of this "big" operations is a file cleanup operation that the rxfile service performs:

11/12/2015 1:53:53 PM: Status:INFO -- Details=----------------------------------------------------------
11/12/2015 1:53:54 PM: Status:INFO -- Details=Remote File Cleaning Agent running---
11/12/2015 1:53:55 PM: Status:ERROR -- Details=Exception received from the Agent -ORA-00942: table or view does not exist
- processing environment -QAINEWPKD--
11/12/2015 1:53:55 PM: Status:ERROR -- Details=Exception received from the Agent -ORA-00942: table or view does not exist
- processing environment -TWED--

Notice the interesting fact that there's errors coming after the highlighted entry. By merit of having that entry logged there, it's not hard to assess that those errors were in fact triggered as part of the execution of the file cleaning agent.



Bear with me. I have one more.

Next one up is logging to assist complex troubleshooting efforts. This is about logging pretty much everything with enough granularity to add value when troubleshooting. It's very important to think about the granularity piece, definitively you don't want to be hitting the log function inside a loop that iterates 100000 times adding numbers because you will spend more time evaluating if log or not to log with the configured loglevel than doing work.

Logging best practice #5


Log all "big or special operations". Use LogLevel=5 for this.

You want to have a "verbose" mode of operation to troubleshoot the most intricate aspects of the application. Think of this logging level as the closest thing you will have to Debugging in your IDE. Remember, you don't have access to use your IDE debugger in Production, so you better instrument your application to be an effective troubleshooter when the issues arise (they will arise...).

Here's an example of how rxfile service logs every one of the two actions it performs on files: open and process them, then delete them:

11/12/2015 1:59:27 PM: Status:INFO -- Details=File found: X:-Cerem-InBox-JSBATTIGD-QFTAPDEPLOY-DMP@TPRDVYZ5BFJUT15HT5ZNYAZDY53IKDT43J2SFVBSIFSOCQBRWJGYNM55MP2FWT2Q2AHRVGQEMUPQFI0ESFDQQLLPDMFT45CXH4GPFZP-0000000000000014-post
11/12/2015 1:59:27 PM: Status:INFO -- Details=File deleted: X:-Cerem-InBox-JSBATTIGD-QFTAPDEPLOY-DMP@TPRDVYZ5BFJUT15HT5ZNYAZDY53IKDT43J2SFVBSIFSOCQBRWJGYNM55MP2FWT2Q2AHRVGQEMUPQFI0ESFDQQLLPDMFT45CXH4GPFZP-0000000000000011-post
11/12/2015 1:59:27 PM: Status:INFO -- Details=File deleted: X:-Cerem-InBox-JSBATTIGD-QFTAPDEPLOY-DMP@TPRDVYZ5BFJUT15HT5ZNYAZDY53IKDT43J2SFVBSIFSOCQBRWJGYNM55MP2FWT2Q2AHRVGQEMUPQFI0ESFDQQLLPDMFT45CXH4GPFZP-0000000000000012-post
11/12/2015 1:59:27 PM: Status:INFO -- Details=File found: X:-Cerem-InBox-JSBATTIGD-QFTAPDEPLOY-DMP@TPRDVYZ5BFJUT15HT5ZNYAZDY53IKDT43J2SFVBSIFSOCQBRWJGYNM55MP2FWT2Q2AHRVGQEMUPQFI0ESFDQQLLPDMFT45CXH4GPFZP-0000000000000015-post
11/12/2015 1:59:28 PM: Status:INFO -- Details=File found: X:-Cerem-InBox-JSBATTIGD-QFTAPDEPLOY-DMP@TPRDVYZ5BFJUT15HT5ZNYAZDY53IKDT43J2SFVBSIFSOCQBRWJGYNM55MP2FWT2Q2AHRVGQEMUPQFI0ESFDQQLLPDMFT45CXH4GPFZP-0000000000000016-post


Recommendation

Every time you go through a troubleshooting session, after fixing the issue you should think what logging could you have added to accelerate finding the issue in the first place. And when you do that, get it implemented right away in the current sprint or at most in the next sprint.

Tuesday, October 13, 2015

Software Engineering - Naming things

Naming things, maybe one of the toughest things to do in software engineering, and surely the single most important practice of all.

Check this out: http://martinfowler.com/bliki/TwoHardThings.html

Is it not true indeed? Naming things being one of the two hardest thing in software engineering, with the other being cache invalidation?

I believe the power of good naming, or why not, great naming, not only makes software easier to read, maintain and extend, but also simpler to properly craft in the first place.

Good naming exposes intent, and with intent clearly exposed we can see the structure of what we are building and assess if it makes sense or not. Good naming makes refactoring and evolutionary development possible, good naming allows for patterns emergence.

Bad naming typically tends to hide things creating the illusion of "maybe not so bad software" behind the scenes. With bad naming you definitively don't see good software, but worst of all, you may not see terribly bad software hiding under the surface.

So, what's good naming then? I was going to write my own "naming" rules, but I came across these two great articles that do a much better job I could at the task:

http://parallelcross.com/post/36861185276/naming-strategies-uncle-bob-and-ottinger

http://www.objectmentor.com/resources/articles/Naming.pdf

Enjoy

C++ and automatic objects misconceptions

Let's start this blog entry first with a disclaimer: I'm not a expert, not even experienced C++ developer.

Now, I do know a thing or two about C++, and particularly about some of the primitives of the language and not so much about its standard library.

One thing in particular I always liked is the idea of offering developers two models of object allocation:
  • "automatic" objects
  • "explicitly" allocated objects
Before we jump in to automatic objects, just a note on the second bullet point, what I call "explicitly" allocated objects.
The reason I call that model "explicitly" allocated objects rather than let's say, dynamically allocated objects or heap allocated objects is because really the main difference between that model and the "automatic" objects is that the developer tells the compiler with no room for interpretation that a new object will be allocated with memory provided by the new() operator and a particular constructor will be called.
There's no way around it.

Automatic objects, on the other hand, leave a lot of room to the compiler to decide how the final object is going to be provided to the developer.

With that, the motivation of this post is because I see a lot of confusion between C++ developers on exactly how the compiler solves certain constructs, leading to decisions that may not be the most efficient or cleanest ones when crafting code. In all honestly, most of this blog post will be a small piece of code I wrote with some typical and commonly used constructs when creating objects, assigning them to variables and passing them as function parameters. The code I wrote was to answer questions I had myself, in a few cases with assumed answers (that were wrong!) and the C++ compiler surprised positively in every one of the cases.

Without further due, this is the test code:


And this is the console output if you run the program:

vvvvvvv
Example with only one copy of myString class...
myString created
hello 0 my address: 2030296
hello 0 my address: 2030296
myString destroyed
^^^^^^^
vvvvvvv
Even invoking as a constructor creates only one copy...
myString created
hello 1 my address: 2030252
hello 1 my address: 2030252
myString destroyed
^^^^^^^
vvvvvvv
When really copying into an uninitialized object copy constructor is finally use
d...
myString created
hello 2 my address: 2030208
now we will assign str to str2
myString created using copy constructor
hello 2 my address: 2030164
hello 2 my address: 2030208
hello 2 my address: 2030164
myString destroyed
myString destroyed
^^^^^^^
vvvvvvv
What happens if we invoke function that doesn't use & operator?...
myString created
hello 3 my address: 2030120
myString created using copy constructor
hello 3 my address: 2029768
iReceiveAMyString called
myString destroyed
myString destroyed
^^^^^^^
vvvvvvv
When overwriting an object, assignment operator will be called...
myString created
hello A 4 my address: 2030076
myString created
hello B 5 my address: 2030032
let's print str2
hello B 5 my address: 2030032
now we will assign str to str2
myString operator= called
let's print str and after that str2
hello A 4 my address: 2030076
hello A 4 my address: 2030032
myString destroyed
myString destroyed
^^^^^^^

Hopefully it depicts clearly how the C++ compiler is pretty efficient on its dealing with automatic objects. No unnecessary copies, not a single one, even when returning an automatic object from a function the construction storage was the target storage location.

Think very carefully before using new() allocated objects, it may not be worth on many cases. Not to mention you will be incurring typically on expensive heap allocations.

Sunday, March 15, 2015

How to properly support Windows Fibers in Delphi without breaking exception handling

If you are giving a shot to Windows Fibers under Delphi, you probably faced the issue of how to properly handle Delphi's structured exception handling AND switching between fibers transparently.

Assuming you already know what Fibers are, how to use them and when to use them, I will cut to the chase. If you don't know, or don't know enough, I recommend you the following reads to get started:


https://msdn.microsoft.com/en-us/library/windows/desktop/ms682661%28v=vs.85%29.aspx
http://blogs.technet.com/b/markrussinovich/archive/2009/07/08/3261309.aspx


When exceptions happen in a Delphi program, the compiler generates code to store nested exceptions on an exception stack. This stack is stored in a TLS allocated by the program upon startup.

When using Delphi compiled in 32 bits, the issue of persisting and restoring this stack is trivial. Up to the latest versions of Delphi the RTL exposes a couple of functions in the system unit that allow you to do the trick in two lines of code (literally).
The challenge with Delphi programs compiled in 64 bits is that Delphi doesn't expose anymore an API to persist and restore the Exception stack. It's not hard overcome this, but it requires a bit of reverse engineering and watching what the compiler does by enabling the CPU debugger.
I'll spare you from the pain, and below you will find a unit that does the trick:

uWin64ExceptionStack.pas


Here's the full code that I use for switching between fibers:



Before I forget!!!

Here's a sample test that performs a SwitchToFiber() inside an exception handler block. In fact, it's a triple nested exception handler block to test the mechanism of persisting and restored more than just one element of the exception stack.

I hope this helps you on your adventure trying to implement fibers using Delphi.

A note on testing the approach:

Code was verified to work properly on:
Delphi 2007
DelphiXE4 running in WIN32 and WIN64

Special note about Delphi 5. SetRaiseList was totally broken. When used, it will send the application into a tailspin of access violations. See the fixed code, mainly borrowed from Delphi 2007 implementation.