Wednesday, December 16, 2015

What's good logging on microservice-oriented components

Motivation

I've noticed very different approaches to logging on all of the micro-services we currently have at Sovos (which are a lot a this point).

Prior to our move to micro-services model based on our service-bus, we did have a big portion of Sovos's architecture designed around the concept of coarser services. Nevertheless, we did have pretty good and comprehensive logging model, and most services did log enough to be able to perform efficient and successful troubleshooting without requiring to reproduce problems in Development nor use a debugger.

I get the perception now that with our move to micro-services we are losing the grip on disciplined logging instrumentation on the services.

Let's start with a statement I want to be true for all micro-services we produce (this is a desire of future state essentially):

If it takes more than 5 minutes of troubleshooting to have a rough idea where a problem might be on a faulty service. The service lacks proper logging.

That's what we have to shoot for.

Since we moved to the micro-services model I've observed countless troubleshooting sessions that go for hours, until someone finally has to put the service in a debugger just to figure out it had a bad initialization parameter or something as simple as that...
It doesn't have to be like that.

Logging best practice #1


Every service should follow the best practice of logging at startup ALL initialization parameters (or critical commands) and shutdown. Use LogLevel=0 for this (meaning "always log").

This the example from our rxfile service initialization and initialization parameters logs:

11/12/2015 1:53:53 PM: Status:INFO -- Details=----------------------------------------------------------
11/12/2015 1:53:53 PM: Status:INFO -- Details=--        Remote Execution Module - File Manager        --
11/12/2015 1:53:53 PM: Status:INFO -- Details=CS: J3Y3MAJWPYBGJ250JDOEVQOEUP3LGXOSVSLADYCX55QRIILPDJPYLXVEVBDM10FHK24JNWNIR5TPKP
11/12/2015 1:53:53 PM: Status:INFO -- Details=Target Path: X:-Cerem-InBox
11/12/2015 1:53:53 PM: Status:INFO -- Details=Log Level: 5
11/12/2015 1:53:53 PM: Status:INFO -- Details=Restart Time: 06:00
11/12/2015 1:53:53 PM: Status:INFO -- Details=Timeout interval: 1000
11/12/2015 1:53:53 PM: Status:INFO -- Details=----------------------------------------------------------

As you can see, those are "INFO" log entries with the critical parameters for this particular application.

Shutdown log example:

11/12/2015 1:53:51 PM: Status:INFO -- Details=Stopping service
11/12/2015 1:53:51 PM: Status:INFO -- Details=Main timer disabled
11/12/2015 1:53:51 PM: Status:INFO -- Details=Waiting for execution queues
11/12/2015 1:53:51 PM: Status:INFO -- Details=Waiting for execution queues completed
11/12/2015 1:53:51 PM: Status:INFO -- Details=----------------------------------------------------------
11/12/2015 1:53:51 PM: Status:INFO -- Details=ServiceStop event complete
11/12/2015 1:53:53 PM: Status:INFO -- Details=----------------------------------------------------------

Notice the highlighted significant events logged upon shutdown. There's a reason for that. They allow to troubleshoot common issues every time a program has to deal with asynchronous queues or threads shutdown (who didn't have a deadlock when trying to shutdown a threaded process?).
With that informational logging on top of the shutdown logging you can trace how every piece of the application gets shutdown.
In the case of our micro-services, when running in console mode, we can capture Ctrl-C key combination with a simple Windows hook and initiate a clean shutdown.



The other thing I see often, is exceptions occurring on services that are simply absorbed by the code with no logging whatsoever. I've done this myself in a couple of places, but typically I will put at least a "// ToDo: needs logging" or something along that nature.
We need to be vigilant to code that could absorb exceptions to make sure there's proper logging in place.
A final though on that, hybrid micro-services will have exceptions serialized back to the caller. That's OK, as long as the caller logs the exception if it can't properly handled it.

Logging best practice #2


Ensure that ALL unhandled exceptions are logged as errors. Use LogLevel=2 for this unless the situation is Fatal, use LogLevel=1 for fatal errors.

Example from rxfile service:

11/12/2015 1:53:55 PM: Status:ERROR -- Details=Exception received from the Agent -ORA-00942: table or view does not exist
- processing environment -QAINEWPKD--
11/12/2015 1:53:55 PM: Status:ERROR -- Details=Exception received from the Agent -ORA-00942: table or view does not exist
- processing environment -TWED--
11/12/2015 1:53:55 PM: Status:ERROR -- Details=Exception received from the Agent -ORA-00942: table or view does not exist
- processing environment -DEVORAD--
11/12/2015 1:53:55 PM: Status:ERROR -- Details=Exception received from the Agent -ORA-00942: table or view does not exist
- processing environment -DEVNEWKEYD--
11/12/2015 1:53:55 PM: Status:ERROR -- Details=Exception received from the Agent -ORA-00942: table or view does not exist
- processing environment -NCOLLIND--

Those are logging entries caused by external missing tables on that target schema. The service can take the pounding of those exceptions and keep going fine, but it will log the problem nevertheless, hopefully some day maybe someone will read the log and figure out that either has to remove the entries so rxfile doesn't process those domains, fix the configuration or add the missing tables.



The other situation I see happening is exceptional situations that "somehow" are supposed to be handled within the code, therefore "hidden" from developer view when the software is running, but in the long run end up derailing the program with a crash or errors that become hard to track, unless you can log the trace of this unfortunate (handled) events.

Logging best practice #3


Log handled exceptions that could be "of interest" in the future to troubleshoot issues. Use LogLevel=3 or 4 for this depending on how bad the handled exception.

An example of this is a piece of code, again in rxfile service, where a file being accessed is locked for exclusive access by another process, therefore the service will retry a few times. Again, this is part of the process, so this log entry I think we do it only with loglevel=4:

11/12/2015 1:59:20 PM: Status:INFO -- Details=File found: X:-Cerem-InBox-JSBATTIGD-QFTAPDEPLOY-DMP@TPRDVYZ5BFJUT15HT5ZNYAZDY53IKDT43J2SFVBSIFSOCQBRWJGYNM55MP2FWT2Q2AHRVGQEMUPQFI0ESFDQQLLPDMFT45CXH4GPFZP-0000000000000000-post
11/12/2015 1:59:20 PM: Status:INFO -- Details=Failed to open file -retry no- 1-

That simple log entry can help you troubleshoot potentially issues with algorithms that rely on capturing certain exceptions to perform retries (or not).




Next one up my list is what I will call "major functions". Many times micro-service applications run big processes "in-line" as a direct response to a request, or on a separate thread asynchronously (some kind of garbage collection or resource cleanup of any sort).
When possible, it is good to log those as is not uncommon for this "large processes" to destabilize an application and make it crash. Many times you don't have a trace of the crash itself, but you know what happened immediately before that...

Logging best practice #4


Log "big or special operations" of any kind. Use LogLevel=4 for this.

Example of this "big" operations is a file cleanup operation that the rxfile service performs:

11/12/2015 1:53:53 PM: Status:INFO -- Details=----------------------------------------------------------
11/12/2015 1:53:54 PM: Status:INFO -- Details=Remote File Cleaning Agent running---
11/12/2015 1:53:55 PM: Status:ERROR -- Details=Exception received from the Agent -ORA-00942: table or view does not exist
- processing environment -QAINEWPKD--
11/12/2015 1:53:55 PM: Status:ERROR -- Details=Exception received from the Agent -ORA-00942: table or view does not exist
- processing environment -TWED--

Notice the interesting fact that there's errors coming after the highlighted entry. By merit of having that entry logged there, it's not hard to assess that those errors were in fact triggered as part of the execution of the file cleaning agent.



Bear with me. I have one more.

Next one up is logging to assist complex troubleshooting efforts. This is about logging pretty much everything with enough granularity to add value when troubleshooting. It's very important to think about the granularity piece, definitively you don't want to be hitting the log function inside a loop that iterates 100000 times adding numbers because you will spend more time evaluating if log or not to log with the configured loglevel than doing work.

Logging best practice #5


Log all "big or special operations". Use LogLevel=5 for this.

You want to have a "verbose" mode of operation to troubleshoot the most intricate aspects of the application. Think of this logging level as the closest thing you will have to Debugging in your IDE. Remember, you don't have access to use your IDE debugger in Production, so you better instrument your application to be an effective troubleshooter when the issues arise (they will arise...).

Here's an example of how rxfile service logs every one of the two actions it performs on files: open and process them, then delete them:

11/12/2015 1:59:27 PM: Status:INFO -- Details=File found: X:-Cerem-InBox-JSBATTIGD-QFTAPDEPLOY-DMP@TPRDVYZ5BFJUT15HT5ZNYAZDY53IKDT43J2SFVBSIFSOCQBRWJGYNM55MP2FWT2Q2AHRVGQEMUPQFI0ESFDQQLLPDMFT45CXH4GPFZP-0000000000000014-post
11/12/2015 1:59:27 PM: Status:INFO -- Details=File deleted: X:-Cerem-InBox-JSBATTIGD-QFTAPDEPLOY-DMP@TPRDVYZ5BFJUT15HT5ZNYAZDY53IKDT43J2SFVBSIFSOCQBRWJGYNM55MP2FWT2Q2AHRVGQEMUPQFI0ESFDQQLLPDMFT45CXH4GPFZP-0000000000000011-post
11/12/2015 1:59:27 PM: Status:INFO -- Details=File deleted: X:-Cerem-InBox-JSBATTIGD-QFTAPDEPLOY-DMP@TPRDVYZ5BFJUT15HT5ZNYAZDY53IKDT43J2SFVBSIFSOCQBRWJGYNM55MP2FWT2Q2AHRVGQEMUPQFI0ESFDQQLLPDMFT45CXH4GPFZP-0000000000000012-post
11/12/2015 1:59:27 PM: Status:INFO -- Details=File found: X:-Cerem-InBox-JSBATTIGD-QFTAPDEPLOY-DMP@TPRDVYZ5BFJUT15HT5ZNYAZDY53IKDT43J2SFVBSIFSOCQBRWJGYNM55MP2FWT2Q2AHRVGQEMUPQFI0ESFDQQLLPDMFT45CXH4GPFZP-0000000000000015-post
11/12/2015 1:59:28 PM: Status:INFO -- Details=File found: X:-Cerem-InBox-JSBATTIGD-QFTAPDEPLOY-DMP@TPRDVYZ5BFJUT15HT5ZNYAZDY53IKDT43J2SFVBSIFSOCQBRWJGYNM55MP2FWT2Q2AHRVGQEMUPQFI0ESFDQQLLPDMFT45CXH4GPFZP-0000000000000016-post


Recommendation

Every time you go through a troubleshooting session, after fixing the issue you should think what logging could you have added to accelerate finding the issue in the first place. And when you do that, get it implemented right away in the current sprint or at most in the next sprint.