Wednesday, December 16, 2015

What's good logging on microservice-oriented components

Motivation

I've noticed very different approaches to logging on all of the micro-services we currently have at Sovos (which are a lot a this point).

Prior to our move to micro-services model based on our service-bus, we did have a big portion of Sovos's architecture designed around the concept of coarser services. Nevertheless, we did have pretty good and comprehensive logging model, and most services did log enough to be able to perform efficient and successful troubleshooting without requiring to reproduce problems in Development nor use a debugger.

I get the perception now that with our move to micro-services we are losing the grip on disciplined logging instrumentation on the services.

Let's start with a statement I want to be true for all micro-services we produce (this is a desire of future state essentially):

If it takes more than 5 minutes of troubleshooting to have a rough idea where a problem might be on a faulty service. The service lacks proper logging.

That's what we have to shoot for.

Since we moved to the micro-services model I've observed countless troubleshooting sessions that go for hours, until someone finally has to put the service in a debugger just to figure out it had a bad initialization parameter or something as simple as that...
It doesn't have to be like that.

Logging best practice #1


Every service should follow the best practice of logging at startup ALL initialization parameters (or critical commands) and shutdown. Use LogLevel=0 for this (meaning "always log").

This the example from our rxfile service initialization and initialization parameters logs:

11/12/2015 1:53:53 PM: Status:INFO -- Details=----------------------------------------------------------
11/12/2015 1:53:53 PM: Status:INFO -- Details=--        Remote Execution Module - File Manager        --
11/12/2015 1:53:53 PM: Status:INFO -- Details=CS: J3Y3MAJWPYBGJ250JDOEVQOEUP3LGXOSVSLADYCX55QRIILPDJPYLXVEVBDM10FHK24JNWNIR5TPKP
11/12/2015 1:53:53 PM: Status:INFO -- Details=Target Path: X:-Cerem-InBox
11/12/2015 1:53:53 PM: Status:INFO -- Details=Log Level: 5
11/12/2015 1:53:53 PM: Status:INFO -- Details=Restart Time: 06:00
11/12/2015 1:53:53 PM: Status:INFO -- Details=Timeout interval: 1000
11/12/2015 1:53:53 PM: Status:INFO -- Details=----------------------------------------------------------

As you can see, those are "INFO" log entries with the critical parameters for this particular application.

Shutdown log example:

11/12/2015 1:53:51 PM: Status:INFO -- Details=Stopping service
11/12/2015 1:53:51 PM: Status:INFO -- Details=Main timer disabled
11/12/2015 1:53:51 PM: Status:INFO -- Details=Waiting for execution queues
11/12/2015 1:53:51 PM: Status:INFO -- Details=Waiting for execution queues completed
11/12/2015 1:53:51 PM: Status:INFO -- Details=----------------------------------------------------------
11/12/2015 1:53:51 PM: Status:INFO -- Details=ServiceStop event complete
11/12/2015 1:53:53 PM: Status:INFO -- Details=----------------------------------------------------------

Notice the highlighted significant events logged upon shutdown. There's a reason for that. They allow to troubleshoot common issues every time a program has to deal with asynchronous queues or threads shutdown (who didn't have a deadlock when trying to shutdown a threaded process?).
With that informational logging on top of the shutdown logging you can trace how every piece of the application gets shutdown.
In the case of our micro-services, when running in console mode, we can capture Ctrl-C key combination with a simple Windows hook and initiate a clean shutdown.



The other thing I see often, is exceptions occurring on services that are simply absorbed by the code with no logging whatsoever. I've done this myself in a couple of places, but typically I will put at least a "// ToDo: needs logging" or something along that nature.
We need to be vigilant to code that could absorb exceptions to make sure there's proper logging in place.
A final though on that, hybrid micro-services will have exceptions serialized back to the caller. That's OK, as long as the caller logs the exception if it can't properly handled it.

Logging best practice #2


Ensure that ALL unhandled exceptions are logged as errors. Use LogLevel=2 for this unless the situation is Fatal, use LogLevel=1 for fatal errors.

Example from rxfile service:

11/12/2015 1:53:55 PM: Status:ERROR -- Details=Exception received from the Agent -ORA-00942: table or view does not exist
- processing environment -QAINEWPKD--
11/12/2015 1:53:55 PM: Status:ERROR -- Details=Exception received from the Agent -ORA-00942: table or view does not exist
- processing environment -TWED--
11/12/2015 1:53:55 PM: Status:ERROR -- Details=Exception received from the Agent -ORA-00942: table or view does not exist
- processing environment -DEVORAD--
11/12/2015 1:53:55 PM: Status:ERROR -- Details=Exception received from the Agent -ORA-00942: table or view does not exist
- processing environment -DEVNEWKEYD--
11/12/2015 1:53:55 PM: Status:ERROR -- Details=Exception received from the Agent -ORA-00942: table or view does not exist
- processing environment -NCOLLIND--

Those are logging entries caused by external missing tables on that target schema. The service can take the pounding of those exceptions and keep going fine, but it will log the problem nevertheless, hopefully some day maybe someone will read the log and figure out that either has to remove the entries so rxfile doesn't process those domains, fix the configuration or add the missing tables.



The other situation I see happening is exceptional situations that "somehow" are supposed to be handled within the code, therefore "hidden" from developer view when the software is running, but in the long run end up derailing the program with a crash or errors that become hard to track, unless you can log the trace of this unfortunate (handled) events.

Logging best practice #3


Log handled exceptions that could be "of interest" in the future to troubleshoot issues. Use LogLevel=3 or 4 for this depending on how bad the handled exception.

An example of this is a piece of code, again in rxfile service, where a file being accessed is locked for exclusive access by another process, therefore the service will retry a few times. Again, this is part of the process, so this log entry I think we do it only with loglevel=4:

11/12/2015 1:59:20 PM: Status:INFO -- Details=File found: X:-Cerem-InBox-JSBATTIGD-QFTAPDEPLOY-DMP@TPRDVYZ5BFJUT15HT5ZNYAZDY53IKDT43J2SFVBSIFSOCQBRWJGYNM55MP2FWT2Q2AHRVGQEMUPQFI0ESFDQQLLPDMFT45CXH4GPFZP-0000000000000000-post
11/12/2015 1:59:20 PM: Status:INFO -- Details=Failed to open file -retry no- 1-

That simple log entry can help you troubleshoot potentially issues with algorithms that rely on capturing certain exceptions to perform retries (or not).




Next one up my list is what I will call "major functions". Many times micro-service applications run big processes "in-line" as a direct response to a request, or on a separate thread asynchronously (some kind of garbage collection or resource cleanup of any sort).
When possible, it is good to log those as is not uncommon for this "large processes" to destabilize an application and make it crash. Many times you don't have a trace of the crash itself, but you know what happened immediately before that...

Logging best practice #4


Log "big or special operations" of any kind. Use LogLevel=4 for this.

Example of this "big" operations is a file cleanup operation that the rxfile service performs:

11/12/2015 1:53:53 PM: Status:INFO -- Details=----------------------------------------------------------
11/12/2015 1:53:54 PM: Status:INFO -- Details=Remote File Cleaning Agent running---
11/12/2015 1:53:55 PM: Status:ERROR -- Details=Exception received from the Agent -ORA-00942: table or view does not exist
- processing environment -QAINEWPKD--
11/12/2015 1:53:55 PM: Status:ERROR -- Details=Exception received from the Agent -ORA-00942: table or view does not exist
- processing environment -TWED--

Notice the interesting fact that there's errors coming after the highlighted entry. By merit of having that entry logged there, it's not hard to assess that those errors were in fact triggered as part of the execution of the file cleaning agent.



Bear with me. I have one more.

Next one up is logging to assist complex troubleshooting efforts. This is about logging pretty much everything with enough granularity to add value when troubleshooting. It's very important to think about the granularity piece, definitively you don't want to be hitting the log function inside a loop that iterates 100000 times adding numbers because you will spend more time evaluating if log or not to log with the configured loglevel than doing work.

Logging best practice #5


Log all "big or special operations". Use LogLevel=5 for this.

You want to have a "verbose" mode of operation to troubleshoot the most intricate aspects of the application. Think of this logging level as the closest thing you will have to Debugging in your IDE. Remember, you don't have access to use your IDE debugger in Production, so you better instrument your application to be an effective troubleshooter when the issues arise (they will arise...).

Here's an example of how rxfile service logs every one of the two actions it performs on files: open and process them, then delete them:

11/12/2015 1:59:27 PM: Status:INFO -- Details=File found: X:-Cerem-InBox-JSBATTIGD-QFTAPDEPLOY-DMP@TPRDVYZ5BFJUT15HT5ZNYAZDY53IKDT43J2SFVBSIFSOCQBRWJGYNM55MP2FWT2Q2AHRVGQEMUPQFI0ESFDQQLLPDMFT45CXH4GPFZP-0000000000000014-post
11/12/2015 1:59:27 PM: Status:INFO -- Details=File deleted: X:-Cerem-InBox-JSBATTIGD-QFTAPDEPLOY-DMP@TPRDVYZ5BFJUT15HT5ZNYAZDY53IKDT43J2SFVBSIFSOCQBRWJGYNM55MP2FWT2Q2AHRVGQEMUPQFI0ESFDQQLLPDMFT45CXH4GPFZP-0000000000000011-post
11/12/2015 1:59:27 PM: Status:INFO -- Details=File deleted: X:-Cerem-InBox-JSBATTIGD-QFTAPDEPLOY-DMP@TPRDVYZ5BFJUT15HT5ZNYAZDY53IKDT43J2SFVBSIFSOCQBRWJGYNM55MP2FWT2Q2AHRVGQEMUPQFI0ESFDQQLLPDMFT45CXH4GPFZP-0000000000000012-post
11/12/2015 1:59:27 PM: Status:INFO -- Details=File found: X:-Cerem-InBox-JSBATTIGD-QFTAPDEPLOY-DMP@TPRDVYZ5BFJUT15HT5ZNYAZDY53IKDT43J2SFVBSIFSOCQBRWJGYNM55MP2FWT2Q2AHRVGQEMUPQFI0ESFDQQLLPDMFT45CXH4GPFZP-0000000000000015-post
11/12/2015 1:59:28 PM: Status:INFO -- Details=File found: X:-Cerem-InBox-JSBATTIGD-QFTAPDEPLOY-DMP@TPRDVYZ5BFJUT15HT5ZNYAZDY53IKDT43J2SFVBSIFSOCQBRWJGYNM55MP2FWT2Q2AHRVGQEMUPQFI0ESFDQQLLPDMFT45CXH4GPFZP-0000000000000016-post


Recommendation

Every time you go through a troubleshooting session, after fixing the issue you should think what logging could you have added to accelerate finding the issue in the first place. And when you do that, get it implemented right away in the current sprint or at most in the next sprint.

Tuesday, October 13, 2015

Software Engineering - Naming things

Naming things, maybe one of the toughest things to do in software engineering, and surely the single most important practice of all.

Check this out: http://martinfowler.com/bliki/TwoHardThings.html

Is it not true indeed? Naming things being one of the two hardest thing in software engineering, with the other being cache invalidation?

I believe the power of good naming, or why not, great naming, not only makes software easier to read, maintain and extend, but also simpler to properly craft in the first place.

Good naming exposes intent, and with intent clearly exposed we can see the structure of what we are building and assess if it makes sense or not. Good naming makes refactoring and evolutionary development possible, good naming allows for patterns emergence.

Bad naming typically tends to hide things creating the illusion of "maybe not so bad software" behind the scenes. With bad naming you definitively don't see good software, but worst of all, you may not see terribly bad software hiding under the surface.

So, what's good naming then? I was going to write my own "naming" rules, but I came across these two great articles that do a much better job I could at the task:

http://parallelcross.com/post/36861185276/naming-strategies-uncle-bob-and-ottinger

http://www.objectmentor.com/resources/articles/Naming.pdf

Enjoy

C++ and automatic objects misconceptions

Let's start this blog entry first with a disclaimer: I'm not a expert, not even experienced C++ developer.

Now, I do know a thing or two about C++, and particularly about some of the primitives of the language and not so much about its standard library.

One thing in particular I always liked is the idea of offering developers two models of object allocation:
  • "automatic" objects
  • "explicitly" allocated objects
Before we jump in to automatic objects, just a note on the second bullet point, what I call "explicitly" allocated objects.
The reason I call that model "explicitly" allocated objects rather than let's say, dynamically allocated objects or heap allocated objects is because really the main difference between that model and the "automatic" objects is that the developer tells the compiler with no room for interpretation that a new object will be allocated with memory provided by the new() operator and a particular constructor will be called.
There's no way around it.

Automatic objects, on the other hand, leave a lot of room to the compiler to decide how the final object is going to be provided to the developer.

With that, the motivation of this post is because I see a lot of confusion between C++ developers on exactly how the compiler solves certain constructs, leading to decisions that may not be the most efficient or cleanest ones when crafting code. In all honestly, most of this blog post will be a small piece of code I wrote with some typical and commonly used constructs when creating objects, assigning them to variables and passing them as function parameters. The code I wrote was to answer questions I had myself, in a few cases with assumed answers (that were wrong!) and the C++ compiler surprised positively in every one of the cases.

Without further due, this is the test code:
#include <stdio.h>
#include <string>
using namespace std;
class myString {
static int globalCounter;
string str;
public:
int counter;
myString(const myString& s) : str(s.str), counter(s.counter) {
printf("myString created using copy constructor\r\n");
print();
};
myString(const string& s) : str(s), counter(globalCounter++) {
printf("myString created\r\n");
print();
};
~myString() { printf("myString destroyed\r\n"); };
myString& operator=(const myString& s) {
str = s.str;
counter = s.counter;
printf("myString operator= called\r\n");
return *this;
};
void print() {
string msg = str;
msg += " ";
msg += to_string(static_cast<long long>(counter));
msg += " my address: ";
msg += to_string(reinterpret_cast<long long>(this));
msg += "\r\n";
printf(msg.c_str());
};
};
int myString::globalCounter = 0;
myString fn(const char* str) {
return myString(str);
}
void iReceiveAMyString(const myString) {
printf("iReceiveAMyString called\r\n");
}
void main(){
printf("vvvvvvv\r\n");
{
printf("Example with only one copy of myString class...\r\n");
myString str = fn("hello");
str.print();
} // str will be destroyed when leaving scope, right here
printf("^^^^^^^\r\n");
printf("vvvvvvv\r\n");
{
printf("Even invoking as a constructor creates only one copy...\r\n");
myString str(fn("hello"));
str.print();
} // str will be destroyed when leaving scope, right here
printf("^^^^^^^\r\n");
printf("vvvvvvv\r\n");
{
printf("When really copying into an uninitialized object copy constructor is finally used...\r\n");
myString str(fn("hello"));
printf("now we will assign str to str2\r\n");
myString str2 = str;
str.print();
str2.print();
} // str and str2 will be destroyed when leaving scope, right here
printf("^^^^^^^\r\n");
printf("vvvvvvv\r\n");
{
printf("What happens if we invoke function that doesn't use & operator?...\r\n");
myString str(fn("hello"));
iReceiveAMyString(str);
} // str will be destroyed when leaving scope, right here
printf("^^^^^^^\r\n");
printf("vvvvvvv\r\n");
{
printf("When overwriting an object, assignment operator will be called...\r\n");
myString str(fn("hello A"));
myString str2(fn("hello B"));
printf("let's print str2\r\n");
str2.print();
printf("now we will assign str to str2\r\n");
str2 = str;
printf("let's print str and after that str2\r\n");
str.print();
str2.print();
} // str and str2 will be destroyed when leaving scope, right here
printf("^^^^^^^\r\n");
}


And this is the console output if you run the program:

vvvvvvv
Example with only one copy of myString class...
myString created
hello 0 my address: 2030296
hello 0 my address: 2030296
myString destroyed
^^^^^^^
vvvvvvv
Even invoking as a constructor creates only one copy...
myString created
hello 1 my address: 2030252
hello 1 my address: 2030252
myString destroyed
^^^^^^^
vvvvvvv
When really copying into an uninitialized object copy constructor is finally use
d...
myString created
hello 2 my address: 2030208
now we will assign str to str2
myString created using copy constructor
hello 2 my address: 2030164
hello 2 my address: 2030208
hello 2 my address: 2030164
myString destroyed
myString destroyed
^^^^^^^
vvvvvvv
What happens if we invoke function that doesn't use & operator?...
myString created
hello 3 my address: 2030120
myString created using copy constructor
hello 3 my address: 2029768
iReceiveAMyString called
myString destroyed
myString destroyed
^^^^^^^
vvvvvvv
When overwriting an object, assignment operator will be called...
myString created
hello A 4 my address: 2030076
myString created
hello B 5 my address: 2030032
let's print str2
hello B 5 my address: 2030032
now we will assign str to str2
myString operator= called
let's print str and after that str2
hello A 4 my address: 2030076
hello A 4 my address: 2030032
myString destroyed
myString destroyed
^^^^^^^

Hopefully it depicts clearly how the C++ compiler is pretty efficient on its dealing with automatic objects. No unnecessary copies, not a single one, even when returning an automatic object from a function the construction storage was the target storage location.

Think very carefully before using new() allocated objects, it may not be worth on many cases. Not to mention you will be incurring typically on expensive heap allocations.

Sunday, March 15, 2015

How to properly support Windows Fibers in Delphi without breaking exception handling

If you are giving a shot to Windows Fibers under Delphi, you probably faced the issue of how to properly handle Delphi's structured exception handling AND switching between fibers transparently.

Assuming you already know what Fibers are, how to use them and when to use them, I will cut to the chase. If you don't know, or don't know enough, I recommend you the following reads to get started:


https://msdn.microsoft.com/en-us/library/windows/desktop/ms682661%28v=vs.85%29.aspx
http://blogs.technet.com/b/markrussinovich/archive/2009/07/08/3261309.aspx


When exceptions happen in a Delphi program, the compiler generates code to store nested exceptions on an exception stack. This stack is stored in a TLS allocated by the program upon startup.

When using Delphi compiled in 32 bits, the issue of persisting and restoring this stack is trivial. Up to the latest versions of Delphi the RTL exposes a couple of functions in the system unit that allow you to do the trick in two lines of code (literally).
The challenge with Delphi programs compiled in 64 bits is that Delphi doesn't expose anymore an API to persist and restore the Exception stack. It's not hard overcome this, but it requires a bit of reverse engineering and watching what the compiler does by enabling the CPU debugger.
I'll spare you from the pain, and below you will find a unit that does the trick:

uWin64ExceptionStack.pas


unit uWin64ExceptionStack;
interface
{$IFDEF WIN64}
const
MAX_NESTED_EXCEPTIONS = 16;
type
TSavedRaiseFrame = record
NextRaiseOffset: Integer;
ExceptAddr: Pointer;
ExceptObject: TObject;
end;
TSavedRaiseFrames = array[0..MAX_NESTED_EXCEPTIONS - 1] of TSavedRaiseFrame;
TWin64ExceptionStack = record
ExceptionObjectCount : Integer;
SavedRaiseFrames : TSavedRaiseFrames;
RaiseListPtrOffset : Integer;
procedure LoadFromThreadExceptionStack;
procedure SaveToThreadExceptionStack;
end;
{$ENDIF}
implementation
{$IFDEF WIN64}
uses
Windows;
type
PPRaiseFrame = ^PRaiseFrame;
PRaiseFrame = ^TRaiseFrame;
TRaiseFrame = record
NextRaise: PRaiseFrame;
ExceptAddr: Pointer;
ExceptObject: TObject;
end;
PRaiseFrames = ^TRaiseFrames;
TRaiseFrames = array[0..MAX_NESTED_EXCEPTIONS - 1] of TRaiseFrame;
const
RAISEFRAMES_TLS_OFFSET = $0;
EXCEPTIONOBJECTCOUNT_TLS_OFFSET = sizeof(TRaiseFrames);
RAISELISTPTR_TLS_OFFSET = EXCEPTIONOBJECTCOUNT_TLS_OFFSET + sizeof(NativeUInt);
NULL_RAISE_FRAME = -1;
// GetTLS function extracted and simplified from SysInit.pas unit
function GetTLS : Pointer;
const
tlsArray = $58; { offset of tls array from FS: }
type
PPPointerArray = ^PPointerArray;
var
P: PPointerArray;
begin
if ModuleIsLib then
Result := TlsGetValue(TlsIndex)
else
begin
//P := PPPointerArray(ReadGSQWord(tlsArray));
P := PPPointerArray(PByte(@GSSegBase) + tlsArray)^;
Result := P^[TlsIndex];
end;
end;
procedure TWin64ExceptionStack.LoadFromThreadExceptionStack;
var
ATLS : Pointer;
RaiseFrame : PRaiseFrame;
i : integer;
begin
ATLS := GetTLS;
ExceptionObjectCount := PInteger(NativeUInt(ATLS) + EXCEPTIONOBJECTCOUNT_TLS_OFFSET)^;
for i := 0 to ExceptionObjectCount - 1 do
begin
RaiseFrame := @PRaiseFrames(NativeUInt(ATLS) + RAISEFRAMES_TLS_OFFSET)[i];
SavedRaiseFrames[i].ExceptAddr := RaiseFrame.ExceptAddr;
SavedRaiseFrames[i].ExceptObject := RaiseFrame.ExceptObject;
if RaiseFrame.NextRaise <> nil then
SavedRaiseFrames[i].NextRaiseOffset := NativeUInt(RaiseFrame.NextRaise) - NativeUInt(ATLS) - RAISEFRAMES_TLS_OFFSET
else SavedRaiseFrames[i].NextRaiseOffset := NULL_RAISE_FRAME;
end;
if PPRaiseFrame(NativeUInt(ATLS) + RAISELISTPTR_TLS_OFFSET)^ <> nil then
RaiseListPtrOffset := NativeUInt(PPRaiseFrame(NativeUInt(ATLS) + RAISELISTPTR_TLS_OFFSET)^) - NativeUInt(ATLS) - RAISEFRAMES_TLS_OFFSET
else RaiseListPtrOffset := NULL_RAISE_FRAME;
end;
procedure TWin64ExceptionStack.SaveToThreadExceptionStack;
var
ATLS : Pointer;
RaiseFrame : PRaiseFrame;
i : integer;
begin
ATLS := GetTLS;
PInteger(NativeUInt(ATLS) + EXCEPTIONOBJECTCOUNT_TLS_OFFSET)^ := ExceptionObjectCount;
for i := 0 to ExceptionObjectCount - 1 do
begin
RaiseFrame := @PRaiseFrames(NativeUInt(ATLS) + RAISEFRAMES_TLS_OFFSET)[i];
RaiseFrame.ExceptAddr := SavedRaiseFrames[i].ExceptAddr;
RaiseFrame.ExceptObject := SavedRaiseFrames[i].ExceptObject;
if SavedRaiseFrames[i].NextRaiseOffset <> NULL_RAISE_FRAME then
RaiseFrame.NextRaise := PRaiseFrame(NativeUInt(PRaiseFrames(NativeUInt(ATLS) + RAISEFRAMES_TLS_OFFSET)) + SavedRaiseFrames[i].NextRaiseOffset)
else RaiseFrame.NextRaise := nil;
end;
if RaiseListPtrOffset <> NULL_RAISE_FRAME then
PPRaiseFrame(NativeUInt(ATLS) + RAISELISTPTR_TLS_OFFSET)^ := PRaiseFrame(NativeUInt(PRaiseFrames(NativeUInt(ATLS) + RAISEFRAMES_TLS_OFFSET)) + RaiseListPtrOffset)
else PPRaiseFrame(NativeUInt(ATLS) + RAISELISTPTR_TLS_OFFSET)^ := nil;
end;
{$ENDIF}
end.
Here's the full code that I use for switching between fibers:



{
The idea of DelphiSwitchToFiber() function is to backup on a local variable in stack the
state of the Exception stack right before calling SwitchToFiber() and then restoring its state
right atfer returns from call to SwitchToFiber().
If SwitchToFiber() is used directly from within an Except or Finally block, and if there's an exception
raised after switching to another fiber, upon coming back the results will be unpredictable because
the exception stack will be completely unwinded and all raise exceptions destroyed.
In order to prevent this issue we must backup the Exception stack before the call to SwitchToFiber()
and restore it right after the call.
Details of 64 bits exception stack persistance in unit uWin64ExceptionStack.pas
It's important to notice that when getting back from SwitchToFiber() we can't even assume that
the call was done within the context of Thread where the call was initiated. That's why on
64 bits implementation of this function, we need to obtain TLS pointer to ThreadVar storage
area right after the call to SwitchToThread().
Notice also that 32 bits implementation is dramatically simpler. On 32 bits versions of Delphi
when compiling under Windows, the system unit simply kept a linked list of stacked exceptions
so backing up our Exception stack was as simple as storing a pointer to this list on a local
variable and then restoring this pointer with the provided system API calls.
Win64 is an entirely different story, system unit hides completely all of the exception management
details, so we have to essently do a hack by accessing the area of the TLS where threadvars are stored
to persist the three key elements that participate on Exception management stack. These are:
RaiseFrames : TRaiseFrames;
ExceptionObjectCount : Integer;
RaiseListPtr : PRaiseFrame;
Notice that when calculating offset into TLS area to backup and restore RaiseListPtr we consider
ExceptionObjectCount size as NativeUInt (64 bits) rather than a regular integer (32 bits). This is
because Delphi compiler aligns to 64 bits the elements stored in the ThreadVar storage area.
}
{$IFDEF WIN64}
// Details to implement this function pulled from System.pas unit
procedure DelphiSwitchToFiber(AFiber : Pointer);
var
Win64ExceptionStack : TWin64ExceptionStack;
begin
Win64ExceptionStack.LoadFromThreadExceptionStack;
SwitchToFiber(AFiber);
Win64ExceptionStack.SaveToThreadExceptionStack;
end;
{$ELSE}
{$IFDEF VER130}
// SetRaiseList is totally broken in Delphi 5. Needs to be replaced by this function
function SetRaiseList(NewPtr: Pointer): Pointer;
asm
PUSH EAX
CALL SysInit.@GetTLS
MOV EDX, [EAX].$0 // Offset of threadvar RaiseListPtr in TLS memory block
POP ECX
MOV [EAX], ECX
MOV EAX, EDX
end;
{$ENDIF}
procedure DelphiSwitchToFiber(AFiber : Pointer);
var
SavedRaiseList : Pointer;
begin
SavedRaiseList := RaiseList;
SwitchToFiber(AFiber);
SetRaiseList(SavedRaiseList);
end;
{$ENDIF}
Before I forget!!!

Here's a sample test that performs a SwitchToFiber() inside an exception handler block. In fact, it's a triple nested exception handler block to test the mechanism of persisting and restored more than just one element of the exception stack.

procedure TestTJobQueue.JobMethodRaiseAndSwitchToFiber(const AJob: IJob; const AParams: array of Variant);
begin
try
Check(True);
raise ETestJobQueue.Create('Hello');
except
on E : Exception do
begin
try
raise ETestJobQueue.Create('Hello Nested');
except
on E : Exception do
begin
try
raise ETestJobQueue.Create('Hello More Nested');
except
on E : Exception do
begin
AJob.Sleep(400); // <-- This call will perform a SwitchToFiber() to emulate a blocking wait
CheckEqualsString('Hello More Nested', E.Message);
end;
end;
CheckEqualsString('Hello Nested', E.Message);
end;
end;
CheckEqualsString('Hello', E.Message);
end;
end;
Check(True);
end;
I hope this helps you on your adventure trying to implement fibers using Delphi.

A note on testing the approach:

Code was verified to work properly on:
Delphi 2007
DelphiXE4 running in WIN32 and WIN64

Special note about Delphi 5. SetRaiseList was totally broken. When used, it will send the application into a tailspin of access violations. See the fixed code, mainly borrowed from Delphi 2007 implementation.