Skip to content

Reproduction Harness

Richard Warburton edited this page Aug 30, 2022 · 3 revisions

The Artio reproduction harness provides a way to reproduce sequences of events that happen through your Artio Gateway. The aim is to be able to reproduce problematic scenarios that lead to bugs within Artio or your own code. It requires cooperation from user code in order to be used effectively.

How to use it

In order to reproduce a previous run of Artio you need to keep your Artio logFileDir() and Aeron Archive instance from the previous run. These directories are used as the source of data in order to reproduce the previous run. In order to use reproduction mode your system must deterministically respond to every callback event from Artio in the same way that it did on the original run. For example if you received a NewOrderSingle message and replied with an ExecutionReport then you must do so with the exact same values and fields as before. If you requested a session be owned by a specific library on the original run, you must do so on the reproduction, etc.

At the Engine

When configuring the FixEngine for reproduction mode you need to invoke EngineConfiguration.reproduceInbound() with the start and end times of your reproduction. Then after you have launched your FixEngine and your system is started up and ready to go you can call startReproduction() - the reply returned from this method can be checked in order to know when the reproduction has completed.

The reproduction operation entails replaying messages from the Archive using Aeron's IPC_CHANNEL if the default stream id clashes with an existing Aeron IPC stream then EngineConfiguration.reproductionLogStream() can be used in order to configure the stream id.

In order to make a reproduction run more accurate then EngineConfiguration.writeReproductionLog(true) can be configured. Artio doesn't record when TCP channels get back-pressured by default beyond noting the switches between slow consumer mode going on and off because it's not normally useful information. Enabling the reproduction log records when back-pressure happens and that information can be used during reproduction runs in order to control the order of events. Take care when enabling this flag. Normally it shouldn't generate too many events, but if you get into a situation where a gateway is under constant back-pressure then it can be spammy. The stream used to record and replay the reproduction log can be configured using EngineConfiguration.reproductionLogStream().

At the Library

When configuring the FixLibrary for reproduction mode you need to invoke LibraryConfiguration.reproduceInbound() with the start and end times of your reproduction. Each library instance needs to use the same library id as with the original run, in order to facilitate this the LibraryConfiguration.libraryId() method has been added that can set a library id, rather than them always being randomly generated. Take care to ensure that libraries are given distinct library ids from each other. Giving multiple libraries the same id within the same run isn't supported and can lead to bugs occurring.

Caveats and Limitations

  • In reproduction mode Artio ignores bind operations.
  • You cannot use a custom TcpChannelSupplier - it creates fake TCP channels for the purpose of replaying the reproduction.
  • You should not set a custom clock with reproduction mode - it uses timestamps from the event stream in order to create a fake clock to trigger events.
  • At the moment reproduction mode doesn't support initiated connections, only acceptor connections.
  • You also have to have a way to generate your stream of outbound events that comes from your internal matching engines, with timestamps that advance at the rate of the fake clock.
  • The reproduction of complex interleaving bugs requires the exact order of events and this wouldn't 100% guarantee that the order of the outbound events and replay events may not have the exact same interleaving. In other words it isn't a 100% deterministic reproduction.

Implementation

The aim of the system is to make the changes minimally invasive when and as similar to production as possible when running in Reproduction mode.

In reproduction mode Artio takes the inbound events of the system from the archive. It creates "Fake" TCP connections for testing purposes when it identifies points in time when a TCP connection is received and for you to see the outbound message that go back out in order to reproduce the problem. It then replays the inbound messages, connection creations and disconnects from the system into your system. It creates a fake clock, driven by timestamps from the stream in order to trigger things like events being emitted from the Session logic (like heartbeats) that are time-driven.