Making Changes Persistent

Without an fsync to flush data in the filesystem cache to disk, we cannot be sure that the data will still be there after a power failure, or even after exiting the application normally. For Elasticsearch to be reliable, it needs to ensure that changes are persisted to disk.

In [dynamic-indices], we said that a full commit flushes segments to disk and writes a commit point, which lists all known segments. Elasticsearch uses this commit point during startup or when reopening an index to decide which segments belong to the current shard.

While we refresh once every second to achieve near real-time search, we still need to do full commits regularly to make sure that we can recover from failure. But what about the document changes that happen between commits? We don’t want to lose those either.

Elasticsearch added a translog, or transaction log, which records every operation in Elasticsearch as it happens. With the translog, the process now looks like this:

When a document is indexed, it is added to the in-memory buffer and appended to the translog, as shown in New documents are added to the in-memory buffer and appended to the transaction log.

Figure 1. New documents are added to the in-memory buffer and appended to the transaction log
The refresh leaves the shard in the state depicted in After a refresh, the buffer is cleared but the transaction log is not. Once every second, the shard is refreshed:
- The docs in the in-memory buffer are written to a new segment, without an fsync.
- The segment is opened to make it visible to search.
- The in-memory buffer is cleared.
Figure 2. After a refresh, the buffer is cleared but the transaction log is not
This process continues with more documents being added to the in-memory buffer and appended to the transaction log (see The transaction log keeps accumulating documents).

Figure 3. The transaction log keeps accumulating documents
Every so often—such as when the translog is getting too big—the index is flushed; a new translog is created, and a full commit is performed (see After a flush, the segments are fully commited and the transaction log is cleared):
- Any docs in the in-memory buffer are written to a new segment.
- The buffer is cleared.
- A commit point is written to disk.
- The filesystem cache is flushed with an fsync.
- The old translog is deleted.

The translog provides a persistent record of all operations that have not yet been flushed to disk. When starting up, Elasticsearch will use the last commit point to recover known segments from disk, and will then replay all operations in the translog to add the changes that happened after the last commit.

The translog is also used to provide real-time CRUD. When you try to retrieve, update, or delete a document by ID, it first checks the translog for any recent changes before trying to retrieve the document from the relevant segment. This means that it always has access to the latest known version of the document, in real-time.

Figure 4. After a flush, the segments are fully commited and the transaction log is cleared

flush API

The action of performing a commit and truncating the translog is known in Elasticsearch as a flush. Shards are flushed automatically every 30 minutes, or when the translog becomes too big. See the {ref}/index-modules-translog.html[translog documentation] for settings that can be used to control these thresholds:

The {ref}/indices-flush.html[flush API] can be used to perform a manual flush:

POST /blogs/_flush (1)

POST /_flush?wait_for_ongoing (2)

Flush the blogs index.
Flush all indices and wait until all flushes have completed before returning.

You seldom need to issue a manual flush yourself; usually, automatic flushing is all that is required.

That said, it is beneficial to flush your indices before restarting a node or closing an index. When Elasticsearch tries to recover or reopen an index, it has to replay all of the operations in the translog, so the shorter the log, the faster the recovery.

How Safe Is the Translog?

The purpose of the translog is to ensure that operations are not lost. This begs the question: how safe is the translog?

Writes to a file will not survive a reboot until the file has been fsync'ed to disk. By default, the translog is fsync'ed every 5 seconds and after a write request completes (e.g. index, delete, update, bulk). This process occurs on both the primary and replica shards. Ultimately, that means your client won’t receive a 200 OK response until the entire request has been fsync'ed in the translog of the primary and all replicas.

Executing an fsync after every request does come with some performance cost, although in practice it is relatively small (especially for bulk ingestion, which amortizes the cost over many documents in the single request).

But for some high-volume clusters where losing a few seconds of data is not critical, it can be advantageous to fsync asynchronously. E.g. writes are buffered in memory and fsync'ed together every 5s.

This behavior can be enabled by setting the durability parameter to async:

PUT /my_index/_settings
{
    "index.translog.durability": "async",
    "index.translog.sync_interval": "5s"
}

This setting can be configured per-index and is dynamically updatable. If you decide to enable async translog behavior, you are guaranteed to lose `sync_interval’s worth of data if a crash happens. Please be aware of this characteristic before deciding!

If you are unsure the ramifications of this action, it is best to use the default ("index.translog.durability": "request") to avoid data-loss.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

50_Persistent_changes.asciidoc

50_Persistent_changes.asciidoc

Making Changes Persistent

flush API

Files

50_Persistent_changes.asciidoc

Latest commit

History

50_Persistent_changes.asciidoc

File metadata and controls

Making Changes Persistent

flush API