Audit data corruption on NFS volumes #1351

russjones · 2017-09-30T01:01:37Z

Problem

When running multiple Teleport Auth Servers in a HA configuration, the recommended approach for the audit log is to mount a shared NFS volume which all Auth Servers write to. This however will not work because multiple clients opening a file with the O_APPEND flag will leads to data corruption as outlined in the NFS documentation in section A8 and A9.

TCP guarantees ordered delivery in the context of the single server, however it is possible to have out of order writes in case of the multiple auth servers.

Proposed Solution

To clarify the design algo a little bit:

The only way to solve the problem with NFS that does not guarantee atomic of concurrent appends is to make sure there is only one writer per opened file.

In case if there are several concurrent auth servers writing in the context of the same session, they will write to different files.

The files format will be exactly the same as the existing format.

1. Auth server when receiving session chunks will open the metadata files with starting with the counter of the first received chunk.
Auth server will continue writing to the existing file until the following happens:
2.a. session ends, auth server closes the file
2.b. auth server receives the offset of the chunk that is not successive to the previously received chunk, e.g. the previously written chunk has counter 8, while newly received chunk has counter 10. This means the other auth server wrote the chunk 9.
Auth server resets the state to 1.

For example one auth server 1 will write the following blocks

# will contain blocks from 0 to 92
0_b0ca00f5-a4a9-11e7-9b5d-0a6859bf1618.session.bytes
# will contain blocks from 103 to 500
103_b0ca00f5-a4a9-11e7-9b5d-0a6859bf1618.session.bytes

auth server 2 will write the following blocks:

# will contain blocks 93- 102
93_b0ca00f5-a4a9-11e7-9b5d-0a6859bf1618.session.bytes

Then for playback, we simply gather and join all chunks.

We would need to perform a similar scheme for the metadata as well as the audit log itself because they all also reside on a NFS volume and are subject to the same issues.

Audit events

The Web UI uses the audit log to figure out which sessions are complete and can be played back and and which are active and can be joined based off the values of session.start and session.

Direct integrations with external structured logging facilities for querying and logging are going to solve this problem, e.g. using ELK/Splunk API to query the backends will reducing the amount of work.

The text was updated successfully, but these errors were encountered:

klizhentas · 2017-09-30T01:11:32Z

Correction - as discussed, we don't need to implement this schema for audit log as we don't have to put audit log entries to the NFS volumes and simply use local storage only with log forwarders.

mechastorm · 2017-09-30T01:28:55Z

So just to clarify how do we centralize the session logs for those of us that may not have a preferred log forwarder yet?

The other concern is how are the session logs accessed from the webui there are multiple Auth servers?

klizhentas · 2017-09-30T02:38:55Z

made several edits

russjones · 2017-09-30T05:24:30Z

@klizhentas The Web UI uses the audit log to figure out which sessions are complete and can be played back and and which are active and can be joined based off the values of session.start and session.end so we need a way for each Auth Server to see all events that have occurred in the system at least of type session.start and session.end.

An idea: we store the audit log in the backend and provide a log forwarder that forwards to a file. This allows us to build more log forwarders in the future and maintain existing functionality with the file based events.

klizhentas · 2017-09-30T18:42:34Z

@russjones We can explore bringing back audit logs to backends, or we can add direct integrations with external structured logging facilities for querying as well, e.g. it will be no problem direclty log to ELK/Splunk and simply query the backends, reducing the amount of work.

pmorton · 2017-10-25T20:19:21Z

Echoing @mechastorm, if using the recommended shared NFS volume causes corruption, how does one implement high availability? Is it possible?

klizhentas · 2018-01-06T02:11:28Z

fixed in 2.5.0, by #1549

russjones assigned gravitational-jenkins Sep 30, 2017

klizhentas changed the title ~~Audit data corruption~~ Audit data corruption on NFS volumes Sep 30, 2017

klizhentas assigned klizhentas and unassigned gravitational-jenkins Dec 28, 2017

klizhentas added this to the 2.5.0 milestone Jan 4, 2018

klizhentas closed this as completed Jan 6, 2018

klizhentas mentioned this issue Feb 19, 2018

Changelog #1703

Closed

hatched pushed a commit to hatched/teleport-merge that referenced this issue Nov 30, 2022

Update to Electron 21 (gravitational#1351)

1290955

hatched pushed a commit that referenced this issue Dec 20, 2022

Update to Electron 21 (#1351)

f91b332

hatched pushed a commit that referenced this issue Jan 30, 2023

[v11] Update to Electron 21 (#1351) (#1360)

e92b39c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audit data corruption on NFS volumes #1351

Audit data corruption on NFS volumes #1351

russjones commented Sep 30, 2017 •

edited by klizhentas

Loading

klizhentas commented Sep 30, 2017

mechastorm commented Sep 30, 2017

klizhentas commented Sep 30, 2017

russjones commented Sep 30, 2017 •

edited

Loading

klizhentas commented Sep 30, 2017

pmorton commented Oct 25, 2017

klizhentas commented Jan 6, 2018

Audit data corruption on NFS volumes #1351

Audit data corruption on NFS volumes #1351

Comments

russjones commented Sep 30, 2017 • edited by klizhentas Loading

Problem

Proposed Solution

Audit events

klizhentas commented Sep 30, 2017

mechastorm commented Sep 30, 2017

klizhentas commented Sep 30, 2017

russjones commented Sep 30, 2017 • edited Loading

klizhentas commented Sep 30, 2017

pmorton commented Oct 25, 2017

klizhentas commented Jan 6, 2018

russjones commented Sep 30, 2017 •

edited by klizhentas

Loading

russjones commented Sep 30, 2017 •

edited

Loading