Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

Use atomic operations and a read lock instead of a write lock #945

Closed
wants to merge 2 commits into from

Conversation

shanson7
Copy link
Collaborator

@shanson7 shanson7 commented Jun 14, 2018

This PR was an attempt to reduce the exclusive lock section in the ingest path.

The idea is that the map isn't truly being modified, so we don't need to hold a write lock. The behavioral change is that if the same point is Updated by two threads, the partition/LastUpdate is not guaranteed to match. In practice, I believe that LastUpdate should pretty much be near realtime (and is mostly heuristic anyway). The partition shouldn't change frequently anyway and should be eventually consistent.

Similar changes could be made to AddOrUpdate (optimistically acquiring a read lock) but I wasn't sure how many calls to AddOrUpdate actually resulted in a write.

In our setup, we saw a 30%-40% bump in our backlog processing from this change.

@Dieterbe Dieterbe self-requested a review June 14, 2018 15:56
Copy link
Contributor

@replay replay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Nice catch that you noticed that this can all be done with atomic operations and hence the write lock is not necessary. I'm still surprised by how large the difference is that you measured.

@Dieterbe
Copy link
Contributor

Dieterbe commented Jun 18, 2018

i haven't looked in depth at this yet, but FWIW I'm pretty sure you can't just do atomic writes to integers while elsewhere reading (even when holding the read lock which in this case isn't relevant).
i recently read an interesting article that explained some of this stuff (but not sure if was exactly that), it talked about instruction reordering and how there's much less memory guarantees than some people assume. The gist of it was that compiler and/or hardware are free to repurpose memory locations as it pleases - it only has to comply with guarantees explicit in the memory model - and that this commonly happens. I couldn't find the article but I did find golang/go#5045 (comment) :

I'm fairly certain the rules will be that Go's atomics guarantee sequential consistency among the atomic variables (behave like C/C++'s seqconst atomics), and that you shouldn't mix atomic and non-atomic accesses for a given memory word.

we may want to pursue tracking of partition/lastUpdate in a separate structure that is entirely based on atomics, just thinking out loud.

@shanson7
Copy link
Collaborator Author

It seems that it would be easy enough to create an AtomicUint32 (and the whole family) to encapsulate these actions. The trick is that the MetricDefinition would need to use it all over...

@Dieterbe
Copy link
Contributor

The trick is that the MetricDefinition would need to use it all over...

related: the in-memory index (including its types and MetricDefinition) are due for a make-over anyway.
they're quite inefficient.

@shanson7
Copy link
Collaborator Author

So, we were running the "unsafe" version of this (mixing atomic/non-atomic accesses) for over a month and didn't see issues, but I recently rolled out a new build without this branch as I didn't want race conditions in there, and I wasn't ready to commit to this change. However, without this change I saw about a 30% slowdown in backfill processing without it. So, I decided to just bite the bullet and look through all the .Partition and .LastUpdate accesses and access them atomically (unless a write lock is already held).

@shanson7
Copy link
Collaborator Author

Opened #969 for failed test

@Dieterbe
Copy link
Contributor

Dieterbe commented Aug 24, 2018

  1. did deploying this PR but with these atomic reads added give you your 30% speedup back?
  2. would you say this PR is ready for merging? if so, i'll do an ingest speed test as well

if existing.LastUpdate < int64(point.Time) {
existing.LastUpdate = int64(point.Time)
if atomic.LoadInt64(&existing.LastUpdate) < int64(point.Time) {
atomic.SwapInt64(&existing.LastUpdate, int64(point.Time))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that this is racey.
let's say existing.LastUpdate is very old (30 days ago)
then a point comes in for 29 days ago, and concurrently another one for a day ago via a different kafka partition, and then no more points.
in that case, we can have concurrent Update calls, resulting in the LastUpdate field being updated to 29 days ago, but never to a day ago.
note that for any kafka given partition, carbon stream or prometheus POST we never have overlap in our Update calls.

so in practice, doesn't seem like an issue, but perhaps we should document this something under "tradeoffs and extremely rare edge cases" or something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or we can solve it by either:

  1. doing CompareAndSwap in a loop until we're able to swap for the value we wanted to swap
  2. confirming the swapped out value (return value of SwapInt64) is smaller than what we swapped in. if not, put the old value back, check that we didn't swap for an ever higher value (placed by a concurrent Update call), etc

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This operation is racey at that point even with a write lock. It's dependent on the order individual threads hit that lock call which ( with the hypothetical assumption that data can come for the same series from different threads) can be out of order from the kafka ingest step.

I'm hesitant to add anything overly complex to Update for no real world benefit, but I'll defer to your preference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. if you use a write lock, you can lock, check value, if we have a larger one, update, unlock. this works regardless of the order between two concurrent update operations. (the most recent will always survive).

I think my proposal above will also solve it, and at almost no additional cost.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was extending it to the Partition update as well. Perhaps Partition should only be updated when we have a newer timestamp as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the partition property is really only to shard the index in cassandra, so nodes on startup know which metrics they're responsible for.
i'm not sure if we even properly support live partition changes of metrics (i.e. whether after the change we properly display both the old and new data)
under concurrent updates it's probably ok for the partition to be ambiguous ("under transition"), but once update operations are serialized, the later one should probably always win, even when data is sent out of order. I think MT's behavior in these scenarios is so undefined that probably either way works.

@shanson7
Copy link
Collaborator Author

Yes, this PR definitely got us back where we wanted to be. Ingest rate is not very consistent, but we average ~90k dp/s/core (on 8 cores, we see spikes up to 1M dp/s but average ~700k). Without this change, we were barely breaking 400k dp/s. We have been running this in production for about 2 months now, with no noticeable issues.

I really look forward to seeing if it benefits your speeds as well (the trade-off, I suppose, is greater CPU usage during ingest).

@Dieterbe
Copy link
Contributor

Dieterbe commented Aug 27, 2018

I have a new branch:
https://github.com/grafana/metrictank/tree/readLockUpdates

  1. rebased your branch on top of master
  2. added a commit with what I think is the solution to update LastUpdate properly. I think it should not be any slower (or more complicated) then what you had.
  3. my bumpLastUpdate function should be inlined but for some reason couldn't prove it via go build -gcflags '-m' so i also added a commit manually inlining it (but i don't intend to merge that commit)

then i filled up kafka with some MetricData data and tested ingestion with each version (twice)

Mon Aug 27 11:08:00 CEST 2018 running mt-dieter
Restarting docker-dev-custom-cfg-kafka_metrictank_1 ... done
Mon Aug 27 11:10:00 CEST 2018 running mt-dieter
Restarting docker-dev-custom-cfg-kafka_metrictank_1 ... done
/home/dieter/go/src/github.com/grafana/metrictank
Mon Aug 27 11:12:00 CEST 2018 running mt-dieter-inline
Restarting docker-dev-custom-cfg-kafka_metrictank_1 ... done
Mon Aug 27 11:14:01 CEST 2018 running mt-dieter-inline
Restarting docker-dev-custom-cfg-kafka_metrictank_1 ... done
/home/dieter/go/src/github.com/grafana/metrictank
Mon Aug 27 11:16:02 CEST 2018 running mt-master
Restarting docker-dev-custom-cfg-kafka_metrictank_1 ... done
Mon Aug 27 11:18:02 CEST 2018 running mt-master
Restarting docker-dev-custom-cfg-kafka_metrictank_1 ... done
/home/dieter/go/src/github.com/grafana/metrictank
Mon Aug 27 11:20:03 CEST 2018 running mt-sean
Restarting docker-dev-custom-cfg-kafka_metrictank_1 ... done
Mon Aug 27 11:22:03 CEST 2018 running mt-sean
Restarting docker-dev-custom-cfg-kafka_metrictank_1 ... done
/home/dieter/go/src/github.com/grafana/metrictank

https://snapshot.raintank.io/dashboard/snapshot/hMzSp4LGcBaJ5iKDrvrWMueMWzkxTUL6?orgId=2
https://snapshot.raintank.io/dashboard/snapshot/OCgXyoe0RQEURuAS6gV3Iuzo3K7c5qJS?orgId=2

cpu difference looks fine (tiny. if anything, proportional to the increased ingest but perhaps even less)
version sean, dieter and dieter-inline all seem to perform the same, so i suggest we merge my branch minus the inline commit.
important to note that this is not under a http workload, so any gains here are the baseline gains, under concurrency I expect to gain more.

sound good @shanson7 ?

@shanson7
Copy link
Collaborator Author

Yeah, looks great to me! I'm excited to see if you see a difference in ingest speed as pronounced as we did.

@Dieterbe
Copy link
Contributor

Dieterbe commented Aug 27, 2018

rollout to our internal monitoring environment..
backfill speeds of different instances as the pods restart one by one.
left: old, right: new.
note increased speeds and also shorter cluster restart time (both charts are exactly 7h wide)
a
old deploy duration 8:30 - 14:15 = 5:45
new 14:51 - 19:00 = 4:09
in minutes: (345-249)/249 = 38% faster

@shanson7 shanson7 deleted the readLockUpdates branch October 2, 2018 16:58
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants