Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

refactor cassandra read queries #931

Merged
merged 8 commits into from
Jun 16, 2018
Merged

refactor cassandra read queries #931

merged 8 commits into from
Jun 16, 2018

Conversation

Dieterbe
Copy link
Contributor

@Dieterbe Dieterbe commented Jun 6, 2018

problem

when customers query 2 years worth of data, because we use 4-weekly rowkeys, we read from 27 different row keys, each using a separate query. When using avgConsolidator do times 2 (we read sum and cnt), so 54. this is per series.
this customer was querying 743 series, resulting in 743*54=40k read queries (each completing in under 10ms generally, but still causing a big latency overall)

We speculate that by doing fewer queries that hit multiple rowkeys at once, we can get better performance. I'm currently testing this

procedure

./launch.sh docker-dev-custom-cfg-kafka

wait until up. then (use latest fakemetrics code):

./fakemetrics backfill --kafka-mdm-addr localhost:9092 --kafka-mdm-v2=false --offset $((2*366*24))h --period 1800s --speedup 360000 --mpo 500

this should complete in about 3 minutes.
load MT dashboard and confirm it ingested data at 100kHz for 3 minutes

when done: in grafana look at some.id.of.a.metric.{1,72,199,320,438};id=*;some=tag to confirm data looks good (over 2y)

http benchmarking

when I was trying with only 1 series:

curl 'http://localhost:6060/render?target=some.id.of.a.metric.*&format=json&from=-2y' # to spot check result
time curl 'http://localhost:6060/render?target=some.id.of.a.metric.*&format=json&from=-2y' >/dev/null # benchmark

pre and post were both about 1.46 to 1.48s

same but with more series

input:

GET http://localhost:6060/render?target=some.id.of.a.metric.*&format=json&from=-2y
GET http://localhost:6060/
GET http://localhost:6060/
GET http://localhost:6060/
GET http://localhost:6060/

pre

date; cat input | vegeta attack -rate 1 -duration 300s | vegeta report; date
Thu Jun  7 21:04:00 CEST 2018
Requests      [total, rate]            300, 1.00
Duration      [total, attack, wait]    4m59.002713175s, 4m59.000036889s, 2.676286ms
Latencies     [mean, 50, 95, 99, max]  516.551765ms, 2.617993ms, 2.588240007s, 2.640280368s, 2.710676775s
Bytes In      [total, mean]            756023280, 2520077.60
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:300  
Error Set:
Thu Jun  7 21:08:59 CEST 2018

post

~ ❯❯❯ date; cat input | vegeta attack -rate 1 -duration 300s | vegeta report; date
Thu Jun  7 21:28:00 CEST 2018
Requests      [total, rate]            300, 1.00
Duration      [total, attack, wait]    4m59.002309685s, 4m59.000077373s, 2.232312ms
Latencies     [mean, 50, 95, 99, max]  509.845307ms, 2.964046ms, 2.549719772s, 2.589971517s, 2.648294993s
Bytes In      [total, mean]            756023172, 2520077.24
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:300  
Error Set:
Thu Jun  7 21:32:59 CEST 2018

CPU ram/cpu/io :
https://snapshot.raintank.io/dashboard/snapshot/K3C0XYLkWl4Vjq3nZtI5oih7wLek4h0J?orgId=2
MT dashboard:
https://snapshot.raintank.io/dashboard/snapshot/OrKp30bsRTMfS2pBsXNBWw6r48rV224I?orgId=2

conclusion

  • much fewer, and slower cassandra queries (as expected)
  • very minor decrease in http render latency
  • significant cassandra CPU reduction in favor of slight MT CPU increase
  • iters to points became 6ms slower: why? cause of increased cpu? or did we accidentally return more iters/chunks?
  • cassandra to iter 8ms slower. why? because of sorting perhaps?
  • cassandra get chunks about 20ms faster

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Jun 7, 2018

not looking too promising.
tried raising get-targets-concurrency but no avail. the cassandra searches simply become just as much slower, making the end result the same.
cpu spent in go-tsz.(*bstream).readBits, go-tsz.(*bstream).readByte, and go-tsz.(*Iter).Next is about 40% tried upgrading go-tsz to latest, but it's about the same. cassandra latency outweighs the rest, so cpu profile is misleading.

@Dieterbe Dieterbe force-pushed the cass-multiple-rowkeys branch 2 times, most recently from af9b8d0 to 952d194 Compare June 8, 2018 11:35
@Dieterbe
Copy link
Contributor Author

Dieterbe commented Jun 8, 2018

PLOT TWIST:
the problem happens when enabling the chunk cache. cpu goes up, memory blows up, and the Failed to submit event to accounting, channel was blocked looks very similar to the production event.
note the 80% success rate (with 4/5 reqs being health checks, so all the render requests were failing)

see:
https://snapshot.raintank.io/dashboard/snapshot/EtYL25AHkAERJJa5p7KH9JrthFLDJgyb?orgId=2
https://snapshot.raintank.io/dashboard/snapshot/qBbjmBD1uzeh3U6eCaM6mw1ltk1w9jgC?orgId=2

date; cat input | vegeta attack -rate 1 -duration 300s | vegeta report; date
Fri Jun  8 14:01:11 CEST 2018
Requests      [total, rate]            300, 1.00
Duration      [total, attack, wait]    4m59.000800494s, 4m58.999803659s, 996.835µs
Latencies     [mean, 50, 95, 99, max]  1.579504ms, 939.818µs, 3.072098ms, 12.04762ms, 97.612382ms
Bytes In      [total, mean]            480, 1.60
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  80.00%
Status Codes  [code:count]             200:240  0:60  
Error Set:
Get http://localhost:6060/render?target=some.id.of.a.metric.*&format=json&from=-2y: net/http: timeout awaiting response headers
Fri Jun  8 14:06:36 CEST 2018
metrictank_1    | 2018/06/08 12:06:07 [flat_accnt.go:136 act()] [E] Failed to submit event to accounting, channel was blocked
metrictank_1    | 2018/06/08 12:06:07 [flat_accnt.go:136 act()] [E] Failed to submit event to accounting, channel was blocked
metrictank_1    | 2018/06/08 12:06:07 [flat_accnt.go:136 act()] [E] Failed to submit event to accounting, channel was blocked
metrictank_1    | 2018/06/08 12:06:07 [flat_accnt.go:136 act()] [E] Failed to submit event to accounting, channel was blocked
metrictank_1    | 2018/06/08 12:06:07 [flat_accnt.go:136 act()] [E] Failed to submit event to accounting, channel was blocked
metrictank_1    | 2018/06/08 12:06:07 [flat_accnt.go:136 act()] [E] Failed to submit event to accounting, channel was blocked
metrictank_1    | 2018/06/08 12:06:07 [flat_accnt.go:136 act()] [E] Failed to submit event to accounting, channel was blocked
metrictank_1    | 2018/06/08 12:06:07 [flat_accnt.go:136 act()] [E] Failed to submit event to accounting, channel was blocked
metrictank_1    | 2018/06/08 12:06:07 [flat_accnt.go:136 act()] [E] Failed to submit event to accounting, channel was blocked
metrictank_1    | 2018/06/08 12:06:07 [flat_accnt.go:136 act()] [E] Failed to submit event to accounting, channel was blocked
metrictank_1    | 2018/06/08 12:06:07 [flat_accnt.go:136 act()] [E] Failed to submit event to accounting, channel was blocked
metrictank_1    | 2018/06/08 12:06:07 [flat_accnt.go:136 act()] [E] Failed to submit event to accounting, channel was blocked
metrictank_1    | 2018/06/08 12:06:07 [flat_accnt.go:136 act()] [E] Failed to submit event to accounting, channel was blocked
metrictank_1    | 2018/06/08 12:06:07 [flat_accnt.go:136 act()] [E] Failed to submit event to accounting, channel was blocked
metrictank_1    | 2018/06/08 12:06:07 [flat_accnt.go:136 act()] [E] Failed to submit event to accounting, channel was blocked
metrictank_1    | 2018/06/08 12:06:07 [flat_accnt.go:136 act()] [E] Failed to submit event to accounting, channel was blocked
metrictank_1    | 2018/06/08 12:06:07 [flat_accnt.go:136 act()] [E] Failed to submit event to accounting, channel was blocked
metrictank_1    | [Macaron] 2018-06-08 12:06:08: Started GET / for 172.20.0.1
metrictank_1    | [Macaron] 2018-06-08 12:06:08: Completed / 200 OK in 302.36µs
metrictank_1    | [Macaron] 2018-06-08 12:06:09: Started GET / for 172.20.0.1
metrictank_1    | [Macaron] 2018-06-08 12:06:09: Completed / 200 OK in 311.431µs
metrictank_1    | [Macaron] 2018-06-08 12:06:10: Started GET / for 172.20.0.1
metrictank_1    | [Macaron] 2018-06-08 12:06:10: Completed / 200 OK in 275.913µs
caddy           | 08/Jun/2018:12:06:11 +0000 [ERROR 502 /render] context canceled
metrictank_1    | [Macaron] 2018-06-08 12:06:16: Completed /render?target=some.id.of.a.metric.*&format=json&from=-2y 499  in 44.777299076s
caddy           | 08/Jun/2018:12:06:16 +0000 [ERROR 502 /render] context canceled
metrictank_1    | [Macaron] 2018-06-08 12:06:19: Completed /render?target=some.id.of.a.metric.*&format=json&from=-2y 499  in 43.01550749s
metrictank_1    | [Macaron] 2018-06-08 12:06:19: Completed /render?target=some.id.of.a.metric.*&format=json&from=-2y 499  in 38.015525732s
caddy           | 08/Jun/2018:12:06:21 +0000 [ERROR 502 /render] context canceled
caddy           | 08/Jun/2018:12:06:26 +0000 [ERROR 502 /render] context canceled

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Jun 13, 2018

after applying #943 :

  • cpu a bit lower
  • doesn't blow up ram anymore
  • requests all finish in under 3s (even when cache is supposed to be warm, which is way too high)
  • still seeing many cass/requests (which is why i suspect latency doesn't go down) and all hit-partial in the dashboard.
  • also seeing Failed to submit event to accounting, channel was blocked still, that might be preventing the cache from working properly. printing all events in FlatAccnt.act and running the workload shows about 90% caused by evnt_hit_chnk and 10% by evnt_add_chnk => will work on batched events for the cache
    CPU profile:
(pprof) top30 -cum cache
Active filters:
   focus=cache
Showing nodes accounting for 3.72s, 8.48% of 43.88s total
Dropped 91 nodes (cum <= 0.22s)
Showing top 30 nodes out of 125
      flat  flat%   sum%        cum   cum%
     0.20s  0.46%  0.46%      5.98s 13.63%  github.com/grafana/metrictank/mdata/cache.(*CCache).Search.func1
     0.06s  0.14%  0.59%      5.78s 13.17%  github.com/grafana/metrictank/mdata/cache/accnt.(*FlatAccnt).HitChunk
     0.04s 0.091%  0.68%      5.25s 11.96%  github.com/grafana/metrictank/mdata/cache/accnt.(*FlatAccnt).act
     0.04s 0.091%  0.77%      4.08s  9.30%  github.com/grafana/metrictank/vendor/github.com/raintank/worldping-api/pkg/log.Error
     0.03s 0.068%  0.84%      4.04s  9.21%  github.com/grafana/metrictank/vendor/github.com/raintank/worldping-api/pkg/log.(*Logger).Error
     0.13s   0.3%  1.14%      3.77s  8.59%  github.com/grafana/metrictank/mdata/cache/accnt.(*FlatAccnt).eventLoop
     0.07s  0.16%  1.30%      3.45s  7.86%  github.com/grafana/metrictank/vendor/github.com/raintank/worldping-api/pkg/log.(*Logger).writerMsg
     0.29s  0.66%  1.96%      2.70s  6.15%  github.com/grafana/metrictank/mdata/cache/accnt.(*LRU).touch
         0     0%  1.96%      2.51s  5.72%  runtime.systemstack
     0.49s  1.12%  3.08%      1.78s  4.06%  runtime.mallocgc
     0.02s 0.046%  3.12%      1.77s  4.03%  runtime.Caller
     0.85s  1.94%  5.06%      1.51s  3.44%  runtime.mapaccess2
         0     0%  5.06%      1.37s  3.12%  github.com/grafana/metrictank/api.(*Server).getSeriesFixed
         0     0%  5.06%      1.37s  3.12%  github.com/grafana/metrictank/api.(*Server).getTarget
         0     0%  5.06%      1.37s  3.12%  github.com/grafana/metrictank/api.(*Server).getTargetsLocal.func1
         0     0%  5.06%      1.35s  3.08%  github.com/grafana/metrictank/api.(*Server).getSeries
         0     0%  5.06%      1.35s  3.08%  github.com/grafana/metrictank/api.(*Server).getSeriesCachedStore
     0.19s  0.43%  5.49%      1.35s  3.08%  runtime.chansend
     0.02s 0.046%  5.54%      1.13s  2.58%  runtime.selectnbsend
         0     0%  5.54%      1.12s  2.55%  github.com/grafana/metrictank/mdata/cache.(*CCache).Search
         0     0%  5.54%      1.12s  2.55%  github.com/grafana/metrictank/mdata/cache.(*CCacheMetric).Search
     0.12s  0.27%  5.81%      1.12s  2.55%  github.com/grafana/metrictank/mdata/cache.(*CCacheMetric).searchForward
     0.02s 0.046%  5.86%      1.11s  2.53%  runtime.callers
         0     0%  5.86%      1.09s  2.48%  runtime.callers.func1
     0.20s  0.46%  6.31%      1.09s  2.48%  runtime.gentraceback
         0     0%  6.31%      1.01s  2.30%  fmt.Sprintf
     0.06s  0.14%  6.45%      0.90s  2.05%  container/list.(*List).MoveToFront
     0.19s  0.43%  6.88%      0.87s  1.98%  runtime.gcWriteBarrier
     0.34s  0.77%  7.66%      0.85s  1.94%  runtime.pcvalue
     0.36s  0.82%  8.48%      0.84s  1.91%  runtime.lock
(pprof) list LRU.*touch
Total: 43.88s
ROUTINE ======================== github.com/grafana/metrictank/mdata/cache/accnt.(*LRU).touch in /home/dieter/go/src/github.com/grafana/metrictank/mdata/cache/accnt/lru.go
     290ms      2.70s (flat, cum)  6.15% of Total
         .          .     14:		list:  list.New(),
         .          .     15:		items: make(map[interface{}]*list.Element),
         .          .     16:	}
         .          .     17:}
         .          .     18:
      20ms       20ms     19:func (l *LRU) touch(key interface{}) {
     270ms      1.78s     20:	if ent, ok := l.items[key]; ok {
         .      900ms     21:		l.list.MoveToFront(ent)
         .          .     22:	} else {
         .          .     23:		l.items[key] = l.list.PushFront(key)
         .          .     24:	}
         .          .     25:}
         .          .     26:

the logger.Error overhead is coming from the FlatAccnt.act() errors, so moving to batched events should save about 10% cpu

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Jun 14, 2018

I ran the experiment above again (with chunkcache-AddRange merged but not yet the batched accounting)
the stats look a bit different and i'm not sure why (maybe i ran with a different cpu governor last time) but anyway i just want to compare that against the latest code which uses batched accounting.

differences with previous run cpu 250%, full cache hits, no cass hits once cached. (this doesn't match previous observation where i said cpu was lower and only had partial hits and leading to cassandra requests)

date; cat input | vegeta attack -rate 1 -duration 300s | vegeta report; date
Thu Jun 14 12:50:59 CEST 2018
Requests      [total, rate]            300, 1.00
Duration      [total, attack, wait]    4m59.000888467s, 4m58.99990704s, 981.427µs
Latencies     [mean, 50, 95, 99, max]  506.915838ms, 1.762138ms, 2.570193057s, 2.676656689s, 4.827845583s
Bytes In      [total, mean]            756995640, 2523318.80
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:300  
Error Set:
Thu Jun 14 12:55:58 CEST 2018

https://snapshot.raintank.io/dashboard/snapshot/9j0cGMOzq3sz9b4E7zSO0yIfUa6Wb9dW?orgId=2
https://snapshot.raintank.io/dashboard/snapshot/eZU42MDabp3yGaPxran2v1b4nep1bEQK?orgId=2

batched accounting:

date; cat input | vegeta attack -rate 1 -duration 300s | vegeta report; date
Thu Jun 14 13:47:37 CEST 2018
Requests      [total, rate]            300, 1.00
Duration      [total, attack, wait]    4m59.003074928s, 4m59.000059056s, 3.015872ms
Latencies     [mean, 50, 95, 99, max]  371.625767ms, 2.817488ms, 1.857461051s, 1.959102619s, 3.821066751s
Bytes In      [total, mean]            756931320, 2523104.40
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:300  
Error Set:
Thu Jun 14 13:52:36 CEST 2018

https://snapshot.raintank.io/dashboard/snapshot/EwZbEhluRrfeaurmDwxKNpS67Ax7Z0Zx?orgId=2
https://snapshot.raintank.io/dashboard/snapshot/SFnItYfVuwx5JUce6Jhyo95x5r12qRkD

conclusion: 20% cpu reduction
faster response times
hit rate same for both (100%), not querying cassandra after initial one.

Dieterbe added a commit that referenced this pull request Jun 14, 2018
1) less contention on channel
2) prevent channel overflowing
3) less CPU spent printing
   `Failed to submit event to accounting, channel was blocked`

For the record,
When I patch the code like so:

diff --git a/mdata/cache/accnt/flat_accnt.go b/mdata/cache/accnt/flat_accnt.go
index 4097e067..d202960c 100644
--- a/mdata/cache/accnt/flat_accnt.go
+++ b/mdata/cache/accnt/flat_accnt.go
@@ -1,7 +1,9 @@
 package accnt

 import (
+       "fmt"
        "sort"
+       "time"

        "github.com/grafana/metrictank/mdata/chunk"
        "github.com/raintank/worldping-api/pkg/log"
@@ -178,6 +180,7 @@ func (a *FlatAccnt) eventLoop() {
                                        },
                                )
                        case evnt_add_chnks:
+                               pre := time.Now()
                                payload := event.pl.(*AddsPayload)
                                a.addRange(payload.metric, payload.chunks)
                                cacheChunkAdd.Add(len(payload.chunks))
@@ -189,6 +192,7 @@ func (a *FlatAccnt) eventLoop() {
                                                },
                                        )
                                }
+                               fmt.Println("CODE evnt_add_chnks took", time.Since(pre).Nanoseconds())
                        case evnt_hit_chnk:
                                payload := event.pl.(*HitPayload)
                                a.lru.touch(
@@ -198,6 +202,7 @@ func (a *FlatAccnt) eventLoop() {
                                        },
                                )
                        case evnt_hit_chnks:
+                               pre := time.Now()
                                payload := event.pl.(*HitsPayload)
                                for _, chunk := range payload.chunks {
                                        a.lru.touch(
@@ -207,6 +212,7 @@ func (a *FlatAccnt) eventLoop() {
                                                },
                                        )
                                }
+                               fmt.Println("CODE evnt_hit_chnks took", time.Since(pre).Nanoseconds())

and run the workload described in
#931
(queries for 2y of data across 500 series) I get these durations:

awk '/evnt_add_chnks/ {print $4}' code2.txt |goplot hist
1187.00 -> 11160892.40: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇  850
           22320597.80: ▇▇  39 (upto 22 ms)
           33480303.20: ▇  17 (upto 33 ms)
           44640008.60: ▏  5 (upto 45 ms)
           55799714.00: ▏  5 (upto 56 ms)
           66959419.40: ▏  1 (upto 67 ms)
           78119124.80: ▏  0
           89278830.20: ▏  0
          100438535.60: ▏  0
          111598241.00: ▏  1 (upto 111 ms)

awk '/evnt_hit_chnks/ {print $4}' code2.txt |goplot hist
45.00 -> 22366478.50: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇  7624
         44732912.00: ▏  65 (upto 45 ms)
         67099345.50: ▏  29 (upto 67 ms)
         89465779.00: ▏  8 (upto 89 ms)
        111832212.50: ▏  1 (upto 112 ms)
        134198646.00: ▏  1 (upto 134 ms)
        156565079.50: ▏  0
        178931513.00: ▏  1 (upto 179 ms)
        201297946.50: ▏  0
        223664380.00: ▏  1 (upto 224 ms)
@Dieterbe
Copy link
Contributor Author

Dieterbe commented Jun 14, 2018

the original problem is now addressed via #943
but this PR is still useful to merge, especially to lower cassandra CPU usage.

I just pushed these updates to this PR:
remove the "do not merge" commits that introduced and reverted MT_CHUNK_CACHE_MAX_SIZE: 0 and that adjusted retention for tests (retentions = 1s:6h:2min:2,30min:2y:6h:1 in docker/docker-dev-custom-cfg-kafka/storage-schemas.conf)
rebased to latest master.

NOTE: when merging, need to add nil checks for the new ccache methods introduced in #943

@Dieterbe Dieterbe changed the title WIP Cassandra multiple rowkeys refactor cassandra read queries Jun 14, 2018
@Dieterbe Dieterbe requested review from replay and woodsaj June 14, 2018 13:00
* CassandraStore.Search: instead of doing a query for each row,
  do a single query across all rows. with 4week rows, that would
  have been a lot of rows for long timerange queries
* outcome -> readResult
* no more need to sort readResults (there is only one) however
  we must now sort all chunks from Cassandra.
* change:
  store.cassandra.chunks_per_row to
  store.cassandra.chunks_per_response
* pre-prepared queries for better performance
* remove a bunch of unneeded cruft
* by keeping a few properties in the Table definition
  we can cleanup a bunch of code
const QueryFmtRead = "SELECT ts, data FROM %s WHERE key IN ? AND ts < ?"
const QueryFmtWrite = "INSERT INTO %s (key, ts, data) values(?,?,?) USING TTL ?"

// TTLTables stores table definitions keyed by their TTL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am i right that the TTLs are in hours? if so i'd specify that, otherwise it's very easy to assume that's seconds

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are not

Copy link
Contributor

@replay replay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Dieterbe Dieterbe merged commit d12b42f into master Jun 16, 2018
Dieterbe added a commit that referenced this pull request Jun 18, 2018
1) less contention on channel
2) prevent channel overflowing
3) less CPU spent printing
   `Failed to submit event to accounting, channel was blocked`

For the record,
When I patch the code like so:

diff --git a/mdata/cache/accnt/flat_accnt.go b/mdata/cache/accnt/flat_accnt.go
index 4097e067..d202960c 100644
--- a/mdata/cache/accnt/flat_accnt.go
+++ b/mdata/cache/accnt/flat_accnt.go
@@ -1,7 +1,9 @@
 package accnt

 import (
+       "fmt"
        "sort"
+       "time"

        "github.com/grafana/metrictank/mdata/chunk"
        "github.com/raintank/worldping-api/pkg/log"
@@ -178,6 +180,7 @@ func (a *FlatAccnt) eventLoop() {
                                        },
                                )
                        case evnt_add_chnks:
+                               pre := time.Now()
                                payload := event.pl.(*AddsPayload)
                                a.addRange(payload.metric, payload.chunks)
                                cacheChunkAdd.Add(len(payload.chunks))
@@ -189,6 +192,7 @@ func (a *FlatAccnt) eventLoop() {
                                                },
                                        )
                                }
+                               fmt.Println("CODE evnt_add_chnks took", time.Since(pre).Nanoseconds())
                        case evnt_hit_chnk:
                                payload := event.pl.(*HitPayload)
                                a.lru.touch(
@@ -198,6 +202,7 @@ func (a *FlatAccnt) eventLoop() {
                                        },
                                )
                        case evnt_hit_chnks:
+                               pre := time.Now()
                                payload := event.pl.(*HitsPayload)
                                for _, chunk := range payload.chunks {
                                        a.lru.touch(
@@ -207,6 +212,7 @@ func (a *FlatAccnt) eventLoop() {
                                                },
                                        )
                                }
+                               fmt.Println("CODE evnt_hit_chnks took", time.Since(pre).Nanoseconds())

and run the workload described in
#931
(queries for 2y of data across 500 series) I get these durations:

awk '/evnt_add_chnks/ {print $4}' code2.txt |goplot hist
1187.00 -> 11160892.40: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇  850
           22320597.80: ▇▇  39 (upto 22 ms)
           33480303.20: ▇  17 (upto 33 ms)
           44640008.60: ▏  5 (upto 45 ms)
           55799714.00: ▏  5 (upto 56 ms)
           66959419.40: ▏  1 (upto 67 ms)
           78119124.80: ▏  0
           89278830.20: ▏  0
          100438535.60: ▏  0
          111598241.00: ▏  1 (upto 111 ms)

awk '/evnt_hit_chnks/ {print $4}' code2.txt |goplot hist
45.00 -> 22366478.50: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇  7624
         44732912.00: ▏  65 (upto 45 ms)
         67099345.50: ▏  29 (upto 67 ms)
         89465779.00: ▏  8 (upto 89 ms)
        111832212.50: ▏  1 (upto 112 ms)
        134198646.00: ▏  1 (upto 134 ms)
        156565079.50: ▏  0
        178931513.00: ▏  1 (upto 179 ms)
        201297946.50: ▏  0
        223664380.00: ▏  1 (upto 224 ms)
@Dieterbe Dieterbe mentioned this pull request Jun 29, 2018
@Dieterbe Dieterbe deleted the cass-multiple-rowkeys branch September 18, 2018 08:58
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants