Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Add a new version of the R30 phone-home metric, which removes a false impression of retention given by the old R30 metric #10332

Merged
merged 24 commits into from
Jul 19, 2021
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
899e7c4
Add initial R30v2 tests
reivilibre Jul 6, 2021
7a87cbe
Allow custom headers (e.g. User-Agent) in some test helpers
reivilibre Jul 6, 2021
e106b94
Add code to calculate R30v2
reivilibre Jul 7, 2021
98dcf2b
Tweak and add R30v2 tests
reivilibre Jul 7, 2021
8511e1a
Add R30v2 stats to the phone home data
reivilibre Jul 7, 2021
248966a
Newsfile
reivilibre Jul 7, 2021
39a3b8f
Explicitly report 0 when no clients of that type are present
reivilibre Jul 9, 2021
2eda2a7
Remove review question
reivilibre Jul 9, 2021
261b6c4
Remove some unneeded advances
reivilibre Jul 9, 2021
6b6b6f2
Apply rename
reivilibre Jul 9, 2021
fd4f493
Rewrite R30v2 query to not use window function
reivilibre Jul 9, 2021
92d215c
Cast things to BIGINT because poor Postgres cries
reivilibre Jul 9, 2021
9353939
Add alias to make Postgres happy
reivilibre Jul 9, 2021
6762bcf
Move multiplies to python from SQL
reivilibre Jul 19, 2021
2fca561
Don't bother ordering
reivilibre Jul 19, 2021
80c9187
Simplify WHEN chain
reivilibre Jul 19, 2021
79db58b
Remove review comment
reivilibre Jul 19, 2021
982b2d1
Clarify desired_time is secs
reivilibre Jul 19, 2021
7860bc9
Factorise out store
reivilibre Jul 19, 2021
a865726
Clean up and standardise the advance times
reivilibre Jul 19, 2021
44a0f91
Describe the structure of the dict
reivilibre Jul 19, 2021
46827cf
Clean up casts and move multiplications into Python
reivilibre Jul 19, 2021
899251c
Clarify comment about five minutes
reivilibre Jul 19, 2021
8a4f589
Factorise store again (oops)
reivilibre Jul 19, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changelog.d/10332.feature
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Add a new version of the R30 phone-home metric, which removes a false impression of retention given by the old R30 metric.
4 changes: 4 additions & 0 deletions synapse/app/phone_stats_home.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,10 @@ async def phone_stats_home(hs, stats, stats_process=_stats_process):
for name, count in r30_results.items():
stats["r30_users_" + name] = count

r30v2_results = await hs.get_datastore().count_r30_users()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these repeated calls to hs.get_datastore() are fugly :/.

It's fine for now, for consistency with the rest of the code, but at some point it would be nice to factor this out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have a follow-up PR ready to go after this one

for name, count in r30v2_results.items():
stats["r30v2_users_" + name] = count

stats["cache_factor"] = hs.config.caches.global_factor
stats["event_cache_size"] = hs.config.caches.event_cache_size

Expand Down
129 changes: 129 additions & 0 deletions synapse/storage/databases/main/metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -316,6 +316,135 @@ def _count_r30_users(txn):

return await self.db_pool.runInteraction("count_r30_users", _count_r30_users)

async def count_r30v2_users(self) -> Dict[str, int]:
"""
Counts the number of 30 day retained users, defined as users that:
- Appear more than once in the past 60 days
- Have more than 30 days between the most and least recent appearances that
occurred in the past 60 days.

(This is the second version of this metric, hence R30'v2')

Returns:
A mapping from client type to the number of 30-day retained users for that client.

The dict keys are:
- "all" (a combined number of users across any and all clients)
- "android" (Element Android)
- "ios" (Element iOS)
- "electron" (Element Desktop)
- "web" (any web application -- it's not possible to distinguish Element Web here)
"""

def _count_r30v2_users(txn):
thirty_days_in_secs = 86400 * 30
now = int(self._clock.time())
sixty_days_ago_in_secs = now - 2 * thirty_days_in_secs
one_day_from_now_in_secs = now + 86400

# This is the 'per-platform' count.
sql = """
SELECT
client_type,
count(client_type)
FROM
(
SELECT
user_id,
CASE
WHEN
LOWER(user_agent) LIKE '%%riot%%' OR
LOWER(user_agent) LIKE '%%element%%'
THEN CASE
WHEN
LOWER(user_agent) LIKE '%%electron%%'
THEN 'electron'
WHEN
LOWER(user_agent) LIKE '%%android%%'
THEN 'android'
WHEN
LOWER(user_agent) LIKE '%%ios%%'
THEN 'ios'
ELSE 'unknown'
END
WHEN
LOWER(user_agent) LIKE '%%mozilla%%' OR
LOWER(user_agent) LIKE '%%gecko%%'
THEN 'web'
ELSE 'unknown'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please could we avoid having two slightly different copies of the User-Agent to platform mapping? factor it out to a constant, maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitant to change the pre-existing R30 mapping, but as far as I know, we only want to count Element clients for R30v2.

END as client_type
FROM
user_daily_visits
WHERE
timestamp > ?
AND
timestamp < ?
GROUP BY
user_id,
client_type
HAVING
max(timestamp) - min(timestamp) > ?
) AS temp
GROUP BY
client_type
;
"""

# We initialise all the client types to zero, so we get an explicit
# zero if they don't appear in the query results
results = {"ios": 0, "android": 0, "web": 0, "electron": 0}
txn.execute(
sql,
(
sixty_days_ago_in_secs * 1000,
one_day_from_now_in_secs * 1000,
thirty_days_in_secs * 1000,
),
)

for row in txn:
if row[0] == "unknown":
continue
results[row[0]] = row[1]

# This is the 'all users' count.
sql = """
SELECT COUNT(*) FROM (
SELECT
1
FROM
user_daily_visits
WHERE
timestamp > ?
AND
timestamp < ?
GROUP BY
user_id
HAVING
max(timestamp) - min(timestamp) > ?
) AS r30_users
"""

txn.execute(
sql,
(
sixty_days_ago_in_secs * 1000,
one_day_from_now_in_secs * 1000,
thirty_days_in_secs * 1000,
),
)
row = txn.fetchone()
if row is None:
results["all"] = 0
else:
results["all"] = row[0]

return results

return await self.db_pool.runInteraction(
"count_r30v2_users", _count_r30v2_users
)

def _get_start_of_day(self):
"""
Returns millisecond unixtime for start of UTC day.
Expand Down
Loading