Using Self Joins To Calculate Your Retention, Churn, And Reactivation Metrics

27 Jun 2014

Want to learn how to write faster SQL?

Get our eBook

Joining a table to itself

Most sophisticated user analysis looks at repeat usage: How many users come back? How many don't? How many come back after an extended absence?

To get at these answers, we have to look at what our current users have done in the past. This is tricky in SQL, which doesn't have explicit ways to bucket users based on past behavior.

The answer is to join tables to themselves. By using a template like select ... from events past_events join events current_events, you can get users' past activity and current activity in a single row, making it much easier to reason about both at the same time.

In this post, we'll take you through a few common metrics — Retention, Churn, and Reactivation — and show you how to build them in SQL.

Retention

We'll start with the most common metric: User retention. Simply put, we want to know how many users were active both last month and this month.

Let's start by aggregating our events table into a monthly rollup, like so:

with monthly_activity as (
  select distinct 
    date_trunc('month', created_at) as month, 
    user_id
  from events
)

Now that we have this table, we'll join it to itself. The left-hand-side of the join will represent this month, and the right-hand side will represent last month. We want to make sure that there's only one row per current_month per user_id, and that row only gets included in the join if the user was active this month and last.

Here's the query:

with monthly_activity as (
  select distinct
    date_trunc('month', created_at) as month,
    user_id
  from events
)
select 
  this_month.month, 
  count(distinct user_id)
from monthly_activity this_month
join monthly_activity last_month
  on this_month.user_id = last_month.user_id
  and this_month.month = last_month.month + interval '1 month'
group by month

Our two join conditions are:

  1. this_month.month = last_month.month + interval '1 month': This sets up how the join works. We want rows that include activity from this month and activity from last month, so we can reason about both time periods in the same statement.
  2. this_month.user_id = last_month.user_id: Ensures one row per user_id, and importantly, excludes rows where the user wasn't present both months.

Together, these two conditions give us a table containing only retained users every month, which we simply group and count over!

Churn

Now we'll take retention and turn it on its head: How many users last month did not come back this month?

We'll use the same self join, except this time, we want to only include rows in last_month that do not have equivalents in this_month. Here's the query:

with monthly_activity as (
  select distinct 
    date_trunc('month', created_at) as month, 
    user_id
  from events
)
select 
  this_month.month, 
  count(distinct user_id)
from monthly_activity this_month
left join monthly_activity last_month
  on this_month.user_id = last_month.user_id
  and this_month.month = last_month.month + interval '1 month'
where this_month.user_id is null
group by 1

We've changed our query in two ways:

  1. left join: This includes every row from last month, not just the ones with users who were active this month. This sets up the next step:
  2. where this_month.user_id is null: After the join, we filter out rows where the user was active this month.

This leaves us with a table of users who were active last month but not this month, which once again we can group and count over!

Reactivated Users

Reactivated Users are users who previously churned, but have now come back. Perhaps they had a renewed need for the product, or perhaps your compelling email retention campaign swayed them.

For reactivated users, we'll introduce a new table, containing users' first active months. Here it is:

with first_activity as (
  select user_id, date(min(created_at)) as month
  from events
  group by 2
)

We're going to include all active users each month for whom:

  1. This is not their first month (because then they'd be new).
  2. They were not active the previous month (because then they'd be retained).

Here's how we do it:

with 
monthly_activity as (
  select distinct
    date_trunc('month', created_at) as month,
    user_id
  from events
),
first_activity as (
  select user_id, date(min(created_at)) as month
  from events
  group by 2
)
select 
  current_activity.month, 
  count(distinct user_id)
from monthly_activity this_month
left join monthly_activity last_month
  on this_month.user_id = last_month.user_id
  and this_month.month = last_month.month + interval '1 month'
join first_activity
  on current_activity.user_id = first_activity.user_id
  and first_activity.month != current_activity.month
where last_month.user_id is null
group by 1

Similar to our Churn query, we employ a couple things in tandem:

  1. left join: We want every activity from the current month, even if they weren't active last month.
  2. where last_month.user_id is null: This is the reverse of the trick we used for our Churn query. We want only users who were active this month and not last month.
  3. join first_activity ... on first_activity.month != current_activity.month: This clause excludes new users who joined this month.

Combined together, we get users who were active this month, were not active last month, and are not new: Reactivated users!

Let Periscope Optimize Monthly Activity

For reasons we discussed yesterday, self joins can be really slow. One way to optimize them is to sign up for Periscope and set up your monthly_activity and first_activity tables as views in Periscope!

We'll keep them hot in our cache for you, making the joins really fast. And you won't even have to specify them in every query. Learn more at our homepage.

Thank you