Use Subqueries to Count Distinct 50X Faster

22 Jan 2014

NB: These techniques are universal, but for syntax we chose Postgres. Thanks to the inimitable pgAdminIII for the Explain graphics.

So Useful, Yet So Slow

Count distinct is the bane of SQL analysts, so it was an obvious choice for our first blog post.

First things first: If you have a huge dataset and can tolerate some imprecision, a probabilistic counter like HyperLogLog can be your best bet. (We'll return to HyperLogLog in a future blog post.) But for a quick, precise answer, some simple subqueries can save you a lot of time.

Let's start with a simple query we run all the time: Which dashboards do most users visit?

select 
  dashboards.name, 
  count(distinct time_on_site_logs.user_id)
from time_on_site_logs 
join dashboards on time_on_site_logs.dashboard_id = dashboards.id
group by name 
order by count desc

In Periscope, this would give you a graph like this:

For starters, let's assume the handy indices on user_id and dashboard_id are in place, and there are lots more log lines than dashboards and users.

On just 10 million rows, this query takes 48 seconds. To understand why, let's consult our handy SQL explain:

Explain Slow

It's slow because the database is iterating over all the logs and all the dashboards, then joining them, then sorting them, all before getting down to real work of grouping and aggregating.

Aggregate, Then Join

Anything after the group-and-aggregate is going to be a lot cheaper because the data size is much smaller. Since we don't need dashboards.name in the group-and-aggregate, we can have the database do the aggregation first, before the join:

select
  dashboards.name,
  log_counts.ct
from dashboards
join (
  select
    dashboard_id,
    count(distinct user_id) as ct
  from time_on_site_logs 
  group by dashboard_id
) as log_counts 
on log_counts.dashboard_id = dashboards.id
order by log_counts.ct desc

This query runs in 20 seconds, a 2.4X improvement! Once again, our trusty explain will show us why:

Explain Faster

As promised, our group-and-aggregate comes before the join. And, as a bonus, we can take advantage of the index on the time_on_site_logs table.

First, Reduce The Data Set

We can do better. By doing the group-and-aggregate over the whole logs table, we made our database process a lot of data unnecessarily. Count distinct builds a hash set for each group — in this case, each dashboard_id — to keep track of which values have been seen in which buckets.

Instead of doing all that work, we can compute the distincts in advance, which only needs one hash set. Then we do a simple aggregation over all of them.

select
  dashboards.name,
  log_counts.ct
from dashboards 
join (
  select distinct_logs.dashboard_id, 
  count(1) as ct
  from (
    select distinct dashboard_id, user_id
    from time_on_site_logs
  ) as distinct_logs
  group by distinct_logs.dashboard_id
) as log_counts 
on log_counts.dashboard_id = dashboards.id
order by log_counts.ct desc

We've taken the inner count-distinct-and-group and broken it up into two pieces. The inner piece computes distinct (dashboard_id, user_id) pairs. The second piece runs a simple, speedy group-and-count over them. As always, the join is last.

Explain Fastest

And now for the big reveal: This sucker takes 0.7 seconds! That's a 28X increase over the previous query, and a 68X increase over the original query.

As always, data size and shape matters a lot. These examples benefit a lot from a relatively low cardinality. There are a small number of distinct (user_id, dashboard_id) pairs compared to the total amount of data. The more unique pairs there are — the more data rows are unique snowflakes that must be grouped and counted — the less free lunch there will be.

Next time count distinct is taking all day, try a few subqueries to lighten the load.

Who Are You Guys, Anyway?

We make Periscope, a tool that makes SQL data analysis really fast. We'll be using this space to share the algorithms and techniques we've baked into our product.

You can sign up on our homepage to be notified as we take on new customers.

Update: More Data!

See our follow-up post, Count Distinct Compared on Top 4 SQL Databases to see how these improvements hold up on MySQL, Oracle and SQL Server.

Even More Updates: Count Distinct Even Faster with HyperLogLog!

See our more recent follow-up, HyperLogLog in Pure SQL to learn how to speed up count distinct even further with probabilistic counting and parallelism!

Thank you