Sampling is an incredibly powerful tool to speed up analyses at scale. While it's not appropriate for all datasets or all analyses, when it works, it really works. At Periscope, we've realized several orders of magnitude in speedups on large datasets with judicious use of sampling.
However, when sampling from databases, it's easy to lose all your speedups by using inefficient methods to select the sample itself. In this post we'll show you how to select random samples in fractions of a second.
The obvious, correct, slow solution
Let's say we want to send a coupon to a random hundred users as an experiment. Quick, to the database!
The naive approach sorts the entire table randomly and selects N results. It's slow, but it's simple and it works even when there are gaps in the primary keys.
Selecting a random row in MySQL
select * from users order by rand() limit 1
Selecting a random row in PostgreSQL
select * from users order by random() limit 1
Selecting a random row in Microsoft SQL Server
select top 1 column from users order by newid()
Selecting a random row in Oracle Database
select * from ( select * from users order by dbms_random.value ) where rownum = 1
Thanks to Pete Freitag's website for these starting points.
This query is taking forever!
On a Postgres database with 20M rows in the users table, this query takes 17.51 seconds! To find out why, let's return to our trusty explain:
The database is sorting the entire table before selecting our 100 rows! This is an O(n log n) operation, which can easily take minutes or longer on a 100M+ row table. Even on medium-sized tables, a full table sort is unacceptably slow in a production environment.
Query faster by sorting only a subset of the table
The most obvious way to speed this up is to filter down the dataset before doing the expensive sort.
We'll select a larger sample than we need and then limit it, because we might get randomly fewer than the expected number of rows in the subset. We also need to randomly sort afterward to avoid biasing towards earlier rows in the table.
Here's our new query:
select * from users where random() < 200 / (select count(1) from logs)::float order by random() limit 100
(We'll be using Postgres from this point forward for simplicity. Most of these techniques work well on other DBs.)
This baby runs in 7.97s: Twice as fast!
This is pretty good, but we can do better. You'll notice we're still scanning the table, albeit after the restriction. Our next step will be to avoid scans of any kind.
Generate random indices in the ID range
Ideally we wouldn't use any scans at all, and rely entirely on index lookups. If we have an upper bound on table size, we can generate random numbers in the ID range and then lookup the rows with those IDs.
select * from users where id in ( select round(random() * 21e6)::integer as id from generate_series(1, 110) group by id -- Discard duplicates ) limit 100
This puppy runs in 0.064s, a 273X speedup over the naive query!
Counting the table itself takes almost 8 seconds, so we'll just pick a constant beyond the end of the ID range, sample a few extra numbers to be sure we don't lose any, and then select the 100 we actually want.
Bonus: Random sampling with replacement
Imagine you want to flip a coin a hundred times. If you flip a heads, you need to be able to flip another heads. This is called sampling with replacement. All of our previous methods couldn't return a single row twice, but the last method was close: If we remove the inner
group by id, then the selected ids can be duplicated:
select * from users where id in ( select round(random() * 21e6)::integer as id from generate_series(1, 110) -- Preserve duplicates ) limit 100
Sampling is an incredibly powerful tool for speeding up statistical analyses at scale, but only if the mechanism for getting the sample doesn't take too long. Next time you need to do it, generate random numbers first, then select those records. Or, try Periscope, which will use sampling to speed up your analyses without any work on your end!