<p>Sampling is an incredibly powerful tool to speed up analyses at scale. While it’s not appropriate for all datasets or all analyses, when it works, it really works. We’ve realized several orders of magnitude in speedups on large datasets with judicious use of sampling.</p>



<p>However, when sampling from databases, it’s easy to lose all your speedups by using inefficient methods to select the sample itself. In this post we’ll show you how to select random samples in fractions of a second.</p>



<h2 class="wp-block-heading"><strong>The obvious, correct, slow solution</strong></h2>



<p>Let’s say we want to send a coupon to a random hundred users as an experiment. Quick, to the database!</p>



<p>The naive approach sorts the entire table randomly and selects N results. It’s slow, but it’s simple and it works even when there are gaps in the primary keys.</p>



<h3 class="wp-block-heading"><strong>Selecting a random row in MySQL</strong></h3>



<pre class="wp-block-code"><code>select * from users
order by rand()
limit </code></pre>



<h3 class="wp-block-heading"><strong>Selecting a random row in PostgreSQL</strong></h3>



<pre class="wp-block-code"><code>select * from users
order by random()
limit 1</code></pre>



<h3 class="wp-block-heading"><strong>Selecting a random row in Microsoft SQL Server</strong></h3>



<pre class="wp-block-code"><code>select top 1 column from users
order by newid()</code></pre>



<h3 class="wp-block-heading"><strong>Selecting a random row in Oracle Database</strong></h3>



<pre class="wp-block-code"><code>select * from (
  select * from users
  order by dbms_random.value
)
where rownum = 1</code></pre>



<p><em>Thanks to </em><a href="https://www.petefreitag.com/item/466.cfm" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)"><em>Pete Freitag’s website</em></a><em> for these starting points.</em></p>



<h2 class="wp-block-heading"><strong>This query is taking forever!</strong></h2>



<p>On a Postgres database with 20M rows in the users table, this query takes <strong>17.51 seconds</strong>! To find out why, let’s return to our trusty explain:</p>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/Postgres-map.png" alt="Postgres map" class="wp-image-75979"/></figure>



<p>The database is sorting the entire table before selecting our 100 rows! This is an O(n log n) operation, which can easily take minutes or longer on a 100M+ row table. Even on medium-sized tables, a full table sort is unacceptably slow in a production environment.</p>



<h3 class="wp-block-heading"><strong>Query faster by sorting only a subset of the table</strong></h3>



<p>The most obvious way to speed this up is to <a href="https://www.sisense.com/blog/use-subqueries-to-count-distinct-50x-faster/" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">filter down the dataset</a> before doing the expensive sort.</p>



<p>We’ll select a larger sample than we need and then limit it, because we might get randomly fewer than the expected number of rows in the subset. We also need to randomly sort afterward to avoid biasing towards earlier rows in the table.</p>



<p>Here’s our new query:</p>



<pre class="wp-block-code"><code>select * from users
where 
  random() &lt; 200 / (select count(1) from logs)::float
order by random()
limit 100</code></pre>



<p><em>(We’ll be using Postgres from this point forward for simplicity. Most of these techniques work well on other DBs.)</em></p>



<p>This baby runs in <strong>7.97s: Twice as fast!</strong></p>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/User-table-map.png" alt="User table map" class="wp-image-75984"/></figure>



<p>This is pretty good, but we can do better. You’ll notice we’re still scanning the table, albeit after the restriction. Our next step will be to avoid scans of any kind.</p>



<h2 class="wp-block-heading"><strong>Generate random indices in the ID range</strong></h2>



<p>Ideally we wouldn’t use any scans at all, and rely entirely on index lookups. If we have an upper bound on table size, we can generate random numbers in the ID range and then lookup the rows with those IDs.</p>



<pre class="wp-block-code"><code>select * from users
where id in (
  select round(random() * 21e6)::integer as id
  from generate_series(1, 110)
  group by id -- Discard duplicates
)
limit 100</code></pre>



<p>This puppy runs in <strong>0.064s</strong>, a <strong>273X speedup</strong> over the native query!</p>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/function-table-map.png" alt="Function table map" class="wp-image-75991"/></figure>



<p>Counting the table itself takes almost 8 seconds, so we’ll just pick a constant beyond the end of the ID range, sample a few extra numbers to be sure we don’t lose any, and then select the 100 we actually want.</p>



<h2 class="wp-block-heading"><strong>Bonus: Random sampling with replacement</strong></h2>



<p>Imagine you want to flip a coin a hundred times. If you flip a heads, you need to be able to flip another heads. This is called <a href="https://web.ma.utexas.edu/users/parker/sampling/repl.htm" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">sampling with replacement</a>. All of our previous methods couldn’t return a single row twice, but the last method was close: If we remove the inner group by id, then the selected ids can be duplicated:</p>



<pre class="wp-block-code"><code>select * from users
where id in (
  select round(random() * 21e6)::integer as id
  from generate_series(1, 110) -- Preserve duplicates
)
limit 100</code></pre>



<p>Sampling is an incredibly powerful tool for speeding up statistical analyses at scale, but only if the mechanism for getting the sample doesn’t take too long. Next time you need to do it, generate random numbers first, then select those records.</p>


How To Sample Rows in SQL 273X Faster

LinkedIn

Twitter

GitHub

curve-image-unique-image-unique

curve

3-dark-2-image-unique-image-unique

3 DARK 2

Get the latest in analytics right in your inbox.

Article