So, couple of days ago, some guy, from Periscope company wrote a blogpost about getting number of distinct elements, per group, faster using subqueries.
This was then submitted to Hacker News and r/Programming on Reddit.
Then, the original authors submitted second blogpost comparing speed between four different DB engines. Which, in turn, was also commented on Reddit.
I found the numbers presented by Periscope (as their improvement) as not that great.
Unfortunately – their blog doesn't allow for comments, so I decided to test it, and write on my own blog, what I can find about it.
First, what I gathered from their blogposts:
- 2 tables: dashboards (with columns: name, id and possibly something else) and time_on_site_logs (with columns: user_id, dashboard_id, and possibly something else)
- dashboards has 1200 rows
- time_on_site_logs has 14 million rows
- there is 1700 distinct (dashboard_id, user_id) pairs in time_on_site_logs
Since, I have no idea about table width (which can be important), I will make some guesses about number of columns that should be there. There is also no information about data distribution in time_on_site_logs (as in: average number of dashboards per user, average number of rows per user, and so on). So I'll have to improvise.
Dashboards:
$ CREATE TABLE dashboards ( id serial PRIMARY KEY, name text NOT NULL UNIQUE, created timestamptz NOT NULL DEFAULT now(), visited int8 NOT NULL DEFAULT 0 ); $ INSERT INTO dashboards (name, created, visited) WITH x AS ( SELECT now() - '1 year'::INTERVAL * random() AS c FROM generate_series(1,1200) ORDER BY c ) SELECT 'Dashboard #' || ROW_NUMBER() OVER (ORDER BY c), c, random() * 10000 FROM x;
This gives me data like these:
SELECT * FROM dashboards ORDER BY id DESC LIMIT 20; id | name | created | visited ------+-----------------+-------------------------------+--------- 1200 | Dashboard #1200 | 2014-01-29 09:16:01.894632+01 | 3264 1199 | Dashboard #1199 | 2014-01-29 05:27:19.501032+01 | 5889 1198 | Dashboard #1198 | 2014-01-29 02:52:03.939432+01 | 595 1197 | Dashboard #1197 | 2014-01-28 11:22:32.134632+01 | 8283 1196 | Dashboard #1196 | 2014-01-28 07:43:48.406632+01 | 5713 1195 | Dashboard #1195 | 2014-01-28 03:16:50.537832+01 | 4082 1194 | Dashboard #1194 | 2014-01-27 23:22:24.877032+01 | 7836 1193 | Dashboard #1193 | 2014-01-27 11:16:10.765032+01 | 9849 1192 | Dashboard #1192 | 2014-01-27 10:34:12.032232+01 | 7207 1191 | Dashboard #1191 | 2014-01-27 10:34:10.304232+01 | 2657 1190 | Dashboard #1190 | 2014-01-27 03:49:03.565032+01 | 1996 1189 | Dashboard #1189 | 2014-01-26 08:06:21.257832+01 | 9084 1188 | Dashboard #1188 | 2014-01-26 02:46:08.921832+01 | 9549 1187 | Dashboard #1187 | 2014-01-25 21:19:41.869032+01 | 7073 1186 | Dashboard #1186 | 2014-01-25 15:07:59.264232+01 | 9089 1185 | Dashboard #1185 | 2014-01-25 14:12:43.750632+01 | 3228 1184 | Dashboard #1184 | 2014-01-25 10:03:10.198632+01 | 9176 1183 | Dashboard #1183 | 2014-01-25 07:31:07.395432+01 | 5257 1182 | Dashboard #1182 | 2014-01-25 01:29:44.710632+01 | 633 1181 | Dashboard #1181 | 2014-01-24 21:51:53.773032+01 | 9663 (20 ROWS)
Now, for the time_on_site_logs table:
$ CREATE TABLE time_on_site_logs ( id serial PRIMARY KEY, dashboard_id int4 NOT NULL REFERENCES dashboards (id), user_id int4 NOT NULL, session_started timestamptz NOT NULL, session_time INTERVAL NOT NULL );
Schema is ready. Now for the data. Since I have no idea about data distribution or anything like this, let's just create 1700 random dashboard/user pairs, and repeat the data 14000000/1700 times, and call it a day:
INSERT INTO time_on_site_logs (dashboard_id, user_id, session_started, session_time) WITH x AS ( SELECT d, u FROM generate_series(1,1200) AS d, generate_series(1,10) AS u ORDER BY random() LIMIT 1700 ) SELECT d, u, now() - random() * '2 years'::INTERVAL, random() * '30 minutes'::INTERVAL FROM x, generate_series(1,8236) AS q;
Stats afterwards:
WITH x AS ( SELECT dashboard_id, user_id, COUNT(*) FROM time_on_site_logs GROUP BY dashboard_id, user_id ) SELECT SUM(COUNT) AS all_rows, COUNT(*) AS distinct_pairs FROM x; all_rows | distinct_pairs ----------+---------------- 14001200 | 1700 (1 ROW)
Some of the comments (very ill-informed, in my opinion) on Reddit suggested that the speed of the queries (as shown in original blogposts) depends on indexes, and the fact that the query plan showed seq scans means that there are no indexes.
I don't believe this explanation, so let me add some additional indexes:
$ CREATE INDEX i1 ON time_on_site_logs (dashboard_id); CREATE INDEX $ CREATE INDEX i2 ON time_on_site_logs (user_id); CREATE INDEX $ CREATE INDEX i3 ON time_on_site_logs (dashboard_id, user_id); CREATE INDEX $ CREATE INDEX i4 ON time_on_site_logs (user_id, dashboard_id); CREATE INDEX
Now tables look like this:
$ \d dashboards TABLE "public.dashboards" COLUMN | TYPE | Modifiers ---------+--------------------------+--------------------------------------------------------- id | INTEGER | NOT NULL DEFAULT NEXTVAL('dashboards_id_seq'::regclass) name | text | NOT NULL created | TIMESTAMP WITH TIME zone | NOT NULL DEFAULT now() visited | BIGINT | NOT NULL DEFAULT 0 Indexes: "dashboards_pkey" PRIMARY KEY, btree (id) "dashboards_name_key" UNIQUE CONSTRAINT, btree (name) Referenced BY: TABLE "time_on_site_logs" CONSTRAINT "time_on_site_logs_dashboard_id_fkey" FOREIGN KEY (dashboard_id) REFERENCES dashboards(id) $ \d time_on_site_logs TABLE "public.time_on_site_logs" COLUMN | TYPE | Modifiers -----------------+--------------------------+---------------------------------------------------------------- id | INTEGER | NOT NULL DEFAULT NEXTVAL('time_on_site_logs_id_seq'::regclass) dashboard_id | INTEGER | NOT NULL user_id | INTEGER | NOT NULL session_started | TIMESTAMP WITH TIME zone | NOT NULL session_time | INTERVAL | NOT NULL Indexes: "time_on_site_logs_pkey" PRIMARY KEY, btree (id) "i1" btree (dashboard_id) "i2" btree (user_id) "i3" btree (dashboard_id, user_id) "i4" btree (user_id, dashboard_id) Foreign-KEY constraints: "time_on_site_logs_dashboard_id_fkey" FOREIGN KEY (dashboard_id) REFERENCES dashboards(id)
That should handle all indexing needs for now 🙂
Now – Periscope guys wrote that they tested it on Amazon EC2 instance. I'm cheap, so I'm testing it on my desktop. So the numbers will not be directly comparable. But the ratios should be.
So, let's test the queries.
First, the naive approach:
SELECT dashboards.name, COUNT(DISTINCT time_on_site_logs.user_id) FROM time_on_site_logs JOIN dashboards ON time_on_site_logs.dashboard_id = dashboards.id GROUP BY name ORDER BY COUNT DESC;
I ran it 3 times, and got best result: 492 seconds.
Their (Periscope's) second approach was:
SELECT dashboards.name, log_counts.ct FROM dashboards JOIN ( SELECT dashboard_id, COUNT(DISTINCT user_id) AS ct FROM time_on_site_logs GROUP BY dashboard_id ) AS log_counts ON log_counts.dashboard_id = dashboards.id ORDER BY log_counts.ct DESC;
Again, best of three runs was 23.1 second.
Third query by Periscope:
SELECT dashboards.name, log_counts.ct FROM dashboards JOIN ( SELECT distinct_logs.dashboard_id, COUNT(1) AS ct FROM ( SELECT DISTINCT dashboard_id, user_id FROM time_on_site_logs ) AS distinct_logs GROUP BY distinct_logs.dashboard_id ) AS log_counts ON log_counts.dashboard_id = dashboards.id ORDER BY log_counts.ct DESC
Yielded best time of slighly less than 4.6s.
So far, we have:
Query | Periscope test | depesz test |
---|---|---|
Naive | 348s | 492s |
Aggregate, Then Join | 10.6s | 23.1s |
First, Reduce The Data Set | 7.13s | 4.6s |
This wraps me re-testing what Periscope wrote.
Couple of comments:
- to the person that said (on Reddit) that PostgreSQL cannot use Hash for count(distinct …) – well, true. But rewriting the query so that it will use hashes is trivial, as shown above
- to the people saying that “Yet again they say the query plans include table scans, which imply either no indexes or an awful index design" – well, look at this – I have indexes on everything, yet PostgreSQL doesn't use them in such queries. You can of course provide information how do you think indexes can help us with these queries. With sample schema, data and queries.
But – all things said – the numbers are, in my opinion, still too large.
I mean – if I was to get how many rows there are in time_on_site_logs for every (dashboard/user) – sure. It can take a while, as it has to scan all of the table. But just getting distinct count? Especially so few rows (946 rows of output)? That's got to be optimizable.
Luckily, it is:
I can define a simple function which gives me list of distinct (dashboard_id, user_id):
CREATE OR REPLACE FUNCTION get_list_of_unique_pairs( OUT dashboard_id INT4, OUT user_id INT4 ) RETURNS setof record AS $$ DECLARE v_dash ALIAS FOR dashboard_id; v_user ALIAS FOR user_id; BEGIN SELECT l.dashboard_id, l.user_id INTO v_dash, v_user FROM time_on_site_logs l ORDER BY l.dashboard_id, l.user_id LIMIT 1; LOOP EXIT WHEN NOT FOUND; RETURN NEXT; SELECT l.dashboard_id, l.user_id INTO v_dash, v_user FROM time_on_site_logs l WHERE (l.dashboard_id, l.user_id) > (v_dash, v_user ) ORDER BY l.dashboard_id, l.user_id LIMIT 1; END LOOP; RETURN; END; $$ LANGUAGE plpgsql;
With this function in place, I can just:
WITH x AS ( SELECT dashboard_id, COUNT(*) FROM get_list_of_unique_pairs() GROUP BY dashboard_id ) SELECT d.name, x.count FROM x JOIN dashboards d ON x.dashboard_id = d.id ORDER BY x.count DESC;
This query runs faster than 4.6s. It runs faster than 2.13s. It runs faster than 1s. It runs in 39 ms (0.039s).
Of course – you can say that using functions is cheating. Well, it can be done without functions, but the query is more complicated then:
WITH recursive distinct_pairs AS ( ( SELECT l AS rl FROM time_on_site_logs l ORDER BY l.dashboard_id, l.user_id LIMIT 1 ) UNION ALL SELECT ( SELECT l FROM time_on_site_logs l WHERE (l.dashboard_id, l.user_id) > ((p.rl).dashboard_id, (p.rl).user_id) ORDER BY l.dashboard_id, l.user_id LIMIT 1 ) FROM distinct_pairs p WHERE (p.rl).id IS NOT NULL ), unpacked_counts AS ( SELECT (rl).dashboard_id, COUNT(*) FROM distinct_pairs WHERE (rl).dashboard_id IS NOT NULL GROUP BY (rl).dashboard_id ) SELECT d.name, uc.count FROM dashboards d JOIN unpacked_counts uc ON d.id = uc.dashboard_id ORDER BY uc.count DESC;
This query (which I didn't write myself, RhodiumToad helped me a lot), runs in just a bit below 23ms.
Final comments:
- I would really expect more from a company that writes: “We make … tool that makes SQL data analysis really fast." or “building a much faster way to do SQL data analysis." – sorry, but it can be done faster. Way faster.
- The queries that I wrote are based on the fact that there is little (comparatively) distinct (dashboard_id, user_id) values. If there were more (for example – 50% of row count of time_on_site_logs rows) – it wouldn't work nicely.
- Doing this kind of analysis should be done using rollup tables, which gather information in some side tables, and then you just query these side tables to get results. Otherwise – it might be fast for 14M rows, but for 14B rows, it will be slow again
- The whole process, as shown in Periscope posts can be, relatively simply, parallelized – even in PostgreSQL
and finally – that was fun.
Well, for me at least the claims about “making SQL data analysis really fast” were more about making it “faster” for non-database-experts to answer business questions about data and not about how many seconds it takes to get that answer. Also the post that started it all was more about showing how to do a simple query optimization and the authors them-self believed it should work similar in all databases… But that aside.
The whole story got me interested in what make it difficult for the original, naive query to be automatically transformed into logically equivalent (it is right?) final query as shown here? Like are there any information missing for the optimizer or what? Could you recommend any good entry level introduction to the topic?
@Cc: not really. It’s a matter of PostgreSQL not being able to do “skip scan”. Not sure about other databases, but that’s about it.
Not sure we understand each other. Initial query here takes 492s final less than a second, I’m not asking why the initial query is slow (to which “skip scans” are the answer I presume) but what prevents database for transforming one query into another – that works fast even with current db engine. And as shown in the second blogpost by the periscope guys every db tested could use that kind of query rewriting…
So, what should I google to understand what is so hard about it?
“I don’t believe this explanation, so let me ass some additional indexes…”
That typo is fantastic.
@RobJ: fixed, thanks 🙂
@Cc:
well, what prevents dbs? Well, the fact that noone wrote such heuristic. That’s about it.
Plus – in the Pg – please note that I used procedural language – which is not a simple “rewrite” of query. And as for recursive CTE – well it’s relatively new thing. Long story short – every database has a limit on what it can do – based on what the programmers did put in (at least until we’ll have the A.I.). Simply – programmers never put in logic to transform such query into the other one.
Which, I believe, is a good thing. In a lot of cases normal plan would be faster. The fact that my query is faster in here is because the dataset is seriously skewed.
OK, thanks!
“to the person that said (on Reddit) that PostgreSQL cannot use Hash for count(distinct …) – well, true. But rewriting the query so that it will use hashes is trivial, as shown above”
Agreed, my point was that PostgreSQL currently does not do much to optimize count(DISTINCT …) while other other databases seemingly do. As we can see PostgreSQL started performing well as soon as the query was rewritten.
@CC asks: “what make it difficult for the original, naive query to be automatically transformed into logically equivalent (it is right?) final query as shown here?”
There is nothing preventing a database from doing this. SQL Server and Oracle automatically do the optimization. See Periscope’s second blog post, linked to at the beginning of this page.
@Peter: well, no, they don’t. They do some kind of optimization, but they don’t do skip scan. Otherwise you’d also get result in millisecond area, and not 2-4 seconds.
By speaking of “skip scan”, do you mean this: http://wiki.postgresql.org/wiki/Loose_indexscan ?
If yes, which name for the feature suits better here?
For Oracle has a slightly different meaning of “skip scan”, it actually matches the second case in the above wiki page, but not the first one.
@Victor:
yes, the first example is exactly the same thing I used.
@Victor: as for the names – no idea. I’m not really big into “naming” things. If you prefer to use “Loose indexscan” – fine by me. I use skip scan, because I encountered this name first.
thanks for posting this. it really helped me take another look at how I approach queries. i was able to restructure a 70s select down to 7s. (i still need to cut it down a lot more, but this is a great start).
Thanks a lot, Thats Great !