In 95% of the cases, cartesian products originate from accidental cross join operations and cause unnecessary high load on a database. Maybe the results aren’t even wrong, as someone may have applied a UNION or a DISTINCT keyword, to remove unwanted duplicates.
But there are those 5% of SQL queries, where the cartesian product is intentional, even desireable. Here’s a nice example of such a query. Let’s assume we have these two tables (for simplicity, constraints are omitted):
create table content_hits ( content_id int, subscriber_id int ); create table content_tags ( content_id int, tag_id int );
The above tables model some blog content, where content_hits shows how many subscribers are concerned with a given blog entry, and content_tags shows which tag is applied to a given blog entry. So let’s say, we want to have some statistics about how many times any given subscriber/tag combination was actually seen. We cannot just join and group by subscriber_id, tag_id if we want the complete list of combinations (including the ones that have a zero count). So a cross join comes to the rescue!
SELECT combinations.tag_id, combinations.subscriber_id, -- correlated subquery to count the actual hits by tag/subscriber when -- joining the two tables using content_id (SELECT count(*) FROM content_hits AS h JOIN content_tag AS t ON h.content_id = t.content_id WHERE h.subscriber_id = combinations.subscriber_id AND t.tag_id = combinations.tag_id) as cnt -- Create all combinations of tag/subscribers first, before counting -- anything. This will be necessary to have "zero-counts" for any -- combination of tag/subscriber FROM ( SELECT DISTINCT tag_id, subscriber_id FROM content_tag CROSS JOIN content_hits ) AS combinations
Instead of using a correlated subquery, for performance reasons (depending on the database) it might be more interesting to left outer join the first subquery to the “combinations” relation and then group by subscriber_id, tag_id, and count(content_id) for every group, resulting in this query:
SELECT combinations.tag_id, combinations.subscriber_id, count(content_relation.content_id) -- Create all combinations of tag/subscribers first, before counting -- anything. This will be necessary to have "zero-counts" for any -- combination of tag/subscriber FROM ( SELECT DISTINCT tag_id, subscriber_id FROM content_tag CROSS JOIN content_hits ) AS combinations -- Now, outer join every expected combination to its -- actual relation joined on content_id LEFT OUTER JOIN ( SELECT h.content_id, h.subscriber_id, t.tag_id FROM content_hits AS h JOIN content_tag AS t ON h.content_id = t.content_id ) AS content_relation ON combinations.tag_id = content_relation.tag_id AND combinations.subscriber_id = content_relation.subscriber_id -- And now, finally, group, in order to get the count(content_id) GROUP BY combinations.tag_id, combinations.subscriber_id
The original Stack Overflow question can be seen here:
http://stackoverflow.com/questions/9924433/mysql-select-query-using-count
Filed under: sql Tagged: cartesian product, correlated subquery, cross join, example, sql
