Searching for parents without certain children in SQL
I want to query for parent records (those on the one side of a one-to-many relationship) that have no children that meet a specific set of criteria. The parents in the result set may have no children at all, or they may have many children, just none that meet the criteria. And I only want distinct parents in the result set. If a parent has two children that do not meet the criteria, I only want that parent returned once.
It turns out this type of behavior comes up more often than I’d expect. Imagine a database as part of a web service that has a table of users and a table logging the “lifecycle” email updates sent to those users. (Lifecycle emails are the periodic emails a service sends out over the course of your engagement with that service—they can be purely time based, e.g. 2 weeks after sign up, or they can be criteria based, e.g. after you place your first order. But in order to avoid being spammy/annoying, you usually only want to send these once, or once every X months.) In this scenario, I would want to query for all users (these are the parents) without a record of having received a certain type of email (the children). They may have received other emails, or they may have received no emails at all, but in either case, I want them returned in the result set. Let’s build up to the solution.
An inner join, usually abbreviated as join
in MySQL, will return only those results where a parent has a child and a child has a parent. I’d use this if I wanted to avoid returning orphans: parents without children and children without parents. This is not what I want in this case.
select * from parent join child on parent.id = child.parent_id
An outer join, usually abbreviated as left join
in MySQL, will return every record on the left side of the join statement, regardless of whether it has a child or not. So I’ll rest assured that every parent is returned in the result set. And just like an inner join, I’ll see the parent repeated for each of its children (which may or may not be what I want, depending on the context). Also like the inner join, I won’t get back any orphaned (parentless) children.
select * from parent left join child on parent.id = child.parent_id
A left join with a where clause specifying that the primary key field of the table on the right side of the join is null is a way to select all the orphan (or childless) parents. And since by definition there’s no successful join happening, I can be sure that each orphaned parent is returned only once—no need for the distinct
keyword, or a group by
clause.
select * from parent left join child on parent.id = child.parent_id where child.id is null
Most of the time, I use some combination of the 3 SQL join scenarios above to get my work done. But none of them, at least alone, solve my problem. The last query gets half way there—I definitely want the childless parents—but how do I query for the parents that don’t have certain children?
The first solution I stumbled upon piggybacks off the little-known fact that you can include criteria as a part of the on
clause of your left join
to further filter the results being joined, before your where
clause is applied. By combining this with the child.id is null
check, we can achieve our goal:
select * from parent left join child on parent.id = child.parent_id and child.name = 'foo' where child.id is null
The problem with that solution is it’s super hard to remember, and what’s happening is completely opaque, if not misleading, to the average developer. On one hand it looks like you want parents with children named “foo”, but then the where
clause appears to contradict that. And this only works because it’s a left join
. To make matters worse, very few queries are that trivial. They have many other joins, and other criteria, further separating the connection between the 3 elements of this query that make it work.
The other thing I don’t like about the solution above is it’s kind of hacking the left join
to just return just parents. But by definition, a join is supposed to be joining tables. It implies that you want both of the joined tables to be returned in the results. Even if you know exactly what you’re doing, and you comment the above SQL statement to within an inch of it’s life, you’ll always have that sneaking suspicion that maybe you should add the distinct
keyword, just in case, or group by parent.id
, just to be sure. And it might not be you, oh sage DBA, it might be some developer down the line with less experience, tasked to fix a bug in some legacy system after you’re long gone.
It turns out the best solution I’ve found (so far) to solve this is with a correlated not exists subquery. It looks like this:
select * from parent where not exists ( select * from child where child.parent_id = parent.id and child.name = 'foo' )
I’ve always found correlated subqueries a little hard to wrap my head around, but I think the example above makes perfect sense. A rough gloss in English goes like this: I want all columns from the parent table where no records in the child table exist that have a foreign key matching the parent’s primary key and that have the name “foo”. The beauty is that the whole not exists
clause captures the criteria in one place, rather than 3, and I think makes the intention of the query much more transparent. I’ve tested both solutions in MySQL using tables with millions of rows, and found them to perform comparably.
Have other thoughts or solutions? Leave ’em in the comments. If you want to play around with the queries, here are two test tables and some data to get started with.
CREATE TABLE `parent` ( `id` int(11) NOT NULL AUTO_INCREMENT, `name` varchar(50) DEFAULT NULL, PRIMARY KEY (`id`) ); CREATE TABLE `child` ( `id` int(11) NOT NULL AUTO_INCREMENT, `name` varchar(50) DEFAULT NULL, `parent_id` int(11) DEFAULT NULL, PRIMARY KEY (`id`) ); INSERT INTO parent VALUES (1, 'A'); INSERT INTO parent VALUES (2, 'B'); INSERT INTO parent VALUES (3, 'C'); INSERT INTO parent VALUES (4, 'D'); INSERT INTO child VALUES (1, 'foo', 2); INSERT INTO child VALUES (2, 'foo', 3); INSERT INTO child VALUES (3, 'bar', 3); INSERT INTO child VALUES (4, 'baz', 4);
Great tip, thank you!