Recursive Queries - Iterative Awesome.

I’ve taken to trying to push my skills in SQL lately. Driven by recent requests for more complex reports or wanting to simplify some code. After all, when it comes to your declarative data mangling needs, SQL is hard to beat.

This particular adventure was into ‘Recursive Queries’, or ‘Common Table Expressions’. My database application of choice is PostgreSQL.

This type of query is super useful if you have, or would like to be able to, structure your data in a tree or graph like manner. But…

You don’t want to leave the convenience of an existing relational database
Graph databases are a bit too new or too heavy for your needs
“You can pry <SQL Compliant Database> from my cold, dead, segfaulting hands!”

Lets see if we can build a useful view of our ‘Wibbles’. Here is out Wibble structure:

-- Build a house for our Wibbles.
CREATE TABLE wibbles (
       id            SERIAL     PRIMARY KEY,
       fub           VARCHAR(8) NOT NULL,
       wub           INT        NOT NULL,
       parent_wibble INT        REFERENCES wibbles(id),
       CONSTRAINT un_fub UNIQUE (id,fub)
);

Lets put the Wibbles in there…

-- Invite some wibbles to stay in the new digs
INSERT INTO wibbles (fub,wub,parent_wibble) VALUES
  ('chair',    0, null),
  ('lavalamp', 3, null),
  ('fudge',    1, 1),
  ('choc',     2, 3),
  ('cake',     9, 2),
  ('feet',    12, 3),
  ('pants',   11, 4)
;

This gives us the following Wibble family tree:

lavalamp
|
`- cake

chair
|
`- fudge
 |
 +- choc
 |  |
 |  `- pants
 |
 `- feet

You can see how you could query for the top-most wibbles:

SELECT * FROM wibbles WHERE parent_wibble IS NULL;

 id |   fub    | wub | parent_wibble
----|----------|-----|---------------
  1 | chair    |   0 |
  2 | lavalamp |   3 |
(2 rows)

Or we could find the descendants of ‘chair’:

SELECT * FROM wibbles WHERE parent_wibble = 1;

 id |  fub  | wub | parent_wibble
----|-------|-----|---------------
  3 | fudge |   1 |             1
(1 row)

But that doesn’t return all the Wibbles! But you’re not always going to be able to eyeball your data and know if a query is returning enough data. Were you to rely on this query, you would have to keep querying the database until there were no more results. Replacing the ‘parent_wibble’ value with the new potential ancestor on each iteration.

Concerned Reader: “But that’s madness, surely there is a better way?!”
Me: “I’m glad you asked, dear reader. And don’t call me Shirley.”

… I’m hilarious

This helpful shenanigan is a ‘Recursive Query’. They are a variation on ‘Common Table Expressions’. We’ll start with those as they are useful too.

Common Table Expressions let you;

Separate large, or complex query components into manageable pieces. Providing an alias in the process:

WITH fudge_kids AS (
     SELECT * FROM wibbles WHERE parent_wibble = 1
)
SELECT wub FROM wibbles WHERE parent_wibble IN (
       SELECT id FROM fudge_kids
);

Here is an example from the PostgreSQL docs:

WITH regional_sales AS (
        SELECT region, SUM(amount) AS total_sales
        FROM orders
        GROUP BY region
     ), top_regions AS (
        SELECT region
        FROM regional_sales
        WHERE total_sales > (SELECT SUM(total_sales)/10 FROM regional_sales)
     )
SELECT region,
       product,
       SUM(quantity) AS product_units,
       SUM(amount) AS product_sales
FROM orders
WHERE region IN (SELECT region FROM top_regions)
GROUP BY region, product;

This turns complex queries into something more manageable and approachable. Helpful for eyes that are new, or tired from debugging. Recursive queries build on this structure, adding a base case, and an iterative step.

Without further ado, lets view our wibbles in a new light:

WITH RECURSIVE fobs(id, fub, parent_fub) AS (
     SELECT id, fub, ( null :: varchar(8) ) FROM wibbles WHERE parent_wibble IS NULL
     UNION
     SELECT w.id, w.fub, f.fub
     FROM fobs f, wibbles w
     WHERE w.parent_wibble = f.id
)
SELECT * FROM fobs;

 id |   fub    | parent_fub
----|----------|------------
  1 | chair    |
  2 | lavalamp |
  3 | fudge    | chair
  5 | cake     | lavalamp
  4 | choc     | fudge
  6 | feet     | fudge
  7 | pants    | choc
(7 rows)

There’s a lot going on now, so as is the custom in these situations, lets break it down. First we declare our intentions and describe our results table, and give it a name:

WITH RECURSIVE fobs(id, fub, parent_fub)

This next section is a union of two queries that must return the same type of results. The first section is the base case, or starting point of the query. I have to add a type annotation here as PostgresSQL needs a bit of help in this instance.

SELECT id, fub, ( null :: varchar(8) )
FROM wibbles
WHERE parent_wibble IS NULL

The results of that query are then fed into next query under the name of the entire expression, ‘fobs’, in this case. This is the part of the query where you connect the individual results to one another.

SELECT w.id, w.fub, f.fub
FROM fobs f, wibbles w
WHERE w.parent_wibble = f.id

You can see we select the wibbles where their parent matches the current wibble. The results of both queries are then combined using ‘UNION’. Which removes duplicate rows (more on this later). This gives us a Common Table Expression which we use as before:

SELECT * FROM fobs;

If we adjust our starting point (base case) query, we’re able to find the descendants of a particular wibble:

WITH RECURSIVE fobs(id, fub, parent_fub) AS (
     SELECT id, fub, ( null :: varchar(8) ) FROM wibbles WHERE fub = 'chair'
     UNION
     SELECT w.id, w.fub, f.fub
     FROM fobs f, wibbles w
     WHERE w.parent_wibble = f.id
)
SELECT * FROM fobs;

 id |  fub  | parent_fub
----|-------|------------
  1 | chair |
  3 | fudge | chair
  4 | choc  | fudge
  6 | feet  | fudge
  7 | pants | choc
(5 rows)

Now with one query we’ve found the parent wibble and their descendants!

chair
|
`- fudge
   |
   +- choc
   |  |
   |  `- pants
   |
   `- feet

Neat! Notice that the result of a recursive query is a normal table of results. We can order results, filter them with WHERE clauses. All the other useful relational database tools are available to us.

The recursive component of the query can also have restrictions on it to limit the depth of your search. In fact it needs them, otherwise your query will never return. Lets add a threshold to stop searching the descendants of chair.

     SELECT w.id, w.fub, f.fub
     FROM fobs f, wibbles w
     WHERE w.parent_wibble = f.id
     AND w.wub < 10

Once the recursive component returns an empty set, the query is complete.

If that was all we could do, then it wouldn’t be worth writing about. But the fun is only getting started! As this is SQL, you’re able to use the built in functions to build new values in the results. You can concatenate strings together, add numbers, use aggregation functions, etc.

SELECT (number_1 + number_2) FROM number_table WHERE number IN good_numbers;

When combined with recursive queries, some very cool functionality becomes available.

But what if you had data that was not a simple tree form? But a proper graph, with interconnecting nodes. Perhaps even some cycles.

How then to traverse such a structure, whilst only being able to access the row of results that sent you here in the first place? The answer is both infuriatingly obvious, and chock full of possibilities for cleverness.

The answer is to keep track of where you have been, and you’ll be able to decide if you’re still on the right path.

Me: “See, simple! Now draw the rest of the owl.”
Concerned Reader: *squints*

If these next examples seem familiar, it’s because I copied them verbatim from the PostgreSQL manual page on ‘WITH’ Queries. The documentation is very comprehensive and I recommend checking it out.

WITH RECURSIVE search_graph(id, link, data, depth, path, cycle) AS (
    SELECT g.id, g.link, g.data, 1,
      ARRAY[g.id], -- <-- A
      false
    FROM graph g
  UNION -- <-- B
    SELECT g.id, g.link, g.data, sg.depth + 1,
      path || g.id, -- <-- C
      g.id = ANY(path) -- <-- D
    FROM graph g, search_graph sg
    WHERE g.id = sg.link AND NOT cycle -- <-- E
)
SELECT * FROM search_graph;

This query will: - Navigate the graph’s structure - Keep track of the nodes we’ve visited - Prevent any cycles from causing the query to loop forever.

There are quite a few pieces involved, some are general concepts. Others obscured by PostgreSQL specific operators or functions. We’ll go through it and see if we can make it a bit more digestible.

We’ve covered the larger mechanics involved, such as WITH RECURSIVE foo AS. So you should be able to identify the primary components of this recursive query. The base or starting query, and the recursive component.

Other parts have some marks against them and we’ll check them off in a bit more detail. I would like to mention that anything here is unlikely to only exist in PostgreSQL. So you don’t need to get all shirty about; "OMG MySQL can do this too

!!eleventy".

We’re using PostgreSQL at the moment, so roll with it.

A

This is the PostgreSQL syntax for initialising an ARRAY with a starting value. The purpose is to create the value for tracking our visited nodes as we traverse the graph. We’ll be updating this value later.

Remember that this is our starting or base query. So it must have a value for everything we want to have in the final result. The types must match up.

If you want to find out more about PostgreSQL Arrays, the docs are really quite nice.

B - More on UNION [ALL]

The use of UNION is what combines the results so far, with the results from the next iterative step. The keyword UNION will combine query results and remove duplicate rows.

When used with the optional ALL keyword, it will not remove duplicates. This may produce unexpected results, or in some cases cause the query to loop forever.

C

PostgreSQL syntax for concatenation. In this case it is pushing the g.id of the current node onto the list of visited nodes. If this result in the iterative or recursive component satisfies our WHERE conditions, then the id value is added to the ARRAY.

From the PostgreSQL docs on the concatenation operator ||:

“The concatenation operator allows a single element to be pushed onto the beginning or end of a one-dimensional array. It also accepts two N-dimensional arrays, or an N-dimensional and an N+1-dimensional array.”

You’re of course free to create other structures as you traverse the results. Calculating the cost of arriving at a particular node? Min-Cost-Max-Flow in a query, anyone?

D & E

This is a PostgreSQL expression that combines the expression and the operator of the left of the ANY. Applying it to every value within the array value on the right side. Consulting our very helpful docs again.

expression operator ANY(array expression)

In our case, we use this function to check if the id of the node we’re currently on has exists on the path. This lets us know if we’ve stepped onto a cycle. This expression in our query will return true or false.

We’re able to then use this value in our WHERE clause to avoid including any results where a cycle has been detected. When an empty result is returned from the recursive component of the query, the traversal is finished.

Aside “Why aren’t cycle and path accessed using ‘g.’ ?”’

Note the use of the aliased table names, and that path and cycle are not referenced in the same way; g.id or g.link. This is because these fields only exist on one of the two tables we’re querying, the graph table. So PostgreSQL is not confused about where the field is coming from.

Back to our wibbles…

If we create a similar query for our wibbles. We can see something that can be very very useful when dealing with graph and tree traversals.

Here are some results:

 id |   fub    | wub | parent_wibble |   path    | cycle
----|----------|-----|---------------|-----------|-------
  1 | chair    |   0 |               | {1}       | f
  2 | lavalamp |   3 |               | {2}       | f
  3 | fudge    |   1 |             1 | {1,3}     | f
  5 | cake     |   9 |             2 | {2,5}     | f
  4 | choc     |   2 |             3 | {1,3,4}   | f
  6 | feet     |   8 |             3 | {1,3,6}   | f
  7 | pants    |  11 |             4 | {1,3,4,7} | f
(7 rows)

You don’t have to select things like the cycle or any other column you don’t need in your final results. But you notice in the path column that we constructed, we can see HOW we ended up at a particular result. This is invaluable for debugging.

It may have other applications, but that’s between you and your data. You could present informative, “You’re seeing this because…”, messages for your users, as an example.

Performance

Recursive queries are an expensive operation.

The process of; - Querying for results, placing them onto a temporary table. - Then using those results to query for results. - Add those results (optionally removing duplicates) to the temporary table. - Repeat.

Is costly when compared to the more traditional types of queries.

The \timing flag in psql will produce timing information for individual queries in the REPL.

Examine the output of EXPLAIN to see what, if any, optimisations the query engine is able to perform. The use of Common Table Expressions prevents the query engine from making certain adjustments.

With any “performance concern”, there are few hard and fast rules. Excepting that YOU must examine YOUR use case with YOUR data to be able to make a sensible decision.

Conclusion

If you have or were looking to have some graphs as part of your data design. But weren’t sure on how to manage the data. Or how to integrate it without needing yet another application. Or you weren’t interested in writing all that graph traversing/managing code yourself. Recursive queries may be what you need.

I had a lot of fun learning about recursive queries. I hope I’ve been able to shed some light on a very useful tool. Also that I’ve explained them well enough that you feel confident in applying them.