When we look a new SQL query the majors obstacles to our full understanding are
- Intensive use of sub-queries and nested logic
- Unknown clauses, functions, syntaxes, database-specific commands
- Poorly structured or poorly documented code
So, supposing that we need to understand a SQL query quickly, can we do anything to optimize our SQL analysis approach?
In this section I will present the 4 main things that can make our SQL analysis more efficient.
1. Improve the code format and structure
2. High level SQL review
3. Clarify unknown SQL clauses, syntaxes and functions
4. Split and analyse inner to outer SQL
I will explain each concept and apply it to a SQL example, which actually isn’t a very complex nor a long one, but the very same ideas apply to any SQL query.
Here we go, somebody gave us to use this piece of SQL for our new Data Mining workflow, and told us that it runs on the SQL Server database. Yeah, great stuff, and now?
WITH tmp_1 AS
(
SELECT Calc1 =
( (SELECT TOP 1 DataValue
FROM (
SELECT TOP 50 PERCENT DataValue
FROM SOMEDATA
WHERE DataValue IS NOT NULL
ORDER BY DataValue
) AS A
ORDER BY DataValue DESC
) + (SELECT TOP 1 DataValue
FROM (
SELECT TOP 50 PERCENT DataValue
FROM SOMEDATA
WHERE DataValue Is NOT NULL
ORDER BY DataValue DESC
)AS A
ORDER BY DataValue ASC))/2
),tmp_2 AS
(SELECT AVG(DataValue) Mean, MAX(DataValue) — MIN(DataValue) AS MaxMinRange
FROM SOMEDATA
),tmp_3 AS
(
SELECT TOP 1 DataValue AS Calc2, COUNT(*) AS Calc2Count
FROM SOMEDATA
GROUP BY DataValue
ORDER BY Calc2Count DESC
)
SELECT Mean, Calc1, Calc2 , MaxMinRange AS [Range]
FROM tmp_1 CROSS JOIN tmp_2 CROSS JOIN tmp_3;
1. Improve the code format and structure
It could be silly, but as our eyes appreciate order, the first thing do it is structure the SQL script in a nice and readable way. It takes 5 minutes, by using any SQL client’s script formatting built-in tool, or there are plenty free online pages as well.
Once done on the Example below, we’d get something like this:
The big added value is that now we can clearly see all the sub-queries, how deeply they are nested, and each SELECT is very easily identifiable. Those will be the bricks of our next steps.
2. High level SQL review
2.1 Identify the main SQL layer
Any query can be seen as composed by layers of other queries, and sub-subqueries. We can find those layers by looking at the SELECT clauses and their dependencies.
Now, complex queries might have many layers, but to have a general understanding we can just look at Layer 1, and see how it combines the data coming from its inner layers : does it do JOINs, UNIONs, INTERSECTs, etc.?
This gives you some hints on the general purpose of that query.
In our example
So we can understand that LEVEL #1 combines the data of 3 sub-queries tmp_1, tmp_2, tmp_3 by CROSS JOIN.
Let’s go on and learn more on the tables that feed our query.
2.2 Find out tables and data model
To understand what our SQL query does, we need to identify the tables that provide data to each SELECT clause, and also to deduct the relations among them, mentally building a sort of data model implicitly used by the query.
To do so, we just need to look at the FROM clauses of all the possible layers and their sub-queries.
In our simple example, all the sub-queries use only one table named “SOMEDATA”. We should have a look to its data, which will be helpful for the next step.
3. Clarify unknown SQL clauses, syntaxes and functions
Now we want to understand in more detail what our SQL does. To do so we need to learn more about all the SQL functions, operators, clauses and objects that we are not familiar with. It’s perfectly normal if we don’t know everything about all the existing SQL flavours, of all possible database vendors and versions.
So just find out any piece of code that you don’t know and lookup the documentation.
For instance, the query in our example here is written to run on the database SQL Server.
In our query example, let’s suppose that we don’t understand two parts:
“..what is “PERCENT”? Might it be some SQL Server’s-specific operator?”
“ .. I’ve seen WITH before, but I can’t remember what it does”
After reading the doc, we learned that
TOP 50 PERCENT splits the data in column DataValue in 2 sets, each with half of the data, and takes the top half. Using it in combination with the ORDER BY clause, allows us to consider the top and bottom 50% of the data values.
WITH .. AS creates record sets on the fly, and names it “tmp_1”, which can be then referred to in other parts of the same query.
4. Split and analyse inner to outer SQL
Now we have all the elements to understand the whole query. To do so we need to analyse SQL part by part, staring with the inner query layers, then moving up towards the layer #1.
Every time that we have understood all the sub-queries belonging to a layer, we should be able to understand what their outer layer does as well.
Let’s apply this principle to our query example: we start by looking at the inner sub-queries of section WITH tmp_1 (see the comments 1.), then we move to 2., then 3. combines the knowledge of 1. and 2. together, and that allows to fully understand the whole tmp_1 section.
Once we got to 7. , we understood that the SQL script purpose is just to calculate median, mean, range and mode on a single table’s column.
In this story I presented a very effective way to analyze complex SQL queries. The 4 main steps are very practical and can be applied to query of any complexity. The approach works well for anybody from SQL beginners to experts, it will just change the time needed to complete it.
Thanks for reading.