
Understanding CTEs and Their Impact
Common Table Expressions (CTEs) are widely used in SQL to simplify complex queries and enhance readability. However, there's a prevalent misconception about how they handle memory and execution. Let’s clarify this with an example and address some common misunderstandings.
A Sample Query Using a CTE
WITH AuthorStats AS (
SELECT
author_id,
COUNT(post_id) AS post_count,
SUM(views) AS total_views
FROM
blog_posts
GROUP BY
author_id
)
SELECT
top_posts.author_id AS top_author_by_posts,
top_posts.post_count,
top_views.author_id AS top_author_by_views,
top_views.total_views
FROM
(SELECT
author_id,
post_count
FROM
AuthorStats
ORDER BY
post_count DESC
LIMIT 5) AS top_posts
CROSS JOIN
(SELECT
author_id,
total_views
FROM
AuthorStats
ORDER BY
total_views DESC
LIMIT 5) AS top_views;
Hypothetical Scenario
Consider a scenario like the above query. We have:
- Two Subqueries
- One CTE
What happens when the query executes?
- The CTE executes once.
- The 2 subqueries use the output of the CTE one after another.
The total time taken to execute is the sum of the CTE execution time and the two subqueries.
However, there's an important aspect called Processing Rate, which we can relate to RAM. This is a limited resource, and there is a cost associated with every unit of RAM.
Explaining for Better Understanding
Using a CTE:
Imagine you have 12 containers to process your data, and the data to be processed is 100 GB.
The query executed in step 1 needs to be stored somewhere, which is the RAM. It occupies space (let’s say 6 containers) and stays there until the query completes. This space remains occupied, whether it’s being actively used or not. Now, the remainder of the query has only 6 containers to process the rest of the data.
Using a Subquery Instead:
The query at any point has the entirety of 12 containers to process the data.
More containers mean faster execution—here, potentially 2x faster because the query processes in parallel. Imagine each container processes 10 GB. Even if the subquery portion needs to be executed twice, there's sufficient processing space, and once completed, the rest of the query has all 12 containers available.
This not only saves runtime but also optimizes query performance.
Best Practices
To optimize performance and manage memory effectively:
- Materialize Large CTEs:
For large datasets used multiple times, consider creating temporary tables instead of CTEs.
This can reduce RAM consumption and improve performance. - Use Subqueries:
Run them all as Subqueries. This should give you enough processing space to process them faster.