How Fast Can You Read 100 Billion Records?

SQL Optimization Techniques for Big Data Processing

Companies like Amazon, Meta, and Google process billions of records daily. Learn optimization techniques including table partitioning to handle massive datasets efficiently.

Maybe a few hundred years?

Bring in a computer and it should still take a few hours to traverse through all these records.

Companies like Amazon, Meta, PhonePe, Google etc churn out a few billion records of data every day. That is because of the extensive user base who use these products.

With such large data comes the requirement of large computing power to retrieve them and that always comes with a dollar cost $$ for the company.

The only alternative being, reduce the computation required. And this is where optimisation comes into picture.

Optimisation Methods

1. Partitioning a Table

What is partitioning?

By definition, partitioning is to segregate items into folders.

Considering that everything is in ascending order, your eyes will scroll down to 2024, then January and finally the records against it. There is absolutely no necessity for you to look at other years like 2023 or other months like February and March. This is essentially 3 units of effort against case 1 where it is 15 units.

The entire computation is down by 80%. This is how optimisation works. Using a partition where clause to only fetch the records needed in the output, reduces the computation power and also the time taken for the query to run.

Now when you scale this to say 1 TB of available data, you only end up processing 200GB of data or lesser (basis use case) and save on computation and time.

2. Refrain from using ORDER BY in a Sub Query

But why?

Consider this simple Query:

SELECT
    main.marks
FROM
    (SELECT
        marks
     FROM
        fact.subject_marks  
    ORDER BY  
        marks
        ) AS  main
WHERE
    marks >= 50

What difference does it make by using and not using the ORDER BY clause?

Using ORDER BY

While using an ORDER BY clause, the subquery main first arranges the data set in ascending order before the external query begins to execute on it.

The fact being the external query's output does not get impacted if the data set is in a randomized fashion or in ascending order.

If you are working on a dataset that runs millions of records, the subquery will first get the data in order, which will increase the query processing time.

BK

Balaji Kasiraj

Senior Product and Business Analyst with 6+ years of experience in Analytics. He owns and operates SQL Online, helping thousands of developers format and optimize their SQL queries.