How fast can you read 100 Billion records of data?

Last updated: 23 July 2024

Maybe a few hundred years?

Bring in a computer and it should still take a few hours to traverse through all these records.

Companies like Amazon, Meta, PhonePe, Google etc churn out a few billion records of data every day. That is because of the extensive user base who use these products.

With such large data comes the requirement of large computing power to retrieve them and that always comes with a dollar cost $$ for the company.

The only alternative being, reduce the computation required. And this is where optimisation comes into picture.

Optimisation Methods

1. Partitioning a Table

What is partitioning?

By definition, partitioning is to segregate items into folders.
Ex: If you are a teacher, you would segregate each Student’s records into separate folders so that it can be fetched easily. The process is similar here.

Let's understand it through an example.

Problem Statement: Given the Order list, fetch all the orders that were executed in January of 2024

Given below are two data tables of the same data

Case 1 : Non Partitioned Dataset

Non Partitioned Data

If you had to fetch the Orders from Jan 2024, you would have to traverse through all the 15 records and your output is 3 records. This will be 15 units of effort, i.e reading each line is one unit of effort.

Case 2 : Partitioned Data

Partitioned Data

If you see each Record is put into folders i.e year and month.
Considering that everything is in ascending order, your eyes will scroll down to 2024, then January and finally the records against it. There is absolutely no necessity for you to look at other years like 2023 or other months like February and March. This is essentially 3 units of effort against case 1 where it is 15 units.

The entire computation is down by 80%. This is how optimisation works. Using a partition where clause to only fetch the records needed in the output, reduces the computation power and also the time taken for the query to run.

Now when you scale this to say 1 TB of available data, you only end up processing 200GB of data or lesser (basis use case) and save on computation and time.

2. Refrain from using ORDER BY in a Sub Query

But why?

Consider this simple Query:


    SELECT
        main.marks
    FROM
        (SELECT
            marks
         FROM
            fact.subject_marks  
        ORDER BY  
            marks
            ) AS  main
    WHERE
        marks >= 50
    
                

What difference does it make by using and not using the ORDER BY clause?

Using ORDER BY

While using an ORDER BY clause, the subquery main first arranges the data set in ascending order before the external query begins to execute on it.

The fact being the external query’s output does not get impacted if the data set is in a randomized fashion or in ascending order.

If you are working on a dataset that runs millions of records, the subquery will first get the data in order, which will increase the query processing time.

For more SQL tutorials and examples, visit SQL Online.

More by Balaji