X

Iterating in chunks over a large dataset in Django

I came across a bit of a challenge today when building my log file parsing tool. The client demo I’d built the dataset on was almost 1TB. With a total of 2.5billion rows, I’ll let that sink in for a minute, 2.5 BILLION! Googlebot has been busy over the last 20 months. I needed to come up with a way of loading this data in chunks.

For this tool I wanted to use Big Query to query and derive insights from the data as using my own server would be too expensive and slow. The purpose of my Django instance is to manage & parse the information so I’d like it to remain quite light weight. This will be released soon, so sign up to be kept in the loop.

So the logic I went with was processing in chunks of 10,000 as this is max upload size for Big Query.

My logic was as follows:

  • Get the min id of the values in the current project
  • Get the max id of the values in the current project
  • Create a global variable with the minimum id
  • Create a range of chunks between the min and max values with 10,000 increments

In order to do this the following queries were made;

How to get the MIN value of a field in Django

In order to get the minimum id in Django 2.0 I used the earliest queryset:

Table.objects.earliest(‘id’)

This gives you the earliest object in the table for the specified field.

You can use this with filter parameters and my actual example looks like the following:

Raw_results.objects.filter(project=__project_query).earliest('id').id

As this returns and object and to get the id value you just need to call the key you’re after.


How to get the Max value of a field in Django

In order to get the minimum id in Django 2.0 I used the latest queryset:

Table.objects.latest(‘id’)

You can use this with filter parameters and my actual example looks like the following:

Raw_results.objects.filter(project=__project_query).latest('id').id

As this returns and object and to get the id value you just need to call the key you’re after.

Now we have the min and max values of the data you want we can use the range function to turn this into a list

Iteration_values =  range(min_value, max_value, 10000)

The last thing to do is initiate a global variable which has the min_value

Now we have all the prequistes it time to loop through our newly created list and use the list value & the global value in the query. In order to do this you need to know about less than & greater than query functions in Django.


How to query items less than or greater than in Django

Django has really great inbuilt functions for less than and greater than. I fully recommend you study the queryset page as loads of things you’ll want to accomplish have been made easy in Django.

In our loop we want to include greater or equal to the min id & less than the max id

To apply the greater than queryset add __gte to the column name and to apply the less than add __lt to the column name.

Before the end of each loop then we’ll just need to update the original global value and increment by 10,000

Here is the final code.

current_max_id = Table.objects.filter(project=__project_query).latest('id').id
current_min_id = Table.objects.filter(project=__project_query).earliest('id').id

working_id = current_min_id

# Create chunks of 10k so we can filter between the values

to_loop = range(current_min_id,current_max_id,10000)

for count,item in enumerate(to_loop):
    # Don't need to do anything with the first value
    if count > 0:
        # do stuff here e.g.get a list rows in the first chunk
        listy = list(Table.objects.filter(id__gte=working_id,id__lt=item).all().values())
        # Update the global min so that the query above cycles in chunks
        working_id += 10000

Overall using around 100mb of ram to load 900GB into big query, which isn’t that bad. Currently this is running at 5000 rows per second. I will be looking at adding concurrency to hit the limits which are 100,000 rows per second.

Cons of chunking a query

If your database is under heavy load, chunking is going to massively increase the IO it needs to process and could cause problems so be careful.

It’s going to be slower because you’re making multiple queries

Will Cecil: Digital Marketer, Python Tinkerer & Tech Enthusiast. Follow me: Website / Twitter / Github
Related Post