How to process large database tables chunk by chunk

dgudkov · November 6, 2024, 1:54pm

If you need to process data from large database tables that can't be loaded to EasyMorph entirely due to memory limitations, then use partitioning and loops. Using partitioning, you can process datasets with hundreds of millions and even billions of rows.

Partitioning, in this case, is a logical separation of the source dataset into smaller chunks, which are then processed chunk by chunk, assuming that any one chunk can be loaded into memory entirely.

Partitioning can be done in two ways:

By distinct values in a column (e.g., by date, region, or category)
By ranges of incremental IDs (e.g. 1-1000, 1001-2000, ...)

A few tips on selecting a field for partitioning:

The number of partitions shouldn't be too big - ideally between 10 and 1000 (the bigger the number, the more queries will be made)
Use a field that has a database index because the field will be used for filtering in each query

Below is an example that demonstrates how partitioning can be done using both methods:

Partitioning.zip (246.1 KB)

PS. If you need to export a large partition, use the "Bulk export to database" action, if possible, instead of the regular "Export to database". Bulk export is many times faster but only supported for a few database types.