How I Optimize Memory Usage in Pandas

This is not a blog post about code optimizations, data structures, etc. to reduce memory usage. Although that would be pretty useful too, this isn’t that. This is something more fundamental than that– How to Optimize Memory Usage in Pandas! So let’s see how we can optimize memory usage in our ML workflows.

Text output showing memory usage before and after optimization in Pandas. The memory usage decreases significantly from 18.22 MB to 293.39 KB, demonstrating the effectiveness of data type optimization and efficient memory management in Python's Pandas library

While training a tree model, I ran into memory optimization issues in machine learning—a common issue for large datasets. I started with a 16GB CPU which was nowhere close to enough. So I slapped on more RAM– now armed with 32GB but still the same issue– OOM.

I soldiered on– and added yet more RAM– up to 64GB now. But guess what, still falling short. In frustration, I loaded a sample of my data and checked the RAM consumption, and turns out I would need to 4x the RAM to 256 GB to make this work.

This got me thinking, there must be a better way to do this. I can’t just have one solution of throwing more RAM every time I face this issue.

Optimize Memory Usage in Pandas: Converting Data Types

Looking closer at my data, I realized all my flag columns are being loaded as floats– that too 64 bits. I instantly realized that I could save a lot of RAM if started loading chunks, and converted the flags to booleans.

df[flag_cols].astype("bool")

Each float64 value is 8 bits and a boolean is 1 bit. That’s an 87.5% reduction for my flag columns. And this one line does it all for me!

Taking It Further: Optimize Memory Usage in Pandas via Downcasting

With my newfound trick, I got cracking. I wanted to see how far I can take this. So I started looking at integer columns and checked out their ranges. For the ones that had a limited range, I converted them to either int32 or int16

Data Type	Minimum Value	Maximum Value	Memory Savings from `int64`
`int64`	-9,223,372,036,854,775,808	9,223,372,036,854,775,807	baseline
`int32`	-2,147,483,648	2,147,483,647	50% (8B → 4B)
`int16`	-32,768	32,767	75% (8B → 2B)

If your integers are only positive– start from zero and double the max value to get the new range. This nifty conversion led to even more savings

df[int64_col].astype(np.int32) #or if range permits use below
df[int64_col].astype(np.int16)

Caution! Make sure you don’t force all integer columns to be 32-bit or 16-bit if the range doesn’t permit. If you have columns that are revenue, population, counts etc make sure you use the correct type else you might mess up your data.

Taking it Even Further: Optimizing String Memory Usage in Pandas

All good so far but we have spoken about numerical data only. Let’s move towards something more sinister– string columns. These guys can run up memory requirements and wreak havoc on performance like no other. These innocuous-looking strings are the ones we need to watch out for.

Now the exact memory savings you get from these depend on two factors

cardinality (number of unique values)
length of the strings

If the cardinality is low, then converting to “category” type can lead to huge savings in memory requirements. Below is a small example

import pandas as pd
df = pd.DataFrame({"city": ["Mumbai", "London", "New York"] * 100000})
print(df["city"].memory_usage(deep=True))
df["city"] = df["city"].astype("category")
print(df["city"].memory_usage(deep=True))

In the above toy example, we see a 63x reduction. This factor goes up further as the data increases. Why do we see this reduction in memory usage? It’s because only the unique values are stored in a separate lookup table. Each row stores onlyan integer reference instead of the full string. So instead of a string on every row, we use an integer instead – which is much cheaper to keep in memory.

Final Thoughts

Selecting appropriate datatypes and downcasting selectively can significantly reduce memory usage and rule out the need for hardware upgrades. In some cases, these memory savings can be around 80% or even more. If you have more tricks up your sleeve, then let me know!

You can find more practical tips here.

Optimize Memory Usage in Pandas: Converting Data Types

Taking It Further: Optimize Memory Usage in Pandas via Downcasting

Taking it Even Further: Optimizing String Memory Usage in Pandas

Final Thoughts

Further Reading