Enabling Copy on Write in Pandas for Efficient Memory Management

Sachin PalSachin Pal
6 min read

Pandas supports Copy-on-Write, an optimization technique that helps improve memory use, particularly when working with large datasets.

Starting from version 2.0 of Pandas, the Copy-on-Write (CoW) has taken effect but has not been fully implemented. Most of the optimizations that are possible through Copy-on-Write are supported.

Aim of Copy-on-Write

As the name suggests, the data will be copied when it is modified. What it means?

When a DataFrame or Series shares the same data as the original, it will initially share the same memory for the data rather than creating a copy. When the data of either the original or new DataFrame is modified, a new copy of the data is created for the DataFrame that is being modified.

This will efficiently save memory usage and improve performance when working with large datasets.

Enabling CoW in Pandas

It is not enabled by default, so we need to enable it using the copy_on_write configuration option in Pandas.

import pandas as pd

# Option1
pd.options.mode.copy_on_write = True
# Option2
pd.set_option("mode.copy_on_write" : True)

You can use any of the options to turn on CoW globally in your environment.

Note: CoW will be enabled by default in Pandas 3.0, so get used to it early on.

Impact of CoW in Pandas

The CoW will disallow updating the multiple pandas objects at the same time. Here's how it will happen.

import pandas as pd
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
subset = df["A"]
subset.iloc[0] = 10
df

With CoW, the above snippet will not modify df rather it modifies only the data of subset.

# df
    A    B
0    1    4
1    2    5
2    3    6

# subset
    A
0    10
1    2
2    3

inplace Operations will Not Work

Similarly, the inplace operations will not work with CoW enabled, which directly modifies the original df.

df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
df["A"].replace(1, 5, inplace=True)
df
--------------------
    A    B
0    1    4
1    2    5
2    3    6

We can see that df has remained unchanged and additionally, we will see a ChainedAssignmentError warning.

The above operation can be performed in two different ways. One method is to avoid inplace, and another is to use inplace to directly modify the original df at the DataFrame level.

# Avoid inplace
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
df["A"] = df["A"].replace(1, 5)
df
--------------------
    A    B
0    5    4
1    2    5
2    3    6
# Using inplace at DataFrame level
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
df.replace({"A": {2: 34}}, inplace=True)
df
--------------------
    A    B
0    1    4
1    34    5
2    3    6

Chained Assignment Will Never Work

When we modify the DataFrame or Series using multiple indexing operations in a single line of code, this is what we call the chained assignment technique.

# CoW disabled
with pd.option_context("mode.copy_on_write", False):
    df = pd.DataFrame({"A": [1, 2, 3, 4], "B": [5, 6, 7, 8]})
    df["B"][df['A'] > 2] = 10
df

The above code snippet is trying to change column B from the original df where column A is greater than 2. It means the value at the 2nd and 3rd index in column B will be modified.

Since the CoW is disabled, this operation is allowed, and the original df will be modified.

    A    B
0    1    5
1    2    6
2    3    10
3    4    10

But, this will never work with CoW enabled in pandas.

# CoW enabled
df = pd.DataFrame({"A": [1, 2, 3, 4], "B": [5, 6, 7, 8]})
df["B"][df["A"] > 2] = 10
df
--------------------
    A    B
0    1    5
1    2    6
2    3    7
3    4    8

Instead, with copy-on-write, we can use .loc to modify the df using multiple indexing conditions.

# CoW enabled
df = pd.DataFrame({"A": [1, 2, 3, 4], "B": [5, 6, 7, 8]})
df.loc[(df["A"] == 1) | (df["A"] > 3), "B"] = 100
df

This will modify column B where column A is either 1 or greater than 3. The original df will look like the following.

    A    B
0    1    100
1    2    6
2    3    7
3    4    100

Read-only Arrays

When a Series or DataFrame is accessed as a NumPy array, that array will be read-only if the array shares the same data with the initial DataFrame or Series.

df = pd.DataFrame({"A": [1, 2, 3, 4], "B": ['5', '6', '7', '8']})
arr = df.to_numpy()
arr
--------------------
array([[1, '5'],
       [2, '6'],
       [3, '7'],
       [4, '8']], dtype=object)

In the above code snippet, arr will be a copy because df contains two different types of arrays (int and str). We can perform modifications on the arr.

arr[1, 0] = 10
arr
--------------------
array([[1, '5'],
       [10, '6'],
       [3, '7'],
       [4, '8']], dtype=object)

Take a look at this case.

df = pd.DataFrame({"A": [1, 2, 3, 4], "B": [5, 6, 7, 8]})
arr = df.to_numpy()
arr

The DataFrame df has only one NumPy array (array of same data types), so arr shares the data with df. This means arr will be read-only and cannot be modified inplace.

print(arr.flags.writeable)
arr[0,0] = 10
arr
--------------------
False
ValueError: assignment destination is read-only

Lazy Copy Mechanism

When two or more DataFrames share the same data, the copies will not be created immediately.

df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
df2 = df.reset_index(drop=True)

Both df and df2 shares the same reference in the memory as both share the same data. The copy mechanism will trigger only when any of the DataFrame is modified.

df2.iloc[0, 0] = 10
print(df2)
print(df)
--------------------
    A  B
0  10  4
1   2  5
2   3  6
   A  B
0  1  4
1  2  5
2  3  6

But this is not necessary, if we don't want initial df, we can simply reassign it to the same variable (df) and this process will create a new reference. This will avoid the copy-on-write process.

df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
print("Initial reference: ",id(df))
df = df.reset_index(drop=True)
print("New reference: ",id(df))
df.iloc[0, 0] = 10
print(df)
--------------------
Initial reference:  138400246865760
New reference:      138400246860336
    A  B
0  10  4
1   2  5
2   3  6

This same optimization (lazy copy mechanism) is added to the methods that don't require a copy of the original data.

DataFrame.rename()

df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
df.rename(columns={"A": "X", "B": "Y"})
--------------------
    X    Y
0    1    4
1    2    5
2    3    6

When CoW is enabled, this method returns the original DataFrame rather than creating an entire copy of the data, unlike the regular execution.

DataFrame.drop() for axis=1

Similarly, the same mechanism is implemented for DataFrame.drop() for axis=1 (axis='columns').

df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9]})
df.drop(["A"], axis=1)
--------------------
    B    C
0    4    7
1    5    8
2    6    9

Conclusion

Pandas will by default implement Copy-on-Write (CoW) in version 3.0. All these optimizations that are compliant with CoW will lead to efficient memory and resource management when working with large datasets.

This will reduce unpredictable or inconsistent behavior and greatly maximize performance.


🏆Other articles you might be interested in if you liked this one

Merge, combine, and concatenate multiple datasets using pandas.

Find and delete duplicate rows from the dataset using pandas.

Find and Delete Mismatched Columns From DataFrames Using pandas.

Create temporary files and directories using tempfile module in Python.

Upload and display images on the frontend using Flask.

How does the learning rate affect the ML and DL models?


That's all for now

Keep Coding✌✌

0
Subscribe to my newsletter

Read articles from Sachin Pal directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sachin Pal
Sachin Pal

I am a self-taught Python developer who loves to write on Python Programming and quite obsessed with Machine Learning.