Maximizing Spark Performance: When, Where, and How to Use Caching Techniques

Caching is a technique of storing intermediate results in memory or disk. Computing the whole data again is not needed if we are using it again in further data processing.

In SPARK we do cache the DataFrame so we can use the result in next tranformation.

Examples for the above use case may be like below..

PROBLEM:

If we are joining (Table1 and Table2 ) result of two tables are used further in joining with other Table (Table 3 ).

We are also using the result of join(Table 1 and Table 2) in joining Table 4.

As Spark takes each join as a separate process it reprocess the whole (Table 1 and Table 2 join) when we call it again in join with other tables.

SOLUTION:

We need to cache it so the result can be used by further data frames .

The joined result is read from the memory instead of pulling the whole data again from the storage and recomputing the join result.

WHEN AND WHAT:

There are 2 kinds of caching techniques in spark (Addition there is a functionality by Data bricks io.cache):

CACHING:

Caching the result data frame on to the local cache so when ever we need it again in further transformations we can get it from cache.

\>Make sure your worker memory has sufficient space for (partition to fit in + extra space for storing intermediate results which internally it uses for compute or for creating Hash Maps for any joins)

\>Go with caching if the size of data frame to cache is too small .

Make sure to un-cache it at the end of the job because generally we use intermediate results as part of single job.

DF.cache()

PERSIST:

Caching the results on to the disk of compute engines. so when ever required and called for the further process they are pulled from compute engine SSDs.

We can use persist to store it on both local cache and SSDs it uses a method of LRU(least recently used ) to switch the data back and forth between the cache and SSD.

Check the size of your compute SSDs and see whether the cache size fits in. Below is an example to check size if you are using any cloud provider Databricks. Check their matrix for before chossing the cluster.

Make sure to un-persist at the end of the job.

DF.persist(StorageLevel.<chooseOption>)

We can choose

1.DISK_ONLY

2. MEMORY_ONLY

3. MEMORY_AND_DISK

There are also option available for storing the serialized data and also option to replicate the copies as well.

MEMORY_AND _DISK means its stores in memory and anything that doesn’t fit in is evicted to DISK.

AUTO CACHE BY DATABRICKS (Specific to Databricks, enabled by default increase the disk storage if needed):

Data bricks provide the auto-enabling the cache on the cluster.

When there are repetitive runs of the same query again by leveraging this feature your analytic s team can get the query results faster.

The recommended approach for better results is to maintain individual cluster for each specific stack. For sales reports example if we maintain the separate cluster the the repetitive queries thrown by analysts will be more of same thus we get more quicker results.

Here, it stores the results on the local disk of the compute engine in the SSDs.

You can see it below in the SPARK UI - <STORAGE >

Spark.conf.set(“spark.databricks.io.cache.enabled”,TRUE)

0
Subscribe to my newsletter

Read articles from Varas Vishwanadhula directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Varas Vishwanadhula
Varas Vishwanadhula

• Experienced Data Engineer with expertise in building scalable data pipelines and optimizing ETL processes on Snowflake and AZURE(ADLS GEN2, DATA FACTORY, DATABRICKS , LOGIC APPS) • Proficient in SQL and Python, specializing in data migration and developing cost- effective cloud solutions. • Proficient in SPARK and designed and optimized data integration solutions over Hadoop eco system and cloud..