How to decide bucket count in hive

navinkumarnavinkumar
1 min read

Table of contents

Steps

  1. Calculate Expected Bucket Size:

    • Divide the table size by the block size on Hadoop to get an initial estimate.
    Expected Bucket Size = Table Size / Block Size on Hadoop
  1. Find the Nearest Power of 2:

    • Take the base-2 logarithm of the initial estimate to find the nearest power of 2.
    n = log(Expected Bucket Size, 2)
  1. Calculate Final Bucket Size:

    • Use the rounded-off value to calculate the final bucket size by raising 2 to the power of the rounded value.
    Bucket Size = 2 ^ Rounded Value
  1. Example Calculation:

    Assume:

    • Table Size = 2300 MB

    • Block Size on Hadoop = 128 MB

    • Calculate Expected Bucket Size:

        Expected Bucket Size = 2300 / 128 = 17.96875
      
    • Find the Nearest Power of 2:

        n = log(17.96875, 2) ≈ 4.167418
      
    • Round off to the Nearest Integer:

        Rounded Value ≈ 5
      
    • Calculate Final Bucket Size:

        Bucket Size = 2^5 = 32
      
  2. Result:

    Therefore, the bucket size for the given example is 32.

Reference:

How to Decide Bucket Count in Hive-data engineering

0
Subscribe to my newsletter

Read articles from navinkumar directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

navinkumar
navinkumar

A passionate data engineer skilled in ETL, Data warehousing, SQL, Python, Docker & Kubernetes and more. Follow my blog for tech insights!