Mastering Data Management with Hive: Internal vs. External Tables

When it comes to data management in Hive, understanding the difference between internal and external tables is crucial. Here’s a comprehensive guide on how to efficiently manage your data using Hive on Hadoop clusters.

1. Setting Up Your Environment

First, authenticate and set up your environment: On your local machine, open the command prompt and install the gcloud SDK. Then, run the following commands:


gcloud auth login
gcloud config set project <your-project-id>

Transfer your data to the Hadoop cluster: Create a Dataproc cluster with the Compute Engine managed option (enable Dataproc and Compute Engine APIs). Set the master node to 32GB and the two worker nodes to 32GB each, then create the cluster.

SSH into the master node. From your local machine, copy the sample file to the server path. If the file is large, upload it in a zip format.


gcloud compute scp department_data  user@hive-data-m:/home/username

2. Create and Use Databases: type hive in master server ( open two master servers)

hive> CREATE DATABASE hive_data;
hive> USE hive_data;

3. Internal Tables

Create an internal table:

local server path load data from local path 'file:///home/user/depart_data.csv' into table department_data; when you drop the table as it is managed table data+table delete from hdfs dfs -ls hive/warehouse/hive_data path

hive> CREATE TABLE department_data (
  dept_id INT,
  dept_name STRING,
  manager_id INT,
  salary INT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

Load data from local:

hive> LOAD DATA LOCAL INPATH 'file:///home/username/depart_data.csv' INTO TABLE department_data;

When you drop an internal table, both the table metadata and the data are deleted:(hdfs dfs -ls /tmp/input_data)

sqlCopy codehive> DROP TABLE department_data;

And try another method to create hdfs tmp/input_data path and place the file from local server to hdfs path

load data inpath '/tmp/input_data/depart_data.csv' into table department_data;

query in hive moves the data from tmp/input_data to table so while droping table

it is hdfs so it is managed and will be deleted from both

(copy local server path to temp/input_data hdfs path and load from there to table it moves the load data local inpath 'file:///home/user/depart_data.csv' into table department_data; when you drop the table as it is managed table data+table delete from /hive/warehouse path file be deleted from /tmp/input_data path)

4. External Tables

Create an external table: while moving data from tmp path , file is not present because to maintain storage it will happens. so we need to put the data in hdfs path from localserver path and perform

hive> CREATE EXTERNAL TABLE department_data_external (
  dept_id INT,
  dept_name STRING,
  manager_id INT,
  salary INT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/tmp/input_data/';

Load data from HDFS:

hdfs dfs -mkdir /tmp/input_data
hdfs dfs -put depart_data.csv /tmp/input_data
hive> LOAD DATA INPATH '/tmp/input_data/depart_data.csv' INTO TABLE department_data_external;

Dropping an external table only removes the metadata, not the data itself:

copy local server path to temp/input_data hdfs path and define external table then file will be in /tmp/input_data even you drop the table but there won't be copy of file in warehouse path

hive> DROP TABLE department_data_external;

5. Handling Complex Data Types

Arrays and Maps in Hive:

hive> CREATE TABLE employee (
  id INT,
  name STRING,
  skills ARRAY<STRING>
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':';

Load and query array data:

hive> LOAD DATA LOCAL INPATH 'file:///home/username/array_data.csv' INTO TABLE employee;
hive> SELECT id, name, skills[0] AS primary_skill FROM employee;

Using Maps:

hive> CREATE TABLE employee_map_data (
  id INT,
  name STRING,
  details MAP<STRING, STRING>
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '|' MAP KEYS TERMINATED BY ':';
hive> LOAD DATA LOCAL INPATH 'file:///home/username/map_data.csv' INTO TABLE employee_map_data;
hive> SELECT id, name, details['gender'] AS employee_gender FROM employee_map_data;

Mastering Data Management with Hive: Internal vs. External Tables

Table of contents

Subscribe to my newsletter

Maseed Irfan ali

Maseed Irfan ali