Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query and analysis. Hive Query Language(HQL) is SQL like query language which is used to query data from hive tables.
First create the database
testdb if does not exists and list the databases.
-- Create Database if does not exists CREATE DATABASE IF NOT EXISTS testdb; -- Use the newly craete Database USE testdb; -- List all the databases SHOW DATABASES;
Upload Data in HDFS
drivers.csv and ‘'’timesheet data into HDFS(/user/maria-dev/) from the below data source
Now Upload the Files
Create Internal Table for Drivers Data
As database is recenly created ,no tables exists in the new database. Following query will result in a empty result.
-- Show all the Tables SHOW TABLES;
Now Create Table using the Hive DDL statement.
-- Create Table drivers if does not exists CREATE TABLE IF NOT EXISTS drivers_int(driverId INT ,name STRING ,ssn BIGINT, location STRING ,certified STRING ,wageplan STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE TBLPROPERTIES('skip.header.line.count' = '1'); SHOW TABLES;
In this DDL statement notice the folloing things
FIELDS TERMINATED BY ','
It means thad the data for this table will be seerated by comma or the data will be in csv format
STORED AS TEXTFILE
ROW FORMAT DELIMITED
Describe Tables Schema
-- Describes Table DESCRIBE EXTENDED drivers_int;
Load the data by overwriting the existing data
-- Insert the drivers_int table with the drivers.csv data by overwriting existing data LOAD DATA INPATH '/user/maria_dev/drivers.csv' OVERWRITE INTO TABLE drivers_int; --List the data from Tables SELECT * FROM temp_drivers LIMIT 10;
After executing LOAD DATA we can see table
drivers_int was populated with data from
Note that Hive consumed the data file
drivers.csv during this step. If you look in the File Browser you will see drivers.csv is no longer there.
As we are creating internal table ,data from HDFS will be moved to Hive ware house.
Create table driver time sheet
-- Create Table TimeSheet CREATE TABLE IF NOT EXISTS timesheet_int( driverId INT ,week INT ,hours_logged INT ,miles_loggged INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE TBLPROPERTIES('skip.header.line.count' = '1'); SHOW TABLES;
Load CSV data into Table time sheet
-- Load data into Time Sheet Table LOAD DATA INPATH '/user/maria_dev/timesheet.csv' OVERWRITE INTO TABLE timesheet_int; -- List the data from TimeSheet Table SELECT * FROM timesheet_int LIMIT 10;
Find the sum of hours and miles logged score for an year fro each driver
SELECT driverId ,sum(hours_logged) ,sum(miles_loggged) FROM timesheet_int GROUP BY driverId LIMIT 10;
Join the two tables to get the drivers Name for aggregrated data
-- Join the two tables to get the aggregrated hours and miles data along with the respective driver name SELECT d.driverId, d.name ,t.total_hours ,t.total_miles FROM drivers_int d JOIN (SELECT driverId ,sum(hours_logged) total_hours ,sum(miles_loggged) total_miles FROM timesheet_int GROUP BY driverId) t ON (d.driverId = t.driverId) LIMIT 10;