Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The data flows in a pipeline step by step and data can be stored at any point in the pipeline It allows user to analyze large unstructured dataset by transforming, and applying functions on the dataset.
Apache Pig can be used to process complex data flows and extend them with custom code. A job can be written to collect web server logs, use external programs to fetch geo-location data for the users’ IP addresses, and join the new set of geo-located web traffic to click maps stored as JSON, web analytic data in CSV format, and spreadsheets from the advertising department to build a rich view of user behavior overlaid with advertising effectiveness.
Pig Architecture and Phases
Pig mainly has an intermediate layer which converts the Pig Latin Scripts into the MapReduce Jobs. Main Phases in the intermediate layer are
(a)Query Parsing: In this phase pig command is parsed as query.
(b)Semantic Checking: In this phase syntax of the pig script is checked .
(c)Optimization: In this phase pig tries the easiest and fastest way to get the data.
(d)Physical Planning: In this phase pig script is translated
(e)Map Reduce Processing: In this Phase ,script is converted into MapReduce code and processed.
(f)Logical validation: It is is not possible in pig but semantic checking is possible in Pig
Components of Pig
Apache pig is a scripting langauge that can explore huge data sets with the help of an engine that executes data flows in parallel for Hadoop.
There are two main components in Apache Pig
- Pig Latin: Language for expressing data flows
- Pig Engine: It onverts Pig Latin operators or transformations into a series of MapReduce job and executes. It supports various execution modes as explained in the subsequent section.
Features of Pig Latin
There are number of features Pig Latin provides,among which some are mentioned below.
- Pig Latin scripts can be executed either in Interactive mode through Grunt shell or in Batch mode
- It incldes several operators for many of the traditional data operations(join, sort, filter, etc)
- It provides much flexibility and extensibility so that users can develop their own functions for reading,processing, and writing data
- It is made up of a series of operations, or transformations that are applied to the input data to produce output.
Limitations of Pig
Below are some of the limitations Apache pig has when it’s architecture and working mechanism is compared.
- Apache Pig does not support random reads or queries that gives low latency in the order of tens of milliseconds .
- As low latency queries are not supported in Pig,it is not suitable for OLAP and OLTP applications.
- Apache Pig does not support random writes to update small portion/delta amount of data. All writes are bulk and Streaming writes as like Mapreduce.
Pig Execution Modes
Latest version of Pig has six execution modes or exectypes among which some are experimental mode which might be available for all the versions.
- Local Mode - To run Pig in local mode, you need access to a single machine; all files are installed and run using your local host and file system.
Use the below command to run pig in local mode .
pig -x local
It is useful to debug and check any syntactical error of from pig script using a small subset of data .
- Tez Local Mode - It runs pig in local mode with tez as runtime engine.
pig -x tez_local
Note: Tez local mode is experimental. There are some queries which just error out on bigger data in local mode.
- Spark Local Mode - It run pig in local mode with Apache Spark as runtime engine.
pig -x spark_local
Note: Spark local mode is experimental. There are some queries which just error out on bigger data in local mode.
- Mapreduce Mode - It is the default mode in Pig which needs Hadoop cluster and HDFS installation and runs Pig in Mapreduce mode.
#Two ways to invoke pig pig pig -x mapreduce
- Tez Mode - To run Pig in Tez mode, you need access to a Hadoop cluster and HDFS installation.
pig -x tex
- Spark Mode - To run Pig in Spark mode, you need access to a Spark, Yarn or Mesos cluster and HDFS installation.
pig -x spark
In Spark execution mode, it is necessary to set env::SPARK_MASTER to an appropriate value (local - local mode, yarn-client - yarn-client mode, mesos://host:port - spark on mesos or spark://host:port - spark cluster.
Ways to run Pig
We can run Pig commands in three ways.
Grunt Shell/Interactive Shell
Pig Script File
Embedded Program in Java or other Language
Advantage of Apache Pig over MapReduce
Apache Pig provides a scripting language for describing operations like reading, filtering, transforming, joining, and writing data – exactly the operations that MapReduce was originally designed for.
Rather than expressing these operations in thousands of lines of Java code that uses MapReduce directly, Pig lets users express them in a language not unlike a bash or perl script.
Pig is excellent for prototyping and rapidly developing MapReduce-based jobs, as opposed to coding MapReduce jobs in Java itself.
Use cases of Apache Pig
- Data processing for web search platforms.
- Ad hoc queries across large data sets.
- Rapid prototyping of algorithms for processing large data sets.
- Complex data flows to process web server logs ,web analytic data in CSV format, and spreadsheets from the advertising department
Pig vs Hive
HIVE provides us data warehousing facilities on top of an existing Hadoop cluster. Along with that it provides an SQL like interface which makes your work easier, in case you are coming from an SQL background. You can create tables in Hive and store data there. Along with that you can even map your existing HBase tables to Hive and operate on them.
Pig is basically a dataflow language that allows us to process enormous amounts of data very easily and quickly. Pig basically has 2 parts, the Pig Interpreter and the language, PigLatin. You write Pig script in PigLatin and using Pig interpreter process them. Pig makes our life a lot easier, otherwise writing MapReduce is always not easy. In fact in some cases it can really become a pain.
Hive should be used for analytical querying of data collected over a period of time - for instance, to calculate trends or website logs. Hive should not be used for real-time querying since it could take a while before any results are returned.
Why/When Use Pig in Addition to MapReduce
- Use Pig when you want to do a lot of transformations on your data and don’t want to write a lot of java codes .
- Map reduce requires a Java Programmer and multiple stages to come to solution .
- User has to reinvent common functionality (join, filter etc) .
- Long development cycle with rigorous testing states in of writing MapReduce jobs.