PIG Interview Questions and Answers

Written by Bharathkumar | Sep 11, 2022 10:46:51 AM

1.What do we understand by PIG?

Ans: Pig, it is an Apache open-source project, which operates on Hadoop, providing the engine for the parallel data flow. It contains the language referred as pig Latin, which expresses the data flow. It consists of various operations like sort, joins, filter, etc. & is capable of scripting UDF (User Define Functions) for reading, writing, & processing. Pig uses Map Reduce & HDFS for storing & the entire task for processing.

2.Compare Pig and Hive?

Criteria	Pig	Hive
Language	Pig Latin	SQL-like
Application	Programming purposes	Report creation
Operation	Client Side	Server side
Data support	Semi-structured	Structured
Connectivity	Can be called by other applications	JDBC & BI tool integration

3.Explain the uses of Map Reduce in Pig?

Ans: Apache Pig programs are written in Pig Latin query language which is similar to the SQL query language. To execute this queries, there requires an execution engine. The Pig engine enables to convert the queries into MapReduce jobs and thus MapReduce acts as the execution engine and is designed to run the programs as per the requirements.

Pigs’ operators are using Hadoops’ API depending upon the configurations the job is executed in local mode or Hadoop cluster. Pig is never passes any outputs to Hadoop instead set the inputs and data locations for map-reduce.
Pig Latin provides a set of standard Data-processing operations, such as join, filter, group by, order by, union, etc which are mapped to do the map-reduce tasks. A Pig Latin script describes a (DAG) directed acyclic graph, where the edges are data flows and the nodes are operators that process the data.

4.Explain about the different complex data types in Pig?

Ans: Apache Pig supports 3 complex data types-

Maps- These are key, value stores joined together using #.
Tuples- Just similar to the row in a table, where different items are separated by a comma. Tuples can have multiple attributes.
Bags- Unordered collection of tuples. Bag allows multiple duplicate tuples.

5.Explain the uses of PIG?

Ans: We can use Pig in three categories, they are

ETL data pipeline : It helps to populate our data warehouse. Pig can pipeline the data to an external application, it will wait until it’s finished, so that it has receive the processed data and continue from there. It is the most common use case for Pig.
Research on raw data.
Iterative processing.

6.Name the scalar data type and complex data types in Pig?

Ans: The scalar data types in pig are int, float, double, long, chararray, and bytearray.
The complex data types in Pig are map, tuple, and bag.
Map: The data element with the data type chararray where element has pig data type include complex data type

Example- [city’#’bang’,’pin’#560001]

In this city and pin are data element mapping to values.

Tuple : It is a collection of data types and it has fixed length. Tuple is having multiple fields and these are ordered.

Bag : It is a collection of tuples, but it is unordered, tuples in the bag are separated by comma

Example: {(‘Bangalore’, 560001),(‘Mysore’,570001),(‘Mumbai’,400001)

7.What is the difference in Pig and SQL?

Ans: Pig Latin shifts from SQL in a declarative style of encoding whereas Hive's query language is similar to SQL.
* Pig is above Hadoop and runs on principle, which can sit on top of Dryad too.
* Hive & Pig, both their commands collect to MapReduce jobs.

8.What is the function of Flatten in Pig?

Ans: Many times there are data in one of the tuple or bag which on removal, lead to next level of nesting for that data. In those cases, Flatten, a modifier, embedded in Pig is used. Flatten uninstalls bags & tuples and replaces all the areas in tuple, whereas the un-nesting bags are more complex of its need in creating a new tuple.

9.Where Does Pig Live?

Ans:

Pig is installed on user machine.
No need to install anything on the hadoop cluster
Pig and Hadoop versions must be compatible.
Pig submits and executes jobs to the hadoop cluster

10.What is Pigstorage?

Ans: Loads or stores relations using field delimited text format.
Each line is broken into fields using a configurable field delimiter (defaults to a tab character) to be stored in the tuples fields. It is the default storage when none is specified.

11.What is UDF in Pig?

Ans: If the Built in operators does not provide some of the basic functions, then developers can apply those functions by writing the user defined functions by using programming languages like Python, Java, Ruby, etc. (UDF’s) better known as User Defined Functions are then rooted into the Pig Latin Script.

12.Why do we use Filters in Apache Pig ?

Ans: As the clause in SQL, Apache Pig has to filter for extraction of the records, which are based on predicate or specified condition. The records are then passed through the pipeline if the condition turns to true. Predicate surrounds a variety of operators like ==, <=,!=, >=. For instance - Y = filter X by symbol matches ‘Mr.*’; X= load ‘inputs’ as(name,address)

13. Can we join multiple fields in Apache Pig Scripts?

Ans: We can join multiple fields in PIG by the join operator, which extracts the records from any one input & joins them with the other specified input. This is done by specifying the keys for each input & both the rows will join as soon as the keys are equal.

14.What are the limitations of Pig Script?

Ans: Following are some of the Limitations of the Apache Pig a:

* Apache Pig isn’t preferable for analytics of a single record in huge data sets.
* Pig platform is specifically designed for ETL-type use cases, it’s not a good choice for synchronized or real time scenarios.
* Apache Pig is built on top of MapReduce, which is itself batch processing oriented.

15.Explain the LOAD keyword in Pig script?

Ans: Load helps to load data from the file system. It is a relational operator
In the first step in data-flow language we need to mention the input, which is completed by using ‘load’ keyword.
The LOAD syntax is

LOAD ‘mydata’ [USING function] [AS schema];
Example- A = LOAD ‘intellipaat.txt’;
A = LOAD ‘intellipaat.txt’ USINGPigStorage(‘\t’);

16.What are the relation operations in Pig? Explain any two with examples?

Ans: The relational operations in Pig:

foreach, order by, filters, group, distinct, join, limit.foreach: It takes a set of expressions and applies them to all records in the data pipeline to the next operator.A =LOAD ‘input’ as (emp_name :charrarray, emp_id : long, emp_add : chararray, phone : chararray, preferences : map [] );B = foreach A generate emp_name, emp_id;Filters: It contains a predicate and it allows us to select which records will be retained in our data pipeline.

Syntax: alias = FILTER alias BY expression;

Alias indicates the name of the relation, By indicates required keyword and the expression has Boolean.

Example: M = FILTER N BY F5 == 4;

17.What are the exception handling operators in Pig script?

Ans: Following operators are used for handling the exception in pig script.

DUMP : It helps to display the results on screen.

DESCRIBE : It helps to display the schema of aparticular relation.

ILLUSTRATE : It helps to display step by step execution of a sequence of pig statements

EXPLAIN : It helps to display the execution plan for Pig Latin statements.

18.Differentiate between HiveQL & PigLatin?

Ans: PigLatin is procedural language, whereas HiveQL is declarative.

In HiveQL it is necessary to specify the schema, whereas in PigLatin it is optional.
PigLatin has a nested relative data model, whereas HiveQL has a flat data model.

19.Differentiate between Pig Latin and Pig Engine?

Ans: Pig Latin is scripting language like Perl for searching huge data sets and it is made up of a series of transformations and operations that are applied to the input data to produce data.

Pig engine is an environment to execute the Pig Latin programs. It converts Pig Latin operators into a series of MapReduce jobs.

20.What are all stats classes in the org.apache.pig.tools.pigstats package?

Stat classes are in the package

PigStats
JobStats
OutputStats
InputStats

View full post