Ans: Pig, it is an Apache open-source project, which operates on Hadoop, providing the engine for the parallel data flow. It contains the language referred as pig Latin, which expresses the data flow. It consists of various operations like sort, joins, filter, etc. & is capable of scripting UDF (User Define Functions) for reading, writing, & processing. Pig uses Map Reduce & HDFS for storing & the entire task for processing.
2.Compare Pig and Hive?
Criteria | Pig | Hive |
Language | Pig Latin | SQL-like |
Application | Programming purposes | Report creation |
Operation | Client Side | Server side |
Data support | Semi-structured | Structured |
Connectivity | Can be called by other applications | JDBC & BI tool integration |
Ans: Apache Pig programs are written in Pig Latin query language which is similar to the SQL query language. To execute this queries, there requires an execution engine. The Pig engine enables to convert the queries into MapReduce jobs and thus MapReduce acts as the execution engine and is designed to run the programs as per the requirements.
Ans: Apache Pig supports 3 complex data types-
Ans: We can use Pig in three categories, they are
Ans: The scalar data types in pig are int, float, double, long, chararray, and bytearray.
The complex data types in Pig are map, tuple, and bag.
Map: The data element with the data type chararray where element has pig data type include complex data type
Example- [city’#’bang’,’pin’#560001]
In this city and pin are data element mapping to values.
Tuple : It is a collection of data types and it has fixed length. Tuple is having multiple fields and these are ordered.
Bag : It is a collection of tuples, but it is unordered, tuples in the bag are separated by comma
Example: {(‘Bangalore’, 560001),(‘Mysore’,570001),(‘Mumbai’,400001)
Ans: Pig Latin shifts from SQL in a declarative style of encoding whereas Hive's query language is similar to SQL.
* Pig is above Hadoop and runs on principle, which can sit on top of Dryad too.
* Hive & Pig, both their commands collect to MapReduce jobs.
Ans: Many times there are data in one of the tuple or bag which on removal, lead to next level of nesting for that data. In those cases, Flatten, a modifier, embedded in Pig is used. Flatten uninstalls bags & tuples and replaces all the areas in tuple, whereas the un-nesting bags are more complex of its need in creating a new tuple.
Ans:
Ans: Loads or stores relations using field delimited text format.
Each line is broken into fields using a configurable field delimiter (defaults to a tab character) to be stored in the tuples fields. It is the default storage when none is specified.
Ans: If the Built in operators does not provide some of the basic functions, then developers can apply those functions by writing the user defined functions by using programming languages like Python, Java, Ruby, etc. (UDF’s) better known as User Defined Functions are then rooted into the Pig Latin Script.
Ans: As the clause in SQL, Apache Pig has to filter for extraction of the records, which are based on predicate or specified condition. The records are then passed through the pipeline if the condition turns to true. Predicate surrounds a variety of operators like ==, <=,!=, >=. For instance - Y = filter X by symbol matches ‘Mr.*’; X= load ‘inputs’ as(name,address)
Ans: We can join multiple fields in PIG by the join operator, which extracts the records from any one input & joins them with the other specified input. This is done by specifying the keys for each input & both the rows will join as soon as the keys are equal.
Ans: Following are some of the Limitations of the Apache Pig a:
* Apache Pig isn’t preferable for analytics of a single record in huge data sets.
* Pig platform is specifically designed for ETL-type use cases, it’s not a good choice for synchronized or real time scenarios.
* Apache Pig is built on top of MapReduce, which is itself batch processing oriented.
Ans: Load helps to load data from the file system. It is a relational operator
In the first step in data-flow language we need to mention the input, which is completed by using ‘load’ keyword.
The LOAD syntax is
LOAD ‘mydata’ [USING function] [AS schema];
Example- A = LOAD ‘intellipaat.txt’;
A = LOAD ‘intellipaat.txt’ USINGPigStorage(‘\t’);
Ans: The relational operations in Pig:
foreach, order by, filters, group, distinct, join, limit.foreach: It takes a set of expressions and applies them to all records in the data pipeline to the next operator.A =LOAD ‘input’ as (emp_name :charrarray, emp_id : long, emp_add : chararray, phone : chararray, preferences : map [] );B = foreach A generate emp_name, emp_id;Filters: It contains a predicate and it allows us to select which records will be retained in our data pipeline.
Syntax: alias = FILTER alias BY expression;
Alias indicates the name of the relation, By indicates required keyword and the expression has Boolean.
Example: M = FILTER N BY F5 == 4;
Ans: Following operators are used for handling the exception in pig script.
DUMP : It helps to display the results on screen.
DESCRIBE : It helps to display the schema of aparticular relation.
ILLUSTRATE : It helps to display step by step execution of a sequence of pig statements
EXPLAIN : It helps to display the execution plan for Pig Latin statements.
Ans: PigLatin is procedural language, whereas HiveQL is declarative.
In HiveQL it is necessary to specify the schema, whereas in PigLatin it is optional.Ans: Pig Latin is scripting language like Perl for searching huge data sets and it is made up of a series of transformations and operations that are applied to the input data to produce data.
Pig engine is an environment to execute the Pig Latin programs. It converts Pig Latin operators into a series of MapReduce jobs.
Stat classes are in the package