Ans: Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.
Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming. Apart from supporting all these workload in a respective system, it reduces the management burden of maintaining separate tools.
Ans: SparkContext is the entry point to Spark. Using sparkContext you create RDDs which provided various ways of churning data.
Ans: There are few important reasons why Spark is faster than MapReduce and some of them are below:
In MapReduce, the intermediate data will be stored in HDFS and hence takes longer time to get the data from a source but this is not the case with Spark.
Ans:
Ans:
Ans: Most of the data users know only SQL and are not good at programming. Shark is a tool, developed for people who are from a database background - to access Scala MLib capabilities through Hive like SQL interface. Shark tool helps data users run Hive on Spark - offering compatibility with Hive metastore, queries and data.
Ans: Spark can run on the following platforms:
Ans: Though Spark is written in Scala, it lets the users code in various languages such as:
Also, by the way of piping the data via other commands, we should be able to use all kinds of programming languages or binaries.
Criteria | Hadoop MapReduce | Apache Spark |
Memory | Does not leverage the memory of the hadoop cluster to maximum. | Let's save data on memory with the use of RDD's. |
Disk usage | MapReduce is disk oriented. | Spark caches data in-memory and ensures low latency. |
Processing | Only batch processing is supported | Supports real-time processing through spark streaming. |
Installation | Is bound to hadoop. | Is not bound to Hadoop. |
Ans: Transformations create new RDD’s from existing RDD and these transformations are lazy and will not be executed until you call any action.
Eg: map(), filter(), flatMap(), etc.,
Actions will return results of an RDD.
Eg: reduce(), count(), collect(), etc.,
Ans: Spark has been designed to process data from various sources. So, whether you want to process data stored in HDFS, Cassandra, EC2, Hive, HBase, and Alluxio (previously Tachyon). Also, it can read data from any system that supports any Hadoop data source.
Ans:
Ans: Spark Driver is the program that runs on the master node of the machine and declares transformations and actions on data RDDs. In simple terms, driver in Spark creates SparkContext, connected to a given Spark Master.The driver also delivers the RDD graphs to Master, where the standalone cluster manager runs.
Ans: Accumulators are the write only variables which are initialized once and sent to the workers. These workers will update based on the logic written and sent back to the driver which will aggregate or process based on the logic.
Only driver can access the accumulator’s value. For tasks, Accumulators are write-only. For example, it is used to count the number errors seen in RDD across workers.
Ans: Hive contains significant support for Apache Spark, wherein Hive execution is configured to Spark:
hive> set spark.home=/location/to/sparkHome;
hive> set hive.execution.engine=spark;
Hive on Spark supports Spark on yarn mode by default.
Ans: Broadcast Variables are the read-only shared variables. Suppose, there is a set of data which may have to be used multiple times in the workers at different phases, we can share all those variables to the workers from the driver and every machine can read them.
Ans:
1.Spark is memory intensive, whatever you do it does in memory.
2.Firstly, you can adjust how long spark will wait before it times out on each of the phases of data locality (data local –> process local –> node local –> rack local –> Any)
3.Filter out data as early as possible. For caching, choose wisely from various storage levels.
4.Tune the number of partitions in spark.
Ans: Spark SQL is a module for structured data processing where we take advantage of SQL queries running on the datasets.
Ans: Whenever there is data flowing continuously and you want to process the data as early as possible, in that case you can take the advantage of Spark Streaming. It is the API for stream processing of live data. Data can flow for Kafka, Flume or from TCP sockets, Kenisis etc., and you can do complex processing on the data before you pushing them into their destinations. Destinations can be file systems or databases or any other dashboards.
Ans: In Spark Streaming, you have to specify the batch interval. For example, let’s take your batch interval is 10 seconds, Now Spark will process the data whatever it gets in the last 10 seconds i.e., last batch interval time.But with Sliding Window, you can specify how many last batches has to be processed. In the below screen shot, you can see that you can specify the batch interval and how many batches you want to process. Apart from this, you can also specify when you want to process your last sliding window. For example you want to process the last 3 batches when there are 2 new batches. That is like when you want to slide and how many batches has to be processed in that window.
Ans: MLlib is scalable machine learning library provided by Spark. It aims at making machine learning easy and scalable with common learning algorithms and use cases like clustering, regression filtering, dimensional reduction, and alike.
Ans: Spark SQL is capable of:
Ans: To connect Spark with Mesos-
Ans: Yes, it is possible to run Spark and Mesos with Hadoop by launching each of these as a separate service on the machines. Mesos acts as a unified scheduler that assigns tasks to either Spark or Hadoop.
Ans: Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster.
Ans: Catalyst framework is a new optimization framework present in Spark SQL. It allows Spark to automatically transform SQL queries by adding new optimizations to build a faster processing system.
Ans: Pinterest, Conviva, Shopify, Open Table
Ans: BlinkDB is a query engine for executing interactive SQL queries on huge volumes of data and renders query results marked with meaningful error bars. BlinkDB helps users balance ‘query accuracy’ with response time.
Ans: Developers often make the mistake of-
Developers need to be careful with this, as Spark makes use of memory for processing.
Ans: Parquet file is a columnar format file that helps –