Azure Data Factory Interview Questions and Answers
by Bharathkumar, on Sep 10, 2022 11:03:07 AM
1. What is Azure Data Factory?
Ans: Cloud-based integration service that allows creating data-driven workflows in the cloud for orchestrating and automating data movement and data transformation.
- Using Azure data factory, you can create and schedule the data-driven workflows(called pipelines) that can ingest data from disparate data stores.
- It can process and transform the data by using compute services such as HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine Learning.
2. Why do we need Azure Data Factory?
Ans:
- The amount of data generated these days is huge and this data comes from different sources. When we move this particular data to the cloud, there are a few things needed to be taken care of.
- Data can be in any form as it comes from different sources and these different sources will transfer or channelize the data in different ways and it can be in a different format. When we bring this data to the cloud or particular storage we need to make sure that this data is well managed. i.e you need to transform the data, delete unnecessary parts. As per moving the data is concerned, we need to make sure that data is picked from different sources and bring it at one common place then store it and if required we should transform into more meaningful.
- This can be also done by a traditional data warehouse as well but there are certain disadvantages. Sometimes we are forced to go ahead and have custom applications that deal with all these processes individually which is time-consuming and integrating all these sources is a huge pain. we need to figure out a way to automate this process or create proper workflows.
- Data factory helps to orchestrate this complete process into a more manageable or organizable manner.
3. What are the top-level concepts of Azure Data Factory?
Ans:
- Pipeline: It acts as a carrier in which we have various processes taking place.
This individual process is an activity.
- Activities: Activities represent the processing steps in a pipeline. A pipeline can have one or multiple activities. It can be anything i.e process like querying a data set or moving the dataset from one source to another.
- Datasets: Sources of data. In simple words, it is a data structure that holds our data.
- Linked services: These store information that is very important when it comes to connecting an external source.
For example: Consider SQL server, you need a connection string that you can connect to an external device. you need to mention the source and the destination of your data.
4. What is the difference between Azure Data Lake and Azure Data Warehouse?
Ans: Data Warehouse is a traditional way of storing data that is still used widely. Data Lake is complementary to Data Warehouse i.e if you have your data at a data lake that can be stored in the data warehouse as well but there are certain rules that need to be followed.
DATA LAKE | DATA WAREHOUSE |
Complementary to data warehouse | Maybe sourced to the data lake |
Data is Detailed data or Raw data. It can be in any particular form. you just need to take the data and dump it into your data lake | Data is filtered, summarised, refined |
Schema on reading (not structured, you can define your schema in n number of ways) | Schema on write(data is written in Structured form or in a particular schema) |
One language to process data of any format(USQL) | It uses SQL |
5. What is the integration runtime?
Ans: The integration runtime is the compute infrastructure that Azure Data Factory uses to provide the following data integration capabilities across various network environments.
3 Types of integration runtimes:
- Azure Integration Run Time: Azure Integration Run Time can copy data between cloud data stores and it can dispatch the activity to a variety of computing services such as Azure HDinsight or SQL server where the transformation takes place
- Self Hosted Integration Run Time: Self Hosted Integration Run Time is software with essentially the same code as Azure Integration Run Time. But you install it on an on-premise machine or a virtual machine in a virtual network. A Self Hosted IR can run copy activities between a public cloud data store and a data store in a private network. It can also dispatch transformation activities against computing resources in a private network. We use Self Hosted IR because the Data factory will not be able to directly access on-primitive data sources as they sit behind a firewall. It is sometimes possible to establish a direct connection between Azure and on-premises data sources by configuring the firewall in a specific way if we do that we don’t need to use a self-hosted IR.
- Azure SSIS Integration Run Time: With SSIS Integration Run Time, you can natively execute SSIS packages in a managed environment. So when we lift and shift the SSIS packages to the data factory, we use Azure SSIS Integration Run TIme.
6. What is Cloud Computing?
Ans: Windows Azure is a cloud platform developed by Microsoft that enables businesses to completely run in the cloud.
Cloud computing is Web-based computing that allows businesses and individuals to consume computing resources such as virtual machines, databases, processing, memory, services, storage, or even number of calls or events and pay-as-you-go. The pay-as-you-go model charges for the resources as much as you use. Unlike traditional computing, if you do not use any resources, you do not pay. It is similar to having a water connection or an electricity line. You have a meter and the meter keeps track of your monthly usage and you pay for that usage at a given rate.
Cloud computing is a culmination of numerous attempts at large-scale computing with seamless access to virtually limitless resources.
Here are some key advantages of cloud computing:
- Cloud computing allows businesses to cut their operational and fixed monthly cost of hardware, employees, and software licenses. All hardware, database servers, web servers, software, products, and services are hosted in the cloud and added to the account as needed.
- Cloud computing offers 24/7 uptime (99.99% uptime). Cloud servers and data centers are managed by the cloud service provider and you do not need to have any employees manage that.
- Cloud computing is scalable and reliable. There is no limit on the number of users or resources. Cloud increases processing and resources as needed. If you do not need resources, you can always scale down. A cloud service provider such as Azure or AWS
- Cloud computing provides maintainability and automatic updates of new software, OS, databases, and third-party software. It reduces IT labor cost for a business.
- Cloud service providers have data centers in various locations around the globe that makes it faster and reliable.
7. What is Azure Table Storage?
Ans: Azure Table storage is a very popular service used across many projects which helps to store structured NoSQL data in the cloud, providing a key/attribute store with a schemaless design. Table storage is very well known for its schemaless architecture design. The main advantage of using this is, table storage is fast and cost-effective for many types of applications.
Another advantage of table storage is that you can store flexible datasets like user data for a web application or any other device information or any other types of metadata that your service requires.
You can store any number of entities in the table. One storage account may contain any number of tables, up to the capacity limit of the storage account.
Another advantage of Azure Table storage is that it stores a large amount of structured data. The service is a NoSQL data store that accepts authenticated calls from inside and outside the Azure cloud.
- It helps to store TBs of structured data.
- For storing datasets that don’t require complex joins, foreign keys, or stored procedures.
- Quickly querying data using a clustered index.
8. What is Microsoft Azure and why is it used?
Ans: As discussed above, the companies which provide the cloud service are called the Cloud Providers. There are a lot of cloud providers out there, out of them one is Microsoft Azure. It is used for accessing Microsoft’s infrastructure for the cloud.
Intermediate Interview Questions
9. What is blob storage in Azure?
Ans: Azure Blob Storage is a service for storing large amounts of unstructured object data, such as text or binary data. You can use Blob Storage to expose data publicly to the world or to store application data privately. Common uses of Blob Storage include:
- Serving images or documents directly to a browser
- Storing files for distributed access
- Streaming video and audio
- Storing data for backup and restore disaster recovery, and archiving
- Storing data for analysis by an on-premises or Azure-hosted service
10. What are the steps for creating the ETL process in Azure Data Factory?
Ans: While we are trying to extract some data from the Azure SQL server database, if something has to be processed, then it will be processed and is stored in the Data Lake Store.
Steps for Creating ETL
- Create a Linked Service for source data store which is SQL Server Database
- Assume that we have a cars dataset
- Create a Linked Service for destination data store which is Azure Data Lake Store
- Create a dataset for Data Saving
- Create the pipeline and add copy activity
- Schedule the pipeline by adding a trigger10
11. What are the top-level concepts of Azure Data Factory?
Ans: Pipeline: It acts as a carrier in which we have various processes taking place.
This individual process is an activity.
- Activities: Activities represent the processing steps in a pipeline. A pipeline can have one or multiple activities. It can be anything i.e process like querying a data set or moving the dataset from one source to another.
- Datasets: Sources of data. In simple words, it is a data structure that holds our data.
- Linked services: These store information that is very important when it comes to connecting an external source.
For example: Consider SQL server, you need a connection string that you can connect to an external device. you need to mention the source and the destination of your data.
12. What is the difference between HDinsight & Azure Data Lake Analytics?
HDInsight(PaaS) | ADLA(SaaS) |
HDInsight is Platform as a service | Azure Data Lake Analytics is Software as a service. |
If we want to process a data set, first of all, we have to configure the cluster with predefined nodes and then we use a language like pig or hive for processing data | It is all about passing queries, written for processing data and Azure Data Lake Analytics will create necessary compute nodes as per our instruction on-demand and process the data set |
Since we configure the cluster with HD insight, we can create as we want and we can control it as we want. All Hadoop subprojects such as a spark, Kafka can be used without any limitation. | With azure data lake analytics, it does not give much flexibility in terms of the provision in the cluster, but Azure takes care of it. We don’t need to worry about cluster creation. The assignment of nodes will be done based on the instruction we pass. In addition to that, we can make use of USQL taking advantage of dotnet for processing data. |
Advanced Interview Questions
13. How do I access data by using the other 80 dataset types in Data Factory?
Ans: The Mapping Data Flow feature currently allows Azure SQL Database, Azure SQL Data Warehouse, delimited text files from Azure Blob storage or Azure Data Lake Storage Gen2, and Parquet files from Blob storage or Data Lake Storage Gen2 natively for source and sink.
- Use the Copy activity to stage data from any of the other connectors, and then execute a Data Flow activity to transform data after it’s been staged. For example, your pipeline will first copy into Blob storage, and then a Data Flow activity will use a dataset in the source to transform that data.
14. How can I schedule a pipeline?
Ans: You can use the scheduler trigger or time window trigger to schedule a pipeline.
- The trigger uses a wall-clock calendar schedule, which can schedule pipelines periodically or in calendar-based recurrent patterns (for example, on Mondays at 6:00 PM and Thursdays at 9:00 PM).
15. What is Azure Service Fabric?
Ans: Azure Service Fabric is a distributed systems platform that makes it easy to package, deploy, and manage scalable and reliable microservices. Service Fabric also addresses the significant challenges in developing and managing cloud applications. Developers and administrators can avoid complex infrastructure problems and focus on implementing mission-critical, demanding workloads that are scalable, reliable, and manageable. Service Fabric represents the next-generation middleware platform for building and managing these enterprise-class, tier-1, cloud-scale applications.
16. What has changed from private preview to limited public preview in regard to data flows?
Ans: You will no longer have to bring your own Azure Databricks clusters.
- Data Factory will manage cluster creation and tear-down.
- Blob datasets and Azure Data Lake Storage Gen2 datasets are separated into delimited text and Apache Parquet datasets.
- You can still use Data Lake Storage Gen2 and Blob storage to store those files. Use the appropriate linked service for those storage engines.
17. What are the benefits of the traffic manager in Windows Azure?
Ans: The traffic manager is allocated to control the distribution of the user to deploy the cloud service. The benefit of the traffic manager constitutes;
- It makes the application available worldwide through automated traffic control machinery.
- The traffic managing service contributes to high performance by loading the page faster and convenient usage.
- There is no lag of time to maintain or upgrade the existing system. The system keeps running in the back while the system takes time for up-gradation.
- The configuration is made easy through the Azure portal.
18. What is Azure Redis Cache?
Ans: Redis is an open-source (BSD licensed), in-memory data structure store, used as a database, cache, and message broker. Azure Redis Cache is based on the popular open-source Redis cache. It gives you access to a secure, dedicated Redis cache, managed by Microsoft, and accessible from any application within Azure. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, and geospatial indexes with radius queries.
19. How do I gracefully handle null values in an activity output?
Ans: You can use the @coalesce construct in the expressions to handle the null values gracefully.
20. What is the difference between IaaS, PaaS, and SaaS?
Ans: Iaas, PaaS, and SaaS are three major components of Azure and cloud computing.
Infrastructure as a Service (IaaS):
With IaaS, you rent IT infrastructure – servers and virtual machines (VMs), storage, networks, operating systems – from a cloud provider on a pay-as-you-go basis.
Platform as a Service (PaaS):
Platform as a service (PaaS) refers to cloud computing services that supply an on-demand environment for developing, testing, delivering, and managing software applications.
Software as a Service (SaaS):
Software as a service (SaaS) is a method for delivering software applications over the Internet, on-demand, and typically on a subscription basis. With SaaS, cloud providers host and manage the software application and underlying infrastructure and handle any maintenance, such as software upgrades and security patching.
Learn more here: Introduction to Cloud Computing.