ETL Testing Interview Questions and Answers
by Rupa.R, on Sep 10, 2022 2:56:18 PM
1.What is ETL Testing?
ETL stands for Extract-Transform-Load and it is a process of how data is loaded from the source system to the data warehouse. Data is extracted from an OLTP database, transformed to match the data warehouse schema and loaded into the data warehouse database.
2.Why is this ETL process used?
Data has become the critical part of all kinds of businesses and operations. Because data is so important to a successful business, poor performance or inaccurate procedure can cost time and money. Therefore, ETL testing is designed to ensure that the data processing is done in the expected way for the business/enterprise to get the benefit out of it.
3.What is Full load & Incremental or Refresh load?
– Initial Load : It is the process of populating all the data warehousing tables for the very first time
– Full Load : While loading the data for the first time, all the set records are loaded at a stretch depending on the volume. It erases all the contents of tables and reloads with fresh data
– Incremental Load : Applying the dynamic changes as and when necessary in a specific period. The schedule is predefined each period
4. Where are this ETL systems exactly used?
ETL systems are used by businesses to integrate data from multiple sources. These software systems are key components in ensuring that your company is processing its data efficiently, allowing your business to run smooth and without interruption.
5. What tools you have used for ETL testing?
1. Data access tools e.g., TOAD, WinSQL, AQT etc. (used to analyze content of tables)
2. ETL Tools e.g. Informatica, DataStage
3. Test management tool e.g. Test Director, Quality Center etc. ( to maintain requirements, test cases, defects and traceability matrix)
6. Explain what are Cubes and OLAP Cubes?
Cubes are data processing units comprised of fact tables and dimensions from the data warehouse. It provides multi-dimensional analysis.
OLAP stands for Online Analytics Processing, and OLAP cube stores large data in muti-dimensional form for reporting purposes. It consists of facts called as measures categorized by dimensions.
7.What are ETL tester roles and responsibilities?
Requires in depth knowledge on the ETL tools and processes
Needs to write the SQL queries for the various given scenarios during the testing phase.
Test components of ETL data warehouse
Execute backend data-driven test
Create, design and execute test cases, test plans and test harness
Identify the problem and provide solutions for potential issues
Approve requirements and design specifications
Data transfers and Test flat file
Writing SQL queries for various scenarios like count test
Should be able to carry our different types of tests such as Primary Key, defaults and keep a check on the other functionality of the ETL process.
8.What is ETL testing in comparison with data base testing?
- Verifies whether data is moved as expected• Verifies whether counts in the source an target are matching• Verifies whether data is transformed as expected• Verifies that the foreign primary key relations are preserved during the ETL• Verifies for duplication in loaded data
9.What Is The Difference Between Etl Tool And Olap Tools?
ETL tool is meant for extraction data from the legacy systems and load into specified database with some process of cleansing data.
ex: Informatica, data stage ….etc
OLAP is meant for Reporting purpose in OLAP data available in Multidirectional model. so that you can write simple query to extract data from the data base.
ex: Business objects, Cognos….etc
10. Explain what factless fact schema is and what is Measures?
A fact table without measures is known as Factless fact table. It can view the number of occurring events. For example, it is used to record an event such as employee count in a company.
11.What Is Ods (operation Data Source)?
- ODS – Operational Data Store.
2.ODS Comes between staging area & Data Warehouse. The data is ODS will be at the low level of granularity.
- Once data was populated in ODS aggregated data will be loaded into EDW through ODS.
12.What Is A Staging Area? Do We Need It? What Is The Purpose Of A Staging Area?
Data staging is actually a collection of processes used to prepare source system data for loading a data warehouse. Staging includes the following steps:
- Source data extraction, Data transformation (restructuring),
- Data transformation (data cleansing, value transformations),
- Surrogate key assignments.
13.What is a three-tier data warehouse?
Most data warehouses are considered to be a three-tier system. This is essential to their structure. The first layer is where the data lands. This is the collection point where data from outside sources is compiled. The second layer is known as the ‘integration layer.’ This is where the data that has been stored is transformed to meet company needs. The third layer is called the ‘dimension layer,’ and is where the transformed information is stored for internal use.
14.What is the difference between data mining and data warehousing?
Data warehousing comes before the mining process. This is the act of gathering data from various exterior sources and organizing it into one specific location: the warehouse. Data mining is when that data is analyzed and used as information for making decisions.
15. What is partitioning and what are some types of partitioning?
Partitioning is when an area of data storage is sub-divided to improve performance. Think of it as an organizational tool. If all your collected data is in one large space without organization the digital tools used for analyzing it will have a more difficult time finding the information in order to analyze it. Partitioning your warehouse will create an organizational structure that will make locating and analyzing easier and faster.
Two types of partitioning are round-robin partitioning and Hash Partitioning. Round-robin partitioning is when the data is evenly distributed among all partitions. This means that the number of rows in each partition is relatively the same. Hash partitioning is when the server applies a hash function in order to create partition keys to group data.
16.What process is exactly involved in ETL testing?
The process of ETL allows a business/enterprise to collect important data from different source systems and validate/change it to fit their goals and models, and then store it in data warehouse for analytic, forecasts and other kinds of reports for daily use. In a world of digital enterprise, it is a critical part of running an effective and efficient business.
17. What are the different ETL testing operations?
ETL testing includes the following :
Verify whether the data is transforming correctly according to business requirements
Verify that the projected data is loaded into the data warehouse without any truncation and data loss
Make sure that ETL application reports invalid data and replaces with default values
Make sure that data loads at expected time frame to improve scalability and performance
18. Explain what is tracing level and what are the types?
Tracing level is the amount of data stored in the log files. Tracing level can be classified in two Normal and Verbose. Normal level explains the tracing level in a detailed manner while verbose explains the tracing levels at each and every row.
19. Explain what is Grain of Fact?
Grain fact can be defined as the level at which the fact information is stored. It is also known as Fact Granularity
20. What is Data base testing?
Data base testing contains different steps compared to data ware house testing:
• Database testing is done using smaller scale of data normally with OLTP (Online transaction processing) type of databases.
• In database testing normally data is consistently injected from uniform sources.
• We generally perform only CRUD (Create, read, update and delete) operation in database testing.
• Normalized databases are used in DB testing.
21.Why ETL testing is required?
To verify the Data which are being transferred from one system to the other in the described patter/manner by the business (requirements)
22.What are snapshots? What are materialized views & where do we use them? What is a materialized view log?
– Snapshots are copies of read-only data of a master table.
– They are located on a remote node that is refreshed periodically to reflect the changes made to the master table.
– They are replica of tables
– Views are built by using attributes of one or more tables.
– View with single table can be updated, whereas view with multiple tables cannot be updated
Materialized View log
– A materialized view is a pre computed table that has aggregated or joined data from fact tables and dimension tables.
– To put it simple, a materialized view is an aggregate table.
23.What is partitioning? Explain about Round-Robin, Hash partitioning.
– Partitioning is to sub divide the transactions to improve performance.
– Increasing the number of partitions enables Informatica Server for creation of multiple connections to various sources.
– The following are the partitions
– Data is distributed evenly by Informatica among all partitions.
– This partitioning is used where the number of rows to process in each partition are approximately same
– Informatica server applies a hash function for the purpose of partitioning keys to group data among partitions.
– It is used where ensuring the processes groups of rows with the same partitioning key in the same partition, need to be ensured.
24.What are the differences between Connected and Unconnected lookup?
– participates in mapping
– Returns multiple values
– Can be connected to another transformations and returns a value
– It is used when lookup function is used instead of an expression transformation while mapping, where lookup does not available in the main flow
– Returns only one output port
– It cannot be connected to another transformation
– Unconnected Lookups are reusable.
25. How to fine tune mappings?
The following are the steps to fine tune mappings:
– Use the condition for filter in source qualifies without using filter
– Utilize persistence and cache that is shared in look up t/r
– Use the aggregations t/r in sorted i/p group by different ports
– Use operators in expressions instead of functions
– Increase the cache size and commit interval