Blog

go back to the blog

Testing in a Big Data World – Software Testing With Big Data

  • 05/11/2013
  • 11382 Views
  • 2 comments
  • Posted by EuroSTAR
-->

As a greater number of organizations are adopting “Big Data” as their Data Analytics solution, they are finding it difficult to define a robust testing strategy and setting up an optimal test environment for Big Data. This is mostly due to the lack of knowledge and understanding on Big Data testing as the technology is still gaining popularity in the industry. Big Data involves processing of huge volume of structured/unstructured data across different nodes using languages such as “Map-reduce”, “Hive” and “Pig”. A robust testing strategy needs to be defined well in advance in order to ensure that the functional and non-functional requirements are met and that the data conforms to acceptable quality. In this blog we intend to define recommended test approaches in order to test “Hadoop” based applications.

Traditional software testing approaches on Hadoop are based upon sample data record sets, which is fine for unit testing activities. However, the challenge comes in determining how to validate an entire data set consisting of millions, and even billions, of records.

In order to successfully test a Big Data Analytics application, the test strategy should include the following testing considerations at a minimum.

Data Staging Validation

Data from various sourcing systems like RDBMS, social media, web logs etc. should be validated to ensure that proper data is pulled into the Hadoop system. Some of the high validations, which have to be performed, are:

  • Comparing source data with the data landed on Hadoop system to ensure they match
  • Verify the right data is extracted and loaded into the correct HDFS location

Some teams will always verify sample sets of data by using sampling algorithm because verification of full dataset are difficult to achieve. However this approach may not uncover all the data inconsistencies and may result in data quality issues within HDFS. Hence it is very important to include full dataset validation in your test strategy. Tools such as Datameer, Talend or Informatica can be used for validating the staged data. Import jobs should be created to pull the data into these cial media, web logs etc. should be validated to ensure that proper data is pulled into the Hadoop system. Some of the high validations, which have to be performed, are:

  • Comparing source data with the data landed on Hadoop system to ensure they match
  • Verify the right data is extracted and loaded into the correct HDFS location

Some teams will always verify sample sets of data by using sampling algorithm because verification of full dataset are difficult to achieve. However this approach may not uncover all the data inconsistencies and may result in data quality issues within HDFS. Hence it is very important to include full dataset validation in your test strategy. Tools such as Datameer, Talend or Informatica can be used for validating the staged data. Import jobs should be created to pull the data into these tools from source and staging systems. The data should be compared using the Data Analytics capability of these tools. The below diagram describes the overall approach for staging validations.

 

Transformation or “Map-reduce” validation

This type of validation is similar to Data Warehousing Testing wherein a tester verifies that the business rules are applied on the data. However, in this case there is a slight difference in the test approach as Hadoop data should be tested for volume, variety and velocity.

Typically DWH testing involves software testing of Gigabytes of data, whereas Hadoop testing in comparison involves testing of Petabytes of data. There is a definitive approach to test the DWH by using “sampling” techniques, although this cannot be achieved with a Hadoop application because even sampling testing will be challenging in a Hadoop Framework. There are numerous probabilities and combinations in large volumes of data which render sampling techniques ineffective as a validation approach.

DWH systems can only process structured data, however Hadoop systems may handle both structured and unstructured data with limited additional efforts. This Testing wherein a tester verifies that the business rules are applied on the data. However, in this case there is a slight difference in the test approach as Hadoop data should be tested for volume, variety and velocity.

Typically DWH testing involves testing of Gigabytes of data, whereas Hadoop testing in comparison involves testing of Petabytes of data. There is a definitive approach to test the DWH by using “sampling” techniques, although this cannot be achieved with a Hadoop application because even sampling testing will be challenging in a Hadoop Framework. There are numerous probabilities and combinations in large volumes of data which render sampling techniques ineffective as a validation approach.

DWH systems can only process structured data, however Hadoop systems may handle both structured and unstructured data with limited additional efforts. This potential is already leading to new ways of data exploration which in turn will result in an increase of scenarios for Hadoop Testing.

The key validations to be performed are:

  • Verification of the ETL: implemented on the data
  • Verification of data aggregations/segregation rules implemented on the data
  • Verification of output data. Validate that processed data remains the same even when executed on a distributed environment.
  • Verify the batch processes designed for data transformation.

Hive is the most reliable language to perform Data Warehouse Validation. Testers should write HQL queries that replicate the data requirements and compare it with the output produced by the MR jobs (Development team). If there are no discrepancies found in the report, then the test script is considered pass.

 

Data Warehouse Validation

This software testing is performed after the data processed in Hadoop environment is loaded to the Enterprise Datawarehouse. The high level scenarios to be tested include:

  • Verify the processed data from HDFS file system is moved correctly to the EDW tables
  • Verify the EDW data requirements are met
  • Verify the data is aggregated as per specified requirement

Datawarehouse validation is similar to data staging validation. DataMeer or Talend or Informatica tool can be used for validating the data loads from Hadoop to traditional Datawarehouse.

 

Architecture Testing

As Hadoop involves processing large volumes of data, architecture testing is very critical for the application to be successful. Poorly designed systems may cause performance degradation and the system could fail to meet the SLAs contracted with the Business. At the minimum, Performance and Failover test services should be performed in a Hadoop environment.

Performance testing should be conducted by setting up large volumes of data with an environment similar to production. Performance metrics like job completion time, data throughput, memory utilization and similar system level metrics should be verified as part of the Performance testing services.

Failover test should be performed as Hadoop consists of name node and several data nodes hosted on different machines. The objective of Failover testing is to verify that data processing happens seamlessly in case of failure of data nodes.

——————————————————-

About the Author

Vinaykumar Chandrashekar works as QA lead at Accenture. He has over 6 years of experience in software testing Datawarehousing applications for Banking and Healthcare domains. He has recently started designing test strategies to effectively test data warehouses built on Hadoop framework for a Top-5 US Financial Institution. Vinay also has extensive experience in backend automation testing approaches and strategies.

Vinyas Shrinivas Shetty works as a QA Engineer at Accenture with over 3 years of development and testing background in C++/Python across industry domains. He has been providing inputs into the test strategy for a Top-5 US Financial Institution client to transition to the Hadoop Platform, leveraging his knowledge in software testing, development, and banking data warehouse applications to contribute to the program strategy efforts.

Blog post by

go back to the blog

eurostar

Leave your blog link in the comments below.

2 Responses to Testing in a Big Data World – Software Testing With Big Data

  1. This is the biggest obsession in application with big data is testing. Manually testing the big data application is nearly impossible and hence many are moving ahead with automation. Though there aren’t much tools available at this time but sooner or later the big data testing is going to be automated. Check what TestingWhiz has to say about big data testing – http://www.testing-whiz.com/get-ready-for-the-next-release-of-testingwhiz.
    This tool already supports teradata and hadoop is in the next release.

  2. Avatar shiva says:

    Hi Vinay – Can you provide your email Id. Just having couple of questions, need to take it offline regarding testing framework in Hadoop.

EuroSTAR In Pictures

View image gallery