go back to the blog

Challenges in Testing Data Applications

  • 18/09/2013
  • no comments
  • Posted by EuroSTAR

In their first blog post Jeff and Matthias introduced the concepts of testing data applications and discussed why testing data is difficult. You can read the blog post here.

In this article we are going to discuss some of the challenges in testing data applications – both traditional data warehouse applications that are often ETL/ELT driven, as well as some of the changes that new Big Data technologies add.


What do testers need to do, and what are some of the challenges we face?


• Defining Test Strategy and building appropriate test cases.

o I have found that testing functional requirements for these applications to be challenging. Rarely are you trying to test just 1 thing at a time, but rather, we tend to process sets of data, so there are dozens, if not hundreds of specific attributes per target data set that have separate rules that need to be accounted for. While it would be easier to test just individual things, often execution times are long, or the amount of time that is required to set up, and change input data to narrow down the tests is prohibitive.

o Also, there is rarely just one, or a handful of source datasets that impact a process, but rather, it is again often in the dozens to potentially hundreds of various lookups, and incoming datasets that need to be manipulated.

o Granularity is difficult to track, as data is often normalized or denormalized across from the source system to the target system, as well as performing aggregations that combine rows, or performing joins that create additional records.

• Creating appropriate test data for specified test cases.

o Input data must be identified, and modified to provide an initial set of data that provides good coverage. Various test coverage techniques like Risk-Based Testing, Orthogonal Array analysis or dedicated data models have been introduced to the industry for this.

o Or if smaller, more focused sets of data to cover a few specific rules are used, we need to maintain multiple copies of potential input data. These sets are then multiplied by the number of sources in a process.

o Master Data and system configuration needs to match this input data.


• Setting up test data and cases to be executed.

o After we have set up good testing scenarios and test cases to prove out those scenarios, we need to actually put the data in place. This has proven to be time intensive and sometimes difficult due to not only the number of physical data sets that need to be put in place, often on different servers, but also by the actual format required by the data. Either we need to create SQL scripts to insert or update input data into tables, or in the case of actual files, testers either needs to be able to modify the files natively, or have some sort of program that will read the format of data the tester provides – for example a CSV exported from excel – and then format it properly in the manner that the application expects to read it – for example an XML file, or a binary data set, or a complex flat file that uses non-printable characters as delimiters, or a mix of fixed length, and variable length attributes. And this step may or may not require the help of the developers to actually properly format the data.

o There are a number of tools from a development, ETL or test perspective that support this process, some aimed at assisting the testers with creating data, some with a focus on extracting, manipulation and providing test data from production copies.


• Executing the test cases.

o After the environment has been prepared for a single test, then we need to actually execute the code.

o For Unit and Integration testing, this is often just a call to the scheduling script, which will then call all of the individual bits and pieces.

o For system testing though, this can often be much more difficult, as there are often many different commands that need to be kicked off, sometimes at the same time.

o The time of execution, particularly when running large volumes of data, can be prohibitive, and require an entire night or more to just run through a single set of test cases – leaving the tester with the difficult task of choosing between test coverage and efficiency / test cycle duration

• Evaluating the results of the test execution.

o After we have completed running the tests, then the real work kicks in, and we need to prove the test was actually successful or not.

o While ensuring that the test case actually completed is an important measurement, we also need to look at the detail.

o Again, there tend to be many different output datasets that need to be evaluated, and potentially hundreds to thousands of attributes that need to be measured.

o We will also run into similar issues with complex data formats that the output is stored in, that we encountered when setting the data up.

o We need to identify how we want to evaluate the data:

  • Do we have expected results prepared? Are we going to manually compare the results by eye, or do we have a process set up to programmatically compare and find the difference.
  • If we can’t set up expected data, do we have SQL scripts, or another language that we can execute business rules in to evaluate the computations?
  • Do these scripts work with the native formats of the data, or do we need to transform and manipulate the data so the comparison utilities work? Are we sure that we didn’t do something wrong in the transformation logic that creates a false positive? Do we need to test this transformation logic and can we even replicate the actual application’s business logic (think capital markets algorithms, life insurance premium calculations, manufacturing production processes, embedded systems etc)?
  • If we’re developing scripts that compute similar business logic, we need to be wary of completely redeveloping the application a second time just to test it.
  • We potentially need to decide what percentage of the total output we want to check, and what level of risk we’re willing to take on by assuming that the data we didn’t look at has a similar level of quality as the data we did compare.

o If we’re doing unit or integration testing where we are dealing with intermediary datasets, the data is often split up across different files, potentially on different servers. This is a particular problem when testing map reduce, or other distributed big data applications.

• Identifying Application Logic defects vs. Environment Stability issues.

o Once we have identified defects, we need to make sure that we have sufficient knowledge to be able to trace to where and hopefully why the error occurred.

o Is it because of faulty logic in the application, or is it caused by the environment that the application is running in – for example, are there other processes that manipulate or modify the data while we’re running, or are possibly there’s a race condition due to the execution order, or number of different technologies being used to process the data.

• Reporting defects.

o And finally we need to be able to report the defects in a manageable manner.

o Are log files easily accessible and known?

o How much of the data do we save, so that a developer can reproduce the error – this is particularly an issue where the input and output datasets can be hundreds of gigabytes or terabytes.

That is a summary of some of the common issues that we face when testing data. I am sure that many of you can share more issues with us, as well as to start off a discussion of ways that we can start to manage these issues so that we can do our job, and provide valuable feedback with what is happening in the applications we test, and what areas require improvement and fixing.


jeff_pascoe_100x107Compact Solutions ( was founded on a passion for cutting edge technology and the need for better access to corporate information assets. Formed in May 2002 and headquartered in Chicago, Compact Solutions has offices in four countries: United States, United Kingdom, Poland and India. Compact’s goal is to bring every customer Speed, Power and Profit from their information. We provide both development and testing services for data applications, as well as software products for Metadata Integration and Automated Testing through TestDrive.

Jeffrey Pascoe is the Director of Solutions Delivery Europe for Compact Solutions. With 15 years experience he has a proven track record in providing traditional consulting services, training and education. In particular he has worked on a number of automated testing frameworks, and software products. His particular fields of interest and passions lie in large data integration applications, meta-programming, data governance, and automated testing.

Blog post by

go back to the blog


Leave your blog link in the comments below.

EuroSTAR In Pictures

View image gallery