go back to the blog

How To Streamline Your Test Data Management

  • 12/02/2014
  • one comment
  • Posted by EuroSTAR

Test data is an area that’s rife with problems, ranging from the trivial to the terrifying. Some common problems include:

• The difficulty of creating a set of comprehensive test data. Sure, you could just make a copy of the entire production environment to be sure you’ve captured everything, but that’s usually cumbersome and painful. Batch runs take an unnecessarily long time to run when the test environment is stuffed with too much data.
• Testers need reliable test data that hasn’t already been hacked to pieces by other testers.
• Having multiple systems that share a common test environment is complicated. Different systems are usually not in sync, and test planning can get really complex quickly. So pretty much every system needs its own test data.

In this article, I’ll tell you about a process for streamlining test data management across multiple test environments. I’ll discuss the pros and cons of the process, and the roles you need to make the process as efficient as possible.

This article is based on the approach I developed for testing large administrative systems at an insurance company. My process isn’t plug-and-play and won’t fit every organization on earth, and it’s unlikely that you can apply the whole process as is. But key parts of it can be quite general and you can use them to further develop your own processes. For instance, you may have to adjust the process depending on how your systems integrate with one another, and whether you have multiple platforms. I’ve used the process in integration testing, system testing and acceptance testing. For component testing, however, we always used the developers’ local environment and developed test data in ways that this article doesn’t go into. That said, it should be possible for you to apply these principles to component testing too.

A process for establishing test data and test environments

The process consists of the following six steps:

  1. Generate a sample from the production environment
  2. Perform consistency checks
  3. Anonymize sensitive/personal information
  4. Create and deploy the central testing environment
  5. Load test data for each system
  6. Update the backup copy in case of changes

Each step is described in more detail below.

1. Generate a sample from the production environment

The test manager or a designated test data manager carries out this step. For example, you might extract every tenth or twentieth person from the production environment, depending on the volume of data available (if you have too much data, select one record of twenty; if you have too little, take one out of ten). Then store the samples in a separate staging environment for further processing.

Even after you’ve pulled a random sample, you’ll probably find that you need to complete some data by hand. Usually there are several groups that have special requirements of the test data, such as business specialists, testers and IT operations. It’s often the exceptions and “outliers” (such as people with unique insurance policies), or groups (such as all high-net worth customers) that business managers and testers need to be sure you include in the test data. On the other hand, some test data should always be included: this might be especially tricky situations, complex business logic, system components where errors have been a problem in the past, or test data associated with test cases based on previously-identified bugs. Some test data can be valid for a specific system but be removed the next time new test data is created. The test data manager or a DBA (Database Administrator) runs queries against the production database to extract test data; if possible, you should save those queries so that you can run them again at a later date. You may also need to specify parts of the test data manually if they don’t come back via the queries.

In a subsequent step, you need to scrub any sensitive information; this may even be necessary earlier in the process in order to exclude certain information, such as data about fellow employees, payroll information, medical history, etc.

2. Perform consistency checks

Next, you’ll want to run tests to verify that any and all connections have survived the process. Connections can tie data within the same system and in adjacent systems. An example of an intrasystem connection is when your tests relate to a widow/widower collecting on the husband/wife’s life insurance policy; you have to make sure the deceased’s insurance is included in test data for the transaction to be tested correctly. You might find links to other systems in payment, storage or personal data, etc.

3. Anonymize sensitive/personal information

The next step is anonymization (also called de-identification). You’ll almost certainly encounter various views about de-identification, depending on how sensitive the information is. If you’re unsure, contact the company’s lawyers to verify how you need to go about de-identification.

The most important first step is to de-identify the name and address. A good idea is to change all individuals’ addresses so that they go to your company. If an error occurs and the system fires off a letter to one of the test subjects, you can rest assured that the message will come back to the company. You can also rename the test subjects to the company’s name followed by “test case”.

Further de-identification can make Social Security numbers, customer account numbers, and other data anonymous. Obviously the system will check the accuracy of the data, for example, that a Social Security number is valid, so you can’t always simply replace live information with nonsense data. In some cases, you have to use a formula to calculate information that the system will accept.

One problem you may run into is testers being unable to find their test data. For example, if a tester includes a specific Social Security number in a test case to be able to test a special situation in the system, and you replace this number with a made-up one, she won’t be able to find the information in the test environment. In this case, you need a translation table: a translation table contains two columns with the correct Social Security number and corresponding fake Social Security number.

Another problem that you may need to consider is whether external partners, such as tax authorities, are participating in the testing process. It may not be possible to require your partner to introduce the same de-identification system you’re using, so if they send a file with accurate Social Security numbers, you’ll need to de-identify the data according to the translation table.

4. Create and prepare the central test environment

Now that test the data is ready, you can introduce it into the test environment. In our case, we uploaded the data to the environment we used for testing with other integrated systems (sometimes called the system test environment, system integration test environment, or interoperability test environment). To prepare the test environment, programs, files, etc., should be moved to the environment, either by the developers individually, or by a test environment manager.

In the context of test planning this person should have already determined the fictitious date to apply to the tests. The fictitious date is the date that the system clock is set when the tests begin. You usually want to start test runs into a new year, since there are often a lot of extra activities that kick in when the year changes. It’s useful to create a test suite that shows which test cases the team should execute.

For the system to get to this notional date, you may have to “roll forward” the test environment so that it’s right in the starting blocks for testing. You can do this by running batch routines that bring the system to the correct date and adjust the clock. When this is done, you take a backup of the whole environment including its test data. This backup will become the starting position for all test environments regardless of the system and test level.

5. Upload test data for each system

Now you can upload the backup to any one of the test environments to be used in any of the various systems. That way, any test will be based on the same test data and the same versions of the tested system. Links to all other systems will also be available so even if you’re testing at a level where you normally would just be testing the system, you’ll still uncover integration problems early. Since each system is reading a copy of the key backup, you minimize disruption of other tests. You should update the shared environment with new programs certified OK after the integration tests, which means that every system will have access to the latest software version.

6. Update the backup copy with any changes

If you make changes to the files, you’ll need to make a new central backup file. If needed, the respective systems can download this updated environment to their own test environment. You need to communicate to the developers working on every system that there’s a new backup, and a designated person needs to be responsible for downloading the backup into each environment.

Overview of the process

In my case, we had the same central database for all of our systems. This meant, for example, that the insurance system, payment system, and claims system all used customer data from that central environment. A graphic illustration of the connection looks like this:

ulf blog 12-2_500x385

Figure 1 – Overview of the relationships between environments.

Different situations where you can use this process

Of course, you can use this process in all kinds of everyday testing work, whether it’s in new development projects or ongoing maintenance. Additionally, you can use it for other more specific situations like the ones I’ll describe below.

Sometimes you need to have test data that reflects the complete production environment. That’s often the case when “parallel testing” involves comparing an old version of an application with new versions to see whether the end result is identical (or one version is superior). This approach is especially useful in the context of major changes to the system, and you can easily implement it in a separate environment for the sole purpose of parallel testing. You take the entire output of data to the testing environment, anonymize the data, and then run it with the old program(s). Then you run the same tests, but using the new programs, and compare to see whether they achieved the same results.

You can use this process even when you don’t need more than a small set of test data. In some situations you might even be able to cut down test data sets to just 10 or 15 records, where you can manually download a few test subjects to a specific test environment. This approach is useful when you want to follow a few subjects and have deeper follow-up on the results. For example, you might want to follow the flow of an insurance claim through the system all the way down to the financial system and the general ledger (in what’s called “end to end testing”).

Advantages of this approach

Since you’ve created a backup that can be used by all systems in all test environments at all levels of testing, no one has to create their own test data set for each one of them. This saves a lot of time for the entire team, and also has the advantage that all systems have the same consistent set of test data.

This approach is also particularly effective if you conduct tests in test runs. Test runs involve conducting a test cycle with an arbitrary test date, for example over the course of four months from November to February. After you run the test cases for a test cycle and fix the bugs, you start from scratch again with the corrected software. You make a new backup and download it into the test environment. Then you start a new test run with the same test cycle, fictitious date, test data and test cases. The testers can use the same test data as before and don’t need to look up new test data. Since you have a higher quality dataset, you’ll have fewer defects that developers are unable to reproduce because they lack adequate test data, and retesting will also be more efficient.

Disadvantages of this approach

This process saves time once you’ve gotten it established, but it takes time to develop a workable process. Every time you go to create new test data, you need a certain amount of time that you have to include in the schedule. That said, the advantages clearly outweigh these disadvantages.

Required roles

The work is considerably easier if you have a specifically-designated person in charge of the test data, i.e. a test data manager. A DBA (database administrator), developer, or test manager can take responsibility for seeking out and creating test data. It also helps if you have a designated test environment manager documenting the test environment requirements, setting up the testing environment, ensuring that the test environment works, and even helping administer testing tools. It’s possible for one person to play both of these two roles, but if the workload is heavy, a specific individual should be assigned to each role separately.


An effective process for handling test data offers a lot of advantages. Test runs take less time, you maintain data integrity, and you automatically retain data for special situations. Establishing the process of creating test data demands an investment of time, but when it’s in place, you can use the same process over and over again. You need a test environment and a test data manager responsible for keeping everything up to date, and informing responsible parties about any changes.

About the author

Ulf Eriksson is one of the founders of ReQtest, an online bug tracking software hand-built and developed in Sweden. ReQtest is the culmination of Ulf’s decades of work in development and testing. Ulf is a huge fan of Agile and counts himself as an early adopter of the philosophy, which he has abided to for a number of years in his professional life as well as in private.

Ulf’s goal is to life easier for everyone involved in testing and requirements management, and he works towards this goal in his role of Product Owner at ReQtest, where he strives to make ReQtest easy and logical for anyone to use, regardless of their technical knowledge or lack thereof.

The author of a number of white papers and articles, mostly on the world of software testing, Ulf is also slaving over a book, which will be compendium of his experiences in the industry. Ulf lives in Stockholm, Sweden.

Twitter – @ulf_reqtest

Blog post by

go back to the blog


Leave your blog link in the comments below.

One Response to How To Streamline Your Test Data Management

  1. […] Test data is an area that’s rife with problems, ranging from the trivial to the terrifying. Some common problems include: • The difficulty of creating a set of comprehensive test data. Sure, you could just make a copy of the entire production environment to be sure you’ve captured everything, but that’s usually cumbersome and painful….read more →  […]

EuroSTAR In Pictures

View image gallery