Want a Data Validation Framework that Works? Involve Your Business People.


What’s the cost of having bad data quality? Gartner estimates that organizations lose $15 million per year from bad data. IBM puts the cost to U.S. companies at $3 trillion per year (or 15% of GDP). Their financial tally accounts for inaccurate analysis, poor decision-making, damaged customer relationships, and more.


But those numbers aside for a moment, the real cost of bad data is that your business won’t use your data if they don’t trust it. Not taking advantage of your data – your most valuable non-human business asset – equates to lost productivity, and lost opportunities.

What is one of the best ways to improve the quality of your data? Use a data validation framework that involves your business users.

Why Data Accuracy is So Difficult

We all know that data accuracy is an ongoing challenge. The trouble faced by many data stewards is trying to imagine all the myriad things that could possibly go wrong in their data, and then finding ways to guard against them (via monitoring, auditing, validation, etc.).

But with modern data stores, the data may be coming from multiple data sources, and every column has a range of possible values, so the number of possible combinations becomes astronomically high. And just as the number of possible combinations is astronomically high, the number of invalid combinations is also astronomically high.

In short, you can’t possibly proactively identify all of the things that could go wrong. And even if you could, your victory would be short-lived. Why? Because the data is constantly changing. New data is arriving: new types of data, new patterns of data. You have heard the variation of Murphy’s Law that states: “If you make something idiot-proof, someone will just make a better idiot.” This is especially true in a dynamic environment like data.

Your objective, then, is to design a data validation framework that accomplishes two separate goals:

  • Scales up to handle the large number of validations needed
  • Continuously adapts to ongoing changes in the data itself
A Powerful Data Validation Framework: Crowd-Sourcing

One solution that addresses both of these goals is crowd-sourcing – that is, harnessing the knowledge of all of your data stakeholders, and allowing them to continuously add and improve your data validations on an ongoing basis. Your stakeholders include all of the people who touch the published data, and thus have the potential to find problems, including:

  • Data analysts
  • Data stewards
  • Data scientists
  • Developers and QA people
  • Business users (perhaps most importantly)

Your business users know your data the best. They are closest to the data, and they will often be the first to find any data problems or corruption that may exist.

You might wonder: If my business users find problems in the data, won’t that break their trust? Perhaps. But consider this: Your business users will likely find problems in the data anyways. That risk is largely unavoidable. Their sense of trust will hinge primarily on how you react to the problems once they are found.

Use Different Approaches for Types of Business Users

Involving the business users in both the discovery process as well as the resolution of data problems may be your best approach to establishing a sense of trust. In essence, you want to make the business users stakeholders in the data, as opposed to just consumers.

Harnessing the capabilities of ALL the data stakeholders, including your business users, is the goal of a crowd-sourced solution. Business users, while not generally a technical role, will generally fall into one of three categories:

  • Data Natives – People who will not just find a data problem, but also identify the missing validation and add that validation themselves.
  • Data Immigrants – People who can find a problem and identify the missing validation, but are perhaps not confident enough to actually add the validation into the system.
  • Data Refugees – Can find a data problem but are not fluent enough to identify a generalized validation, or to add that validation to the system.

You should assume that you will encounter all three categories of business users on your team. You want to fashion a data validation framework that supports all three levels of data literacy so that you can take advantage of all of their capabilities.

For the Data Refugees, you will need a communication and tracking mechanism that allows these users to submit requests and problems. Your data stewards and data engineers can work the submissions and report the results reported back to the Business users via that same communication mechanism. Tools like the open-source Bugzilla or Atlassian’s Jira can be used for issue tracking and workflow management.

For the Data Natives and Data Immigrants, you will want to offer more self-service options, both for scalability as well as to reap the benefits of their expertise. You want to encourage and support their participation in the data management processes, and especially the data validation processes. One approach that allows for the direct involvement of these subject matter experts is to Create an Open Validation Architecture.

In sum, the people who are closest to the data are in the best position to help you drive your data quality problems to zero. Involving those people to help find the problems, and also enabling them to contribute data validations to avoid problems in the future, can improve the speed with which problems are found and fixed. And that responsiveness will both improve your data quality and build trust with your business users.

CoStrategix helps companies solve business problems with digital and data solutions. We help bridge the gap between business and technology teams in applying data science to drive better decision-making. If you are struggling with how to apply data science to your business challenge, feel free to reach out to us.