Architect an Open Data Validation Technique to Get the Best Data Quality


No matter how creative you are, you can’t possibly proactively identify every possible data validation. You have to accept that some, perhaps even most, of the possible data problems will be found after the data has already been published.


That’s why data validation is an ongoing process, rather than just a one-time task. In a previous blog, we wrote about Involving Your Business Users in Your Data Validation Framework. This article describes an open data validation technique to involve your business users and drive your data quality problems close to zero.

First Rule: Make It Easy

How do you build a data validation system suited to different levels of data literacy, and motivate people to use it on an ongoing basis? Make it easy. And make it SQL.

SQL is a universal language for accessing data. Almost every developer, QA, data analyst, data scientist, or business user knows at least some basic SQL. So start by building your validation system around SQL.

Also, architect an open strategy. Assume that new validations may be needed on a daily or weekly basis, and architect your validation logic around that assumption. To encourage people to actually perform this constant upkeep, the process has to be quick and simple, so that it does not interfere with a stakeholder’s daily work or stream of thought.

A 5-Step Open Data Validation Technique

  1. Start by getting all your data into a relational database that speaks SQL. Use something big and fast – like Snowflake or RedShift. But get all your data into a SQL-based database.
  2. Make sure your data is stable and segmented. If your data is already batched, then assign it a batch number so that you can query it distinctly. If your data is streaming, then assign a batch number anyway, so every hour or every day represents a separate batch.
  3. Create validation checks using SQL queries. Almost every validation you need to perform can be represented using a SQL query. If possible, structure each SQL query so that data is only returned if the validation has failed. For example, suppose that every day, a data source sends you the list of the 50 U.S. states: SELECT COUNT(*) AS state_count
    FROM state_list
    HAVING ((state_count < 1) OR (state_count > 50))This way you have a query that will only return a row if the data has FAILED its validation. Capture the query results to a separate table, so you have a history of when exactly the data started failing. Then you can write triggers alerts and notifications to let people know when a validation failure has occurred.
  4. Now, this step is important: Store the validation SQL queries directly in the database itself. Create a table called VALIDATION_SQL_QUERIES, and fill it up with all the validations that need to run on every batch of data that arrives. Don’t put those validation SQL queries in your code, or in a configuration file, or check them into a GitHub repository somewhere. That makes them inaccessible to everybody except the developers. (And then your developers are stuck maintaining them and updating them.)We want the validation queries to be easily accessible to anyone who knows basic SQL. And those same people can then create them and update them, again using only SQL.
  5. Finally, write up instructions on how to add a validation query to the database, and publish those instructions to all of the data users: the data analysts, the operations people, the data scientists, the business users, the data stewards – everyone. Every one of those roles should know how to use SQL. They can use SQL to create a new validation query. And then they can use SQL to insert that new query into the database. Then they can use SQL to insert any alerts or notifications that might be associated with that validation. Could you create a nice little front-end that allows users to manage their validation SQL queries via a web page? Check that their INSERTs and UPDATEs are correct before you allow them to be executed against your database. Yes. That would be great. But you don’t have to. You could just write some SQL validations that check the content and format of the table that holds your SQL validations. (Using the validation functionality recursively to validate itself.) Either way is fine.

The key takeaway is that developers do NOT have a monopoly on finding data problems. Once the system goes live, 99% of the data problems are going to be found by other roles. The key thing is to make your data validation technique fast and easy for those other roles to create and manage their own validations. The faster and easier it is, the more validations they will create, and that is a win-win for everyone.

Note: For extra credit, you can reuse this same validation infrastructure to execute and track your metrics KPIs on an hourly or daily basis. You can even add validations against your results table to detect and automatically notify the stakeholders when the KPIs are out of bounds.

CoStrategix helps companies solve business problems with digital and data solutions. We help bridge the gap between business and technology teams in applying data science to drive better decision-making. If you are struggling with how to apply data science to your business challenge, feel free to reach out to us.