Validate, Validate, and Validate Data. But, in terms of what?

Home > Blog > Validate, Validate, and Validate Data. But, in terms of what?

2022-02-20

Validate, Validate, and Validate Data. But, in terms of what?

Support by donation Gift a cup of coffee

This article is part of the series: Productizing Data with People

When it comes to modern machine learning and data analytics applications, I cannot stress the importance of data validation enough. However, it's rarely discussed what defines the validity of our data.

Most importantly, accuracy, which many practitioners can easily think, is just a single aspect of the problem, and I strongly believe privacy, security, and ethics measurements must be equally treated as the accuracy metrics. Is our job done once we confirm a statistically significant increase in recall/precision and/or certain business metrics? No, absolutely not. On top of that, we (as a modern data-driven developer) must be more conscious about individual data points we are interacting with, as I discussed in Data Ethics with Lineage.

That is, the developers need to implement a way to ensure if the data is truly "good" to use. For instance, if the data contains some PIIs, your machine learning model can reasonably show better performance, but the model must not be deployed from the privacy standpoint. Or, when the data is highly skewed toward a certain population (e.g., by gender, country, religion), prediction results must be biased. These situations are carefully treated by a proper mechanism embedded in a data pipeline.

Lack of non-accuracy aspect of data validation in the TensorFlow Extended (TFX) paper. I was recently reviewing Google's classic TFX paper (2017). The idea behind the data validation mechanism of TFX is schema matching; by validating data schema, the system prevents users from using/publishing any corrupted data they haven't expected. It contributes to maintaining the accuracy of the dataset itself, as well as the accuracy of downstream machine learning models. It should be noticed that, even though the paper spent a good chunk of paragraphs for defining a "good" model and judgment of model goodness, the discussion about the models hasn't been extended to data itself. As far as I read, we don't see such consideration even in its follow-up paper Data Validation for Machine Learning (2019).

Emerging trend in data observability and quality. I might be biased by the topics I'm regularly following, but it shouldn't be a coincidence I repeatedly heard the names of enterprise vendors over the last couple of months, which are similarly tackling underlying data validation issues. It includes the ones that I listened through Software Engineering Daily Podcast e.g., Monte Carlo, Trifacta, and Anomalo. In fact, I'm not working on a B2B data business anymore, and hence I have zero opportunity to work with these third-parties on my day-to-day job. But they are caught by my radar on a regular basis. Thus, there seems to be a trend in data observability and quality domain. Personally, it is really nice to see active discussions about "how data itself should be"; after a few years since Google gently unlocked the domain of data validation in academia, the practitioners also finally started thinking about what happens somewhere in the middle of ETL and modeling work.

Towards "zero-trust" data pipeline. That said, the domain is still immature as far as I can see, and the measures users could validate are limited to the basic ones, such as the traditional accuracy metrics, missing values, and PII detection. You might think security, privacy, and ethics related treatments e.g., data anonymization, normalization, amplification must be done by the upstream jobs that are in charge of ETL/data wrangling stuff, but it is not a good practice unless you have the full control/ownership of the upstream jobs; anywhere in the pipeline could suddenly be broken for some random reasons, and the issues don't always surface as clear deterioration of accuracy metrics. Therefore, I would rather suggest building a data pipeline in a zero-trusted manner. Having solid definition of "valid" data and making sure the validity in your own scope are crucial to prevent any undesired consequences, and non-accuracy aspects must be properly taken into account to keep product quality & reliability high.

This article is part of the series: Productizing Data with People

Author: Takuya Kitazawa

Takuya Kitazawa is a freelance software developer, previously working at a Big Tech and Silicon Valley-based start-up company where he wore multiple hats as a full-stack software developer, machine learning engineer, data scientist, and product manager. At the intersection of technological and social aspects of data-driven applications, he is passionate about promoting the ethical use of information technologies through his mentoring, business consultation, and public engagement activities. See CV for more information, or contact at [email protected].

Support by donation Gift a cup of coffee

Disclaimer

Opinions are my own and do not represent the views of organizations I am/was belonging to.
I am doing my best to ensure the accuracy and fair use of the information. However, there might be some errors, outdated information, or biased subjective statements because the main purpose of this blog is to jot down my personal thoughts as soon as possible before conducting an extensive investigation. Visitors understand the limitations and rely on any information at their own risk.
That said, if there is any issue with the content, please contact me so I can take the necessary action.