2022-01-30

## Data Ethics with Lineage

Nowadays, there is almost no question that data is the new oil. But, do we really have a clear picture of where the oil is coming from, in which route, by whom, how, and when?

In practice, the massive continuous flows of the oil have forced us to radically automate and abstract the complex data pipelines. Eventually, we become able to "forget" about the data sources and intermediates both in a positive and negative way. Here, I strongly believe we (as a modern data-driven developer) must be more conscious about individual data points we are interacting with, and I see data lineage as a means of ethical product development in this regard.

To give an example, if you own a complex data pipeline on Apache Airflow, the tool will nicely automate your day-to-day ETL processes and hides deep contexts about the data.

On one hand, the effort unlocks developers to focus purely on an essential problem, which is commonly tied to an ultimate destination of the data flow. That is, abstraction and automation improve developer's productivity.

On the negative side, however, such a modern data-driven system deteriorates developer's consciousness, awareness, and understanding about the detail of data.

When we see a dataset on a Jupyter notebook, we shouldn't forget the fact that there is a real world upstream. The oil is not a toy for unconscious software engineers and/or data scientists, and it's not a tool for capitalistic competition based on an extrinsic motivation. It is rather highly sensitive and precious information depicting everyone's beautiful life. Thus, I personally want to be a person who doesn't overlook a hidden aspect of data in front of me; I should literally appreciate individual rows on a pandas DataFrame.

As I mentioned at the top, this is where data lineage can help. Commonly speaking, by tracking a source, destination, and metadata of every single step in the pipelines, data lineage helps developers to effectively reduce a chance of errors, easily debug an issue, and accurately audit what's going on behind the scene.

Every steps in a pipeline emit "What's incoming (outgoing) from (to) where"-type of information to a certain lineage backend, and the system eventually sends the information to dedicated service(s) where we can train our consciousness about the data.

In the case of Apache Airflow, there is an experimental feature we could use. Assume a BashOperator task takes one file (1 inlet) and generates three resulting files (3 outlets) as follows:

f_in = File(url="/tmp/whole_directory/")
outlets = []
for file in FILE_CATEGORIES:
f_out = File(url="/tmp/{}/{{{{ data_interval_start }}}}".format(file))
outlets.append(f_out)

run_this = BashOperator(
task_id="run_me_first", bash_command="echo 1", dag=dag, inlets=f_in, outlets=outlets
)


Once the task is executed, its lineage becomes accessible through XCom:

Meanwhile, a configured lineage backend separately processes the information:

from airflow.lineage.backend import LineageBackend

import requests

class ExampleBackend(LineageBackend):

def send_lineage(self, operator, inlets=None, outlets=None, context=None):
url = "https://webhook.site/8fe779dc-10ae-4917-8f41-685723a31064"
data = {
"operator": operator.__class__.__name__,
"inlets": [str(inlet) for inlet in inlets],
"outlets": [str(outlet) for outlet in outlets],
"context": str(context),
}
requests.post(url, json=data)
operator.log.info(f"Sent lineage to {url}: {data}")


* See takuti-sandbox/airflow-test for the complete example.

Although it's a simple example that a webhook receives the plain data, a downstream service can be anything e.g., RDB, monitoring tool, notification, and Slack channel.

The point here is how easily we can take our first step toward conscious data engineering. Implementing a mechanism that forces us to be aware of not only final outputs but original sources and intermediates would be critical to grow data ethics in the large.

#### Support (Thank you!)

Note that, as an Amazon Associate, I earn from qualifying purchases on amazon.ca.

2022-03-20
It "Was" Ethical: Key Takeaways from UMich's Data Science Ethics Course
2022-02-20
Validate, Validate, and Validate Data. But, in terms of what?
2019-11-10
Rethinking the Role of Data Leaders @ Data Leaders Summit Europe 2019

Last updated: 2022-09-02

#### Author: Takuya Kitazawa

Takuya Kitazawa is a freelance software developer, minimalistic traveler, ultralight hiker & runner, and craft beer enthusiast. While my area of specialty is in data & AI ethics and machine learning productization, I have worked full-stack throughout the career e.g., as a frontend/backend engineer, OSS developer, technical evangelist, solution architect, data scientist, and product manager. You can find what I am doing lately at my "now" page, and your inquiry is always welcome at [email protected], including comments on my blog posts.

Opinions are my own and do not represent the views of organizations I am/was belonging to.

#### Popular articles

2020-02-07
Why a Data Science Engineer Becomes a Product Manager
2018-10-26
Apache Hivemall at #ODSCEurope, #RecSys2018, and #MbedConnect
2017-02-25
Parallel Programming vs. Concurrent Programming