It "Was" Ethical: Key Takeaways from UMich's Data Science Ethics Course

Home > Blog > It "Was" Ethical: Key Takeaways from UMich's Data Science Ethics Course

March 20, 2022 | 3 min read | 674 words

It "Was" Ethical: Key Takeaways from UMich's Data Science Ethics Course

This article is part of the series: Productizing Data with People

One of the most important takeaways from UMich's "Data Science Ethics" on Coursera is that ethics is defined by social consensus.

"What's good" changes

First of all, since the definition of "right" and "wrong" changes as time goes by & technology advances, we as a data science practitioner should keep questioning ourselves like "Is this socially acceptable?" throughout the data-driven product development lifecycle. In particular, by balancing an individual's value and public benefit, our deliverables must meet shared expectations from the society.

To give an example, we commonly see cameras at supermarkets or shopping malls nowadays, and many of us do not think our privacy is threatened because of the devices; there is social consensus that these cameras are "good" for security reasons and do contribute to making the public places safer. However, if the videos or images captured by the cameras are used for unintentional purposes, it becomes questionable whether the practice is ethical or not.

Thus, a boundary between good and bad tends to be fuzzy, and social acceptance criteria depend largely on the contexts when it comes to technology/data-driven solutions. Even though a behavior was ethical 10 years ago, it's possible that the same action is considered as unethical by the latest society. This is how ethics differ from religion, law, policy, and regulation, which are typically more stable and don't fluctuate that much. These fixed criteria rely on ethical behaviors defined by the society as of their original publication though; normally, society defines ethics, and regulation follows (after a long time).

Ethical data-driven work

Meanwhile, even though ultimate societal impact is unpredictable, there are several good practices for data scientists to minimize the risk of doing something unethical. For instance, when we collect data, we could ensure if sampled data is a statistically reasonable representative of the population i.e., no imbalance among the attributes. Otherwise, it is "easy" to make algorithms racial as a model fits to (is biased by) the majority; minor samples can be easily suppressed in a resulting predictor unless we make a conscious treatment.

Moreover, it is crucial to validate the data and model based on proper measurement, meaning Validate, Validate, and Validate Data. But, in terms of what? A validation phase enables us to make the whole system less error-prone and overcome potential drift (i.e., temporal change). As the professor mentioned in the course, systems are only as good as data.

Another important topic in the ethics context is user's privacy and anonymity. Historically, these are considered as part of trust relationships based on local face-to-face communications, but it has become challenging on the internet; we don't have enough information about the person we are interacting with, and hence there is no guarantee that the person is trustable. Even though blockchain has slightly changed the situation, it is still mandatory for developers to provide privacy related features by design e.g., allowing users to control their own data, preventing the possibility of leakage, taking into account recall-precision trade-off.

Mindful data science

Last but not least, as regulation tends to come later, the course emphasized the value of code of ethics at an individual level; Ethical Product Developer describes my code of ethics, for example, although I should make it more succinct. Being mindful about the externalities of data-driven decisions is clearly the first step to being ethical, and we should not forget a fact that there are real human beings behind the data.

When we see a dataset on a Jupyter notebook, we shouldn't forget the fact that there is a real world upstream. The oil is not a toy for unconscious software engineers and/or data scientists, and it's not a tool for capitalistic competition based on an extrinsic motivation. It is rather highly sensitive and precious information depicting everyone's beautiful life. (Data Ethics with Lineage)

In that sense, there must be no difference between medical or social scientific studies vs. data science at large. Therefore, obtaining informed consent & receiving objective review from third-parties would also need to be the norm for real-world data science.

This article is part of the series: Productizing Data with People

Support

Gift a cup of coffee

Author: Takuya Kitazawa

I am an independent consultant, mentor, and advocate for sustainable technology development with a decade of experience in AI/ML products, data systems, and digital transformation. Based in Canada and originally from Japan, I have lived and worked globally, including part-time residence in Malawi, Africa. See CV for more information, or contact at [email protected].

Now

Disclaimer

Opinions are my own and do not represent the views of organizations I am/was belonging to.
I am doing my best to ensure the accuracy and fair use of the information. However, there might be some errors, outdated information, or biased subjective statements because the main purpose of this blog is to jot down my personal thoughts as soon as possible before conducting an extensive investigation. Visitors understand the limitations and rely on any information at their own risk.
That said, if there is any issue with the content, please contact me so I can take the necessary action.