How data leads to Bias in AI Systems

An analysis of embedded prejudice in datasets

Nov 05, 2023

Data is not objective, it is reflective of pre-existing social and cultural biases. In this essay, I will attempt to describe how data used to train AI systems fundamentally serves as a mirror for the biased environment developed by us humans.

This is going to be the first in a series where I hope to discuss how data leads to bias, How we can find bias in data, how to best avoid biased data while training AI models, and how data can serve as a means for humans to restrict the abilities of AI systems.

I’m going to experiment a little bit with this essay by deviating from my usual structured approach where I make use of headings and sub-headings. With this I’ll try to pass on my thoughts in one well-structured body with paragraphs representing different thoughts. I’m doing this because I feel an essay of this type doesn’t require explicit structure and formality.

Let’s dive right in

Human Biases

A lot of people perceive data to be objective and factual, but research shows that human judgements and biases shape data. Let’s look at it this way, Data is essentially gathered from human interaction so doesn’t it make sense that the biases which exist in the real world would be found in datasets?

This bias exhibits itself in different ways, the first would be from the researchers building the models. Race, geographical region, and the gender of these individuals could unconsciously affect the types of questions they ask and problems they work on but the role of humans in creating bias within data doesn’t just stop at the expert level.

Research shows that biases exist in unfair causal pathways in the data generation process, it explains that bias exists in the complex relationships between variables. In other words, certain links within datasets cause discriminatory or biased relationships. A more intuitive way of describing this would be through an analogy.

Imagine a simple AI system that was built to discover potential lawbreakers from a group of people in a city park. The model would be trained on the criminal data of all the citizens living in that city. On the surface, the data seems objective, it’s after all just the criminal record data for people living in a particular city.

However, a deeper analysis shows the existence of inherent bias, for example, the data might show higher arrest rates for individuals who lived in low-income neighbourhoods but there’s a high probability that high arrest rates aren’t necessarily due to high crime rates but rather over-policing in low-income neighbourhoods than higher-income ones with equal crime rates.

This is an example of how pathways in data generation could lead to bias in how a model co-relates variables within a dataset. Our fictional model for instance would be trained to recognize a wrong relationship between low-income backgrounds and increased criminal tendencies and this leads to unfairness in predictions.

Data Collection Bias

Bias can also be found in the methods used to collect data. By nature, data collection procedures involve trade-offs which affect the diversity of datasets. For example, data collected solely from digital sources reflects the demographics of those with access to technology rather than providing a diverse sample descriptive of the general populace.

For example, going back to our earlier analogy on biased relationships between variables in data. If the people in the low-income neighbourhoods don’t have access to technology then they’re going to be essentially unrepresented in the compiled dataset.

Ultimately, all data collection techniques involve some kind of trade-off which advantages certain groups over others. Consequently, no dataset can effectively represent an existing reality. So while datasets might provide objective numbers they essentially mirror existing divides within society.

Conclusion

Most of the biases existent in data are a result of either capturing existing human bias or flaws in the techniques used to collect data. In the next essay for this series, we’ll discuss how causal modelling can help discover biases within data.

Aurum Finds

Here is a non-exhaustive list of articles I’ve recently read and highly recommend.

Will AI kill us all:
Alejandro Piad Morffis
, in collaboration with
Oleg Davydov
attempts to answer the question: Will AI kill us all? The AI fear thing has been a huge issue recently and with it has come a lot of crazy theories. It’s refreshing to read a level-headed view from people who know what they are talking about.
Compilations and Thoughts on Marc Andreessen's Techno-Optimism:
Zan Tafakari
provides a grounding and convincing view of Marc Andressen’s popular techno-optimist manifesto, so grounding in fact that it made me remove the techno-optimist tag from my bio after seeing it for what it was.
Becoming Polymathic:
Andrew Smith
and
Michael Woudenberg
provide a beautiful counter-argument to the stated belief that specialization is the road to human progress. They argue that humans were meant to be polymaths and this is how we will solve problems.
Albuquerque Part 1: Everything is Activism (even cancer patient advocacy):
Rudy Fischmann
is a new friend I made along the way while working on a new project of mine. This essay talks about a conference Rudy recently attended during his work as a cancer patient advocate.