Data labellers: the invisible workers who make AI possible

AI chatbots like ChatGPT, Bard, Claude and LLaMA can explain complex concepts like gravity in language a five-year-old can understand. And AI image generators such as Midjourney, DALL-E and Stable Diffusion turn your words into pictures with the click of a mouse. Miraculous feats made possible by training the systems on massive datasets – created with the help of data labellers.

Data labellers, also called annotators or data professionals, label, tag, and categorise AI training data. Training a generative AI is like the story of the Very Hungry Caterpillar. A neural network consumes an enormous quantity and variety of data, a model emerges, and a beautiful application is born. 

For chatbots, that data is mostly scraped from the internet. It typically includes sources like books, Wikipedia, arXiv, GitHub, and Common Crawl’s web archive. Data labellers then make manual annotations so AI systems can learn to navigate human interaction. We saw a nice example of why that is necessary when ChatGPT hackers persuaded the bot to write instructions for making napalm. The information is available in the AI’s training data, but spelling it out to users is inappropriate. Data labellers create that context so the AI learns not to give the recipe.

Image generators work in a similar way. DALL-E2 uses CLIP (Contrastive Language-Image Pre-training) to connect words and images. CLIP learns from text-image pairs that are already publicly available on the internet. Think of, say, a photo of a dog on Wikipedia captioned ‘poodle’. But others use ImageNet, one of the largest image databases. It contains 14 million pictures tagged with 22,000 object categories like ‘balloon’ or ‘strawberry’. Creating it was a tremendous job requiring 25,000 data labellers.


As organisations begin to develop AIs for their own specific purposes, they need to organise their data. Say you want to build a customer service AI for your insurance company. It has to know that ‘renew’ and ‘accident’ are keywords and to relay customers to the correct department. Or, if you are in car repairs, you may want an AI that identifies parts in a picture and adds them to a quote. Either way, you will likely outsource the labour-intensive manual training data preparation to a data labelling firm. 

Most of these companies are headquartered in the US or Europe and have employees in the Global South, where wages are lower. While data annotation jobs help lift families out of poverty in India and Kenya, the work can also be exploitative or even traumatic. 

A recent report in TIME tells the story of annotators who worked on a tool to detect harmful content -like napalm recipes- in ChatGPT. Data labellers classified tens of thousands of text snippets, some describing in graphic detail murder, suicide, torture, self-harm, incest and child sex abuse. The company that employed the workers, Sama, had won the contract because of their experience in content moderation for Facebook. 

Sama tries to look after its workers, writing on its website that it is “driving an ethical AI supply chain that meaningfully improves employment and income outcomes for those with the greatest barriers to work.” It’s also B Corp certified. It pays a reasonable wage and provides workers access to psychological support. But some content is just too horrible. Ultimately, Sama exited early from the harmful-content-detector contract for ChatGPT. They also discontinued content moderation services altogether earlier this year.


Another issue is the lack of transparency around data annotation tasks. Billy Perrigo, an investigative journalist speaking on a panel hosted by Digital Futures Lab explains: “I’ve heard examples of companies that presumably are trying to train an AI that can do computer vision, [and annotators] being asked to take photos of themselves in different lighting conditions. But it gets worse, asking them to take photos of children aged between a certain age range with no kind of consent whatsoever.”

The concern applies to gig economy workers in the West, too. In June, Amazon Mechanical Turk workers talked about transparency problems during the ACM Conference on Fairness, Accountability and Transparency. Labellers tagging photos of border crossings and satellite images, for example, felt uncomfortable. What is the client going to do with that? But this is easy to fix, if workers have their voices heard.

Non-profit Karya actively engages with its workforce to remove such uncertainties. The organisation aims to top up the incomes of rural Indians by providing flexible work, fair pay, and information about tasks. It’s also on a mission to reduce disparity in the AI data industry. Labelled data is much more valuable, and Karya wants profits to flow back to workers. 

The company is experimenting with a structure they call the Karya public licence: workers who generate data for generic datasets own that data. Head of Research Safiya Husain at Karya, also on the Digital Future’s Lab panel, says, “So what we’re trying to do is every time we’re able to resell a dataset, we pay the workers again for that initial amount.”

It’s an interesting idea and, at scale, could work on blockchain like royalties for musicians and other artists. The AI data industry is young and innovative. And its workers are indispensable as much as invisible. Let’s not forget about them while we excitedly launch our own projects. When selecting a partner to develop your AI, ask questions, insist on transparency, and pay a fair price.


For more about new developments in AI, click here.


Posted

in

, ,

by

Tags: