Reward hacking: AI finds a way

HennyGe Wichers

September 6, 2023

Image showing code. Reward hacking: AI finds a way — Photo by Markus Spiske on Unsplash

Reward hacking happens when an Artificial Intelligence (AI) achieves its goal – but not in the way the programmer intended. The computer finds a loophole or takes a shortcut, usually resulting in unintended consequences.

GenProg, an automated bug-fixing tool, gives us a nice illustration. Its job was to keep lists clear of sorting errors. When it encountered an incorrectly sorted list, GenProg would identify the cause and correct the program making the mistake. At least, that was the idea. But the AI worked out an easier solution: keep lists clear of errors – by deleting their contents. No one said it shouldn’t.

Programming an AI is tricky. The system needs an objective, or proxy, to bring about the behaviour we want. In the case of GenProg, keeping lists clear of errors was a proxy for fixing bugs. That could work – but not if simply deleting everything is an option. Ultimately, the AI will solve errors only if it is the most efficient way to achieve its goal of error-free lists.

Many AI researchers and developers have bumped into unexpected solutions because of such exploits. It’s not a new phenomenon either. Steven Kerr wrote the aptly titled paper ‘On the Folly of Rewarding A, While Hoping for B’ nearly three decades ago. It’s now a much-quoted classic.

Sometimes, the hacks AI comes up with are ingenious. Other times, they are a problem. But a lot of the time, they’re funny. Here are six that made me smile.

1. optical engineering

Optical engineers use AI to help design lenses for sophisticated equipment like cameras and microscopes. Finding the optimal shape and position of lenses involves a lot of number-crunching, so it’s a perfect job for a computer. Researchers in Canada devised a specialist algorithm and found a unique new design that outperformed any alternative by at least factor two – only it called for a lens that was 20 meters thick. Yes, 20 meters. That’s 66 feet.

2. tic-tac-toe

In her book You Look Like a Thing and I Love You, Janelle Shane describes programmers building algorithms to play tic-tac-toe against each other. To make the game more interesting, the board was infinitely large. One of the programmers allowed an AI to develop its own approach – and it began winning all its games. On closer inspection, they saw the machine placed its moves VERY far away on the board. When the opponent’s computer tried to simulate the expanded board, it would run out of memory and crash – forfeiting the game. Ingenious indeed.

3. skin cancer screening

ModelDerm, a skin cancer detection tool, famously became as good as human dermatologists at diagnosing malignant skin lesions. But its developers later warned that the AI was diagnosing rulers instead of cancer. The classification algorithm was trained on images of skin lesions labelled as cancerous or not cancerous. The images of cancerous lesions included a ruler for scale. So, the model learnt to identify rulers as a sign of malignancy – that’s easier than learning about different kinds of lesions. But real-world patients with malignant tumours, unfortunately, don’t come with rulers attached.

Image showing a measuring tape in curls. Reward hacking: AI finds a way — Photo by Diana Polekhina on Unsplash

4. boat racing

AI researchers often use video games to test ideas. It’s convenient because the game producers have already created a world and set of rules, so the researchers don’t have to. OpenAI experimented with CoastRunners. The goal of the game is to finish a boat race quickly, and players earn higher scores by hitting targets laid out along the route. The team programmed an AI to get as many points as possible, assuming that would make it win the race. But it didn’t. The AI found that it could rack up points by doing doughnuts in a little harbour, continuously collecting re-spawning turbos – and occasionally catching fire. Given its objective, that was better than finishing the race.

5. autopilot

Here’s another gem highlighted in You Look Like a Thing and I Love You. An AI that was supposed to land a plane on an aircraft carrier figured out how to get a perfect score. It discovered that if it applied a large enough force to the landing, its simulation memory would overflow, like an odometer rolling over from 99999 to 00000. As a result, the simulation would register zero force. That’s a very smooth landing – but only in theory because, in reality, the plane would smash into the deck.

6. simulated worlds

Simulated organisms are very good at evolving to make the most of energy sources in their world. In that way, they’re a lot like biological organisms, which have adapted to extract energy from oil, caffeine, and even farts. Astrophysicist David L. Clements happened across an AI organism getting free food by exploiting arithmetic rounding. The creature started with just a little food, yet decided to have lots of children. The simulation gave each child a tiny bit of food – if a child had only a fraction of food, the simulation would round it up to one. Eventually, many fractions across many children would become a lot of food. Yep, eat that!

Tweet by David L. Clements describing the organism rounding exploit. Reward hacking: AI finds a way — Tweet by David L. Clements

I hope you enjoyed this little parade of unintended side effects. There’s only one takeaway: With AI, you literally get what you ask for.

You might also like this article about programming AI: You can’t fetch the coffee if you’re dead

For more about new developments in tech and AI, click here.