Bruce Schneier summarises the findings in his new report on
AI and hacking.
AIs can engage in something called reward hacking. Because AIs don’t solve problems in the same way people do, they will invariably stumble on solutions we humans might never have anticipated—and some will subvert the intent of the system. That’s because AIs don’t think in terms of the implications, context, norms, and values we humans share and take for granted. This reward hacking involves achieving a goal but in a way the AI’s designers neither wanted nor intended.
Take a soccer simulation where an AI figured out that if it kicked the ball out of bounds, the goalie would have to throw the ball in and leave the goal undefended. Or another simulation, where an AI figured out that instead of running, it could make itself tall enough to cross a distant finish line by falling over it. Or the robot vacuum cleaner that instead of learning to not bump into things, it learned to drive backwards, where there were no sensors telling it it was bumping into things. If there are problems, inconsistencies, or loopholes in the rules, and if those properties lead to an acceptable solution as defined by the rules, then AIs will find these hacks.
Schneier gives the example of Volkswagen’s design to trick to check on emissions control tests. Although this was not done by AI, Schneier raises the concern about what happens when such decisions are made within blackboxes.
This is interesting to consider alongside Kate Crawford’s discussion of the human side of AI.