Artificially Intelligent Vision Systems are Overconfident, Like Humans
Self-driving vehicles, security and surveillance, and robot vacuums — AI systems are increasingly integrating themselves into our lives. Many of these innovations rely on AIs trained in object recognition, identifying objects like vehicles, people, or obstacles. Safety requires that a system know its limitations and realize when it doesn't recognize something.
Just how well-calibrated are the accuracy and confidence of the object recognition AIs that power these technologies? Our team set out to assess the calibration of AIs and compare them with human judgments.
Artificial Intelligence Identifications & Confidence
Our study required a set of novel visual stimuli that we knew were not already posted online and so could not be familiar to any of the systems or individuals we wanted to test. We asked 100 workers on Amazon Mechanical Turk to take 15 pictures in and around their homes, each one featuring an object. After removing submissions that failed to follow these instructions, we were left with 1,208 images. We uploaded these photos to four AI systems (Microsoft Azure, Facebook Detectron2, Amazon Rekognition, and Google Vision), which labeled objects identified in each image and reported confidence for each label. We showed these same images to people and asked them to identify objects and report their confidence.
To measure the accuracy of the labels, we asked a different set of human judges to estimate the percentage of other humans who would report that the identified label is present in the image, and paid them based on these estimates. These judges assessed the accuracy of labels from both human participants and AIs.
AI vs. Humans: Confidence and Accuracy Calibration
Both humans and AIs are, on average, overconfident. Humans reported an average confidence of 75% but were only 66% accurate. AIs displayed an average confidence of 46% and accuracy of 44%.
| Average Confidence | Average Accuracy | |
|---|---|---|
| AIs | 46% | 44% |
| Humans | 75% | 66% |
Table 1. Average confidence and accuracy levels for the full range of confidence levels.
Overconfidence is most prominent at high levels of confidence, as the figure below shows.
Identifications at a Glance
Before concluding that humans are more overconfident than AIs, we must note an important difference between them. The AIs each generated a list of objects identified with varying levels of confidence. Human participants, by contrast, identified the objects most likely to be present in the image — meaning high-confidence labels were overrepresented in the human-generated set compared to the AI-generated set. Since the risk of overconfidence increases with confidence, comparing all labels could be misleading.
To make a more equivalent comparison, we repeated our analysis using only labels identified with confidence of 80% or greater. This also illustrated that both humans and AIs are overconfident, but in this subset human judgments were not more overconfident than AIs. Humans and AIs were 94% and 90% confident, but only 70% and 63% accurate, respectively.
| Average Confidence (>80%) | Average Accuracy | |
|---|---|---|
| AIs | 90% | 63% |
| Humans | 94% | 70% |
Table 2. Average confidence and accuracy levels for labels with confidence over 80%.
One notable finding is how humans and AIs generated different types of labels. The image below illustrates this: humans labeled objects such as "remote" (85% confidence) and "buttons" (52%), while AI-generated labels with similar confidence were "indoor" (87%) and "font" (75%).
Conclusions
The results support our prediction that artificially intelligent agents are vulnerable to being too sure of themselves, just like people. This is relevant for tools guided by artificial intelligence: autonomous vehicles, security and surveillance, and robot assistants. Because AIs are prone to overconfidence, users and operators should be conscious of this while utilizing these tools. One response would be, as the consequences of making an error go up, to make the system more risk-averse and less likely to act on imperfect beliefs. Another would be to require AIs to have checks and verification systems that might catch errors. Provably safe AI systems must know their limitations and demonstrate well-calibrated confidence.