How Photos of Your Kids Are Powering Surveillance Technology
One day in 2005, a mother in Evanston, Ill., joined Flickr. She uploaded some pictures of her children, Chloe and Jasper. Then she more or less forgot her account existed.
Years later, their faces are in a database that’s used to test and train some of the most sophisticated artificial intelligence systems in the world.
A selection of images from the MegaFace database.
By Kashmir Hill and Aaron Krolik
The pictures of Chloe and Jasper Papa as kids are typically goofy fare: grinning with their parents; sticking their tongues out; costumed for Halloween. Their mother, Dominique Allman Papa, uploaded them to Flickr after joining the photo-sharing site in 2005.
None of them could have foreseen that 14 years later, those images would reside in an unprecedentedly huge facial-recognition database called MegaFace. Containing the likenesses of nearly 700,000 individuals, it has been downloaded by dozens of companies to train a new generation of face-identification algorithms, used to track protesters, surveil terrorists, spot problem gamblers and spy on the public at large. The average age of the people in the database, its creators have said, is 16.
“It’s gross and uncomfortable,” said Mx. Papa, who is now 19 and attending college in Oregon. “I wish they would have asked me first if I wanted to be part of it. I think artificial intelligence is cool and I want it to be smarter, but generally you ask people to participate in research. I learned that in high school biology.”
Chloe Papa Amanda Lucier for The New York Times
By law, most Americans in the database don’t need to be asked for their permission — but the Papas should have been.
As residents of Illinois, they are protected by one of the strictest state privacy laws on the books: the Biometric Information Privacy Act, a 2008 measure that imposes financial penalties for using an Illinoisan’s fingerprints or face scans without consent. Those who used the database — companies including Google, Amazon, Mitsubishi Electric, Tencent and SenseTime — appear to have been unaware of the law, and as a result may have huge financial liability, according to several lawyers and law professors familiar with the legislation.
How MegaFace was born
How did the Papas and hundreds of thousands of other people end up in the database? It’s a roundabout story.
In the infancy of facial-recognition technology, researchers developed their algorithms with subjects’ clear consent: In the 1990s, universities had volunteers come to studios to be photographed from many angles. Later, researchers turned to more aggressive and surreptitious methods to gather faces at a grander scale, tapping into surveillance cameras in coffee shops, college campuses and public spaces, and scraping photos posted online.
According to Adam Harvey, an artist who tracks the data sets, there are probably more than 200 in existence, containing tens of millions of photos of approximately one million people. (Some of the sets are derived from others, so the figures include some duplicates.) But these caches had flaws. Surveillance images are often low quality, for example, and gathering pictures from the internet tends to yield too many celebrities.
In June 2014, seeking to advance the cause of computer vision, Yahoo unveiled what it called “the largest public multimedia collection that has ever been released,” featuring 100 million photos and videos. Yahoo got the images — all of which had Creative Commons or commercial use licenses — from Flickr, a subsidiary.
The database creators said their motivation was to even the playing field in machine learning. Researchers need enormous amounts of data to train their algorithms, and workers at just a few information-rich companies — like Facebook and Google — had a big advantage over everyone else.
“We wanted to empower the research community by giving them a robust database,” said David Ayman Shamma, who was a director of research at Yahoo until 2016 and helped create the Flickr project. Users weren’t notified that their photos and videos were included, but Mr. Shamma and his team built in what they thought was a safeguard. They didn’t distribute users’ photos directly, but rather links to the photos; that way, if a user deleted the images or made them private, they would no longer be accessible through the database.
But this safeguard was flawed. The New York Times found a security vulnerability that allows a Flickr user’s photos to be accessed even after they’ve been made private. (Scott Kinzie, a spokesman for SmugMug, which acquired Flickr from Yahoo in 2018, said the flaw “potentially impacts a very small number of our members today, and we are actively working to deploy an update as quickly as possible.” Ben MacAskill, the company's chief operating officer, added that the Yahoo collection was created “years before our engagement with Flickr.”)
Additionally, some researchers who accessed the database simply downloaded versions of the images and then redistributed them, including a team from the University of Washington. In 2015, two of the school’s computer science professors — Ira Kemelmacher-Shlizerman and Steve Seitz — and their graduate students used the Flickr data to create MegaFace. Containing more than four million photos of some 672,000 people, it held deep promise for testing and perfecting face-recognition algorithms.
One day in 2005, a mother in Evanston, Ill., joined Flickr. She uploaded some pictures of her children, Chloe and Jasper. Then she more or less forgot her account existed.
Years later, their faces are in a database that’s used to test and train some of the most sophisticated artificial intelligence systems in the world.
By Kashmir Hill and Aaron Krolik
The pictures of Chloe and Jasper Papa as kids are typically goofy fare: grinning with their parents; sticking their tongues out; costumed for Halloween. Their mother, Dominique Allman Papa, uploaded them to Flickr after joining the photo-sharing site in 2005.
None of them could have foreseen that 14 years later, those images would reside in an unprecedentedly huge facial-recognition database called MegaFace. Containing the likenesses of nearly 700,000 individuals, it has been downloaded by dozens of companies to train a new generation of face-identification algorithms, used to track protesters, surveil terrorists, spot problem gamblers and spy on the public at large. The average age of the people in the database, its creators have said, is 16.
“It’s gross and uncomfortable,” said Mx. Papa, who is now 19 and attending college in Oregon. “I wish they would have asked me first if I wanted to be part of it. I think artificial intelligence is cool and I want it to be smarter, but generally you ask people to participate in research. I learned that in high school biology.”
Chloe Papa Amanda Lucier for The New York Times
By law, most Americans in the database don’t need to be asked for their permission — but the Papas should have been.
As residents of Illinois, they are protected by one of the strictest state privacy laws on the books: the Biometric Information Privacy Act, a 2008 measure that imposes financial penalties for using an Illinoisan’s fingerprints or face scans without consent. Those who used the database — companies including Google, Amazon, Mitsubishi Electric, Tencent and SenseTime — appear to have been unaware of the law, and as a result may have huge financial liability, according to several lawyers and law professors familiar with the legislation.
How MegaFace was born
How did the Papas and hundreds of thousands of other people end up in the database? It’s a roundabout story.
In the infancy of facial-recognition technology, researchers developed their algorithms with subjects’ clear consent: In the 1990s, universities had volunteers come to studios to be photographed from many angles. Later, researchers turned to more aggressive and surreptitious methods to gather faces at a grander scale, tapping into surveillance cameras in coffee shops, college campuses and public spaces, and scraping photos posted online.
According to Adam Harvey, an artist who tracks the data sets, there are probably more than 200 in existence, containing tens of millions of photos of approximately one million people. (Some of the sets are derived from others, so the figures include some duplicates.) But these caches had flaws. Surveillance images are often low quality, for example, and gathering pictures from the internet tends to yield too many celebrities.
In June 2014, seeking to advance the cause of computer vision, Yahoo unveiled what it called “the largest public multimedia collection that has ever been released,” featuring 100 million photos and videos. Yahoo got the images — all of which had Creative Commons or commercial use licenses — from Flickr, a subsidiary.
The database creators said their motivation was to even the playing field in machine learning. Researchers need enormous amounts of data to train their algorithms, and workers at just a few information-rich companies — like Facebook and Google — had a big advantage over everyone else.
“We wanted to empower the research community by giving them a robust database,” said David Ayman Shamma, who was a director of research at Yahoo until 2016 and helped create the Flickr project. Users weren’t notified that their photos and videos were included, but Mr. Shamma and his team built in what they thought was a safeguard. They didn’t distribute users’ photos directly, but rather links to the photos; that way, if a user deleted the images or made them private, they would no longer be accessible through the database.
But this safeguard was flawed. The New York Times found a security vulnerability that allows a Flickr user’s photos to be accessed even after they’ve been made private. (Scott Kinzie, a spokesman for SmugMug, which acquired Flickr from Yahoo in 2018, said the flaw “potentially impacts a very small number of our members today, and we are actively working to deploy an update as quickly as possible.” Ben MacAskill, the company's chief operating officer, added that the Yahoo collection was created “years before our engagement with Flickr.”)
Additionally, some researchers who accessed the database simply downloaded versions of the images and then redistributed them, including a team from the University of Washington. In 2015, two of the school’s computer science professors — Ira Kemelmacher-Shlizerman and Steve Seitz — and their graduate students used the Flickr data to create MegaFace. Containing more than four million photos of some 672,000 people, it held deep promise for testing and perfecting face-recognition algorithms.