AOL discloses 650,000 AOL users' search data

UPDATE: Apparently Findory CEO Greg Linden had the story first. He also makes some level-headed comments about the severity of the privacy breach:

Nevermind that no one actually has come up with an example where someone could be identified. Just the theoretical possibility is enough to create a privacy firestorm in some people's minds.

I am as concerned about privacy as any tech geek, but most of my concern is focused on things like millions of credit cards being leaked and millions of social security numbers being lost.

Agreed, for example 196,000 HP employee records including names, addresses, Social Security numbers, dates of birth and other employment-related information (one of them mine) seems far worse.

Further update: The NY Times found one. Also, Andrew Orlowski weighs in on the 'database of intentions' in light of the AOL mess.

Well this isn't going to help AOL's image. Over the weekend, AOL researchers posted a 400MB+ tarball of the raw search query data of some 650K AOL users over the period from March 1, 2006 to May 30, 2006. While users' screen names were "anonymized" by changing them to numbers, individual users' query streams were left intact.

Given the recent widespread outcry over the US government's move to request search data from Google, Yahoo, AOL and others, I can only marvel at the astoundingly poor judgment shown by the research team that did this. Given his position and extensive track record of research and publications, how could this guy not have realized what a Bad Idea this would be? Well apparently someone alerted him, because the data was removed from the AOL site by Sunday evening. Too late, of course; several hundred downloads had already escaped and are already being posted, mirrored and grep'd widely.

There's of course a lot of commentary about this around the blogs:

* Adam at CalTech seems to have been first to pick up the story
* SiliconBeat
* TechCrunch, of course
* Zoli Erdos

I guess by morning you'll see a whole lot more shouting, and I expect AOL will be doing big time damage control. Let's also watch for a couple of other, less obvious things that might/should happen:

* Bloggers refusing to post the data on ethical grounds, even if they got a copy ethically. I have a copy that I downloaded while AOL's site was up, and I'm not posting it (no, it's not on this website so don't bother trying to pwn me please ;-) I guess the real ethical question is, should I delete it?

* On examination of the data, people may realize that in most cases, there will be insufficient data to derive personally identifying information on these users. Yes, there are some disgusting and disturbing query trails that shed an unpleasant light on human obsessions. And yes, there will probably be a few people who can be pinpointed from their search terms. But after a cursory look at the data, I am going to suggest that it's mostly the mundane, everyday, largely anonymous stuff of our online hunter-gatherer lifestyle, and we aren't going to see a huge class action suit or a vast surge in identity theft.

People are going to shout though, and some other people are definitely going to squirm.

Posted by Gene at August 7, 2006 1:19 AM | TrackBack

I just whipped up a searchable database to help you people who don't want to download 2 gigs and GREP your way through. check it out,

Posted by: devon at August 8, 2006 12:30 AM
