Netflix Prize Anonymous Dataset Broken
Posted 12/2/07 at 11:40:48PM | by  

Erin Simon

Arvind Narayanan and Vitaly Shmatikov, two computer scientists at the University of Texas, have just broken the Netflix Prize dataset. The dataset, which contains over 100 million film ratings by 500,000 Netflix customers, is the basis of a contest to see if developers can improve on Netflix's recommendation engine. Netflix removed all personally identifiable information from the data before releasing it, of course, but (just like last year's AOL dataset) the data itself contained enough information to be traced back to individuals. Narayanan and Shmatikov simply compared ratings in the Netflix dataset to publicly accessible ratings on the Internet Movie Database – and could frequently match a Netflix anonymized ID with an IMDB user. While this might seem minor, movie ratings can reveal political affiliations, religious beliefs, or sexual orientation. Perhaps especially sexual orientations.

In fact, just that kind of sensitivity motivated Congress to pass the Video Privacy Protection Act in 1988, after Supreme Court nominee Robert Bork's video rental history was published in a newspaper. The Act refers to “prerecorded video cassette tapes or similar audio visual materials,” which would appear to include DVDs (though I'm not aware of any judicial decision so construing the statute), but it only prevents disclosure of “information which identifies a person as having requested or obtained specific video materials or services,” which seems limited to rental and sale history, not ratings information. So while this release of data may not have been unlawful, it points to a serious problem in how our legal system treats sensitive information. What you rate may tell more about you than what you view, but only the latter is protected by federal law, and that's only because Congress didn't want their own porn rental history disclosed. Privacy protections are patchy and piecemeal, enacted after some glaring event frightens the public and the legislature, and they can be quickly outstripped by technological developments. Some analysts are calling for companies to stop releasing datasets at all, given how easily they have been de-anonymized, while others bemoan the lack of such valuable data for research purposes. I say be wary of what information you give out, or you'll learn that lesson the hard way.

View / Add Comment(s)


-ADVERTISEMENT-
RELATED CATEGORIES
RELATED ARTICLES
FWD: The Fourth Amendment and Your Email The government can't snoop in your email without probable cause (at least in Ohio) (at least for now).
Dave to Fort Bend: You Suck, Too Some Texas school officials have yet to learn that video games [not-equal-sign] terrorism. But Associate Editor Dave Murphy has no problem spelling it out for them.
Does Google's Street View Encroach on Personal Privacy Rights? Google adds street-level images to its map site, but some people are concerned about the company peeking into their windows

Comments

login or register to post comments
Interesting links you chose to add
Submitted by Fenthic on Tue, 2007-12-04 02:37.

The links in the phrase "especially sexual orientations" I found rather funny that you link your articles to those. Not upset about it just think its funny.



- Advertisement -