Netflix Prize Anonymous Dataset Broken


Arvind Narayanan and Vitaly Shmatikov , two computer scientists at the University of Texas, have just broken the Netflix Prize dataset. The dataset, which contains over 100 million film ratings by 500,000 Netflix customers, is the basis of a contest to see if developers can improve on Netflix's recommendation engine . Netflix removed all personally identifiable information from the data before releasing it, of course, but (just like last year's AOL dataset ) the data itself contained enough information to be traced back to individuals. Narayanan and Shmatikov simply compared ratings in the Netflix dataset to publicly accessible ratings on the Internet Movie Database – and could frequently match a Netflix anonymized ID with an IMDB user. While this might seem minor, movie ratings can reveal political affiliations , religious beliefs , or sexual orientation . Perhaps especially sexual orientations .

In fact, just that kind of sensitivity motivated Congress to pass the Video Privacy Protection Act in 1988, after Supreme Court nominee Robert Bork 's video rental history was published in a newspaper. The Act refers to “prerecorded video cassette tapes or similar audio visual materials,” which would appear to include DVDs (though I'm not aware of any judicial decision so construing the statute), but it only prevents disclosure of “information which identifies a person as having requested or obtained specific video materials or services,” which seems limited to rental and sale history, not ratings information. So while this release of data may not have been unlawful, it points to a serious problem in how our legal system treats sensitive information. What you rate may tell more about you than what you view, but only the latter is protected by federal law, and that's only because Congress didn't want their own porn rental history disclosed. Privacy protections are patchy and piecemeal, enacted after some glaring event frightens the public and the legislature, and they can be quickly outstripped by technological developments. Some analysts are calling for companies to stop releasing datasets at all, given how easily they have been de-anonymized, while others bemoan the lack of such valuable data for research purposes . I say be wary of what information you give out, or you'll learn that lesson the hard way.

Around the web