Monday, April 29, 2013

The double-edged sword of losing our privacy

Today's New York Times had a fascinating pair of articles that nicely, but seemingly without the intention of the editors, shows some of the pros and cons of applying data mining to publicly available private information.

"I was discovered by an algorithm", the lead story in the business section, is about a headhunter start-up company that aggregates information from a variety of public sources to identify high-end programming and development talent.  They use this data to supplement the standard information that an employer would receive (eg, degrees, schools, awards, work history, etc) and identify high potential candidates whose talents don't always come through in a typical resume or CV.  The article describes how "big data" techniques allow employers to utilize a richer array of variables to identify and evaluate prospective job candidates, and highlights the case of an individual who received a lucrative programming job but who would otherwise not have even passed a standard recruiting screen due to poor high school performance and lack of a college degree.  Would prospective recruits feel violated by this black-box search and evaluation process conducted without their permission or awareness?  Both the individual and his employer say no.  Score one for lack of privacy being a good thing.

"When your data wanders to places you've never been", buried inside the business section, tells the tale of a woman who gets targeted by pharma direct marketers who have mistakenly identified her as a multiple sclerosis patient based on "big data" searches of publicly available information on the web.  She ends up feeling both violated, and worse, too daunted by the complex chain of data brokers and marketing companies behind the error to do anything about it.  Score one for lack of privacy being a bad thing.

It's interesting that neither of these articles really dealt with the obvious flip sides of each situation.  Information gleaned from outside of a traditional recruiting process can be used to discriminate just as easily as it can be used to create new job opportunities.  And my health and demographic information can just as easily lead me to valuable treatments and support communities as it can to subject me to unwanted marketing and possible discrimination.

A common thread in each of these articles is that neither was a case of collection or use of illicitly gotten data (such as SSN, DOB, etc), rather, the data mining leveraged information that was voluntarily provided by the individuals in question, albeit for other purposes.  Though the information was available in the clear on the internet and was not illegally gotten, the individuals probably thought of it as perhaps not private but at least shielded or too isolated to be useful through random or targeted public searches.  In both cases they were wrong, one pleasantly and the other not so pleasantly.

The "big data" privacy issue is not so much about what a bad actor would do if they could get rare data gems like my SSN or my bank account, it's about the inferential mosaic that could be assembled by good, neutral, and bad actors alike from the many small pebbles of information that I myself have strewn across the web, such as what I say on an affinity user site or a web-based survey or an Amazon review or a Yelp comment (or a public blog).

I'm reminded of the story of an app called "Girls Around Me" that matched location data from Foursquare with profile data from Facebook to pinpoint women in a particular location and automatically stalk their Facebook pages to get pictures, background information, and messaging capability.  Not what either the women or Foursquare or Facebook had intended when they opened up their data and their APIs.

What's scary is not that there are unintended consequences, it's that there are unintended AND unpredictable consequences.  In health care, Latanya Sweeney has launched an interesting project to show how individual health information routinely and legally diffuses through a broad array of companies and websites.  Patients probably know bits and pieces of it, but probably not the scale and scope of it, as shown below.

This chart is most interesting for what it doesn't show, rather than what it shows:  It doesn't include the patient-generated data behind the NYT articles noted above.  As big data advances in scale and scope, it is the information that we voluntarily share -- like on PatientsLikeMe and CureTogether and SmartPatients -- that will eventually get fed into "big data" black-boxes and used in ways both good and bad that we are unable to foresee right now.