Where does Data Privacy technology stand today?

YA Zardari
4 min readMar 22, 2020
Photo by Kai Brame on Unsplash

‘Software is Eating the World’ was one of the early statements made by the technology industry. That phrase served as a thesis for our future, and the basis for many new innovations.

There is another phrase that seems set to be similarly prescient: ‘Data is the New Oil’. Data has become central, and with the rise of data has come the rise of analytics, automation, and artificial intelligence. Increasingly, the world looks to be shaped by the impact of these technologies.

But there is no popular phrase, yet, to underscore their externalities, and the growing nexus of technology surrounding them.

The area with the most steam and relevant concern is privacy, though it would be a mistake to view this as an externality alone— it is a thriving space in its own right. Privacy is not only an ethical concern, but a functional necessity for effective use, and a business problem to overcome.

No one will allow a service to collect their data if it unsafely handles it. Companies need to prevent security incidents, and privacy leaks, which is easier said than done when you’re sharing and making data accessible across an organization. They also need a way to collaborate with and exchange data with external parties, in a way that’s secure, and avoids loss of IP.

Given the conversations I’ve had with market leaders, and reports I’ve read in Gartner and other organizations, these areas of the data journey can be segmented into 1) data protection, and 2) data collaboration. This is still early, and I’ll soon write about the data journey in more detail.

The goal of data protection is to comprehensively privacy-protect data sets for a company’s internal use (ie. mitigate re-identification risk for individuals), but preserve analytical value so it can actually be leveraged.

The way to do this is to build an equivalence class. Take common feature attributes in a data set, and create classes with common identifiers. Then ensure that there are a minimum of x individuals in every equivalence (ie. a minimum of 10 people with the same zip code, a large enough grouping to mitigate privacy risk).

There are two primary technology avenues to do this:

K-Anonymity: Remove precision from data sets until there is a minimum of K-1 individuals in each equivalence class.

Differential Privacy: Inject noise into the data set using a mathematical methodology that provides guarantees against re-exposure (only so much noise can be injected until analytical value is consumed however). This can be more useful for transactional data sets (ie. those based in time, or an event) that cannot be so easily classified into an equivalence class.

Of course it is a bit more complicated to implement this that it seems. Many organizations are using a form of k-anonymity today, but they have their data scientists do it manually. This can be laborious, and is hard to get consistent across an entire enterprise-grade organization. It needs to be autonomous. In addition, it is difficult to realistically assess privacy risk. Privacy risk must be understood objectively, as Google and the University of Chicago found out.

Both these things require that a computer is able to understand what is in data — but data is diverse. Figuring out an algorithm that is agnostic to all kinds of data, with different identifiers, that can work on clean, unclean, structured, and unstructured data is at once difficult, and at once necessary. The best companies in the data privacy space are solving all of these problems.

Data Collaboration, as it stands today, has similar fundamental challenges. I don’t trust another party to share my data with them, but we’d both benefit from working together. Why not share data with a mutual third party that perform the analysis for us? Well, you’d need to still trust them to view your IP-sensitive information, and you’d still be sharing the data and compounding exposure risk — what if they’re trustworthy, but have a security breach? In addition, data residency laws often mean that sensitive data cannot be moved across geographic lines, regardless of the business case.

You should not need to move data sets from their data centers. You should not need to co-locate data together and combine them into one large data set. But you should be able build models with the results equivalent to as if they were.

The goal is to perform machine learning across distributed data sets — and secure multi-party computation, or SMC, does just that.

Another technique is to encrypt data so that no one can see it, period, but use cryptographic techniques that enable the use of encrypted data for machine learning. This is called Homomorphic Encryption, and while promising, is still too computationally-burdening to use at any practical scale.

Just as before, there are more things needed before SMC can be implemented. A process of private set intersection is required that matches up appropriate columns between different data sets (one data set uses first name as an identifier, and the other says employee first name, for example). That means revealing meta-data to both companies, or a third party. You also have to figure out the financial model — how do you structure a deal so that both parties benefit? The Hub-In-Spoke model (a definition can be found here under discussion) might be it, though it will take a strong commercial push to get everyone involved.

These are the top technologies in the space today. Major venture capital funding is being raised for start-ups tackling these problems, and companies like Amazon are launching their own data marketplaces.

This is an important part of our AI future that will play a fundamental role in shaping it. My next posts will go into further detail as I watch it play out.

--

--

YA Zardari
0 Followers

Business development, my work in data science tech and startup, and my own reflections; found below.