de-identifying data is really hard, and it only gets harder over time. Say the NHS releases prescribing data: date, doctor, prescription, and a random identifier. That’s a super-useful data-set for medical research.
And say the next year, Addison-Lee or another large minicab company suffers a breach (no human language contains the phrase “as secure as minicab IT”) that contains many of the patients’ journeys that resulted in that prescription-writing.
Merge those two data-sets and you re-identify many of the patients in the data. Subsequent releases and breaches compound the problem, and there’s nothing the NHS can do to either predict or prevent a breach by a minicab company.
Even if the NHS is confident in its anonymization, it can never be confident in the sturdiness of that anonymity over time.
Cory Doctorow discusses the problems on anonymity of de-identified data over time. He talks about the paper in Nature about the use of generative models to re-identify datasets and the site developed by the Imperial College of London which demonstrates how problematic the notion of de-identified data is.