Why Obfuscation does not mean Security
Whether you chose to redact, mask or to anonymize a document or a form; you must be aware that does not mean the data is now fully protected from wandering
eyes. For instance consider the following Data Report from a hospital.
Patient Visit Report:
First Name
|
Last Name
|
Gender
|
Zip Code
|
Date of Birth
|
Chief Complaint
|
Eric
|
Gardner
|
Male
|
89876
|
05/08/1978
|
Red Eyes
|
Tony
|
Foyle
|
Male
|
89877
|
09/07/1986
|
Sore Throat
|
Amanda
|
Sampson
|
Female
|
89879
|
11/23/1990
|
Nasal congestion
|
Dona
|
Garcia
|
Female
|
89872
|
02/17/1972
|
Chest Pain
|
Masked Patient Visit Report:
First Name
|
Last Name
|
Gender
|
Zip Code
|
Date of Birth
|
Chief Complaint
|
EXXX
|
xxxxxxxer
|
Male
|
89876
|
XX/XX/1978
|
Red Eyes
|
XXXy
|
Fxxxxxx
|
Male
|
89877
|
XX/XX/1986
|
Sore Throat
|
Axxxxa
|
Sxxxxxxxxx
|
Female
|
89879
|
XX/XX/1990
|
Nasal congestion
|
Doxxxxx
|
XXXXXia
|
Female
|
89872
|
XX/XX/1972
|
Chest Pain
|
Notice that our masked report actually completely conceals the patients’ identities. We could hand this data to a researcher with peace of mind and be
quite comfortable that we have not violated any of the patient’s data privacy right under HIPAA.
But if this same report was handed to an insurance company investigator he might very well be able to cross reference this data with existing data sets and
fully identify a person.
Thus it is important to remember that obfuscating identity is only very effective in cases where the consumer of obfuscated data does not have access to
other pieces of data that are in some way associated with the shared data.
It is kind of like Solving for a Multi Variable Algebra Problem in High school not unlike the following.
Example 1: X = Y+10, Z = Y+30, Z =40. , Y=-10, X = 0;
Example 2: K = 7 + J, J = J * 0; -> K = 7, J = 0
Example 3: A = B^2, A = 49 -> B = +/- 7
The point of the Algebra Expressions above is to demonstrate that sometimes it is possible to pinpoint or at least narrow down the possibilities of missing
pieces of data by cross referencing the limited known data against other known pieces of data or against other black listed pieces of data.
In a patient’s case if I knew What Letter the First Name Started with and I knew the Gender, and I knew the Postal Code of the neighborhood where the
person lived and I had their Year of Birth. I have effectively been able to eliminate many possibilities from my search population. Imagine how quickly
somebody with access to marketing databases can come up with a list of candidates that have the best probability to fill the hole with the missing data.