Depersonalization of data is a growing issue for modelers as privacy concerns about consumer’s data increases.  It is often necessary to de-associate personal identifiers from datasets or take other precautions to assure the anonymity of individuals studied.  This is difficult because many fields we use in modeling, gender, date of birth, and zip code can be used to identify individuals.  A study by Latanya Sweeney showed gender, date of birth, and zip code can uniquely identify 85% of the US population.  To meet privacy concerns removing driver license number, Social Security number and full name is often not enough.


Here is an example, you are given two datasets one has a demographic profile of an individual and results from a medical study and the other dataset has a full name, address, and date of birth.  The concern is you do not want someone to uniquely identify individuals across these datasets. As mentioned before, if both datasets contain gender, date of birth, and home zip code you can identify individuals with an 85% accuracy. Here there has been no depersonalization.  If age had replaced the date of birth in the study dataset one to one identification across datasets would not have been easily achievable.

Concept: K-anonymization

K-anonymization enables you to talk of degrees to which one dataset is related to another dataset. It is not the only measure of depersonalization and has some issues, namely it is NP-Hard but is an important concept to understand. If a record in one dataset can be matched to k records in another dataset that dataset is said to be (k-1). For example, if you can uniquely match each record in two datasets (one to one matching) K-anonymization is zero.  If, however, many records can match a given record K-anonymization is greater than zero. A large value for k indicates a greater degree of personalization of the study dataset.  When calculating the value you use a full information dataset and a study dataset that requires depersonalization.


Further Reading

L. Sweeney , Uniqueness of Simple Demographics in the U.S (2002) Carnegie Melon University, Laboratory for International Data Privacy