Hopping Hurdles: Anonymizing Personally Identifiable Data to Foster Collaboration

Last year, Data Driven Detroit entered a partnership with Microsoft and The Skillman Foundation to coordinate the collective use of data on opportunity youth – young adults, 16-24 years old, who are neither working nor enrolled in school or a vocational training program – to generate a data-driven response to the needs of this demographic. There are many challenges that must be overcome to accomplish this mission, and for each challenge we must develop a new strategy, process, or tool to advance toward this goal.

One challenge to such coordination is the necessity for maintaining data privacy. In fact, we found this to be the single greatest priority for our partners. To address and overcome this hurdle, we developed a computer program called the D3Anonymizer. The D3Anonymizer allows sensitive individual level information to be shared and analyzed without compromising the privacy of individuals within the dataset, while still maintaining the ability to control for duplicate records. The D3Anonymizer does this by utilizing three identifiable fields to generate an anonymized key field that is unique to the individual record but has no identifiable information. It then removes the original identifiable fields leaving an analyzable document without personal identifying information.

Let’s see how this works on an individual level. Let’s say a record belongs to a person named Erica Powell with a birthday of 11-Jun-95. For the D3Anonymizer to work, the first name, last name, and birthday would have to be in 3 separate columns:

|Erica|Powell|11-Jun-95|

The D3Anonymizer takes these columns and extracts components from each column into a single column, removing the ability to identify the person while keeping the ability to identify a unique record.

|Erica|Powell|11-Jun-95| …… turns into …… |ecall199523|

How is the new field generated?

1)   All punctuation is removed from the first and last name and all letters are converted to lower case.

2)   The program recognizes the date and is able to do so regardless of formatting or format consistency.

3)   It determines the length of the name. If it is less than 4 letters, it adds q’s to the end of the name until it is 4 letters long. This prevents first or last names that are short from being revealed.

4)   It takes the first letter and last two letters of the first name, along with the last two letters of the last name and combines them.

5)   It adds the birth year as well as the birth week number to the end of the new key.

6)   Lastly, the old columns are removed leaving just the new key and the non-sensitive columns.

An additional complexity that can be added – and can vary with each D3Anonymizer use – is that the first and last name columns can be flipped to provide a completely different key. For example:

|Erica|Powell|11-Jun-95| …… would become …… |pllca199523|

One of the greatest challenges faced by this collaborative effort, and any collaborative effort involving individual level data, is maintaining the privacy of sensitive information. The D3Anonymizer meets this challenge, allowing our data collaborators to protect the identities of their clients while still providing actionable information for decision making.