|
|
|
@ -8,10 +8,16 @@ date: September 05, 2018
|
|
|
|
|
|
|
|
|
|
## It's everywhere! |
|
|
|
|
|
|
|
|
|
 |
|
|
|
|
|
|
|
|
|
## Stuff is totally insecure! |
|
|
|
|
|
|
|
|
|
 |
|
|
|
|
|
|
|
|
|
## It's really difficult! |
|
|
|
|
|
|
|
|
|
 |
|
|
|
|
|
|
|
|
|
# What topics to cover? |
|
|
|
|
|
|
|
|
|
## A really, really vast field |
|
|
|
@ -109,19 +115,68 @@ date: September 05, 2018
|
|
|
|
|
# Defining privacy |
|
|
|
|
|
|
|
|
|
## What does privacy mean? |
|
|
|
|
- Many meanings of privacy |
|
|
|
|
- Many kinds of "privacy breaches" |
|
|
|
|
- Obvious: third party learns your private data |
|
|
|
|
- Retention: you give data, company keeps it forever |
|
|
|
|
- Passive: you don't know your data is collected |
|
|
|
|
|
|
|
|
|
## Why is privacy hard? |
|
|
|
|
- Hard to pin down what privacy means! |
|
|
|
|
- Once data is out, can't put it back into the bottle |
|
|
|
|
- Privacy-preserving data release today may violate privacy tomorrow, combined |
|
|
|
|
with "side-information" |
|
|
|
|
- Data may be used many times, often doesn't change |
|
|
|
|
|
|
|
|
|
## Hiding private data |
|
|
|
|
- Remove "personally identifiable information" |
|
|
|
|
- Delete "personally identifiable information" |
|
|
|
|
- Name and age |
|
|
|
|
- Birthday |
|
|
|
|
- Social security number |
|
|
|
|
- ... |
|
|
|
|
- Publish the "anonymized" or "sanitized" data |
|
|
|
|
|
|
|
|
|
## Problem: not enough |
|
|
|
|
- Can match up anonymized data with public sources |
|
|
|
|
- *De-anonymize* data, associate names to records |
|
|
|
|
- Really, really hard to think about side information |
|
|
|
|
- May not even be public at time of data release! |
|
|
|
|
|
|
|
|
|
## Netflix challenge |
|
|
|
|
- Database of movie ratings |
|
|
|
|
- Published: ID number, movie rating, and rating date |
|
|
|
|
- Attack: from public IMDB ratings, recover names for Netflix data |
|
|
|
|
|
|
|
|
|
## "Blending in a crowd" |
|
|
|
|
- Only release records that are similar to others |
|
|
|
|
- *k-anonymity*: require at least k identical records |
|
|
|
|
- Other variants: *l-diversity*, *t-closeness*, ... |
|
|
|
|
|
|
|
|
|
## Problem: composition |
|
|
|
|
- Repeating k-anonymous releases may lose privacy |
|
|
|
|
- Privacy protection may fall off a cliff |
|
|
|
|
- First few queries fine, then suddenly total violation |
|
|
|
|
- Again, interacts poorly with side-information |
|
|
|
|
|
|
|
|
|
## Differential privacy |
|
|
|
|
- Proposed by Dwork, McSherry, Nissim, Smith (2006) |
|
|
|
|
|
|
|
|
|
> A new approach to formulating privacy goals: the risk to one’s privacy, or in |
|
|
|
|
> general, any type of risk... should not substantially increase as a result of |
|
|
|
|
> participating in a statistical database. This is captured by differential |
|
|
|
|
> privacy. |
|
|
|
|
|
|
|
|
|
## Basic setting |
|
|
|
|
- Private data: set of records from individuals |
|
|
|
|
- Each individual: one record |
|
|
|
|
- Example: set of medical records |
|
|
|
|
- Private query: function from database to output |
|
|
|
|
- Randomized: adds noise to protect privacy |
|
|
|
|
|
|
|
|
|
## Basic definition |
|
|
|
|
A query $Q$ is **$(\varepsilon, \delta)$-differentially private** if for every two |
|
|
|
|
databases $db, db'$ that differ in **one individual's record**, and for every |
|
|
|
|
subset $S$ of outputs, we have: |
|
|
|
|
|
|
|
|
|
$$ |
|
|
|
|
\Pr[ Q(db) \in S ] \leq e^\varepsilon \cdot \Pr[ Q(db') \in S ] + \delta |
|
|
|
|
$$ |
|
|
|
|