diff --git a/website/docs/resources/slides/images/broken.png b/website/docs/resources/slides/images/broken.png new file mode 100644 index 0000000..1d2563a Binary files /dev/null and b/website/docs/resources/slides/images/broken.png differ diff --git a/website/docs/resources/slides/images/iot-cameras.png b/website/docs/resources/slides/images/iot-cameras.png new file mode 100644 index 0000000..a1bee50 Binary files /dev/null and b/website/docs/resources/slides/images/iot-cameras.png differ diff --git a/website/docs/resources/slides/images/netflix.png b/website/docs/resources/slides/images/netflix.png new file mode 100644 index 0000000..59d8a4d Binary files /dev/null and b/website/docs/resources/slides/images/netflix.png differ diff --git a/website/docs/resources/slides/lecture01.md b/website/docs/resources/slides/lecture01.md index a5dda20..0817bec 100644 --- a/website/docs/resources/slides/lecture01.md +++ b/website/docs/resources/slides/lecture01.md @@ -8,10 +8,16 @@ date: September 05, 2018 ## It's everywhere! +![](images/iot-cameras.png) + ## Stuff is totally insecure! +![](images/broken.png) + ## It's really difficult! +![](images/netflix.png) + # What topics to cover? ## A really, really vast field @@ -109,19 +115,68 @@ date: September 05, 2018 # Defining privacy ## What does privacy mean? -- Many meanings of privacy +- Many kinds of "privacy breaches" + - Obvious: third party learns your private data + - Retention: you give data, company keeps it forever + - Passive: you don't know your data is collected ## Why is privacy hard? +- Hard to pin down what privacy means! +- Once data is out, can't put it back into the bottle +- Privacy-preserving data release today may violate privacy tomorrow, combined + with "side-information" +- Data may be used many times, often doesn't change ## Hiding private data -- Remove "personally identifiable information" +- Delete "personally identifiable information" + - Name and age + - Birthday + - Social security number + - ... +- Publish the "anonymized" or "sanitized" data ## Problem: not enough +- Can match up anonymized data with public sources +- *De-anonymize* data, associate names to records +- Really, really hard to think about side information + - May not even be public at time of data release! + +## Netflix challenge +- Database of movie ratings +- Published: ID number, movie rating, and rating date +- Attack: from public IMDB ratings, recover names for Netflix data ## "Blending in a crowd" +- Only release records that are similar to others +- *k-anonymity*: require at least k identical records +- Other variants: *l-diversity*, *t-closeness*, ... ## Problem: composition +- Repeating k-anonymous releases may lose privacy +- Privacy protection may fall off a cliff + - First few queries fine, then suddenly total violation +- Again, interacts poorly with side-information ## Differential privacy +- Proposed by Dwork, McSherry, Nissim, Smith (2006) + +> A new approach to formulating privacy goals: the risk to one’s privacy, or in +> general, any type of risk... should not substantially increase as a result of +> participating in a statistical database. This is captured by differential +> privacy. + +## Basic setting +- Private data: set of records from individuals + - Each individual: one record + - Example: set of medical records +- Private query: function from database to output + - Randomized: adds noise to protect privacy ## Basic definition +A query $Q$ is **$(\varepsilon, \delta)$-differentially private** if for every two +databases $db, db'$ that differ in **one individual's record**, and for every +subset $S$ of outputs, we have: + +$$ +\Pr[ Q(db) \in S ] \leq e^\varepsilon \cdot \Pr[ Q(db') \in S ] + \delta +$$