This repository has been archived on 2024-11-04. You can view files and clone it, but cannot push or open issues or pull requests.
cs763/website/docs/resources/slides/lecture-welcome.md

283 lines
8.3 KiB
Markdown
Raw Normal View History

2018-09-04 05:02:03 +00:00
---
2019-09-03 03:46:40 +00:00
author: Security and Privacy in Data Science (CS 763)
2018-11-21 06:46:50 +00:00
title: Course Welcome
2019-09-03 03:46:40 +00:00
date: September 04, 2019
2018-09-04 05:02:03 +00:00
---
# Security and Privacy
## It's everywhere!
2018-09-05 03:31:32 +00:00
![](images/iot-cameras.png)
2018-09-04 05:02:03 +00:00
## Stuff is totally insecure!
2018-09-05 03:31:32 +00:00
![](images/broken.png)
2018-09-04 05:02:03 +00:00
# What topics to cover?
## A really, really vast field
- Things we will not be able to cover:
- Real-world attacks
- Computer systems security
- Defenses and countermeasures
- Social aspects of security
- Theoretical cryptography
- ...
## Theme 1: Formalizing S&P
- Mathematically formalize notions of security
- Rigorously prove security
- Guarantee that certain breakages can't occur
> Remember: definitions are tricky things!
## Theme 2: Automating S&P
- Use computers to help build more secure systems
- Automatically check security properties
- Search for attacks and vulnerabilities
2019-09-03 03:46:40 +00:00
## Five modules
2018-09-04 05:02:03 +00:00
1. Differential privacy
2019-09-03 03:46:40 +00:00
2. Adversarial machine learning
2019-09-04 01:35:58 +00:00
3. Crytpography in machine learning
2019-09-03 03:46:40 +00:00
4. Algorithmic fairness
5. PL and verification
## This course is broad!
- Each module could be its own course
- We won't be able to go super deep
- You will probably get lost
- Our goal: broad survey of multiple areas
- Lightning tour, focus on high points
> Hope: find a few things that interest you
## This course is technical!
- Approach each topic from a rigorous point of view
- Parts of "data science" with **provable guarantees**
- This is not a "theory course", but...
. . .
![](images/there-will-be-math.png)
2018-09-04 05:02:03 +00:00
# Differential privacy
2019-09-03 03:46:40 +00:00
##
![](images/privacy.png)
2018-09-04 05:02:03 +00:00
## A mathematically solid definition of privacy
- Simple and clean formal property
- Satisfied by many algorithms
- Degrades gracefully under composition
2019-09-03 03:46:40 +00:00
# Adversarial machine learning
2018-09-04 05:02:03 +00:00
2019-09-03 03:46:40 +00:00
##
2018-09-04 05:02:03 +00:00
2019-09-03 03:46:40 +00:00
![](images/aml.jpg)
2018-09-04 05:02:03 +00:00
## Manipulating ML systems
- Crafting examples to fool ML systems
- Messing with training data
- Extracting training information
2019-09-04 01:35:58 +00:00
# Cryptography in machine learning
2019-09-03 03:46:40 +00:00
##
![](images/crypto-ml.png)
## Crypto in data science
- Learning models without raw access to private data
- Collecting analytics data privately, at scale
- Side channels and implementation issues
- Verifiable execution of ML models
- Other topics (e.g., model watermarking)
# Algorithmic fairness
##
![](images/fairness.png)
## When is a program "fair"?
- Individual and group fairness
- Inherent tradeoffs and challenges
- Fairness in unsupervised learning
- Fairness and causal inference
# PL and verification
##
![](images/pl-verif.png)
## Proving correctness
- Programming languages for security and privacy
- Interpreting neural networks and ML models
- Verifying properties of neural networks
- Verifying probabilistic programs
2018-09-04 05:02:03 +00:00
# Tedious course details
2019-09-03 03:46:40 +00:00
## Lecture schedule
- First ten weeks: **lectures MWF**
- Intensive lectures, get you up to speed
- M: I will present
- WF: You will present
- Last five weeks: **no lectures**
- Intensive work on projects
- I will be available to meet, one-on-one
> You must attend lectures and participate
2018-09-04 05:02:03 +00:00
## Class format
- Three components:
1. Paper presentations
2019-09-03 03:46:40 +00:00
2. Presentation summaries
3. Final project
- Announcement/schedule/materials: on [website](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/)
- Class mailing list: [compsci763-1-f19@lists.wisc.edu]()
2018-09-04 05:02:03 +00:00
## Paper presentations
2019-09-03 03:46:40 +00:00
- In pairs, lead a discussion on group of papers
- See website for [detailed instructions](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/assignments/presentations/jjj)
- See website for [schedule of topics](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/schedule/lectures/)
- One week **before** presentation: meet with me
- Come prepared with presentation materials
- Run through your outline, I will give feedback
## Presentation summaries
- In pairs, prepare written summary of another group
- See website for [detailed instructions](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/assignments/summaries/)
- See website for [schedule of topics](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/schedule/lectures/)
- One week **after** presentation: send me summary
- I will work with you to polish report
- Writeups will be shared with the class
2018-09-04 05:02:03 +00:00
## Final project
2019-09-03 03:46:40 +00:00
- In groups of three (or very rarely two)
- See website for [project details](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/assignments/project/)
2018-09-04 05:02:03 +00:00
- Key dates:
2019-09-03 03:46:40 +00:00
- **October 11**: Milestone 1
- **November 8**: Milestone 2
2018-09-04 05:02:03 +00:00
- **End of class**: Final writeups and presentations
## Todos for you
2019-09-03 03:46:40 +00:00
0. Complete the [course survey](https://forms.gle/NvYx3BM7HVkuzYdG6)
1. Explore the [course website](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/)
2019-09-04 18:47:29 +00:00
2. Think about which lecture you want to present
2019-09-03 03:46:40 +00:00
3. Think about which lecture you want to summarize
2019-09-04 01:35:58 +00:00
4. Form project groups and brainstorm topics
> Signup for slots and projects [here](https://docs.google.com/spreadsheets/d/1hSbRy0mo3PjlozN0Ph1JkP5JwlRG8y7ukuCdorofncA/edit?usp=sharing)
## We will move quickly
- First deadline: **next Monday, September 9**
- Form paper and project groups
- Signup sheet [here](https://docs.google.com/spreadsheets/d/1hSbRy0mo3PjlozN0Ph1JkP5JwlRG8y7ukuCdorofncA/edit?usp=sharing)
- Please: don't sign up for the same slot
- First slot is soon: **next Friday, September 13**
- Only slot for presenting differential privacy
- I will help the first group prepare
2018-09-04 05:02:03 +00:00
# Defining privacy
## What does privacy mean?
2018-09-05 03:31:32 +00:00
- Many kinds of "privacy breaches"
- Obvious: third party learns your private data
- Retention: you give data, company keeps it forever
- Passive: you don't know your data is collected
2018-09-04 05:02:03 +00:00
## Why is privacy hard?
2018-09-05 03:31:32 +00:00
- Hard to pin down what privacy means!
- Once data is out, can't put it back into the bottle
- Privacy-preserving data release today may violate privacy tomorrow, combined
with "side-information"
- Data may be used many times, often doesn't change
2018-09-04 05:02:03 +00:00
## Hiding private data
2018-09-05 03:31:32 +00:00
- Delete "personally identifiable information"
- Name and age
- Birthday
- Social security number
- ...
- Publish the "anonymized" or "sanitized" data
2018-09-04 05:02:03 +00:00
## Problem: not enough
2018-09-05 03:31:32 +00:00
- Can match up anonymized data with public sources
- *De-anonymize* data, associate names to records
- Really, really hard to think about side information
- May not even be public at time of data release!
2019-09-04 18:47:29 +00:00
## Netflix prize
2018-09-05 03:31:32 +00:00
- Database of movie ratings
- Published: ID number, movie rating, and rating date
2019-09-04 18:47:29 +00:00
- Competition: predict which movies IDs will like
- Result
- Tons of teams competed
- Winner: beat Netflix's best by **10%**
> A triumph for machine learning contests!
##
![](images/netflix.png)
## Privacy flaw?
- Attack
- Public info on IMDB: names, ratings, dates
- Reconstruct names for Netflix IDs
- Result
- Netflix settled lawsuit ($10 million)
- Netflix canceled future challenges
2018-09-04 05:02:03 +00:00
## "Blending in a crowd"
2018-09-05 03:31:32 +00:00
- Only release records that are similar to others
- *k-anonymity*: require at least k identical records
- Other variants: *l-diversity*, *t-closeness*, ...
2018-09-04 05:02:03 +00:00
## Problem: composition
2018-09-05 03:31:32 +00:00
- Repeating k-anonymous releases may lose privacy
- Privacy protection may fall off a cliff
- First few queries fine, then suddenly total violation
- Again, interacts poorly with side-information
2018-09-04 05:02:03 +00:00
2019-09-04 18:47:29 +00:00
# Differential privacy
## Yet another privacy definition
2018-09-05 03:31:32 +00:00
> A new approach to formulating privacy goals: the risk to ones privacy, or in
> general, any type of risk... should not substantially increase as a result of
> participating in a statistical database. This is captured by differential
> privacy.
2019-09-04 18:47:29 +00:00
- Proposed by Dwork, McSherry, Nissim, Smith (2006)
2018-09-05 03:31:32 +00:00
## Basic setting
- Private data: set of records from individuals
- Each individual: one record
- Example: set of medical records
- Private query: function from database to output
- Randomized: adds noise to protect privacy
2018-09-04 05:02:03 +00:00
## Basic definition
2018-09-05 03:31:32 +00:00
A query $Q$ is **$(\varepsilon, \delta)$-differentially private** if for every two
databases $db, db'$ that differ in **one individual's record**, and for every
subset $S$ of outputs, we have:
$$
\Pr[ Q(db) \in S ] \leq e^\varepsilon \cdot \Pr[ Q(db') \in S ] + \delta
$$
2019-09-04 18:47:29 +00:00
## Basic reading
> Output of program doesn't depend too much on any single person's data
- Property of the algorithm/query/program
- No: "this data is differentially private"
- Yes: "this query is differentially private"