cs763/website/docs/resources/slides/lecture-welcome.md

---
author: Security and Privacy in Data Science (CS 763)
title: Course Welcome
date: September 04, 2019
---

# Security and Privacy

## It's everywhere!

![](images/iot-cameras.png)

## Stuff is totally insecure!

![](images/broken.png)

# What topics to cover?

## A really, really vast field
- Things we will not be able to cover:
    - Real-world attacks
    - Computer systems security
    - Defenses and countermeasures
    - Social aspects of security
    - Theoretical cryptography
    - ...

## Theme 1: Formalizing S&P
- Mathematically formalize notions of security
- Rigorously prove security
- Guarantee that certain breakages can't occur

> Remember: definitions are tricky things!

## Theme 2: Automating S&P
- Use computers to help build more secure systems
- Automatically check security properties
- Search for attacks and vulnerabilities

## Five modules
1. Differential privacy
2. Adversarial machine learning
3. Crytpography in machine learning
4. Algorithmic fairness
5. PL and verification

## This course is broad!
- Each module could be its own course
    - We won't be able to go super deep
    - You will probably get lost
- Our goal: broad survey of multiple areas
    - Lightning tour, focus on high points

>  Hope: find a few things that interest you

## This course is technical!
- Approach each topic from a rigorous point of view
- Parts of "data science" with **provable guarantees**
- This is not a "theory course", but...

. . .

![](images/there-will-be-math.png)

# Differential privacy

##

![](images/privacy.png)

## A mathematically solid definition of privacy
- Simple and clean formal property
- Satisfied by many algorithms
- Degrades gracefully under composition

# Adversarial machine learning

##

![](images/aml.jpg)

## Manipulating ML systems
- Crafting examples to fool ML systems
- Messing with training data
- Extracting training information

# Cryptography in machine learning

##

![](images/crypto-ml.png)

## Crypto in data science
- Learning models without raw access to private data
- Collecting analytics data privately, at scale
- Side channels and implementation issues
- Verifiable execution of ML models
- Other topics (e.g., model watermarking)

# Algorithmic fairness

##

![](images/fairness.png)

## When is a program "fair"?
- Individual and group fairness
- Inherent tradeoffs and challenges
- Fairness in unsupervised learning
- Fairness and causal inference

# PL and verification

##

![](images/pl-verif.png)

## Proving correctness
- Programming languages for security and privacy
- Interpreting neural networks and ML models
- Verifying properties of neural networks
- Verifying probabilistic programs

# Tedious course details

## Lecture schedule
- First ten weeks: **lectures MWF**
    - Intensive lectures, get you up to speed
    - M: I will present
    - WF: You will present
- Last five weeks: **no lectures**
    - Intensive work on projects
    - I will be available to meet, one-on-one

> You must attend lectures and participate

## Class format
- Three components:
    1. Paper presentations
    2. Presentation summaries
    3. Final project
- Announcement/schedule/materials: on [website](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/)
- Class mailing list: [compsci763-1-f19@lists.wisc.edu]()

## Paper presentations
- In pairs, lead a discussion on group of papers
    - See website for [detailed instructions](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/assignments/presentations/jjj)
    - See website for [schedule of topics](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/schedule/lectures/)
- One week **before** presentation: meet with me
    - Come prepared with presentation materials
    - Run through your outline, I will give feedback

## Presentation summaries
- In pairs, prepare written summary of another group
    - See website for [detailed instructions](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/assignments/summaries/)
    - See website for [schedule of topics](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/schedule/lectures/)
- One week **after** presentation: send me summary
    - I will work with you to polish report
    - Writeups will be shared with the class

## Final project
- In groups of three (or very rarely two)
- See website for [project details](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/assignments/project/)
- Key dates:
    - **October 11**: Milestone 1
    - **November 8**: Milestone 2
    - **End of class**: Final writeups and presentations

## Todos for you
0. Complete the [course survey](https://forms.gle/NvYx3BM7HVkuzYdG6)
1. Explore the [course website](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/)
2. Think about which lecture you want to present
3. Think about which lecture you want to summarize
4. Form project groups and brainstorm topics

> Signup for slots and projects [here](https://docs.google.com/spreadsheets/d/1hSbRy0mo3PjlozN0Ph1JkP5JwlRG8y7ukuCdorofncA/edit?usp=sharing)

## We will move quickly
- First deadline: **next Monday, September 9**
    - Form paper and project groups
    - Signup sheet [here](https://docs.google.com/spreadsheets/d/1hSbRy0mo3PjlozN0Ph1JkP5JwlRG8y7ukuCdorofncA/edit?usp=sharing) 
    - Please: don't sign up for the same slot
- First slot is soon: **next Friday, September 13**
    - Only slot for presenting differential privacy
    - I will help the first group prepare

# Defining privacy

## What does privacy mean?
- Many kinds of "privacy breaches"
    - Obvious: third party learns your private data
    - Retention: you give data, company keeps it forever
    - Passive: you don't know your data is collected

## Why is privacy hard?
- Hard to pin down what privacy means!
- Once data is out, can't put it back into the bottle
- Privacy-preserving data release today may violate privacy tomorrow, combined
  with "side-information"
- Data may be used many times, often doesn't change

## Hiding private data
- Delete "personally identifiable information"
    - Name and age
    - Birthday
    - Social security number
    - ...
- Publish the "anonymized" or "sanitized" data

## Problem: not enough
- Can match up anonymized data with public sources
- *De-anonymize* data, associate names to records
- Really, really hard to think about side information
    - May not even be public at time of data release!

## Netflix prize
- Database of movie ratings
- Published: ID number, movie rating, and rating date
- Competition: predict which movies IDs will like
- Result
    - Tons of teams competed
    - Winner: beat Netflix's best by **10%**

> A triumph for machine learning contests!

## 

![](images/netflix.png)


## Privacy flaw?
- Attack
    - Public info on IMDB: names, ratings, dates
    - Reconstruct names for Netflix IDs 
- Result
    - Netflix settled lawsuit ($10 million)
    - Netflix canceled future challenges

## "Blending in a crowd"
- Only release records that are similar to others
- *k-anonymity*: require at least k identical records
- Other variants: *l-diversity*, *t-closeness*, ...

## Problem: composition
- Repeating k-anonymous releases may lose privacy
- Privacy protection may fall off a cliff
    - First few queries fine, then suddenly total violation
- Again, interacts poorly with side-information

# Differential privacy

## Yet another privacy definition

> A new approach to formulating privacy goals: the risk to one’s privacy, or in
> general, any type of risk... should not substantially increase as a result of
> participating in a statistical database.  This is captured by differential
> privacy.

- Proposed by Dwork, McSherry, Nissim, Smith (2006)

## Basic setting
- Private data: set of records from individuals
    - Each individual: one record
    - Example: set of medical records
- Private query: function from database to output
    - Randomized: adds noise to protect privacy

## Basic definition
A query $Q$ is **$(\varepsilon, \delta)$-differentially private** if for every two
databases $db, db'$ that differ in **one individual's record**, and for every
subset $S$ of outputs, we have:

$$
\Pr[ Q(db) \in S ] \leq e^\varepsilon \cdot \Pr[ Q(db') \in S ] + \delta
$$

## Basic reading
> Output of program doesn't depend too much on any single person's data

- Property of the algorithm/query/program
    - No: "this data is differentially private"
    - Yes: "this query is differentially private"
-												Checkpoint first lecture.

											
										
										
											2018-09-04 05:02:03 +00:00
+								---
-												Course welcome.

											
										
										
											2019-09-03 03:46:40 +00:00
+								author: Security and Privacy in Data Science (CS 763)
-												Rename.

											
										
										
											2018-11-21 06:46:50 +00:00
+								title: Course Welcome
-												Course welcome.

											
										
										
											2019-09-03 03:46:40 +00:00
+								date: September 04, 2019
-												Checkpoint first lecture.

											
										
										
											2018-09-04 05:02:03 +00:00
+								---
 								# Security and Privacy
 								## It's everywhere!
-												Complete first lecture.

											
										
										
											2018-09-05 03:31:32 +00:00
+								![](images/iot-cameras.png)
-												Checkpoint first lecture.

											
										
										
											2018-09-04 05:02:03 +00:00
+								## Stuff is totally insecure!
-												Complete first lecture.

											
										
										
											2018-09-05 03:31:32 +00:00
+								![](images/broken.png)
-												Checkpoint first lecture.

											
										
										
											2018-09-04 05:02:03 +00:00
+								# What topics to cover?
 								## A really, really vast field
 								- Things we will not be able to cover:
 								    - Real-world attacks
 								    - Computer systems security
 								    - Defenses and countermeasures
 								    - Social aspects of security
 								    - Theoretical cryptography
 								    - ...
 								## Theme 1: Formalizing S&P
 								- Mathematically formalize notions of security
 								- Rigorously prove security
 								- Guarantee that certain breakages can't occur
 								> Remember: definitions are tricky things!
 								## Theme 2: Automating S&P
 								- Use computers to help build more secure systems
 								- Automatically check security properties
 								- Search for attacks and vulnerabilities
-												Course welcome.

											
										
										
											2019-09-03 03:46:40 +00:00
+								## Five modules
-												Checkpoint first lecture.

											
										
										
											2018-09-04 05:02:03 +00:00
+. Differential privacy
-												Course welcome.

											
										
										
											2019-09-03 03:46:40 +00:00
+. Adversarial machine learning
-												Prepare signup sheets.

											
										
										
											2019-09-04 01:35:58 +00:00
+. Crytpography in machine learning
-												Course welcome.

											
										
										
											2019-09-03 03:46:40 +00:00
+. Algorithmic fairness
 . PL and verification
 								## This course is broad!
 								- Each module could be its own course
 								    - We won't be able to go super deep
 								    - You will probably get lost
 								- Our goal: broad survey of multiple areas
 								    - Lightning tour, focus on high points
 								>  Hope: find a few things that interest you
 								## This course is technical!
 								- Approach each topic from a rigorous point of view
 								- Parts of "data science" with **provable guarantees**
 								- This is not a "theory course", but...
 								. . .
 								![](images/there-will-be-math.png)
-												Checkpoint first lecture.

											
										
										
											2018-09-04 05:02:03 +00:00
 								# Differential privacy
-												Course welcome.

											
										
										
											2019-09-03 03:46:40 +00:00
+								##
 								![](images/privacy.png)
-												Checkpoint first lecture.

											
										
										
											2018-09-04 05:02:03 +00:00
+								## A mathematically solid definition of privacy
 								- Simple and clean formal property
 								- Satisfied by many algorithms
 								- Degrades gracefully under composition
-												Course welcome.

											
										
										
											2019-09-03 03:46:40 +00:00
+								# Adversarial machine learning
-												Checkpoint first lecture.

											
										
										
											2018-09-04 05:02:03 +00:00
-												Course welcome.

											
										
										
											2019-09-03 03:46:40 +00:00
+								##
-												Checkpoint first lecture.

											
										
										
											2018-09-04 05:02:03 +00:00
-												Course welcome.

											
										
										
											2019-09-03 03:46:40 +00:00
+								![](images/aml.jpg)
-												Checkpoint first lecture.

											
										
										
											2018-09-04 05:02:03 +00:00
 								## Manipulating ML systems
 								- Crafting examples to fool ML systems
 								- Messing with training data
 								- Extracting training information
-												Prepare signup sheets.

											
										
										
											2019-09-04 01:35:58 +00:00
+								# Cryptography in machine learning
-												Course welcome.

											
										
										
											2019-09-03 03:46:40 +00:00
 								##
 								![](images/crypto-ml.png)
 								## Crypto in data science
 								- Learning models without raw access to private data
 								- Collecting analytics data privately, at scale
 								- Side channels and implementation issues
 								- Verifiable execution of ML models
 								- Other topics (e.g., model watermarking)
 								# Algorithmic fairness
 								##
 								![](images/fairness.png)
 								## When is a program "fair"?
 								- Individual and group fairness
 								- Inherent tradeoffs and challenges
 								- Fairness in unsupervised learning
 								- Fairness and causal inference
 								# PL and verification
 								##
 								![](images/pl-verif.png)
 								## Proving correctness
 								- Programming languages for security and privacy
 								- Interpreting neural networks and ML models
 								- Verifying properties of neural networks
 								- Verifying probabilistic programs
-												Checkpoint first lecture.

											
										
										
											2018-09-04 05:02:03 +00:00
+								# Tedious course details
-												Course welcome.

											
										
										
											2019-09-03 03:46:40 +00:00
+								## Lecture schedule
 								- First ten weeks: **lectures MWF**
 								    - Intensive lectures, get you up to speed
 								    - M: I will present
 								    - WF: You will present
 								- Last five weeks: **no lectures**
 								    - Intensive work on projects
 								    - I will be available to meet, one-on-one
 								> You must attend lectures and participate
-												Checkpoint first lecture.

											
										
										
											2018-09-04 05:02:03 +00:00
+								## Class format
 								- Three components:
 . Paper presentations
-												Course welcome.

											
										
										
											2019-09-03 03:46:40 +00:00
+. Presentation summaries
 . Final project
 								- Announcement/schedule/materials: on [website](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/)
 								- Class mailing list: [compsci763-1-f19@lists.wisc.edu]()
-												Checkpoint first lecture.

											
										
										
											2018-09-04 05:02:03 +00:00
 								## Paper presentations
-												Course welcome.

											
										
										
											2019-09-03 03:46:40 +00:00
+								- In pairs, lead a discussion on group of papers
 								    - See website for [detailed instructions](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/assignments/presentations/jjj)
 								    - See website for [schedule of topics](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/schedule/lectures/)
 								- One week **before** presentation: meet with me
 								    - Come prepared with presentation materials
 								    - Run through your outline, I will give feedback
 								## Presentation summaries
 								- In pairs, prepare written summary of another group
 								    - See website for [detailed instructions](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/assignments/summaries/)
 								    - See website for [schedule of topics](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/schedule/lectures/)
 								- One week **after** presentation: send me summary
 								    - I will work with you to polish report
 								    - Writeups will be shared with the class
-												Checkpoint first lecture.

											
										
										
											2018-09-04 05:02:03 +00:00
 								## Final project
-												Course welcome.

											
										
										
											2019-09-03 03:46:40 +00:00
+								- In groups of three (or very rarely two)
 								- See website for [project details](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/assignments/project/)
-												Checkpoint first lecture.

											
										
										
											2018-09-04 05:02:03 +00:00
+								- Key dates:
-												Course welcome.

											
										
										
											2019-09-03 03:46:40 +00:00
+								    - **October 11**: Milestone 1
 								    - **November 8**: Milestone 2
-												Checkpoint first lecture.

											
										
										
											2018-09-04 05:02:03 +00:00
+								    - **End of class**: Final writeups and presentations
 								## Todos for you
-												Course welcome.

											
										
										
											2019-09-03 03:46:40 +00:00
+. Complete the [course survey](https://forms.gle/NvYx3BM7HVkuzYdG6)
 . Explore the [course website](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/)
-												Tweak slides.

											
										
										
											2019-09-04 18:47:29 +00:00
+. Think about which lecture you want to present
-												Course welcome.

											
										
										
											2019-09-03 03:46:40 +00:00
+. Think about which lecture you want to summarize
-												Prepare signup sheets.

											
										
										
											2019-09-04 01:35:58 +00:00
+. Form project groups and brainstorm topics
 								> Signup for slots and projects [here](https://docs.google.com/spreadsheets/d/1hSbRy0mo3PjlozN0Ph1JkP5JwlRG8y7ukuCdorofncA/edit?usp=sharing)
 								## We will move quickly
 								- First deadline: **next Monday, September 9**
 								    - Form paper and project groups
 								    - Signup sheet [here](https://docs.google.com/spreadsheets/d/1hSbRy0mo3PjlozN0Ph1JkP5JwlRG8y7ukuCdorofncA/edit?usp=sharing)
 								    - Please: don't sign up for the same slot
 								- First slot is soon: **next Friday, September 13**
 								    - Only slot for presenting differential privacy
 								    - I will help the first group prepare
-												Checkpoint first lecture.

											
										
										
											2018-09-04 05:02:03 +00:00
 								# Defining privacy
 								## What does privacy mean?
-												Complete first lecture.

											
										
										
											2018-09-05 03:31:32 +00:00
+								- Many kinds of "privacy breaches"
 								    - Obvious: third party learns your private data
 								    - Retention: you give data, company keeps it forever
 								    - Passive: you don't know your data is collected
-												Checkpoint first lecture.

											
										
										
											2018-09-04 05:02:03 +00:00
 								## Why is privacy hard?
-												Complete first lecture.

											
										
										
											2018-09-05 03:31:32 +00:00
+								- Hard to pin down what privacy means!
 								- Once data is out, can't put it back into the bottle
 								- Privacy-preserving data release today may violate privacy tomorrow, combined
 								  with "side-information"
 								- Data may be used many times, often doesn't change
-												Checkpoint first lecture.

											
										
										
											2018-09-04 05:02:03 +00:00
 								## Hiding private data
-												Complete first lecture.

											
										
										
											2018-09-05 03:31:32 +00:00
+								- Delete "personally identifiable information"
 								    - Name and age
 								    - Birthday
 								    - Social security number
 								    - ...
 								- Publish the "anonymized" or "sanitized" data
-												Checkpoint first lecture.

											
										
										
											2018-09-04 05:02:03 +00:00
 								## Problem: not enough
-												Complete first lecture.

											
										
										
											2018-09-05 03:31:32 +00:00
+								- Can match up anonymized data with public sources
 								- *De-anonymize* data, associate names to records
 								- Really, really hard to think about side information
 								    - May not even be public at time of data release!
-												Tweak slides.

											
										
										
											2019-09-04 18:47:29 +00:00
+								## Netflix prize
-												Complete first lecture.

											
										
										
											2018-09-05 03:31:32 +00:00
+								- Database of movie ratings
 								- Published: ID number, movie rating, and rating date
-												Tweak slides.

											
										
										
											2019-09-04 18:47:29 +00:00
+								- Competition: predict which movies IDs will like
 								- Result
 								    - Tons of teams competed
 								    - Winner: beat Netflix's best by **10%**
 								> A triumph for machine learning contests!
 								##
 								![](images/netflix.png)
 								## Privacy flaw?
 								- Attack
 								    - Public info on IMDB: names, ratings, dates
 								    - Reconstruct names for Netflix IDs
 								- Result
 								    - Netflix settled lawsuit ($10 million)
 								    - Netflix canceled future challenges
-												Checkpoint first lecture.

											
										
										
											2018-09-04 05:02:03 +00:00
 								## "Blending in a crowd"
-												Complete first lecture.

											
										
										
											2018-09-05 03:31:32 +00:00
+								- Only release records that are similar to others
 								- *k-anonymity*: require at least k identical records
 								- Other variants: *l-diversity*, *t-closeness*, ...
-												Checkpoint first lecture.

											
										
										
											2018-09-04 05:02:03 +00:00
 								## Problem: composition
-												Complete first lecture.

											
										
										
											2018-09-05 03:31:32 +00:00
+								- Repeating k-anonymous releases may lose privacy
 								- Privacy protection may fall off a cliff
 								    - First few queries fine, then suddenly total violation
 								- Again, interacts poorly with side-information
-												Checkpoint first lecture.

											
										
										
											2018-09-04 05:02:03 +00:00
-												Tweak slides.

											
										
										
											2019-09-04 18:47:29 +00:00
+								# Differential privacy
 								## Yet another privacy definition
-												Complete first lecture.

											
										
										
											2018-09-05 03:31:32 +00:00
 								> A new approach to formulating privacy goals: the risk to one’s privacy, or in
 								> general, any type of risk... should not substantially increase as a result of
 								> participating in a statistical database.  This is captured by differential
 								> privacy.
-												Tweak slides.

											
										
										
											2019-09-04 18:47:29 +00:00
+								- Proposed by Dwork, McSherry, Nissim, Smith (2006)
-												Complete first lecture.

											
										
										
											2018-09-05 03:31:32 +00:00
+								## Basic setting
 								- Private data: set of records from individuals
 								    - Each individual: one record
 								    - Example: set of medical records
 								- Private query: function from database to output
 								    - Randomized: adds noise to protect privacy
-												Checkpoint first lecture.

											
										
										
											2018-09-04 05:02:03 +00:00
 								## Basic definition
-												Complete first lecture.

											
										
										
											2018-09-05 03:31:32 +00:00
+								A query $Q$ is **$(\varepsilon, \delta)$-differentially private** if for every two
 								databases $db, db'$ that differ in **one individual's record**, and for every
 								subset $S$ of outputs, we have:
 								$$
 								\Pr[ Q(db) \in S ] \leq e^\varepsilon \cdot \Pr[ Q(db') \in S ] + \delta
 								$$
-												Tweak slides.

											
										
										
											2019-09-04 18:47:29 +00:00
 								## Basic reading
 								> Output of program doesn't depend too much on any single person's data
 								- Property of the algorithm/query/program
 								    - No: "this data is differentially private"
 								    - Yes: "this query is differentially private"