This repository has been archived on 2024-11-04. You can view files and clone it, but cannot push or open issues or pull requests.
cs763/website/docs/resources/slides/lecture-welcome.md

283 lines
8.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
author: Security and Privacy in Data Science (CS 763)
title: Course Welcome
date: September 04, 2019
---
# Security and Privacy
## It's everywhere!
![](images/iot-cameras.png)
## Stuff is totally insecure!
![](images/broken.png)
# What topics to cover?
## A really, really vast field
- Things we will not be able to cover:
- Real-world attacks
- Computer systems security
- Defenses and countermeasures
- Social aspects of security
- Theoretical cryptography
- ...
## Theme 1: Formalizing S&P
- Mathematically formalize notions of security
- Rigorously prove security
- Guarantee that certain breakages can't occur
> Remember: definitions are tricky things!
## Theme 2: Automating S&P
- Use computers to help build more secure systems
- Automatically check security properties
- Search for attacks and vulnerabilities
## Five modules
1. Differential privacy
2. Adversarial machine learning
3. Crytpography in machine learning
4. Algorithmic fairness
5. PL and verification
## This course is broad!
- Each module could be its own course
- We won't be able to go super deep
- You will probably get lost
- Our goal: broad survey of multiple areas
- Lightning tour, focus on high points
> Hope: find a few things that interest you
## This course is technical!
- Approach each topic from a rigorous point of view
- Parts of "data science" with **provable guarantees**
- This is not a "theory course", but...
. . .
![](images/there-will-be-math.png)
# Differential privacy
##
![](images/privacy.png)
## A mathematically solid definition of privacy
- Simple and clean formal property
- Satisfied by many algorithms
- Degrades gracefully under composition
# Adversarial machine learning
##
![](images/aml.jpg)
## Manipulating ML systems
- Crafting examples to fool ML systems
- Messing with training data
- Extracting training information
# Cryptography in machine learning
##
![](images/crypto-ml.png)
## Crypto in data science
- Learning models without raw access to private data
- Collecting analytics data privately, at scale
- Side channels and implementation issues
- Verifiable execution of ML models
- Other topics (e.g., model watermarking)
# Algorithmic fairness
##
![](images/fairness.png)
## When is a program "fair"?
- Individual and group fairness
- Inherent tradeoffs and challenges
- Fairness in unsupervised learning
- Fairness and causal inference
# PL and verification
##
![](images/pl-verif.png)
## Proving correctness
- Programming languages for security and privacy
- Interpreting neural networks and ML models
- Verifying properties of neural networks
- Verifying probabilistic programs
# Tedious course details
## Lecture schedule
- First ten weeks: **lectures MWF**
- Intensive lectures, get you up to speed
- M: I will present
- WF: You will present
- Last five weeks: **no lectures**
- Intensive work on projects
- I will be available to meet, one-on-one
> You must attend lectures and participate
## Class format
- Three components:
1. Paper presentations
2. Presentation summaries
3. Final project
- Announcement/schedule/materials: on [website](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/)
- Class mailing list: [compsci763-1-f19@lists.wisc.edu]()
## Paper presentations
- In pairs, lead a discussion on group of papers
- See website for [detailed instructions](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/assignments/presentations/jjj)
- See website for [schedule of topics](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/schedule/lectures/)
- One week **before** presentation: meet with me
- Come prepared with presentation materials
- Run through your outline, I will give feedback
## Presentation summaries
- In pairs, prepare written summary of another group
- See website for [detailed instructions](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/assignments/summaries/)
- See website for [schedule of topics](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/schedule/lectures/)
- One week **after** presentation: send me summary
- I will work with you to polish report
- Writeups will be shared with the class
## Final project
- In groups of three (or very rarely two)
- See website for [project details](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/assignments/project/)
- Key dates:
- **October 11**: Milestone 1
- **November 8**: Milestone 2
- **End of class**: Final writeups and presentations
## Todos for you
0. Complete the [course survey](https://forms.gle/NvYx3BM7HVkuzYdG6)
1. Explore the [course website](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/)
2. Think about which lecture you want to present
3. Think about which lecture you want to summarize
4. Form project groups and brainstorm topics
> Signup for slots and projects [here](https://docs.google.com/spreadsheets/d/1hSbRy0mo3PjlozN0Ph1JkP5JwlRG8y7ukuCdorofncA/edit?usp=sharing)
## We will move quickly
- First deadline: **next Monday, September 9**
- Form paper and project groups
- Signup sheet [here](https://docs.google.com/spreadsheets/d/1hSbRy0mo3PjlozN0Ph1JkP5JwlRG8y7ukuCdorofncA/edit?usp=sharing)
- Please: don't sign up for the same slot
- First slot is soon: **next Friday, September 13**
- Only slot for presenting differential privacy
- I will help the first group prepare
# Defining privacy
## What does privacy mean?
- Many kinds of "privacy breaches"
- Obvious: third party learns your private data
- Retention: you give data, company keeps it forever
- Passive: you don't know your data is collected
## Why is privacy hard?
- Hard to pin down what privacy means!
- Once data is out, can't put it back into the bottle
- Privacy-preserving data release today may violate privacy tomorrow, combined
with "side-information"
- Data may be used many times, often doesn't change
## Hiding private data
- Delete "personally identifiable information"
- Name and age
- Birthday
- Social security number
- ...
- Publish the "anonymized" or "sanitized" data
## Problem: not enough
- Can match up anonymized data with public sources
- *De-anonymize* data, associate names to records
- Really, really hard to think about side information
- May not even be public at time of data release!
## Netflix prize
- Database of movie ratings
- Published: ID number, movie rating, and rating date
- Competition: predict which movies IDs will like
- Result
- Tons of teams competed
- Winner: beat Netflix's best by **10%**
> A triumph for machine learning contests!
##
![](images/netflix.png)
## Privacy flaw?
- Attack
- Public info on IMDB: names, ratings, dates
- Reconstruct names for Netflix IDs
- Result
- Netflix settled lawsuit ($10 million)
- Netflix canceled future challenges
## "Blending in a crowd"
- Only release records that are similar to others
- *k-anonymity*: require at least k identical records
- Other variants: *l-diversity*, *t-closeness*, ...
## Problem: composition
- Repeating k-anonymous releases may lose privacy
- Privacy protection may fall off a cliff
- First few queries fine, then suddenly total violation
- Again, interacts poorly with side-information
# Differential privacy
## Yet another privacy definition
> A new approach to formulating privacy goals: the risk to ones privacy, or in
> general, any type of risk... should not substantially increase as a result of
> participating in a statistical database. This is captured by differential
> privacy.
- Proposed by Dwork, McSherry, Nissim, Smith (2006)
## Basic setting
- Private data: set of records from individuals
- Each individual: one record
- Example: set of medical records
- Private query: function from database to output
- Randomized: adds noise to protect privacy
## Basic definition
A query $Q$ is **$(\varepsilon, \delta)$-differentially private** if for every two
databases $db, db'$ that differ in **one individual's record**, and for every
subset $S$ of outputs, we have:
$$
\Pr[ Q(db) \in S ] \leq e^\varepsilon \cdot \Pr[ Q(db') \in S ] + \delta
$$
## Basic reading
> Output of program doesn't depend too much on any single person's data
- Property of the algorithm/query/program
- No: "this data is differentially private"
- Yes: "this query is differentially private"