--- author: Security and Privacy in Data Science (CS 763) title: Course Welcome date: September 02, 2020 --- # Security and Privacy ## It's everywhere! ![](images/iot-cameras.png) ## Stuff is totally insecure! ![](images/broken.png) # What topics to cover? ## A really, really vast field - Things we will not be able to cover: - Real-world attacks - Computer systems security - Defenses and countermeasures - Social aspects of security - Theoretical cryptography - ... ## Theme 1: Formalizing S&P - Mathematically formalize notions of security - Rigorously prove security - Guarantee that certain breakages can't occur > Remember: definitions are tricky things! ## Theme 2: Automating S&P - Use computers to help build more secure systems - Automatically check security properties - Search for attacks and vulnerabilities ## Five modules 1. Differential privacy 2. Adversarial machine learning 3. Crytpography in machine learning 4. Algorithmic fairness 5. PL and verification ## This course is broad! - Each module could be its own course - We won't be able to go super deep - You will probably get lost - Our goal: broad survey of multiple areas - Lightning tour, focus on high points > Hope: find a few things that interest you ## This course is technical! - Approach each topic from a rigorous point of view - Parts of "data science" with **provable guarantees** - This is not a "theory course", but... . . . ![](images/there-will-be-math.png) # Differential privacy ## ![](images/privacy.png) ## A mathematically solid definition of privacy - Simple and clean formal property - Satisfied by many algorithms - Degrades gracefully under composition # Adversarial machine learning ## ![](images/aml.jpg) ## Manipulating ML systems - Crafting examples to fool ML systems - Messing with training data - Extracting training information # Cryptography in machine learning ## ![](images/crypto-ml.png) ## Crypto in data science - Learning models without raw access to private data - Collecting analytics data privately, at scale - Side channels and implementation issues - Verifiable execution of ML models - Other topics (e.g., model watermarking) # Algorithmic fairness ## ![](images/fairness.png) ## When is a program "fair"? - Individual and group fairness - Inherent tradeoffs and challenges - Fairness in unsupervised learning - Fairness and causal inference # PL and verification ## ![](images/pl-verif.png) ## Proving correctness - Programming languages for security and privacy - Interpreting neural networks and ML models - Verifying properties of neural networks - Verifying probabilistic programs # Tedious course details ## Lecture schedule - First ten weeks: **lectures MWF** - Intensive lectures, get you up to speed - M: I will present - WF: You will present - Last five weeks: **no lectures** - Intensive work on projects - I will be available to meet, one-on-one > You must attend lectures and participate ## Class format - Three components: 1. Paper presentations 2. Presentation summaries 3. Final project - Announcement/schedule/materials: on [website](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/) - Class mailing list: [compsci763-1-f19@lists.wisc.edu]() ## Paper presentations - In pairs, lead a discussion on group of papers - See website for [detailed instructions](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/assignments/presentations/jjj) - See website for [schedule of topics](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/schedule/lectures/) - One week **before** presentation: meet with me - Come prepared with presentation materials - Run through your outline, I will give feedback ## Presentation summaries - In pairs, prepare written summary of another group - See website for [detailed instructions](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/assignments/summaries/) - See website for [schedule of topics](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/schedule/lectures/) - One week **after** presentation: send me summary - I will work with you to polish report - Writeups will be shared with the class ## Final project - In groups of three (or very rarely two) - See website for [project details](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/assignments/project/) - Key dates: - **October 11**: Milestone 1 - **November 8**: Milestone 2 - **End of class**: Final writeups and presentations ## Todos for you 0. Complete the [course survey](https://forms.gle/NvYx3BM7HVkuzYdG6) 1. Explore the [course website](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs763/) 2. Think about which lecture you want to present 3. Think about which lecture you want to summarize 4. Form project groups and brainstorm topics > Signup for slots and projects [here](https://docs.google.com/spreadsheets/d/1hSbRy0mo3PjlozN0Ph1JkP5JwlRG8y7ukuCdorofncA/edit?usp=sharing) ## We will move quickly - First deadline: **next Monday, September 9** - Form paper and project groups - Signup sheet [here](https://docs.google.com/spreadsheets/d/1hSbRy0mo3PjlozN0Ph1JkP5JwlRG8y7ukuCdorofncA/edit?usp=sharing) - Please: don't sign up for the same slot - First slot is soon: **next Friday, September 13** - Only slot for presenting differential privacy - I will help the first group prepare # Defining privacy ## What does privacy mean? - Many kinds of "privacy breaches" - Obvious: third party learns your private data - Retention: you give data, company keeps it forever - Passive: you don't know your data is collected ## Why is privacy hard? - Hard to pin down what privacy means! - Once data is out, can't put it back into the bottle - Privacy-preserving data release today may violate privacy tomorrow, combined with "side-information" - Data may be used many times, often doesn't change ## Hiding private data - Delete "personally identifiable information" - Name and age - Birthday - Social security number - ... - Publish the "anonymized" or "sanitized" data ## Problem: not enough - Can match up anonymized data with public sources - *De-anonymize* data, associate names to records - Really, really hard to think about side information - May not even be public at time of data release! ## Netflix prize - Database of movie ratings - Published: ID number, movie rating, and rating date - Competition: predict which movies IDs will like - Result - Tons of teams competed - Winner: beat Netflix's best by **10%** > A triumph for machine learning contests! ## ![](images/netflix.png) ## Privacy flaw? - Attack - Public info on IMDB: names, ratings, dates - Reconstruct names for Netflix IDs - Result - Netflix settled lawsuit ($10 million) - Netflix canceled future challenges ## "Blending in a crowd" - Only release records that are similar to others - *k-anonymity*: require at least k identical records - Other variants: *l-diversity*, *t-closeness*, ... ## Problem: composition - Repeating k-anonymous releases may lose privacy - Privacy protection may fall off a cliff - First few queries fine, then suddenly total violation - Again, interacts poorly with side-information # Differential privacy ## Yet another privacy definition > A new approach to formulating privacy goals: the risk to one’s privacy, or in > general, any type of risk... should not substantially increase as a result of > participating in a statistical database. This is captured by differential > privacy. - Proposed by Dwork, McSherry, Nissim, Smith (2006) ## Basic setting - Private data: set of records from individuals - Each individual: one record - Example: set of medical records - Private query: function from database to output - Randomized: adds noise to protect privacy ## Basic definition A query $Q$ is **$(\varepsilon, \delta)$-differentially private** if for every two databases $db, db'$ that differ in **one individual's record**, and for every subset $S$ of outputs, we have: $$ \Pr[ Q(db) \in S ] \leq e^\varepsilon \cdot \Pr[ Q(db') \in S ] + \delta $$ ## Basic reading > Output of program doesn't depend too much on any single person's data - Property of the algorithm/query/program - No: "this data is differentially private" - Yes: "this query is differentially private"