UWisconsin CS 763: Security and Privacy in Data Science (Previously CS 839: Topics in Security and Privacy)
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

183 lines
5.1 KiB

---
4 years ago
author: Topics in Security and Privacy Technologies (CS 839)
4 years ago
title: Course Welcome
date: September 05, 2018
---
# Security and Privacy
## It's everywhere!
![](images/iot-cameras.png)
## Stuff is totally insecure!
![](images/broken.png)
## It's really difficult!
![](images/netflix.png)
# What topics to cover?
## A really, really vast field
- Things we will not be able to cover:
- Real-world attacks
- Computer systems security
- Defenses and countermeasures
- Social aspects of security
- Theoretical cryptography
- ...
## Theme 1: Formalizing S&P
- Mathematically formalize notions of security
- Rigorously prove security
- Guarantee that certain breakages can't occur
> Remember: definitions are tricky things!
## Theme 2: Automating S&P
- Use computers to help build more secure systems
- Automatically check security properties
- Search for attacks and vulnerabilities
## Our focus: four modules
1. Differential privacy
2. Applied cryptography
3. Language-based security
4. Adversarial machine learning
# Differential privacy
## A mathematically solid definition of privacy
- Simple and clean formal property
- Satisfied by many algorithms
- Degrades gracefully under composition
# Applied crypto
## Computing in an untrusted world
- Proving you know something without revealing it
- Certifying that you did a computation correctly
- Computing on encrypted data, without decryption
- Computing joint answer without revealing your data
# Language-based security
## Ensure security by construction
- Programming languages for security
- Compiler checks that programs are secure
- Information flow, privacy, cryptography, ...
# Adversarial machine learning
## Manipulating ML systems
- Crafting examples to fool ML systems
- Messing with training data
- Extracting training information
# Tedious course details
## Class format
- Three components:
1. Paper presentations
2. Final project
3. Class participation
- Annoucement/schedule/materials: on [website](https://pages.cs.wisc.edu/~justhsu/teaching/current/cs839/)
- Class mailing list: [compsci839-1-f18@lists.wisc.edu]()
## Paper presentations
- Sign up to lead a discussion on one paper
- Suggested topic, papers, and schedule on website
- Before each presentation:
- I will send out brief questions
- Please email me brief answers
> If you want advice, come talk to me!
## Final project
- Work individually or in pairs
- Project details and suggestions on website
- Key dates:
- **September 19**: Pick groups and topic
- **October 15**: Milestone 1
- **November 14**: Milestone 2
- **End of class**: Final writeups and presentations
> If you want advice, come talk to me!
## Todos for you
0. Complete the course survey
1. Check out the course website
2. Think about what paper you want to present
3. Brainstorm project topics
# Defining privacy
## What does privacy mean?
- Many kinds of "privacy breaches"
- Obvious: third party learns your private data
- Retention: you give data, company keeps it forever
- Passive: you don't know your data is collected
## Why is privacy hard?
- Hard to pin down what privacy means!
- Once data is out, can't put it back into the bottle
- Privacy-preserving data release today may violate privacy tomorrow, combined
with "side-information"
- Data may be used many times, often doesn't change
## Hiding private data
- Delete "personally identifiable information"
- Name and age
- Birthday
- Social security number
- ...
- Publish the "anonymized" or "sanitized" data
## Problem: not enough
- Can match up anonymized data with public sources
- *De-anonymize* data, associate names to records
- Really, really hard to think about side information
- May not even be public at time of data release!
## Netflix challenge
- Database of movie ratings
- Published: ID number, movie rating, and rating date
- Attack: from public IMDB ratings, recover names for Netflix data
## "Blending in a crowd"
- Only release records that are similar to others
- *k-anonymity*: require at least k identical records
- Other variants: *l-diversity*, *t-closeness*, ...
## Problem: composition
- Repeating k-anonymous releases may lose privacy
- Privacy protection may fall off a cliff
- First few queries fine, then suddenly total violation
- Again, interacts poorly with side-information
## Differential privacy
- Proposed by Dwork, McSherry, Nissim, Smith (2006)
> A new approach to formulating privacy goals: the risk to one’s privacy, or in
> general, any type of risk... should not substantially increase as a result of
> participating in a statistical database. This is captured by differential
> privacy.
## Basic setting
- Private data: set of records from individuals
- Each individual: one record
- Example: set of medical records
- Private query: function from database to output
- Randomized: adds noise to protect privacy
## Basic definition
A query $Q$ is **$(\varepsilon, \delta)$-differentially private** if for every two
databases $db, db'$ that differ in **one individual's record**, and for every
subset $S$ of outputs, we have:
$$
\Pr[ Q(db) \in S ] \leq e^\varepsilon \cdot \Pr[ Q(db') \in S ] + \delta
$$