NeurIPS 2020 Workshop
Virtual workshop, December 11, 2020
This one day workshop focuses on privacy preserving techniques for machine learning and disclosure in large scale data analysis, both in the distributed and centralized settings, and on scenarios that highlight the importance and need for these techniques (e.g., via privacy attacks). There is growing interest from the Machine Learning (ML) community in leveraging cryptographic techniques such as Multi-Party Computation (MPC) and Homomorphic Encryption (HE) for privacy preserving training and inference, as well as Differential Privacy (DP) for disclosure. Simultaneously, the systems security and cryptography community has proposed various secure frameworks for ML. We encourage both theory and application-oriented submissions exploring a range of approaches listed below. Additionally, given the tension between the adoption of machine learning technologies and ethical, technical and regulatory issues about privacy, as highlighted during the Covid-19 pandemic, we invite submissions for the special track on this topic.
Submission deadline: Oct 02, 2020, 23:59 (Anywhere on Earth)
Notification of acceptance: Oct 23, 2020
Submissions in the form of extended abstracts must be at most 4 pages long (not including references; additional supplementary material may be submitted but may be ignored by reviewers), non-anonymized and adhere to the NeurIPS format. We encourage submission of work that is new to the privacy-preserving machine learning community. Submissions solely based on work that has been previously published in conferences on machine learning and related fields are not suitable for the workshop. On the other hand, we allow submission of works currently under submission and relevant works recently published in privacy and security venues. The workshop will not have formal proceedings, but authors of accepted abstracts can choose to have a link to arxiv or a pdf added on the workshop webpage.
View the recordings on SlidesLive!
The workshop will be hosted in two blocks: BLOCK I accommodates Asia and Europe (morning) time zones, BLOCK II accommodates U.S. and Europe (evening) time zones. Unless otherwise noted, all listed times are CET (UTC+1).
Each block will contain three main components: an hour to view two recorded talks by invited speakers followed by a 30min live joint Q&A with both, a poster session/social via Gather.Town (little tutorial for our venue) , and various 'contributed talks' highlighting submissions to this workshop with corresponding live Q&A sessions. Due to time zone constraints, most contributed talks will be in BLOCK II, but all talks will be recorded on SlidesLive for viewing afterwards.
To join the workshop you will need a NeurIPS 2020 workshop registration ticket (see neurips.cc for more information). Instructions on how to join the workshop will be provided by NeurIPS.
BLOCK I, Asia/Europe: (17:20-21:00 Beijing) (14:50-18:30 Delhi) (10:20-14:00 Paris) | |
10:20-10:30 | Welcome & Introduction |
10:30-11:00 | Invited talk (1): Reza Shokri — Data privacy at the intersection of trustworthy machine learning |
Machine learning models leak significant amount of information about their training data, through their predictions and parameters. In this talk, we discuss the impact of trustworthy machine learning, notably interpretability and fairness, on data privacy. We present the privacy risks of model explanations, and the effects of differential privacy on interpretability. We will also discuss the trade-off between privacy and (group) fairness, and how training fair models can make underrepresented groups more vulnerable to inference attacks.
|
|
11:00-11:30 | Invited talk (2): Katrina Ligett — The Elephant in the Room: The Problems that Privacy-Preserving ML Can’t Solve |
In this talk, I attempt to lay out the problems of the data ecosystem, and to explore which of them can potentially be addressed by the toolkit of privacy-preserving machine learning. What we see is that while privacy-preserving machine learning has made amazing advances over the past decade and a half, there are enormous and troubling problems with the data ecosystem that seem to require an entirely different set of solutions.
|
|
11:30 - 12:00 | Invited Talk Q&A with Reza and Katrina |
12:00-12:10 | Break 10min |
12:10-12:30 |
POSEIDON: Privacy-Preserving Federated Neural Network Learning
(contributed talk: 15min presentation + 5min Q&A)
Sinem Sav, Apostolos Pyrgelis, Juan Ramón Troncoso-Pastoriza, David Froelicher, Jean-Philippe Bossuat, João Sá Sousa and Jean-Pierre Hubaux |
We address the problem of privacy-preserving training and evaluation of neural networks in an N-party, federated learning setting. We propose a novel system, POSEIDON, that employs multiparty lattice-based cryptography and preserves the confidentiality of the training data, the model, and the evaluation data, under a passive-adversary model and collusions between up to N−1 parties. Our experimental results show that POSEIDON achieves accuracy similar to centralized or decentralized non-private approaches and that its computation and communication overhead scales linearly with the number of parties. POSEIDON trains a 3-layer neural network on the MNIST dataset with 784 features and 60K samples distributed among 10 parties in less than 2 hours.
|
|
12:30-14:00 | Gather.Town Poster Session and Social (log in with Neurips registration credentials) |
BLOCK II, U.S./Europe: (8:30-13:25 LA) (11:30-16:25 NYC) (17:30-22:25 Paris) | |
17:30-17:40 | Welcome & Introduction |
17:40-18:05 | Invited talk (1): Carmela Troncoso — Is Synthetic Data Private? |
Synthetic datasets produced by generative models have been advertised as a silver-bullet solution to privacy-preserving data publishing. In this talk, we show that such claims are unfounded. We show how synthetic data does not stop linkability or attribute inference attacks; and that differentially-private training does not increase the privacy gain of these datasets. We also show that some target records receive substantially less protection than others and that the more complex the generative model, the more difficult it is to predict which targets will remain vulnerable to inference attacks. We finally challenge the claim that synthetic data is an appropriate solution to the problem of privacy-preserving microdata publishing.
|
|
18:05-18:30 | Invited talk (2): Dan Boneh — Proofs on secret shared data: an overview |
Many consumer devices these days are Internet enabled, and locally record information about how the device is used by its owner. Manufacturers have a strong interest in mining this data in order to improve their products, but concerns over data privacy often prevents this from taking place. A recent collection of techniques enables companies to process this distributed data without ever seeing the data in the clear. One obstacle is that a malfunctioning device might send invalid data and throw off the analysis. No one will ever know because no one can see the data in the clear. To prevent this, there is a need for ultra light-weight zero-knowledge techniques to prove properties about the hidden collected data. This talk will survey some recent progress in this area.
|
|
18:30 - 19:00 | Invited Talk Q&A with Carmela and Dan |
19:00-19:10 | Break 10min |
19:10-20:10 | Gather.Town Poster Session and Social (log in with Neurips registration credentials) |
20:10-20:20 | Break 10min |
20:20-20:35 |
On the (Im)Possibility of Private Machine Learning through Instance Encoding
(contributed talk: 15min presentation)
Nicholas Carlini, Samuel Deng, Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody, Shuang Song, Abhradeep Thakurtak, Florian Tramèr |
A learning algorithm is private if the produced model does not reveal (too much) about its training set. In this work, we study whether a non-private learning algorithm can be made private by relying on an instance encoding mechanism that modifies the inputs before being fed to the normal learner. We formalize the notion of instance encoding and its privacy by providing two attack models. We first prove impossibility results for achieving the first (stronger) model. We further demonstrate practical attacks in the second (weaker) attack model on recent proposals that aim to use instance encoding for privacy.
|
|
20:35-20:50 |
Poirot: Private Contact Summary Aggregation
(contributed talk: 15min presentation)
Chenghong Wang, David Pujol, Yaping Zhang, Johes Bater, Matthew Lentz, Ashwin Machanavajjhala, Kartik Nayak, Lavanya Vasudevan and Jun Yang |
Physical distancing between individuals is key to preventing the spread of a diseasesuch as COVID-19. On the one hand, having access to information about physical interactions is critical for decision makers; on the other, this information is sensitive and can be used to track individuals. In this work, we design Poirot, a system to collect aggregate statistics about physical interactions in a privacy-preserving manner. We show a preliminary evaluation of our system that demonstrates the scalability of our approach even while maintaining strong privacy guarantees.
|
|
20:50-21:05 |
Greenwoods: A Practical Random Forest Framework for Privacy Preserving Training and Prediction
(contributed talk: 15min presentation)
Harsh Chaudhari and Peter Rindal |
In this work we propose two prediction protocols for a random forest model. The
first takes a traditional approach and requires the trees in the forest to be complete in order to hide sensitive information. Our second protocol takes a novel approach which allows the servers to obliviously evaluate only the “active path” of the trees. This approach can easily support trees with large depth while revealing no sensitive information to the servers. We then present a distributed framework for privacy preserving training which circumvents the expensive procedure of privately training the random forest on a combine dataset and propose an alternate efficient collaborative approach with the help of users participating in the training phase.
|
|
21:05-21:20 | Joint Q&A with the three speakers above |
21:20-21:25 | Break 5min |
21:25-21:40 |
Shuffled Model of Federated Learning: Privacy, Accuracy, and Communication Trade-offs
(contributed talk: 15min presentation)
Antonious Girgis, Deepesh Data, Suhas Diggavi, Peter Kairouz and Ananda Theertha Suresh |
We study empirical risk minimization (ERM) optimization with communication efficiency and privacy under the shuffled model. We use our communication- efficient schemes for private mean estimation in the optimization solution of the ERM. By combining this with privacy amplification by client sampling and data sampling at each client as well as the shuffled privacy model, we demonstrate that one can get the same privacy, optimization-performance operating point as recent methods using full-precision communication, but lower communication cost, i.e., effectively getting communication efficiency for “free”.
|
|
21:40-21:55 |
Sample-efficient proper PAC learning with approximate differential privacy
(contributed talk: 15min presentation)
Badih Ghazi, Noah Golowich, Ravi Kumar and Pasin Manurangsi |
In this paper we prove that the sample complexity of properly learning a class of Littlestone dimension d with approximate differential privacy is at most Õ(d^6) (ignoring privacy and accuracy parameters). This result answers a question of Bun et al. (FOCS 2020) by improving upon their upper bound of 2^O(d) on the sample complexity. Prior to our work, finiteness of the sample complexity for privately learning a class of finite Littlestone dimension was only known for improper private learners, and the fact that our learner is proper answers another question of Bun et al. which was also asked by Bousquet et al. (2019). Using machinery developed by Bousquet et al., we show that the sample complexity of sanitizing a binary hypothesis class is at most polynomial in its Littlestone dimension and dual Littlestone dimension. This implies that a class is sanitizable if and only if it has finite Littlestone dimension. An important ingredient of our proofs is a new property of binary hypothesis classes that we call irreducibility, which may be of independent interest.
|
|
21:55-22:10 |
Training Production Language Models without Memorizing User Data
(contributed talk: 15min presentation)
Swaroop Ramaswamy, Om Dipakbhai Thakkar, Rajiv Mathews, Galen Andrew, Brendan McMahan and Françoise Beaufays |
This paper presents the first consumer-scale next-word prediction (NWP) model trained with Federated Learning (FL) while leveraging the Differentially Private Federated Averaging (DP-FedAvg) technique. There has been prior work on building practical FL infrastructure, including work demonstrating the feasibility of training language models on mobile devices using such infrastructure. It has also been shown (in simulations on a public corpus) that it is possible to train NWP models with user-level differential privacy (DP) using DP-FedAvg. Nevertheless, training production-quality NWP models with DP-FedAvg in a real-world production environment on a heterogeneous fleet of mobile phones requires addressing numerous challenges. For instance, the coordinating central server has to keep track of the devices available at the start of each round and sample devices uniformly at random from them, while ensuring \emph{secrecy of the sample}, etc. Unlike all prior privacy-focused FL work of which we are aware, for the first time we demonstrate the deployment of a DP mechanism for the training of a production neural network in FL, as well as the instrumentation of the production training infrastructure to perform an end-to-end empirical measurement of unintended memorization.
|
|
22:10-22:25 | Joint Q&A with the three speakers above |