Thesis Proposal

This dissertation investigates how to reduce communication overhead in privacy-preserving applications. It develops two complementary directions toward this goal. The first studies how sublinear communication protocols can be achieved for tasks such as secure search over encrypted datasets, secure collaborative sampling protocols, and sublinear approximation algorithms such as Min-Hash for set similarity. These techniques serve as common building blocks across a wide range of privacy-preserving applications. Together, these results reveal a unified perspective on how limited communication and provable privacy reinforce each other across modern cryptographic and data-analytic settings. The second direction extends these to practical systems such as the FACTS framework for privacy-preserving accountability on end-to-end encrypted messaging systems (EEMS). This work enables threshold tracebacks to identify the originator of reported messages while the cost depends only on the number of reports, and includes preliminary designs and analysis for extending these mechanisms to identify the super-spreader of reported messages.

1 Introduction

The modern digital landscape is defined by a fundamental tension between the exponential growth of data and the imperative to protect the privacy of the individuals generating it. As datasets scale into the terabytes and petabytes, and as digital communication becomes the primary medium for societal discourse, the computational paradigms that served the early internet are reaching their limits. Traditional cryptographic protocols for privacy-preserving applications, such as Secure Multi-Party Computation (MPC) and Fully Homomorphic Encryption (FHE), offer mathematically robust guarantees of confidentiality. However, these protocols typically exhibit linear or super-linear complexity with respect to the input size (\(O(n)\)). In a world of massive datasets, linear complexity is often synonymous with infeasibility. A query that requires touching every record in a billion-row database, even if cryptographically secure, remains practically useless if it takes days to execute.

Consequently, one important frontier of cryptographic research has shifted toward the development of sublinear algorithms—protocols whose complexity is logarithmic, constant with respect to the input size, or dependent only on the output size. This thesis proposal report synthesizes four seminal contributions that pioneer this shift across two distinct but interrelated domains.

The first domain, Sublinear Secure Protocols, comprises three studies that attack the fundamental algorithmic bottlenecks of secure computation. These works introduce a new data structure for compressing large sparse data vectors for homomorphically encrypted search to allow for parallel, sublinear retrieval, develop protocols for secure sampling from distributed distributions without linear communication, and investigate the inherent privacy properties of sketching algorithms (Min-Hash).

The second domain, designated as FACTS, addresses the societal crisis of misinformation within end-to-end encrypted messaging systems (EEMS). Here, the challenge is not merely computational efficiency but the architectural reconciliation of user privacy with platform accountability. The proposed solution, the Fuzzy Anonymous Complaint Tally System (FACTS), leverages a novel probabilistic data structure to achieve sublinear scalability in complaint auditing, proving that privacy and moderation need not be mutually exclusive.

A unifying theme across both groups is the strategic utilization of approximation and probabilistic correctness. Whether it is the fuzzy counting of complaints in FACTS, the rejection sampling in secure estimation, or the false-positive rates of compressed encodings in search, these works demonstrate that controlled relaxation of exactness is the key to unlocking sublinear performance. This introduction delineates the theoretical underpinnings and practical implications of these advancements, setting the stage for a detailed technical analysis.

1.1 The Linear Bottleneck in Secure Computation

To understand the necessity of the contributions discussed herein, one must first appreciate the “Linearity Wall.” In standard secure multi-party computation, if Party A holds a vector \(x\) and Party B holds a vector \(y\), computing a function \(f(x, y)\) usually requires processing every bit of the inputs to avoid leaking information about which bits were “useful”. For example, in Private Information Retrieval (PIR), the server must process the entire database to answer a single query; otherwise, the server learns which items were not touched, narrowing down the user’s interest.

While techniques like Oblivious RAM (ORAM) can hide access patterns, they impose logarithmic overheads that, while asymptotically better than linear scanning for multiple accesses, still require significant bandwidth and state management. The research in Sublinear Secure Protocols specifically targets scenarios where even these overheads are unacceptable, aiming for protocols where communication depends on the output size or the security parameter, rather than the input database size.

1.2 Sublinear Secure Protocols: Primitives for Sublinear Privacy

Collectively, these works argue that the future of privacy-preserving technologies lies in the domain of sublinear algorithms, where mathematical approximation provides the necessary slack to achieve scalability.

1.3 FACTS: Accountability via Probabilistic Structures

This line of research focuses on the application of “fighting fake news”. end-to-end encrypted messaging systems (EEMS) platforms like WhatsApp and Signal utilize encryption that hides message content from the service provider. This prevents any content moderation methods that rely on analyzing the content in the clear, such as tools used by platforms like Facebook or Twitter. The FACTS system introduces a paradigm where moderation is triggered only by a consensus of user complaints.

The core innovation here is the Collaborative Counting Bloom Filter (CCBF). This data structure allows the system to tally complaints against millions of messages without the server knowing which complaints correspond to which message until a threshold is crossed. This is achieved by “mixing” counters in a shared bit array. The efficiency is sublinear in the number of messages in the system per epoch: identifying a message for audit requires no linear scan of the database, and registering a complaint requires flipping a single bit. This work exemplifies how probabilistic data structures—traditionally used for efficiency in networking–can be repurposed as privacy primitives.

2 Literature Review

This section surveys existing literature relevant to the two primary lines of work. First, we discuss the foundational cryptographic primitives and protocols for Secure Search, Secure Sampling, and Privacy-Preserving Set Similarity. Second, we review the literature relevant to our specific application in accountability systems of end-to-end encrypted messaging systems, FACTS.

2.1 Line 1: Secure Search, Sampling, and Similarity

This line of work focuses on optimizing privacy-preserving protocols for retrieving and analyzing data. We divide the literature into three interconnected domains: secure search, secure sampling, and set similarity.

2.1.0.1 Differential privacy (DP).

Differential privacy limits what an adversary can learn about any individual input from the output of a computation [1], [2]. For an overview of DP and standard mechanisms in both the curator setting and distributed settings, we refer the reader to the book by Dwork and Roth [3]. DP will appear repeatedly in our discussion, both as an explicit privacy goal and as a tool for trading accuracy for improved efficiency.

2.1.1 Secure Search

Secure search is a widely-studied problem with solutions spanning various cryptographic settings. In the discussion below, let \(n\) denote the number of data items.

2.1.1.1 Secure pattern matching (SPM) on FHE-encrypted data.

In SPM, given an encrypted query \(\unicode{x27E6}q \unicode{x27E7}\) and \(n\) FHE-encrypted data items \((\unicode{x27E6}x_1 \unicode{x27E7}, \ldots, \unicode{x27E6}x_n \unicode{x27E7})\), the protocol returns a vector of \(n\) ciphertexts \(\unicode{x27E6}b_1 \unicode{x27E7}, \ldots, \unicode{x27E6}b_n \unicode{x27E7}\), where \(b_i\) indicates whether the \(i\)-th data element matches the query [4], [5], [6], [7]. These works primarily focus on optimizing the search circuits used to determine matches. Consequently, the communication complexity and the client’s running time remain proportional to the number of data items. In contrast, our work focuses on the orthogonal problem of optimizing the retrieval of matched data items, aiming for sublinear communication and client computation.

2.1.1.2 Searchable encryption (SE).

Searchable encryption [8], [9] enables highly efficient search (typically \(o(n)\) time) over encrypted data. Efficient SE schemes have been proposed for a wide variety of queries, including equality queries [10], [11], range queries [12], [13], and conjunctive queries [14], [15]. However, to achieve sublinear query performance, SE schemes generally require significant preprocessing and must relax security guarantees, allowing partial information (such as access patterns) to leak to the server [16]. Unlike standard SE, our work focuses on achieving preprocessing-free constructions that leak nothing about the queries or results beyond their sizes.

2.1.1.3 Property Preserving Encryption (PPE).

PPE [17] produces ciphertexts that maintain certain relationships (e.g., equality or order) of the underlying plaintexts. Examples include deterministic encryption [18] and order-preserving encryption [19], [20]. However, it has been demonstrated [21], [22], [23] that such ciphertexts leak significant information, rendering them undesirable for many security-critical applications.

2.1.1.4 General Techniques (PIR, MPC, ORAM, ODS).

Private Information Retrieval (PIR) [24] allows retrieval while hiding the index, but requires the client to know the index beforehand. General MPC [25] and ORAM [26] can theoretically solve secure search, but generic constructions typically incur high communication (\(\Omega(n)\) for MPC) or significant overhead to hide access patterns. Oblivious data structures (ODS) can support richer data structures (e.g., search trees), but typical ODS constructions incur \(\Omega(\log^2 n)\) rounds per operation, which motivates constant-round designs for practical secure search.

2.1.2 Secure Sampling

2.1.2.1 Sampling from streaming data.

Non-private sampling from data streams has been studied extensively [27], [28], [29], [30], [31], typically achieving \(\ell_p\) sampling with sublinear computation. These works generally operate in a single-party setting and do not consider privacy.

2.1.2.2 Secure multiparty sampling.

In the privacy domain, works like [32], [33] investigate sampling in the information-theoretic setting, while Champion et al. [34] consider the computational setting for publicly-known distributions. Our work is distinct in focusing on sampling from a private distribution in the computational setting, with a specific emphasis on reducing communication.

2.1.2.3 Secure MPC of differentially private functionalities.

Since Dwork et al. [35], significant work has focused on using MPC to realize DP functionalities [36], [37], [38]. While these works address machine learning and aggregation, we focus specifically on optimizing the communication complexity for sampling functionalities.

2.1.3 Private Sketching and Secure Sketching

2.1.3.1 Secure sketching.

A long line of work studies secure sketches for estimating statistics (e.g., Tor usage, web traffic, unique count, median) with sublinear communication and computation [39], [40], [41], [42], [43]. These works are closely related in spirit, but generally focus on building secure protocols for specific streaming-style statistics, whereas our focus is on sublinear primitives (search, sampling, and similarity) that can serve as broadly reusable building blocks.

2.1.3.2 Private sketching.

Sketching algorithms are sublinear-space methods that produce compact summaries enabling efficient storage, merging, and processing. A growing body of work observes that sketches can also aid privacy, since information loss in the sketch can make the sketch itself differentially private or reduce the additional noise required [44], [45], [46], [47], [48], [49], [50], [51], [52], [53], [54], [55], [56], [57]. Related work also constructs private sketches for set cardinality and set operations, including mergeable sketches for estimating intersection and union [48], [58], [59], [60], [61], [62]. This perspective directly motivates our study of privacy properties of Min-hash and related similarity sketches.

2.1.4 Set Similarity via Min-Hash

2.1.4.1 Jaccard Index and Min-Hash.

Many works have constructed 2-party Jaccard index estimation protocols using Min-hash with sublinear communication to avoid the high cost of exact computation [63], [64], [65].

2.1.4.2 Differentially Private (DP) Min-Hash.

Recent research aims to output Jaccard similarity estimates while preserving differential privacy. Aumüller et al. [66] achieve local DP min-hash by perturbing vectors with noise. Other attempts have faced challenges; for example, [67] was shown to have flaws in its privacy claim, while [68] incurs high noise overhead.

2.1.4.3 Optimizing Secure Computation using DP.

Our work sits at the intersection of DP and MPC. A specific line of research relevant to our goals is optimizing secure computation using DP, initiated by Beimel et al. [69]. Subsequent works have applied this to set intersection [70], [71], graph-parallel computations [72], and shuffling [73]. We similarly leverage DP-style relaxations to improve the efficiency of similarity protocols.

2.1.4.4 Secure approximation.

Secure approximation studies what functions can be securely approximated without revealing anything beyond the true output [74], [75]. This notion is distinct from DP-style approximation, but it is conceptually adjacent: both investigate how controlled relaxation can enable more efficient privacy-preserving computation.

2.1.4.5 Robust sketching and property-preserving hashing.

Property-preserving hash (PPH) compresses large inputs into short digests enabling computation of a property from the digests alone, and adversarially robust PPH further requires correctness even when inputs are adaptively chosen after the hash is fixed [76], [77], [78], [79]. Related work on robust sketching similarly studies sketches that remain accurate under adaptive inputs [80], [81]. These works focus on robustness to adversarial inputs, whereas our focus is on privacy when the adversary additionally sees the hash functions or sketch randomness.

2.2 Line 2: FACTS (Forwarding Accountability in End-to-End Encrypted Messaging Platforms)

Our second line of work shifts focus to a specific application: enabling accountability in end-to-end encrypted messaging systems.

2.2.0.1 Message Franking.

The prevailing approach for reporting malicious messages is message franking [82], [83], [84]. This technique allows a recipient to cryptographically prove the identity of the sender. However, message franking is limited to identifying the immediate sender and cannot trace the original source in a forwarding chain. Furthermore, it does not provide threshold guarantees to prevent unmasking users based on a single complaint.

2.2.0.2 Scalable Oblivious Data Structures.

FACTS requires scalable oblivious storage to track complaints. While multi-client ORAM protocols exist [85], [86], [87], they do not yet scale to the millions of users required for messaging applications. Similarly, oblivious counters [88], [89] generally focus on exact counting or complex operations, lacking the specific compression traits of the Compact Count-Min Bloom Filter (CCBF) we utilize. More generally, oblivious data structures [90], [91], [92] enable higher-level oblivious operations over encrypted data, but do not directly provide the compression needed to store and update complaint tallies at large scale.

2.2.0.3 Private Sketching for User-Server Settings.

CCBF can be viewed as a compact sketch for storing complaint counts over a large set of messages. There has been significant interest in privacy-preserving sketching algorithms for cardinality estimation, frequency measurement, and related approximations [42], [43]. However, much of this literature targets multi-party settings (often with multiple servers) where parties run secure computation to evaluate the statistic. In contrast, FACTS restricts communication to the user–server model, which limits the direct applicability of these approaches.

3 Secure Search

3.1 Introduction

As computing paradigms are shifting to cloud-centric technologies, users of these technologies are increasingly concerned with the privacy and confidentiality of the data they upload to the cloud. Specifically, a client uploads data to the server and expects the following guarantees:

The server should be able to perform computations on the uploaded data in response to client queries;

The client should be able to efficiently recover the results of the server’s computation with minimal post-processing.

In this work, we will focus on the computational task of secure search. In this application, the client uploads a set of records to the server, and later posts queries to the server. Computation proceeds in two steps called matching and fetching. In the matching step, the server compares the encrypted search query from the client with all encrypted records in the database, and computes an encrypted 0/1 vector, with 1 indicating that the corresponding record satisfies the query. The fetching step returns all the 1-valued indexes and the corresponding records, to the client for decryption.

While seemingly conflicting goals, the guarantees of (1), (2), (3) can be simultaneously achieved for the secure search setting via techniques such as secure multiparty computation and searchable encryption. Recently, a line of works has focused on Fully Homomorphic Encryption (FHE)-based secure search, which we describe next.

3.1.0.1 FHE-based secure search.

The simplicity of the framework of secure search on FHE encrypted data is attractive. Compared to other secure search systems, no costly setup procedure is necessary; it is sufficient for the client merely to upload the encrypted database to the server. Confidentiality is provided because the server works only on the encrypted query and records. The server can still perform the search correctly due to the powerful property of the full homomorphism of the underlying encryption scheme.

For this reason, researchers have been paying increasing attention to this problem. In particular, Akavia et al. [93] introduce a framework of performing secure search on FHE-encrypted data (see Figure 1).

Informally, a secure, homomorphic encrypted search scheme has the following Setup:

(Setup) The client encrypts and uploads \(n\) items \(x = (x_1, \ldots, x_n)\) to the server. Let \(\unicode{x27E6}x \unicode{x27E7}= (\unicode{x27E6}x_1 \unicode{x27E7}, \ldots, \unicode{x27E6}x_n \unicode{x27E7}).\) denote the encrypted data stored in the server.

Throughout the paper, we let \(\unicode{x27E6}\cdot \unicode{x27E7}\) denote an FHE-encrypted ciphertext. After the encrypted records have been uploaded, the client can perform a secure search using three algorithms, (Query, Match, Fetch).

(Query) The client sends an encrypted query \(\unicode{x27E6}q \unicode{x27E7}\) to the server.

(Match) The server homomorphically evaluates the query \(\unicode{x27E6}q \unicode{x27E7}\) on each record \(\unicode{x27E6}x_i \unicode{x27E7}\) to obtain the encrypted matching results \(\unicode{x27E6}b \unicode{x27E7}= (\unicode{x27E6}b_1 \unicode{x27E7}, \ldots, \unicode{x27E6}b_n \unicode{x27E7}).\) That is, \(b_i\) is 1 if item \(x_i\) satisfies the given query \(q\); otherwise, \(b_i\) is \(0\).

(Fetch) Given \(\unicode{x27E6}b \unicode{x27E7}\), the server homomorphically computes \(\unicode{x27E6}i^* \unicode{x27E7}\), where \(i^* = \min \{ i \in [n] : b_i = 1 \}\) which corresponds to the first matching record index. It fetches \(\unicode{x27E6} x_{i^*} \unicode{x27E7}\) (obliviously) and sends \((\unicode{x27E6}i^* \unicode{x27E7}, \unicode{x27E6}x_{i^*} \unicode{x27E7})\) to the client for decryption.

3.1.0.2 Multiplications in the fetching step.

Akavia et al. also provide a construction that performs the fetching step in \(O(n \log^2 n)\) homomorphic multiplications. Subsequently, more efficient algorithms have been presented with \(O(n \log n)\) multiplications [94] and \(O(n)\) multiplications [95].

3.1.1 Motivation

3.1.1.1 Bottleneck: fetching records sequentially.

Suppose a client wants to fetch all matching items. Under the above framework, the client would first obtain the first matching index \(i^*\) and its corresponding item \(x_{i^*}\). To fetch the second matching item, the framework suggests that the client should slightly change the original query \(q\) to a new query \(q'_{i^*}\) as follows:

Then, by executing a new instance of the protocol with the encrypted query \(\unicode{x27E6}q'_{i^*} \unicode{x27E7}\), the client will obtain the second matching item. By repeating this procedure, the client will ultimately obtain all the matching records.

Note that the query \(q'_{i^*}\) embeds \(i^*\) in itself as a constant, which implies that there is no way for the client to construct this query \(q'_{i^*}\) without obtaining \(i^*\) first. In other words, the client can construct the query for the second matching item, only after fetching the first matching item. In this sense, the framework inherently limits the client to fetch only a single matching record at a time in a sequential manner.

If there are \(\ell\) matching records, the client and server have to execute \(\ell\) instances of the Query, Match, and Fetch algorithms. Since each Match and Search step requires costly homomorphic multiplications, the limitation of sequential protocol execution creates a serious bottleneck with respect to the running time. This leads us to ask the following natural question:

Is there a different secure search framework that allows the client to fetch all the matching records by executing a smaller number of protocol executions, possibly avoiding sequential record fetching?

3.1.1.2 Reducing homomorphic multiplications.

All previous schemes have to perform \(\Omega(n)\) homomorphic multiplications in the fetching step. Since homomorphic multiplications are costly operations, it is desirable to reduce such computations, which begs the natural following question:

Can you reduce the number of homomorphic multiplications in the fetching step? In this paper, we answer both of the above questions affirmatively.

	rounds	#Match	\({\mathsf hmult}\)	\({\mathsf hadd}\)	\({\mathsf smult}\)	communication	plaintext modulus
LEAF [95]	\(s\)	\(s\)	\(O(ns)\)	\(O(ns\log n)\)	0	\(O(s \cdot \log n \cdot \|C\|)\)	2
Protocol w/ BF-COIE	\(3\)	1	0	\(O(n \log \frac n s)\)	0	\(O(s^{1+\epsilon} \log \frac n s \cdot \|C\| + pir(s))\)	prime
Protocol w/ PS-COIE	\(3\)	1	0	\(n \cdot s\)	\(n \cdot s\)	\(O(s \cdot \|C\| + pir(s) )\)	prime
Protocol w/ BFS-CODE	\(2\)	1	\(n\)	\(O(\kappa n)\)	0	\(O(s \kappa\cdot \|C\|)\)	prime

3.1.2 Our Work

3.1.2.1 Parallelizing the Fetch procedure.

To address the issues, we introduce a new secure search framework where the matching items are retrieved in parallel in a constant number of rounds. Our Setup, Query and Match algorithms are the same as in prior work. However, we modify the Fetch procedure, dividing into two steps: Encode and Decode. In the Encode step, the server homomorphically inserts the matching items into a data structure - the particular structure depends on the construction, as we provide 3 different constructions, each using a different encoding. After receiving the encrypted encoding, the client decrypts the encoding and runs the Decode step to recover the items.

3.1.2.2 Compressed oblivious encoding.

The encoding is computed homomorphically, and, most importantly, allows to encode the full result set, rather than just a single item. In particular, we introduce a notion of Compressed Oblivious Encoding (COE). A compressed oblivious encoding takes as input a large, but sparse, vector and compresses it to a much smaller encoding from which the non-zero entries of the original vector can be recovered. What makes this encoding oblivious is that the encoding procedure is performed on encrypted data. In certain constructions, the encoding includes the data values (CODE, compressed oblivious data encoding), and in others it only includes the indices (COIE, compressed oblivious index encoding). In the latter case, the Decode procedure is interactive, and allows the client to recover the values from the decoded set of indices.

For simplicity, when describing the generic syntax of secure search scheme, we denote the Encode procedure as taking both the indices and the values as input, and we suppress the fact that when the values are not used during Encoding, the Decoding step must be interactive. Recall, we use \(\unicode{x27E6}b \unicode{x27E7}= (\unicode{x27E6}b_1\unicode{x27E7}, \ldots, \unicode{x27E6}b_n \unicode{x27E7})\) to denote the encrypted bit vector that results from the Match step.

(Encode) Let \(S = \{ i \in [n] : b_i = 1 \}\). Let \(V = \{v_i : i \in S\}\). The server homomorphically evaluates an \(\unicode{x27E6}\mathsf{encoding}(S,V) \unicode{x27E7}\) and send it to the client.

(Decode) The client decrypts \(\unicode{x27E6} \mathsf{encoding}(S) \unicode{x27E7}\) and runs the decoding procedure to recover \((S, V)\).

We assume that the results set \(|S|\) is small (i.e., sublinear in \(n\)). We would like the size of the compressed encoding to be sublinear in \(n\) to maintain meaningful communication cost.

3.1.2.3 No multiplications in the Encode step.

To ensure minimal computational cost for encoding the results, we also wish to minimize the number of homomorphic multiplications. Recall, the best prior work requires \(O(n)\) multiplications by the server. Somewhat surprisingly, we demonstrate three encoding algorithms that can be evaluated without any homomorphic multiplications!

3.1.2.4 Using PIR (Private Information Retrieval).

The asymptotic complexities and trade-offs of the search protocols are presented in Figure [fig:comparison].

In some of our protocols (i.e., the search protocols with BF-COIE and PS-COIE; see Sections 4 and 6.3 for more detail), the indices and actual records are fetched in separate steps. This allows us to focus on optimizing the retrieval of the indices after which the values can be fetched using an efficient (setup-free) PIR protocol resulting in overall savings.

However, if reliance on PIR is undesirable, we also offer a variant that fetches the values directly (i.e., the protocol w/ BFS-CODE in Figure [fig:comparison]; see Sections 5 and 6.4 for more detail), as in prior work.

3.1.2.5 Implementation.

We implement all of our proposed schemes and compare their performance with that of prior work. Our experiments show that our schemes outperform the fetching procedure of prior work by a factor of 1800X when fetching 16 records, which results in a 26X speedup for the full search functionality.

3.2 Preliminaries

Let \(\kappa\) be the security parameter. For a vector \(a\), let \(\mathsf{idx}(a)\) denote the set of all the positions \(i\) such that \(a_i\) is non-zero, i.e., \[\mathsf{idx}(a) := \{ i: a_i \ne 0\}.\]

3.2.0.1 Chernoff bound.

3.2.0.2 FHE.

We use a standard CPA-secure (leveled) fully homomorphic encryption scheme \(({\mathsf Gen}, {\mathsf Enc}, {\mathsf Dec})\). We refer readers to [94], [95] for a formal definition. We use \(\unicode{x27E6} x \unicode{x27E7}\) to denote an encryption of \(x\).

We also use \(+\) (resp. \(\cdot\)) to denote homomorphic addition (resp., multiplication). For example, \(\unicode{x27E6}c \unicode{x27E7}:= \unicode{x27E6}a \unicode{x27E7}+ \unicode{x27E6}b \unicode{x27E7}\) means that homomorphic addition of two FHE-ciphertexts \(\unicode{x27E6}a \unicode{x27E7}\) and \(\unicode{x27E6}b \unicode{x27E7}\) has been applied, which results in \(\unicode{x27E6}c \unicode{x27E7}\).

3.2.0.3 PIR.

A PIR protocol allows the client to choose the index \(i\) and retrieve the \(i\)th record from one (or more) untrusted server(s) while hiding the index value \(i\) [24].

Assume that each of the \(k\) server has \(n\) records \(D = (d_1, \ldots, d_n)\) where all items \(d_i\) have equal length. A single-round \(k\)-server PIR protocol consists of the following algorithms:

The query algorithms \(Q_j(i, r) \to q_j\) for each server \(j \in [k]\), which are executed by the client with input index \(i\) and randomness \(r\).

The answer algorithms \(A_j(D, q_j) \to a_j\) for each server \(j \in [k]\), which is executed by the \(j\)th server.

The reconstruction algorithm \(R(i, r, (a_1, \ldots, a_k)) \to d_i\). The communication complexity of a PIR protocol is defined by the sum of the all query lengths and answer lengths, i.e., \[\sum_{j \in [k]} |q_j| + |a_j|.\]

A PIR protocol is correct if for any \(D = (d_1, \ldots, d_n)\) with \(|d_1| = \cdots = |d_n|\), and for any \(i \in [n]\), it holds that \[\Pr_r \bigg [ R \Big(i, r, \big \{ A_j(D, Q_j(i,r)) \big \}_{j=1}^k \Big ) = d_i \bigg ] = 1.\] A PIR protocol is private if for any \(j \in [k]\), for any \(i_0, i_1 \in [n]\) with \(i_0 \ne i_1\), the following distributions are computationally (or statistically) indistinguishable: \[\{Q_j(i_0, r)\}_r \approx \{Q_j(i_1, r)\}_r.\]

3.2.1 Bloom Filter

A Bloom filter [96] is a well-known space-efficient data structure that allows a user to insert arbitrary keywords and later to check whether a certain keyword in the filter.

3.2.1.1 \(\mathsf BF.Init()\).

The filter \(B\) is essentially an \(\ell\)-bit vector, where \(\ell\) is a parameter, which is initialized with all zeros. The filter is also associated with a set of \(\eta\) different hash functions \[\mathcal{H}= \{ h_q: \{0,1\}^* \to [\ell] \}_{q=1}^\eta.\]

3.2.1.2 \(\mathsf BF.Insert(B, \alpha)\).

To insert a keyword \(\alpha\), the hash results are added to the filter. In particular,

Compute \(j = h_q(\alpha)\) and set \(B_j := 1\). Here \(B_j\) is the \(j\)th bit of \(B\).

3.2.1.3 \(\mathsf BF.Check(B, \beta)\).

To check whether a keyword \(\beta\) has been inserted to a BF filter \(B\), one can just check the filter with all hash results. In particular,

The main advantage of the filter is that it guarantees there will be no false negatives and allows a tunable rate of false positives: \[\bigg(1 - \Big (1 - \frac{1}{ \ell} \Big )^{\eta s} \bigg)^\eta \approx \Big(1 - e^{-\frac{\eta s}{ \ell}} \Big)^\eta,\] where \(s\) is the number of keywords in a Bloom filter.

3.2.1.4 Random oracle model for hash functions.

We show our analysis in the random oracle model. That is, the hash functions are modelled as random functions.

3.2.2 Algebraic Bloom Filter

In this work, we leverage a variant of the Bloom filter where, when inserting an item, the bit-wise OR operation is replaced by addition. There have been works using a similar idea of having each cell hold an integer instead of holding a bit [97], [98].

Moreover, we consider a limited scenario where the upperbound on the number of keywords to be inserted is known beforehand. In particular, let \(s\) denote such an upperbound.

As before, the filter is also associated with a set of \(\eta\) different hash functions \(\mathcal{H}= \{ h_q: \{0,1\}^* \to [\ell] \}_{q=1}^\eta\). However, now the filter \(B\) is not an \(\ell\)-bit vector but a vector where each element is in \([s \eta]\) (i.e., \(B \in [s\eta]^\ell\)) ¹. Therefore, the number of bits to encode \(B\) is now blown up by a multiplicative factor \(\lceil \lg s\eta \rceil\).

The BF operations are described below where differences are marked by framed boxes.

3.2.2.1 \(\mathsf BF.Insert(B, \alpha)\).

To insert a keyword \(\alpha\), the hash results are added to the filter. In particular,

3.2.2.2 \(\mathsf BF.Check(B, \beta)\).

To check whether a keyword \(\beta\) has been inserted to a BF filter \(B\), one can just check the filter with all hash results. In particular,

It is easy to see that this variant construction enjoys the same properties as the original BF construction.

3.3 Compressed Oblivious Encoding

As our main building block, we introduce a new tool we call Compressed Oblivious Encoding. A compressed oblivious encoding takes as input a large, but sparse, vector and compresses it to a much smaller encoding from which the non-zero entries of the original vector can be recovered. What makes this encoding oblivious is that the encoding procedure is oblivious to the original data; in fact, in our constructions the original data will all be encrypted. An efficient encoding must satisfy the following two performance requirements: 1) The size of the encoding must be sublinear in the size of the original array, and 2) constructing the encoding should be computationally cheap. Our constructions only use (homomorphic) addition and multiplication by constant (i.e. plaintext values).

A related notion is that of compaction over encrypted data [99], [100] which aims to put all non-zero entries of a vector to the front of the encoding. Our encoding can be viewed as a form of noisy compaction where, in addition to keeping all the non-zero entries, it allows a small number zero entries to be mixed in with the result. Thus, a compressed encoding trades some inaccuracy in the output for much cheaper construction costs.

We define two variants of compressed oblivious encodings, one that encodes the indices of non-zero entries and one that encodes the actual entries themselves.

3.3.1 Compressed Oblivious Index Encoding

A compressed oblivious index encoding (COIE) encodes the indices or locations of all the non-zero entries in the input array. We begin by defining the parameters and syntax for a COIE scheme.

3.3.1.1 Parameters.

\(f_p\): False positives – The upperbound on the number of false positives returned by the decoding algorithm.

3.3.1.2 Syntax.

\(\unicode{x27E6}\gamma_1 \unicode{x27E7}, \ldots, \unicode{x27E6}\gamma_c \unicode{x27E7}\leftarrow \mathsf{Encode}(\unicode{x27E6}v_1 \unicode{x27E7}, \ldots, \unicode{x27E6}v_n \unicode{x27E7})\). The \(\mathsf{Encode}\) algorithm takes as input a vector of ciphertexts with \(v_i \in \{0,1\}\) for all \(i \in [n]\). It outputs an encrypted encoding \(\unicode{x27E6}\gamma_1 \unicode{x27E7}, \ldots, \unicode{x27E6}\gamma_c \unicode{x27E7}\).

\(I \leftarrow\mathsf{Decode}(\gamma_1, \ldots, \gamma_c)\). The \(\mathsf{Decode}\) algorithm takes the encoding \((\gamma_1, \ldots, \gamma_c)\), in decrypted form, and outputs a set \(I \subseteq [n]\)

3.3.1.3 Correctness.

Let \((\gamma_1, \ldots, \gamma_c) \leftarrow{\mathsf Dec}(\unicode{x27E6}\gamma_1 \unicode{x27E7}, \ldots, \unicode{x27E6}\gamma_c \unicode{x27E7})\) denote a correct decryption of the encoding.

A \((n, s, c, f_p)\)-COIE scheme is correct, if the following conditions are satisfied:

(No false negatives) For all \(v \in \{0,1\}^n\) with at most \(s\) non-zero positions, and for all \(i \in \mathsf{idx}(v)\), it should hold \[i \in \mathsf{Decode}( {\mathsf Dec}(\mathsf{Encode}(\unicode{x27E6}v_1 \unicode{x27E7}, \ldots, \unicode{x27E6}v_n \unicode{x27E7})))\] with probability at least \(1-\mathsf{negl}(\kappa)\) where the random coins are taken from \(\mathsf{Encode}\).

(Few false positives) For all \(v \in D^n\) with at most \(s\) non-zero positions, consider the set of false positives \[E = \{i \in [n]: v_i = 0\mbox{, but } i \in I \},\] where \(I = \mathsf{Decode}({\mathsf Dec}(\mathsf{Encode}(\unicode{x27E6}v_1 \unicode{x27E7}, \ldots, \unicode{x27E6}v_n \unicode{x27E7}))).\)

We require that \(|E| \le f_p\) with the overwhelming probability over the randomness of \(\mathsf{Encode}\).

3.3.1.4 Efficiency.

For an efficient construction, we require that the latter two of these are sublinear in the size of the input vector.

3.3.2 Compressed Oblivious Data Encoding

A Compressed Oblivious Data Encoding (CODE) scheme is very similar to COIE except, rather than encoding the locations of non-zero entries, it encodes the values of these entries. We give a definition of CODE below where differences are marked by framed boxes.

3.3.2.1 Parameters.

A CODE scheme is parametrized by the same four parameters \((n,s,c,f_p)\) as a COIE.

3.3.2.2 Syntax.

\(\unicode{x27E6}\gamma_1 \unicode{x27E7}, \ldots, \unicode{x27E6}\gamma_c \unicode{x27E7}\leftarrow \mathsf{Encode}(\unicode{x27E6}v_1 \unicode{x27E7}, \ldots, \unicode{x27E6}v_n \unicode{x27E7})\). The \(\mathsf{Encode}\) algorithm takes as input a vector of ciphertexts with \(v_i \in \boxed{D}\) for all \(i \in [n]\). It outputs an encrypted encoding \(\unicode{x27E6}\gamma_1 \unicode{x27E7}, \ldots, \unicode{x27E6}\gamma_c \unicode{x27E7}\).

\(\boxed{V} \leftarrow\mathsf{Decode}(\gamma_1, \ldots, \gamma_c)\). The \(\mathsf{Decode}\) algorithm takes the encoding \((\gamma_1, \ldots, \gamma_c)\), in decrypted form, and outputs a set of values

3.3.2.3 Correctness.

A \((n, s, c, f_p)\)-CODE scheme over domain \(D\) is correct, if the following conditions are satisfied:

(No false negatives) For all \(v \in \{0,1\}^n\) with at most \(s\) non-zero positions, and for all \(i \in \mathsf{idx}(v)\), it should hold \[\boxed{v_i \in \mathsf{Decode}( {\mathsf Dec}(\mathsf{Encode}(\unicode{x27E6}v_1 \unicode{x27E7}, \ldots, \unicode{x27E6}v_n \unicode{x27E7})))}\]

with probability \(1-\mathsf{negl}(\kappa)\) where the random coins are taken from \(\mathsf{Encode}\).

(Few false positives) For all \(v \in D^n\) with at most \(s\) non-zero positions, consider the set of false-positive values \[\boxed{E = \{ z \in V: z \ne v_i \mbox{~for any~} i \in \mathsf{idx}(v)\}},\] where \(V = \mathsf{Decode}({\mathsf Dec}(\mathsf{Encode}(\unicode{x27E6}v_1 \unicode{x27E7}, \ldots, \unicode{x27E6}v_n \unicode{x27E7}))).\)

We require \(|E| \le f_p\) with the overwhelming probability over the randomness of \(\mathsf{Encode}\).

3.4 COIE Schemes

We assume the input index vector \(v \in \{0,1\}^n\) is sparse. In particular, throughout the paper, we assume \(s = o(n)\).

3.4.1 A Warm-up construction

Using an algebraic BF, we can create an \((n, s, c, f_p)\)-COIE scheme (the parameters \(c\) and \(f_p\) will be worked out after the description of the scheme).

3.4.1.1 \(\mathsf{Encode}(\unicode{x27E6}v_1 \unicode{x27E7}, \ldots, \unicode{x27E6}v_n \unicode{x27E7})\).

Initialize a BF \(\unicode{x27E6}B \unicode{x27E7}:= (\unicode{x27E6}B_1 \unicode{x27E7}, \ldots, \unicode{x27E6} B_c \unicode{x27E7})\) with \(B_j = 0\) for all \(j\). Let \(\mathcal{H}= \{ h_q: \{0,1\}^* \to [c] \}_{q=1}^\eta\) be the associated hash functions.

For \(q = 1, \ldots, \eta\), do the following: Compute \(j = h_q(i)\) and set \(\unicode{x27E6}B_j \unicode{x27E7}:= \unicode{x27E6}B_j \unicode{x27E7}+ \unicode{x27E6}v_i \unicode{x27E7}\).

Note that at step 2.a in the above, if \(v_i = 0\), then \(B_j\) stays the same. On the other hand, if \(v_i = 1\), then \(B_j\) will be increased by 1. This implies that \(B\) will exactly store the results of the operations \(\{ {\mathsf BF.Insert}(B, i) : i \in \mathsf{idx}(v)\}.\)

3.4.1.2 \(\mathsf{Decode}(B_1, \ldots, B_c)\).

Given the algebraic BF \(B\), we can recover the indices for the nonzero elements as follows:

For \(i \in [n]\): if \({\mathsf BF.Check}(B, i)\) = “yes", add \(i\) to \(I\).

3.4.1.3 Parameters \(c\) and \(f_p\).

Since this is a warm-up construction, we perform only a rough estimation on the false positive parameter and the compactness parameter.

For reasons that will become clear later, we wish to keep the upper bound on the number of false positives (\(f_p\)) small. In particular, we use a BF with false-positive rate \(1/n\). Since there are \(n\) operations of \(\mathsf BF.Check\), the expected number of false positives is 1, and from the Chernoff bound, the number of false positives is bounded by \(\Omega(\log \kappa)\) with overwhelming probability in \(\kappa\). This implies that we have \(f_p = \Omega(\log \kappa).\)

The dimension \(c\) of the Bloom filter \(B\) can be computed using the following equation of BF false positive ratio:

Setting \(c = \eta s \cdot n^{\frac 1 \eta}\) will satisfy the equation. This can be verified by using an equality \(1 - e^{-x} \le x\) for \(x \in [0,1]\); that is, \(1 - e^{-\frac{\eta s}{c}} \le \frac{\eta s}{c} = 1/n^{1/\eta}.\)

3.4.1.4 Efficiency.

The encoding algorithm uses \(n\eta\) homomorphic addition operations, and \(n\eta\) hash functions.

The dimension \(c\) of the encoding is \(\eta s \cdot n^{\frac 1 \eta}\). Usually, \(\eta\) is set to between 2 and 32.

In summary, we have reduced the encoding size \(c\) to be sub-linear in \(n\) as desired. However, we still need to reduce the number \(\mathsf BF.Check\) operations in Decode to be sub-linear in \(n\). We show how to achieve that in our next construction.

3.4.2 BF-COIE

We now show how to improve the above construction to achieve decoding in time \(o(n)\). The main idea of this improvement is to use Bloom filters to represent a binary search tree, one BF per level of the tree. We can then guide the decoding algorithm to avoid decoding branches that do not contain non-zero entries. As most branches can be truncated well before reaching the leaf-level Bloom filter, this results in sublinear total cost.

3.4.2.1 Example.

Before presenting the formal protocol for this construction we convey our idea through an example. Let \(n = 32\), and suppose we wish to encode the indices \(I = \{1, 15, 16\}\). Denote \[I^k = \left \{ \Big \lceil \frac i {2^k} \Big \rceil : i \in I \right \}.\]

Intuitively, an element \(i\) in \(I^k\) can be thought of a range of length \(2^k\) covering \([(i-1)\cdot 2^{k}+1, i\cdot2^k]\). We have:

Now, assume we insert each set \(I^k\) into its own BF. We can traverse these BF’s to decode the set \(I\) as follows:

Check \(I^4\) for all possible indices. The only possible indices at this level are \(1\) and \(2\), since \(n=32\) and \(I^4\) divides the original indices by \(2^4 = 16\).

In the above example, When we query the BF for \(I^4\), it only contains the index \(1\), which means that no values greater than 16 are contained in \(I\). We can thus avoid checking any such indices at the lower levels.

Now consider the BF at the next level (i.e., the BF for \(I^3\)). The only possible values at this level are 1,2,3,4, but since we already know that there are no values greater than 16 in \(I\), we only need to check for values \(1, 2\) (since \(3 \cdot 8 > 16\)).

Check \(I^3\) for indices \(1, 2\). The BF will show that indices \(1\) and \(2\) are both present, which means that we need to check indices \(1,2\) and \(3,4\) in \(I^2\).

Check \(I^2\) for indices \(1, 2, 3, 4\). The BF will show that indices \(1\) and \(4\) are present, which means that we only need to check indices \(1, 2\) and \(7, 8\) in \(I^1\), all other indices can be skipped.

Check \(I^1\) for indices \(1, 2, 7, 8\). The BF will show that indices \(1\) and \(8\) are present, which means that we need to check indices \(1, 2\) and \(15, 16\).

Check \(I^0\) for indices \(1, 2, 15, 16\), and output the final present indices \(1, 15, 16\).

Assuming, for now, that there are no false positives, observe that this approach checks at most \(2 \cdot |I|\) values at each level, and there are \(\lg n\) levels. Therefore, the decoding algorithm will check \(O( |I| \cdot \lg n)\) indices, which is sub-linear in \(n\).

3.4.2.2 BF-COIE.

We now describe our BF-COIE construction. As before, we will work out the parameters after describing our construction. The encoding algorithm is described in Algorithm [alg:BFCOIEE].

Initialize \(\unicode{x27E6}B^k \unicode{x27E7}= (\unicode{x27E6}B^k_1 \unicode{x27E7}, \ldots, \unicode{x27E6} B^k_\ell \unicode{x27E7}) := ({\mathsf{nil}}, \ldots, {\mathsf{nil}})\).

Choose \(\mathcal{H}^k = \{ h^k_q: \{0,1\}^* \to [\ell] \}_{q=1}^\eta\) at random.

\(i' := \lceil i/2^k \rceil\), \(j := h^k_q(i')\),
If \(\unicode{x27E6}B^k_j \unicode{x27E7}\) is \({\mathsf{nil}}\), then \(\unicode{x27E6}B^k_j \unicode{x27E7}:= \unicode{x27E6}v_{i'} \unicode{x27E7}\)
Otherwise, \(\unicode{x27E6}B^k_j \unicode{x27E7}:= \unicode{x27E6}B^k_j \unicode{x27E7}+ \unicode{x27E6}v_{i'} \unicode{x27E7}\)

Output \(\unicode{x27E6}B^0 \unicode{x27E7}, \ldots, \unicode{x27E6}B^t \unicode{x27E7}\).

Protocol : \({\mathsf BF\mbox{-}COIE}.\mathsf{Encode}(\unicode{x27E6}v_1 \unicode{x27E7}, \ldots, \unicode{x27E6}v_n \unicode{x27E7})\)

Note that in steps (a) to (c) above, the warm-up construction is used to construct BF \(B^k\) for indices \(I^k\).

In order to reduce the size of the output encoding, we set \(t\) to be \(\lg {\frac n {2s}}\) instead of \(\lg n\) as described previously. Note that when \(t\) is set in this way, \(I^t\) contains at most \(n/2^t = 2s\) possible values thus maintaining our invariant.

For \(k = t, t-1, \ldots, 1\), and for \(i' \in I^k\):
\(~~~\) If \({\mathsf BF.Check}(B^k, i')\) is “yes", add \(2i'-1\), \(2i'\) in \(I^{k-1}\)

For \(i \in I^0\):
\(~~~\) If \({\mathsf BF.Check}(B^0, i)\) is “yes", add \(i\) to \(I\)

3.4.2.3 Useful lemma.

The proof, by an application of the Chernoff bound, can be found in Appendix 7.1.1.

Regarding the above Lemma, we remark that setting \(\delta = \Omega(\log \kappa)\), we have \[\Pr\left[\sum_{i=1}^m X_i \ge 1 + \delta \right] = \mathsf{negl}(\kappa).\]

3.4.2.4 Parameters \(c\) and \(f_p\).

We set the false positive upperbound \(f_p := \Omega(\log \kappa)\) for the BF-COIE scheme. In our experiments, we set \(f_p = 16\).

Now, let \(m = \max(2s, s+2f_p)\), we set the BF false positive rate to \(1/m\). Recall that in the BF-COIE construction, the topmost BF \(B^t\) performs the \(\mathsf BF.Check\) operation with \(2s\) times; see line (2) in Algorithm [alg:BFCOIED]. Using the above Lemma, the number of false positives in the top level BF \(B^t\) is at most \(f_p\) with all but negligible probability in \(\kappa\). Furthermore, the index \(i\) in \(B^t\) is expanded into two indices \(2i-1\) and \(2i\) in \(B^{t-1}\). This means that the number of false indices to be checked in \(B^{t-1}\) due to the false positives in \(B^t\) is at most \(2f_p\).

Now consider an index \(i\) that belongs to \(B^{t}\). Algorithm [alg:BFCOIED] will run \({\mathsf BF.Check}\) on the values \(2i-1\) and \(2i\) in \(B^{t-1}\). Since at least one of these values must actually belong to \(B^{t-1}\), this leads to at most one false index being checked. Thus, the maximum number of false indices that would be checked in \(B^{t-1}\) is at most \(s + 2f_p\) (i.e., \(2f_p\) from false positives of \(B^t\) and \(s\) from true positives of \(B^t\)).

The above argument applies inductively all the way to the bottom most level, which means that the maximum number of false indices that would be checked in each level BF \(B^i\) will be at most \(s + 2f_p\). In the end, the bottom BF will have at most \(f_p\) false positives, and the overall BF-COIE scheme will have at most \(f_p\) false positives with all but negligible probability in \(\kappa\).

For the compactness parameter \(c\), we must determine the dimension \(\ell\) of each BF. Recall that we set the BF false positive rate to \(1/m\) for \(m = \max(2s, s+2f_p)\):

\[\Big(1 - e^{-\frac{\eta s}{\ell}} \Big)^\eta \le \frac 1 m.\] Setting \(\ell = \eta \cdot s \cdot m^{\frac 1 \eta}\) would satisfy the above condition, which can be verified using an inequality \(1 - e^{-x} \le x\) for \(x \in [0,1]\); that is, \(1 - e^{-\frac{\eta s}{\ell}} \le \frac{\eta s}{\ell} = (1/m)^{1/\eta}.\)

Since the encoding has \(t+1\) BFs, the overall compactness parameter is as follows: \[c = (t + 1) \cdot \ell = O\left ( \eta \cdot s^{1 + \frac 1 \eta} \cdot \lg \frac n s \right ).\]

3.4.2.5 Efficiency.

The size \(c\) of encoding is \(O\left ( \eta \cdot s^{1 + \frac 1 \eta} \cdot \lg \frac n s \right )\). In our experiment, we choose \(\eta = 2\).

The encoding algorithm uses \(O(\eta \cdot n \cdot \lg \frac n s)\) homomorphic addition operations and hash functions.

The decoding algorithm uses \(\mathsf BF.Check\) operations for \(O(s \lg \frac n s)\) times.

In summary, assuming \(s = o(n)\), we reduced the encoding size \(c\) to be sub-linear in \(n\). Moreover, we also reduced the number \(\mathsf BF.Check\) operations to be sub-linear in \(n\).

3.4.2.6 Remark.

Although this scheme has multiple BFs, the size of encoding \(c\) is smaller than that of the warm-up scheme! This is because with multiple levels of BFs, we can relax the false positive ratio for each BF. The encoding computation time was increased by a multiplicative factor of \(\lg \frac n s\).

3.4.3 COIE Scheme Based on Power Sums

3.4.3.1 Removing false positives using power sums.

We offer another encoding scheme using quite different techniques that can eliminate the false positives of the prior construction. To achieve this, we abandon Bloom filters, and instead use a power sum encoding, as has been done in several works using DC-Nets for anonymous broadcast FHE.

3.4.3.2 PS-COIE.

We describe a COIE scheme based on power sums, which we call PS-COIE. As before, we will work out the parameters after describing our construction. The encoding algorithm is shown below.

For \(j = 1, \ldots, s\):
\(~~~\) Compute \(\unicode{x27E6}w_j \unicode{x27E7}= \sum_{i=1}^n i^j \cdot \unicode{x27E6}v_i \unicode{x27E7}\)

Output \(\unicode{x27E6}w_1 \unicode{x27E7}, \ldots, \unicode{x27E6}w_s \unicode{x27E7}.\)

Protocol : \({\mathsf PS\mbox{-}COIE}.\mathsf{Encode}(\unicode{x27E6}v_1 \unicode{x27E7}, \ldots, \unicode{x27E6}v_n \unicode{x27E7})\)

Note that the values of \(i^j\) (modulo the underlying plaintext modulus) are publicly computable, so computing \(i^j \cdot \unicode{x27E6}v_i \unicode{x27E7}\) only requires scalar multiplication and no homomorphic multiplication.

Recall that \(v_i \in \{0,1\}\). If we let \(I = \{i: v_i = 1\}\) denote the indices of the nonzero elements, then note that \[w_j = \sum_{i=1}^n i^j \cdot v_i = \sum_{i \in I} i^j.\] Therefore, this \(w_j\) is the \(j\)th power sum of the indices. Using the power sums, we present the decoding algorithm in Algorithm [alg:pscoie].

Recall that we have \(w_j = \sum_{x \in I} x^j,\) for \(j = 1, \ldots, s\), and we would like to reconstruct all \(x\)’s in \(I\).

Let \(f(x) = a_s x^s + a_{s-1} x^{s-1} + \cdots + a_1 x + a_0\) denote the polynomial whose roots are the indices in \(I\).

Use Newton’s identities to compute the coefficients of this polynomial \(f(x)\): \[\begin{aligned} a_s &= 1\\ a_{s-1} &= w_1\\ a_{s-2} &= (a_{s-1} w_1 - w_2 )/2\\ a_{s-3} &= (a_{s-2} w_1 - a_{s-1} w_2 + w_3)/3\\ \vdots\\ a_{0} &= (a_{1} w_1 - a_{2} w_2 + \cdots w_s)/s \end{aligned}\]

Protocol : \({\mathsf PS\mbox{-}COIE}.\mathsf{Decode}(\unicode{x27E6}w_1 \unicode{x27E7}, \ldots, \unicode{x27E6}w_s \unicode{x27E7})\)

3.4.3.3 Parameters \(c\) and \(f_p\).

This COIE scheme has no false positives; that is, \(f_p = 0\). The compactness parameter \(c\) is equal to \(s\).

3.4.3.4 Efficiency.

The encoding algorithm uses \(s \cdot n\) homomorphic addition operations and scalar multiplications².

The decoding algorithm computes coefficients in time \(O(s^2)\). Roots of degree-\(s\) polynomial can be found in time \(O(s^3 \log p)\), where \(p\) is the plaintext modulus of the underlying FHE, by using the Cantor–Zassenhaus algorithm [101].

3.5 CODE Scheme

In the previous section, we showed two constructions of COIE schemes for encoding a vector of indices using sublinear storage. We now turn to the construction of CODE schemes, which, instead of encoding the indices of non-zero entries, encode the actual data values.

3.5.0.1 Simplified key-value store.

To construct our CODE scheme, we first construct an auxiliary data structure that supports the following operations:

\({\mathsf Insert}(key, value)\). This operation allows the user to insert an item based on its key and value.

This data structure is simpler than a typical key-value store since it doesn’t need to find an individual item by key. Note, however, that this is still sufficient to serve our purpose of constructing a CODE scheme.

3.5.1 BF Set

We now show how to instantiate a simplified key-value store using a data structure we call a Bloom filter set (BFS) that is in turn based on the algebraic Bloom filter presented in Section 3.2.2. To insert a pair \((key, value)\), the Bloom filter set stores the actual \(value\) rather than an indicator bit. Items are inserted similar to before, by adding their value to the locations indicated by the hashes of the \(key\).

3.5.1.1 Input data format.

For our construction we make an assumption on the format of the inserted data. Specifically, we assume that all inserted values contain a unique checksum (e.g., a cryptographic hash of the value). We assume that this checksum is sufficiently long that a random sum of checksums does not give a valid checksum except with negligible probability (as a function of \(\kappa)\).

3.5.1.2 Construction.

We first describe the construction of the data structure. We show below how to choose parameters in such a way that the client can extract all the matched items from this Bloom filter, with overwhelming probability.

\({\mathsf BFS.Init}() \to (B, \mathcal{H})\). Create an \(\ell\)-dimensional vector \(B\) where each element can store any possible value in the domain \(D\). Choose a set of \(\eta\) different hash functions \(\mathcal{H}= \{ h_q: \{0,1\}^* \to [\ell] \}_{q=1}^\eta\). Initialize \(B_i := 0\) for \(i \in [\ell]\).

\({\mathsf BFS.Insert}(B, \mathcal{H}, key, \alpha)\). To add \((key, \alpha)\), we add \(\alpha\) to the values stored at the locations indicated by the hashes of \(key\). Specifically,

\({\mathsf BFS.Values}(B).\) Initialize a set \(V\) to be the empty set. For \(j \in [\ell]\), if \(B_j\) has a valid checksum, add \(B_j\) to \(V\). Finally, output \(V\).

We note that, as previously proposed by Goodrich [102], it is possible to avoid the checksum by maintaining a counter of the number of values inserted for each location. Then, \({\mathsf BFS.Values}\) only returns values at locations with a counter of 1.

3.5.1.3 Parameters.

We show how to set the Bloom filter parameters to guarantee that all values can be recovered with all but negligible probability. We assume that we know the upper bound \(s\) on the number of inserted values. We prove the following lemma.

3.5.2 CODE Scheme Based on BF Set

In this section, we construct a CODE scheme. Recall that unlike encoding the indices through a COIE scheme, a CODE scheme encodes data in a compressed manner. The main idea of our construction is simulating the operations of \({\mathsf BFS}\); we call our scheme \(\mathsf BFS\mbox{-}CODE\).

3.5.2.1 Pre-processing the input data.

As mentioned in the description of the BF Set construction, we need to pre-process the input data so that each item is attached with its checksum. Although a data item \(v\) is represented as a single number, it is assumed that \(v\) can be parsed as \(v.val\) for its actual value and \(v.tag\) for its checksum. Moreover, we assume that the checksum is long enough, such that a random linear combination of checksums is only negligibly likely to produce a valid checksum (i.e., \(|checksum| = \omega(\lambda)\)).

We stress that when our CODE scheme is used for secure search, this pre-processing can be performed locally by the client prior to encrypting his data. Moreover, computing checksum adds only a tiny amount of overhead.

3.5.2.2 BFS-CODE.

We now describe our \((n, s, c, f_p)\)-BFS-CODE construction over domain \(D\). As before, we will work out the parameters after describing our construction. The encoding algorithm is shown below.

Initialize \(\unicode{x27E6}B \unicode{x27E7}= (\unicode{x27E6}B_1 \unicode{x27E7}, \ldots, \unicode{x27E6} B_\ell \unicode{x27E7}) := (\unicode{x27E6}0 \unicode{x27E7}, \ldots, \unicode{x27E6}0 \unicode{x27E7})\).

\(j := h_q(i)\); \(\unicode{x27E6}B_j \unicode{x27E7}= \unicode{x27E6}B_j \unicode{x27E7}+ \unicode{x27E6}v_i \unicode{x27E7}\)

Protocol : \({\mathsf BFS\mbox{-}CODE}.\mathsf{Encode}(\unicode{x27E6}v_1 \unicode{x27E7}, \ldots, \unicode{x27E6}v_n \unicode{x27E7})\)

Note that at step 4 in the above, if \(v_i\) is 0, then \(B_j\) stays the same. On the other hand, if \(v_i\) is not 0, \(B_j\) will be increased by \(v_i\). This implies that \(B\) will exactly hold the result of operations \(\{{\mathsf BFS.Insert}(B, \mathcal{H}, i, v_i): i \in \mathsf{idx}(v) \}.\)

3.5.2.3 Correctness.

This is immediate from the additive homomorphism of the underlying encryption scheme and the parameters for the \(\mathsf BFS\). In particular, we set \(\eta = \kappa+\lg{s}\) so that the probability of recovery error is at most \(2^\kappa\).

3.5.2.4 Parameters \(c\) and \(f_p\).

The checksums attached to the data items ensure that we have no false positives with overwhelming probability, that is, \(f_p = 0\). The compactness parameter \(c\) is the dimension \(\ell\) of the BF, which is \(O(\eta s)\).

3.5.2.5 Efficiency.

The encoding algorithm uses \(\ell = O(\eta s)\) encryption operations, \(\eta \cdot n\). addition operations, and \(\eta n\) hash functions.

Since by Lemma 3, the size \(\ell\) of the Bloom filter only depends on the number of matches \(s\) and the number of hash function \(\eta\), we get that the communication complexity of the above protocol is independent of the database size \(n\).

3.6 Secure Search Protocols

We implement secure search protocols by using compressed oblivious encoding schemes. We begin by defining a relaxed notion of correctness that allows for false positives, as is needed in some of our constructions. we then define security of secure search.

3.6.1 \((\ell, f_p)\)-Relaxed Secure Search

We relax the correctness guarantee to allow the Client to retrieve a superset of the matching records. Specifically, if \(\mathcal{S}\) is the set of indexes matching a Client’s query \(q\), then at the end of the protocol, we require the Client to obtain a set \(\mathcal{S}'\) such that:

We parameterize a secure search scheme by \((\ell, f_p)\), where \(\ell\) is the amortized communication complexity per matching record, and \(f_p\) is the number of “false positives,” as defined above.

3.6.2 Security of Setup-free Secure Search

To define security of our secure search schemes, we use a game-based security definition similar to that of Akavia et al. [94]. The game is between a challenger and an adversary \(\mathcal{A}\) with regard to a setup-free search scheme, \(\mathsf{sec\mbox{-}search}\), and an FHE scheme, \(\mathsf{FHE}\).

3.6.3 From COIE to Secure Search

We next present our framework for obtaining Secure Search from COIE. The intuition is likely already clear from the previous descriptions: the encrypted client query is applied to the dataset, returning an encrypted bit vector indicating where index matches lie. The server homomorphically computes the hamming weight of this vector, and sends it to the client for decryption. This provides the result set size to the Server, allowing it to encode the result vector in the COIE.³ The encoding is sent to the client for decryption and decoding.

Because the COIE only encodes the indices, and not the data values, we then add a PIR step to fetch the corresponding data. Note that if the COIE scheme admits false positives, it is possible that the number of false positives, and therefore the number of PIR queries, depends on the data, leaking something to the Server. To fix this problem, the client pads the number of PIR queries as follows. It fixes a bound \(f_p\) on the number of false positives, and aborts if the actual number of false positives exceeds this bound. Otherwise, the client uses enough dummy queries to pad the number of PIR queries to \(s+ f_p\).

3.6.4 From CODE to Secure Search

The proof is similar to the COIE-based scheme and can be found in Appendix 7.1.2.

3.6.4.1 On the use of homomorphic multiplication.

As described, our CODE-based search scheme uses \(n\) homomorphic multiplications to create the vector \(\unicode{x27E6}d \unicode{x27E7}\). However, it may be the case that this vector is already produced as part of the match step, for example for arithmetic queries. In this case, our CODE scheme requires no further homomorphic multiplications.

3.6.4.2 On volume attacks.

In our secure search schemes, the client sends the number \(s\) of matching records to the server so that the server can create an oblivious compress encoding. One recent line of works has developed attacks using volume leakage (e.g., [103], [104], [105]), and these types of attacks can be applied to our scheme in theory.

In our scheme, the volume attacks can be mitigated by hiding \(s\) in a differentially private manner. In particular, the client can add a small amount of noise to \(s\) before sending it to the server. A similar approach was used in previous work e.g., [106].

3.7 Evaluation

3.7.1 Fetch time

We implemented our search protocols based on BF-COIE, PS-COIE, and BFS-CODE schemes. All protocols were implemented using PySEAL [107], which is a Python wrapper of the Microsoft research SEAL library (version 3.6) [108] using the BFV encryption scheme [109]. We instantiated a single-server PIR protocol in our construction using SealPIR [110]. For the root finding step of the decoding procedure in PS-COIE, we use an implementation based on SageMath 9.2 [111].

3.7.1.1 Measuring the Fetch step.

Our search framework improves the overall search time by executing the Match step only once, while the LEAF protocol must execute the Match step \(s\) times. However, since we do not optimize the Match step itself over prior work, we focus on measuring the cost of the Fetch procedure. That is, our experiments measure the time from when the server holds encrypted query results, i.e., \((\unicode{x27E6}b_1 \unicode{x27E7}, \ldots, \unicode{x27E6}b_n \unicode{x27E7})\) with \(b_i \in \{0,1\}\), to when the client recovers all \(s\) records matching the query. Specifically, we measure the cost of steps 4 and up in Algorithms [alg:ssCOIE] and [alg:ssCODE]. Similarly, for LEAF+, we only measure the cost of the Fetch step.

3.7.1.2 Database.

To measure the performance of our protocols, we run experiments with database size \(n\) ranging from 1000 to 100,000 data items and the result set size \(s\) set to between 8 and 128. As in the LEAF+ experiments [95], all data items are \(16\)-bit integers.

3.7.1.3 BF-COIE parameters.

For the BF-COIE secure search, we set the parameters as indicated in Section 3.4.2.

We set the false positive upperbound \(f_p = 16\). Recall that the client aborts (without executing the PIR) if the actual number of false positives exceeds this, but this only happens with probability negligible in the security parameter, which we set \(\kappa= 40\).

We set the number of hash function \(\eta=2\) for each Bloom filter, so each BF has size \(\ell = 2s \cdot \sqrt{2s}\). (If \(2s < s + 2f_p\), we set \(\ell = 2s \cdot \sqrt{s + 2f_p}\)).

3.7.1.4 BFS-CODE parameters.

For BFS-CODE secure search, with \(\kappa= 40\), the number of hash functions \(\eta\) is set to \(\kappa+\lg{s}\), and the Bloom filter size is set to \(2(\eta s - 1)\). Additionally, each data item is attached with a 40-bit checksum to guarantee a \(2^{-\kappa}\) probability of collision. We used SHA2 to compute a checksum.

3.7.1.5 Implementing LEAF+.

For a comparison we also implemented the fetch step of the LEAF+ protocol [95], since their implementation is not publicly available.

Their protocol has \(O(\log \log n)\) depth of multiplications. Therefore, they have to use bootstrapping techniques to reduce the accumulated noise. However, SEAL doesn’t provide a method for bootstrapping, and we suspect that they added a customized implementation of bootstrapping on top of SEAL. Unfortunately, their implementation is not available.

We address this issue by choosing to ignore the time for bootstrapping when we measure the running time of our implementation of LEAF+. Of course, our implementation doesn’t output the correct results, but the measured running time will be shorter than the actual running time. Therefore, we believe that this measured time serves as a good baseline.

3.7.1.6 Experiment environments.

All our experiments were performed on an IntelCore 9900k @4.7GHz with 64GB of memory. For fair comparison, the test was performed on a single thread with no batching optimizations for computation. Networking protocol between server and clients is a 1Gbps LAN.

3.7.1.7 Results: Fetch time vs. database size.

Figure 2 shows the performance of our protocols as a function of database size, while the result set size \(s\) is fixed to 16. However, for LEAF+, we plot the time for fetching only a single record, since fetching \(s\) records takes too long. In our implementation of LEAF+, fetching even a single record when \(n=10,000\) requires 1872 seconds. We note that the authors of LEAF+ report about 60 seconds for a single fetch [95]. We conjecture that they parallelize the scheme with 32 threads. Here, we only use a single thread.

All three of our protocols greatly outperform LEAF+. Looking at BF-COIE in particular:

In BF-COIE search, fetching 16 records with \(n=10,000\) takes 16.7 seconds, compared to 1872 seconds for a single record fetch in LEAF+. We believe that the speed up is due to the fact that LEAF+ (with a single-record fetching) needs \(O(n \log n)\) homomorphic additions and \(O(n)\) homomorphic multiplications, while BF-COIE search needs only \(O(n \log \frac n s)\) homomorphic additions with no homomorphic multiplications. In addition, as Figure 3 shows, the overhead of the PIR step to retrieve the actual data is small.

Due to the sequential limitation in LEAF+, fetching \(16\) records with LEAF+ is extrapolated to take about \(16 \cdot 1872 = 29952\) seconds. Overall, BF-COIE search is about 1800 times faster than LEAF+.

The time for all three of our protocols is dominated by the server’s computation during encode, which grows linearly with the DB size.

Since the number of hash functions \(\eta\) is larger in the BFS-CODE protocol than in BF-COIE protocol, the encoding step of this protocol takes longer.

3.7.1.8 Results: fetch time vs. the result set size.

Figure 3 shows the performance of our protocols as a function of the result set size \(s\) while \(n\) is fixed to \(10,000\). Here, again the performance is dominated by the encoding step, but the relative costs have changed. Due to the need to compute more power sums, the PS-COIE protocol performs worse than BS-COIE and BFS-CODE when \(s\) becomes moderately large.

The time used for transmitting the data over network (green in Figure 3) increases for larger \(s\). However, it still remains small for all three schemes. In the scenario of having lower network bandwidth, batching is recommended to pack a vector of ciphertexts into a single ciphertext with relatively low computation overhead. We discuss communication costs further in Section 3.7.3.

3.7.2 Overall Running Time

Although we do not optimize the Match step itself over prior work, we provide an estimated comparison of the running time for the end-to-end flow.

Our search framework improves the overall search time by executing the Match step only once, while the LEAF protocol must execute the Match step \(s\) times. Based on this, we can extrapolate the running time as follows:

The overall running time for LEAF: \[Time({\mathsf LEAF}) = s \cdot {\mathsf MT}({\mathsf LEAF}) + s \cdot {\mathsf FT}({\mathsf LEAF}).\] Here, \(\mathsf MT\) and \(\mathsf FT\) denote the match time and fetch time respectively.

The overall running time for the BF-COIE scheme: \[Time({\mathsf BF\mbox{-}COIE}) = {\mathsf MT}({\mathsf BF\mbox{-}COIE}) + {\mathsf FT}({\mathsf BF\mbox{-}COIE})\]

Although the implementation (nor the algorithm) of the matching step of LEAF protocol is not available in [95], we expect that it holds \({\mathsf MT}({\mathsf LEAF}) \approx {\mathsf MT}({\mathsf BF\mbox{-}COIE})\). In the experiment performed in LEAF (see Figure 9 in [95]), we have \(m = \frac{ {\mathsf MT}({\mathsf LEAF})}{{\mathsf FT}({\mathsf LEAF})} \approx 1.5\). For \(s=16\), setting \({\mathsf FT}({\mathsf LEAF}) = 1800 \cdot {\mathsf FT}({\mathsf BF\mbox{-}COIE})\) based on the above discussion, we can estimate the speed-up as follows: \[\frac {Time({\mathsf LEAF})}{Time({\mathsf BF\mbox{-}COIE})} = \frac{ s \cdot (m + 1) } { m + 1/1800}.\] Thus, with \(s = 16\), we estimate that our BF-COIE scheme has roughly 26X end-to-end speed-up.

3.7.3 Communication

We now look at the communication required by each of our schemes and by LEAF+. Figure 4 shows the network cost of the protocols when the result set size \(s\) is 16 and the size of the database is \(n=10,000\). In our implementations, the length of an FHE ciphertext is approximately 103KB and the communication cost of PIR is approximately 369KB.

To explain this table, we first need to explain how we determined the costs of LEAF+ and PIR.

	LEAF+	BF-COIE	PS-COIE	BFS-CODE
#ct’s	\(704\)	\(1323\)	\(17\)	\(1321\)
#PIR	0	32	16	0
#ct’s (w/ batching)	\(32\)	\(2\)	\(2\)	\(2\)

LEAF+. Since LEAF+ fetches each data item and the corresponding index one by one, LEAF+ needs to 16 rounds of communication to retrieve 16 data items. Worse yet, LEAF+ requires the client to send the index of the previous match (requiring \(\lg{n}\) bits) in his next query to ensure correctness. Finally, LEAF+ uses bitwise encryption requiring a ciphertext for each bit of the encrypted communication. Thus, in a single round, the client must send \(\lg n=14\) ciphertexts and the server returns \(16+\lg n = 30\) ciphertexts – \(16\) ciphertexts for returning the matching data item, and \(\lg n\) ciphertexts to return its index. This amounts to 704 ciphertexts for fetching 16 items (excluding the query).

PIR costs. We reduce the cost of PIR for the COIE-based schemes by making a slight modification. In addition to storing the FHE-encrypted database, the server also stores a copy of each record encrypted using a symmetric-key encryption scheme (resulting in much shorter ciphertexts). Then, in the PIR step, the client fetches this symmetrically encrypted ciphertext instead of the FHE-encrypted one.

We use SealPIR for our PIR protocol, which requires 368.6 KB per request. We remark that a very recently introduced SealPIR+ takes 80KB per request (see Table 1 in [112]), using which we can reduce the communication further.

We can now compare the communication costs based on rows 1 and 2 of Figure 4. We see that the communication of BF-COIE and BFS-CODE are roughly twice that of LEAF+, while PS-COIE requires almost 10X less communication. The extra communication needed by BF-COIE and BFS-CODE can likely be offset by the much lower round complexity required by our protocol since the latency costs are likely higher than the cost for the extra bandwidth.

3.7.3.1 Reducing communication using ciphertext batching.

We now describe an optimization to significantly reduce the communication of our protocols at the cost of slightly increased server computation. SEAL allows thousands of encrypted values to be packed together into a single ciphertext. This allows us to pack the ciphertexts in all of our protocols into just one a single ciphertext to be sent from the server to the client. However, this does require the server to do some additional computation to pack the ciphertexts prior to sending them. We experimentally measured this packing, and it requires approximately 3 seconds on a single threaded machine.

LEAF+ can also take advantage of packing to reduce the communication of their protocols. However, since the results must be returned one at a time, the best LEAF+ can do is to pack all ciphertexts that are sent in each round, resulting in a total of 32 ciphertexts.

We note that the cost of PIR is unchanged by this modification. Thus, with the packing optimization, the communication of BFS-CODE is roughly 1/16 of the communication needed by LEAF+, but BF-COIE and PS-COIE require approximately 4X and 2X more communication than LEAF+ respectively when SealPIR is used; however, when SealPIR+ is used, both schemes have slightly less communication than LEAF+.

3.8 Related Work

3.8.1 Techniques for Secure Search

3.8.1.1 Secure pattern matching (SPM) on FHE-encrypted data.

In SPM, given an encrypted query \(\unicode{x27E6}q \unicode{x27E7}\) and \(n\) FHE-encrypted data items \((\unicode{x27E6}x_1 \unicode{x27E7}, \ldots, \unicode{x27E6}x_n \unicode{x27E7})\), it returns a vector of \(n\) ciphertexts \(\unicode{x27E6}b_1 \unicode{x27E7}, \ldots, \unicode{x27E6}b_n \unicode{x27E7}\), where \(b_i\) indicates whether the \(i\)th data element is a match [4], [5], [6], [7]. Their works focus on optimizing the search circuits to determine whether a data item matches the query, and therefore the communication complexity and client’s running time are proportional to the number of data items. Our work focuses on the orthogonal problem of optimizing the retrieval of the matched data items with sublinear communication and client computation.

3.8.1.2 Searchable encryption (SE).

Searchable encryption [8], [9] allows highly efficient search (usually in \(o(n)\) time) over encrypted data. Efficient SE schemes have been proposed for a wide variety of queries including equality queries [10], [11], range queries [12], [13], and conjunctive queries [14], [15]. However, to achieve sublinear query performance, SE schemes require significant preprocessing and relax security, allowing some partial information about the queries and data (e.g. access patterns) to leak to the server. For a recent survey on SE constructions and security, see Fuller et al. [16]. In contrast, our work focuses on achieving preprocessing-free secure constructions, leaking nothing about the queries or results other than their sizes.

3.8.1.3 Property Preserving Encryption (PPE).

As a different approach, property-preserving encryption [17] produces ciphertexts that maintain certain relationships (e.g., equality, and order) of the underlying plaintexts. This allows queries to be performed over ciphertexts in the same way that they can be carried out over plaintexts. Examples of PPE include deterministic encryption [18] allowing equality queries, and order-preserving encryption [19], [20] allowing range queries. However, it has been shown [21], [22], [23] that such property-preserving ciphertexts leak a lot of information about the underlying plaintexts. See [16] for a survey of constructions and attacks.

3.8.2 General Techniques

3.8.2.1 Private information retrieval (PIR).

PIR allows the client to choose the index \(i\) and retrieve the \(i\)th record from an untrusted server while hiding the index \(i\) [24]. However, this protocol by itself provides only a limited search functionality requiring the client to know the index of the data to retrieve. In this work, we aim at protocols supporting any arbitrary search functionality.

3.8.2.2 Secure multi-party computation (MPC).

Secure two-party computation [25], [113] allows players to compute any function of their private inputs without compromising privacy of their inputs. For example, the client and the server can run a protocol for secure two-party computation to solve the secure search problem. While there has been much progress in improving efficiency of MPC protocols, such protocols still require \(\Omega(n)\) communication and \(\Omega(n)\) client computation per query. In this work, we aim to achieve protocols with sublinear communication and client work.

3.8.2.3 Oblivious RAM (ORAM) and Oblivious data structure (ODS).

ORAM [26] is a protocol which allows a client to store an array of \(n\) items on an untrusted server and to access an item obliviously, that is, hiding contents and which item is accessed (i.e., the access pattern). Likewise, ODS [90] allows the client to store and use a data structure obliviously. One could implement secure search by utilizing an ODS for a search tree. However, ODS constructions typically need \(\Omega(\log^2 n)\) rounds for each operation. In this work, we aim at achieving a constant round protocol.

3.9 Conclusion

We have presented several new constructions of secure search based on fully homomorphic encryption. Prior constructions were inherently sequential, returning only a single record from the result set, and requiring a new query from the client that depended on the index of the previous match. We have demonstrated several new methods for encoding the entire result set at one time, removing the added rounds, and allowing the server work to be parallelized. Additionally, we have shown that this can be done without homomorphic multiplication, ensuring low computational cost at the server. Finally, we have implemented our constructions, and demonstrated up to three orders of magnitude speed-up over prior work. Additionally, we introduced the notion of compressed oblivious encoding which may be of independent interest.

4 Secure Sampling

4.1 Introduction

Random sampling is an important tool when computing over massive data sets. It has wide application in generating small summaries of data, and serves as a key building block in the design of many algorithms and estimation procedures. In particular, \(L_p\) sampling has been used to develop important streaming algorithms such as the heavy hitters, \(L_p\) norm estimation, cascaded norm estimation, and finding duplicates in data streams [30], [114], [115], [116].

In this work, we introduce and explore the problem of private two-party sampling. We consider a setting in which two parties would like to sample from a distribution whose probability mass function is distributed across the two parties. Specifically, we assume parties \(P_1\) and \(P_2\) each hold \(n\)-dimensional vectors \(\boldsymbol{w}_1 = (w_{1,1}, \ldots, w_{1,n})\) and \(\boldsymbol{w}_2 = (w_{2,1}, \ldots, w_{2,n})\) respectively where every \(w_{b,j}\) is non-negative. These vectors each represent a (possibly non-normalized) probability mass function of a distribution. Specifically, for \(b \in \{1,2\}\), \(i \in [n]\), the non-negative value \(\frac{w_{b,i}}{||\boldsymbol{w}_b||_1}\) represents the probability mass placed by distribution \(\mathcal{D}_b\) on element \(i\). We assume that the dimension \(n\) is very large, and our goal is to obtain secure sampling protocols with communication that is sub-linear in \(n\).

We consider various ways of deriving the probability mass function \(\mathcal{D}\) of the joint distribution from the two individual probability mass functions. Specifically, we consider:

\(L_1\) distribution: Sample item \(i\) with probability \(\frac {w_{1,i} + w_{2,i}} {||\boldsymbol{w}_1 + \boldsymbol{w}_2||_1} = \frac{w_{1,i} + w_{2,i}} {\sum_j (w_{1,j} + w_{2,j})}\).

\(L_2\) distribution: Sample item \(i\) with probability \(\frac {(w_{1,i} + w_{2,i})^2} {||\boldsymbol{w}_1+\boldsymbol{w}_2||_2^2} = \frac{(w_{1,i} + w_{2,i})^2} {\sum_j (w_{1,j} + w_{2,j})^2}\).

Product distribution: Sample item \(i\) with probability \(\frac{w_{1,i} \cdot w_{2,i}}{\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle} = \frac{w_{1,i} \cdot w_{2,i}} {\sum_j (w_{1,j} \cdot w_{2,j})}\)⁴.

Realizing these sampling functionalities securely is immediate via generic 2PC techniques, but the resulting protocols will require communication that is linear in the input length. With sublinear communication, however, it is unclear how to perform some of these tasks (or whether it is even possible to do so), even with an insecure protocol. We give a (partial) characterization of when such sublinear sampling is possible, and give secure protocols for realizing these functionalities where possible.

4.1.0.1 Product sampling and the exponential mechanism.

While \(L_1\) and \(L_2\) sampling are well-studied, to the best of our knowledge, we are the first to consider the notion of product sampling. We describe a concrete, independent application for this new notion: product sampling can be used to implement a distributed version of the well-known exponential mechanism for differentially-private data release [117].

4.1.1 Our Work

We explore the problems described above, providing multiple two-party protocols, all with sub-linear communication, in the semi-honest security model. We note that our protocol for product sampling has additional leakage, beyond what is revealed by the sampling functionality. We characterize exactly what this leakage is, and provide evidence that similar leakage is necessary to achieve sublinear communication. Specifically, we show the following.

4.1.1.1 \(L_1\) sampling.

We begin by constructing a two-party protocol for \(L_1\) sampling that relies on fully homomorphic encryption (FHE). The main idea behind the protocol is to obliviously sample from each of the two parties inputs independently, and then to securely choose one of the two samples using an appropriately biased coin toss. The results are described in Section 4.2.

4.1.1.2 \(L_2\) sampling.

We also provide a protocol for secure \(L_2\) sampling that relies on fully homomorphic encryption (see Section 4.3). In this case, however, achieving \(L_2\) sampling is non-trivial. In fact, even relying on FHE, it is not immediately clear how to compute \(\|\boldsymbol{w}_1 + \boldsymbol{w}_2\|^2_2\) with sublinear communication.

Surprisingly, our \(L_2\) sampling protocol runs in constant rounds and with \(\tilde O(1)\) communication⁵. Interestingly, it does not require us to compute \(\|\boldsymbol{w}_1 + \boldsymbol{w}_2\|^2_2\). To achieve this, we developed a novel technique called “corrective sampling”, which we overview in the next subsection. We note that our techniques straightforwardly extend to \(L_p\) sampling, for constant \(p\).

4.1.1.3 Product sampling.

We then turn to product sampling. We assume, without loss of generality, that the vectors \(\boldsymbol{w}_b\) are normalized (see Section 4.4 for justification).

We first begin with a communication lowerbound, demonstrating that product sampling with sublinear communication is impossible, even without privacy guarantees, if the two input distributions are insufficiently correlated (i.e., \(\langle \boldsymbol{w}_1, \boldsymbol{w}_2\rangle = o(\frac{1}{n^2})\)). We show this through a reduction from the Set Disjointness problem.

Knowing this lowerbound, we consider the problem under a promise that the input vectors are sufficiently correlated. Assuming that \(\langle \boldsymbol{w}_1, \boldsymbol{w}_2\rangle = \omega(\frac{\log n}{n})\), we provide a two-party protocol for secure product sampling leaking (at most) the inner product of the two parties’ inputs. We note that the promise itself leaks some information, so some leakage here is inevitable. Interestingly, we observe that the protocol can be modified to provide a trade-off between the communication cost and the leakage. We also discuss why this trade-off is inherent.

4.1.1.4 Constant round product sampling.

Our product sampling protocol has a round complexity that depends on the inner product. In Section 4.5, we show how to make our construction constant round while incurring small additional leakage. Importantly, we must do this without computing the exact inner product which itself requires \(O(n)\) communication [118].

4.1.1.5 Two party exponential mechanism.

As mentioned previously, one important application of product sampling is the exponential mechanism for providing differential privacy [117]. In Section 4.6, we describe this application in detail.

For this particular application we face an additional challenge: the leakage of \(\langle \boldsymbol{w}_1, \boldsymbol{w}_2\rangle\) that we relied on for achieving sub-linear communication in product sampling does not preserve differential privacy. To overcome this issue, we construct a new, differentially-private approximation for inner product, and show how to use this for building a sub-linear communication secure computation of the exponential mechanism.

4.1.2 Technical Overview

In the following, we overload notation and let \(\mathcal{D}\) denote a distribution as well as its probability mass function. As discussed previously, we consider the case where a probability mass function is distributed across two parties, and the parties would like to securely sample from the corresponding distribution. We consider several ways in which the probability mass function can be distributed across the two parties.

4.1.2.1 \(L_1\) sampling of convex combinations.

In this case, party 1 (resp. party 2) holds a vector \(\boldsymbol{w}_1\) (resp. \(\boldsymbol{w}_2\)), indexed from \(1\) to \(n\). For \(i \in [n]\), \(w_{1,i}/||\boldsymbol{w}_1||_1\) (resp. \(w_{2,i}/||\boldsymbol{w}_2||_1\)) corresponds to the probability mass of \(i\) under distribution \(\mathcal{D}_1\) (resp. \(\mathcal{D}_2\)). The goal of the parties is to sample from the distribution \(\mathcal{D}\), defined as follows for \(i \in [n]\): \[\begin{aligned} \mathcal{D}[i] &:= \frac{||\boldsymbol{w}_1||_1}{||\boldsymbol{w}_1||_1 + ||\boldsymbol{w}_2||_1} \cdot \frac{w_{1,i}}{||\boldsymbol{w}_1||_1} + \frac{||\boldsymbol{w}_2||_1}{||\boldsymbol{w}_1||_1 + ||\boldsymbol{w}_2||_1} \cdot \frac{w_{2,i}}{||\boldsymbol{w}_2||_1}\\ &= \frac{||\boldsymbol{w}_1||_1}{||\boldsymbol{w}_1||_1 + ||\boldsymbol{w}_2||_1} \cdot \mathcal{D}_1[i] + \frac{||\boldsymbol{w}_2||_1}{||\boldsymbol{w}_1||_1 + ||\boldsymbol{w}_2||_1} \cdot \mathcal{D}_2[i] \end{aligned}\] Note that the target distribution \(\mathcal{D}\) is a convex combination of the distributions \(\mathcal{D}_1\) and \(\mathcal{D}_2\) held by the two parties.

A potentially straightforward sampling protocol is to therefore have party \(1\) locally draw a sample \(i_1\) from \(\mathcal{D}_1\), party \(2\) locally draw a sample \(i_2\) from \(\mathcal{D}_2\), and then run a secure two party computation that outputs \(i_1\) with probability \(\frac{||\boldsymbol{w}_1||_1}{||\boldsymbol{w}_1||_1 + ||\boldsymbol{w}_2||_1}\) and \(i_2\) with probability \(\frac{||\boldsymbol{w}_2||_1}{||\boldsymbol{w}_1||_1 + ||\boldsymbol{w}_2||_1}\).

This protocol clearly has sublinear communication, but it unfortunately does not securely realize the ideal functionality. The reason is as follows: conditioned on the ideal functionality outputting a certain index \(i^*\), the probability that \(i^*\) was drawn by party \(1\) (resp. party \(2\)) is \(\frac{w_{1,i^*}}{w_{1,i^*} + w_{2,i^*}}\) (resp. \(\frac{w_{2,i^*}}{w_{1,i^*} + w_{2,i^*}}\)). Thus, if the simulator receives \(i^*\) from the ideal functionality and has to simulate the view of party \(1\), it needs to set \(i_1 = i^*\) with probability \(\frac{w_{1,i^*}}{w_{1,i^*} + w_{2,i^*}}\) and set \(i_1 \neq i^*\) with probability \(\frac{w_{2,i^*}}{w_{1,i^*} + w_{2,i^*}}\). However, the simulator is not able to simulate these probabilities correctly, since it does not know \(w_{2,i^*}\).

To get around this issue we therefore have the parties sample \(i_1\) and \(i_2\) obliviously. To do this with sublinear communication, we can use fully homomorphic encryption (FHE). Specifically, to sample \(i_1\), player 1 first encrypts his input \(\boldsymbol{w}_1\) using an FHE scheme for which he does not know the secret key. The players then jointly choose a random value \(r \in [0,||\boldsymbol{w}_1||_1)\). Player 1 then uses the homomorphic operations to find the value \(i_1\) chosen by this \(r\), and the parties use threshold decryption to recover a secret sharing of \(i_1\). The parties reverse roles to sample \(i_2\). Details of this construction are provided in Section 4.2.

Additionally, an alternative construction that uses sub-linear OT for the oblivious sampling is provided in Section 7.2.3.

4.1.2.2 \(L_2\) Sampling of component-wise sum.

In this case, party 1 (resp. party 2) holds a vector \(\boldsymbol{w}_1\) (resp. \(\boldsymbol{w}_2\)), indexed from \(1\) to \(n\). For \(i \in [n]\). The goal of the parties is to sample from the distribution \(\mathcal{D}\) defined as follows for \(i \in [n]\): \[\mathcal{D}[i] := \frac{(w_{1,i} + w_{2,i})^2}{||\boldsymbol{w}_1 + \boldsymbol{w}_2||_2^2}.\]

We present a protocol that samples from this distribution with \(\tilde{O}(1)\) communication. This protocol relies on a novel technique that we call “corrective sampling”, which is an interesting type of rejection sampling. In what follows, we describe an insecure version of our protocol to give the intuition behind it. To make it secure, we carry out the corrective sampling under FHE as described in Protocol [fig:l2].

The main challenge that we face here, unlike in the case of \(L_1\) sampling, is that it is impossible to compute \(||\boldsymbol{w}_1 + \boldsymbol{w}_2||_2^2\) (and therefore impossible to compute \(\mathcal{D}[i]\) for each \(i\)) with sublinear communication [118]. Instead, we sample index \(i\) from a different, related, distribution, which is easy to sample with sub-linear communication. We then show that we can efficiently correct this distribution by rejecting with the appropriate probability. Interestingly, we show that corrective rejection, which depends on the index \(i\), doesn’t require us to explicitly compute \(||\boldsymbol{w}_1 + \boldsymbol{w}_2||_2^2\). In fact, the parties never learn the corrective term at all!

First, as in rejection sampling, corrective sampling proceeds in trials and in each trial, for every \(i\), the probability that the protocol successfully samples index \(i\) is \(\alpha \cdot {\cal D}[i]\) for some unknown constant \(0 < \alpha < 1\). Since the same constant \(\alpha\) is applied to every index \(i\), by repeating the trials, the protocol samples index \(i\) correctly without skewing the distribution \(\mathcal{D}\). The expected number of trials is \(1/\alpha\). We therefore need to keep \(1/\alpha \in {O}(1)\) to reach our target communication complexity.

As mentioned above, we observe that the protocol never has to explicitly compute \(\alpha\). Towards describing how this is done, first note that in \({\cal D}[i]\), the denominator, \(||\boldsymbol{w}_1 + \boldsymbol{w}_2||_2^2\) – which we assume for purposes of this exposition is at least \(1\) – is the same for every \(i\), so it can be pushed into \(\alpha\) without impacting the discussion above: letting \(\alpha' = \alpha/(||\boldsymbol{w}_1 + \boldsymbol{w}_2||_2^2)\), it suffices to implement rejection sampling with a protocol that samples index \(i\) with probability \(\alpha'\cdot (w_{i,1} + w_{i,2})^2\) = \(\alpha \cdot {\cal D}[i]\). This protocol would only need to explicitly compute \((w_{i,1} + w_{i,2})^2\) (which can be done efficiently given \(i\)), but not \(\alpha'\).

Unfortunately, this does not quite work. \(||\boldsymbol{w}_1 + \boldsymbol{w}_2||_2^2\) can be very large, which would then make \(1/\alpha'\) large. We therefore must combine the above with another idea to ensure that our corrective term introduces at most a \({O}(1)\) overhead.

It samples index \(i\) from distribution \(\mathcal{D}_{\mathrm{ignore}}\), which is easy to sample. We note that the contribution of this distribution will be eventually canceled out through rejection. In particular, we choose the following distribution for \(\mathcal{D}_{\mathrm{ignore}}\): \[{\cal D}_\mathrm{ignore}[i] := \frac{w^2_{1,i} + w^2_{2,i}}{\mathsf{denom}},\] where we set \(\mathsf{denom}= ||\boldsymbol{w}_1||_2^2 + ||\boldsymbol{w}_2||_2^2\) to make the distribution well-defined. Note that \(\mathsf{denom}\) can be computed with \(\tilde O(1)\) communication.

After sampling \(i\) from \({\cal D}_\mathrm{ignore}\), the protocol computes a “corrective bias” for a coin flip that is dependent on \((w_{1,i} + w_{2,i})^2\). We stress that once \(i\) is determined, computing \((w_{1,i} + w_{2,i})^2\) is easy. In particular, a coin is flipped with the following bias: \[\Pr[ coin | i] := \frac{(w_{1,i} + w_{2,i})^2}{2 \cdot {\cal D}_\mathrm{ignore}[i] \cdot \mathsf{denom}}\] Overall, this makes sure that the probability that each trial outputs index \(i\) is \[{\cal D}_\mathrm{ignore}[i] \cdot \Pr[coin|i] = \frac{(w_{1,i} + w_{2,i})^2}{2 \cdot \mathsf{denom}} = \alpha \mathcal{D}[i],\] where \(\alpha = \frac{||\boldsymbol{w}_1 + \boldsymbol{w}_2||_2^2}{2 \cdot \mathsf{denom}}\).

To conclude that this is a valid and efficient sampling procedure, we need to show the following:

\(\alpha\) must be less than \(1\) for the procedure to be valid. This is implied by the fact that \(\|\boldsymbol{w}_1 + \boldsymbol{w}_2 \|_2^2\le 2 \cdot \mathsf{denom}\).

\(1/\alpha\) must be in \(\tilde{O}(1)\) so that the procedure is efficient. We have \(2 \cdot \mathsf{denom}\le 2 \|\boldsymbol{w}_1 + \boldsymbol{w}_2 \|_2^2\), which implies that \(\alpha\) is at least 1/2. So, the expected number of trials is at most 2.

We extend our techniques to the setting of \(L_p\) sampling for constant \(p\) in Section 4.3.3.

4.1.2.3 Product sampling.

In this case, party 1 (resp. party 2) holds a normalized vector \(\boldsymbol{w}_1\) (resp. \(\boldsymbol{w}_2\)), indexed from \(1\) to \(n\). For \(i \in [n]\), \(w_{1,i}\) (resp. \(w_{2,i}\)) corresponds to the probability mass of \(i\) under distribution \(\mathcal{D}_1\) (resp. \(\mathcal{D}_2\)).⁶ The goal of the parties is to sample from the distribution \(\mathcal{D}\) defined as follows for \(i \in [n]\): \[\mathcal{D}[i] := \frac{w_{1,i} \cdot w_{2,i}}{\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle}.\]

We begin by noting (via a simple reduction from Set Disjointness) that it is impossible to achieve sublinear product sampling when no restrictions are placed on the inputs \(\boldsymbol{w}_1, \boldsymbol{w}_2\). We further show (via a more complex reduction from Set Disjointness) that for every protocol \(\Pi\) (parametrized by dimension \(n\)) that correctly samples from \(\mathcal{D}\), there are inputs \(\boldsymbol{w}_1 := \boldsymbol{w}_1(n), \boldsymbol{w}_2 := \boldsymbol{w}_2(n)\), with \(\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle \in \Omega(1/n^2)\), that require linear communication complexity. See Section 4.4.1 for details.

This means that in order to achieve sublinear communication complexity, we would need–at the minimum–a promise on the inputs that guarantees that \(\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle \in \omega(1/n^2)\). We then present a protocol that has the following properties:

The idea for the protocol is the following. The protocol proceeds in rounds: in round \(j\), party 1 and 2 obliviously sample values \(i_1, i_2\) from \(\mathcal{D}_1, \mathcal{D}_2\), respectively (as described for \(L_1\) sampling). Then the parties run a secure protocol that checks whether \(i_1 = i_2\). If yes, they output \(i_1\). Otherwise, the parties repeat the process in the next round.

The main technical portion of our security analysis is to show that the number of rounds (which is the only information leaked) is distributed as a geometric distribution with success probability \(\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle\). This implies that the expected number of rounds is \(1/\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle\), and furthermore, it implies that a simulator who knows \(\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle\) can simulate the terminating round by making a draw from this geometric distribution. See Section 4.4.2 for more details. There, we also describe how we can pad the communication cost to the worst-case, which depends on the given promise, thereby removing the leakage of \(\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle\).

4.1.2.4 Product sampling in constant rounds.

The protocol presented above for product sampling required a large number of rounds stemming from the iterative rejection sampling procedure. We now consider how to parallelize this process. To do so, we need to compute the inner product in order to determine, a priori, how many samples will suffice. However, computing this value requires \(O(n)\) communication [118]!

The natural thing to do is therefore to use an approximation to the inner product that can be computed with sublinear communication. However, when replacing an exact computation of a function \(f(\boldsymbol{w}_1, \boldsymbol{w}_2)\) with an approximation \(\tilde{f}(\boldsymbol{w}_1, \boldsymbol{w}_2; r)\), one needs to be careful that more information is not leaked by the output. Specifically, Ishai et al. [74], [119] introduced the notion of secure multiparty computation of approximations and, loosely speaking, their security definition says that the approximate computation is secure if its output can be simulated from the exactly correct output. While our result falls slightly short of that definition, we are still able to give a rigorous guarantee on the amount of additional information leaked by our approximate functionality. Specifically, we present an approximate functionality \(\tilde{f}\) and prove that the output of \(\tilde{f}(\boldsymbol{w}_1, \boldsymbol{w}_2; r)\) can be simulated given both the exactly correct output \(f(\boldsymbol{w}_1, \boldsymbol{w}_2)\) (where \(f\) is the inner product), as well as the \(L_2\) norms of the individual inputs.

To achieve this, we use a sublinear protocol from the Johnson-Lindenstrauss Transform (JLT) to approximate the dot product of the input vectors. This can be done with sublinear communication by having the parties jointly sample a \(k \times n\) JLT matrix \(\boldsymbol{M}\) for \(k \ll n\) by choosing a short seed and expanding it under FHE. The rest of the computation is then done by communicating vectors \(\boldsymbol{M}\boldsymbol{w}_b\), which are of length \(k\) rather than \(n\). Based on this approximation, the parties can obliviously pre-sample a number of inputs that is sufficient with all but negligible probability, and then input them into a constant round secure computation protocol.

Our contribution here, is to show that this variant protocol only requires additional leakage of \(||\boldsymbol{w}_1||_2^2\), \(||\boldsymbol{w}_2||_2^2\), beyond what is already leaked by the original protocol (i.e., the inner product). Our analysis may be of independent interest, since it shows that given \(\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle\), \(||\boldsymbol{w}_1||_2^2\), \(||\boldsymbol{w}_2||_2^2\), the values \(\boldsymbol{M} \boldsymbol{w}_1\) and \(\boldsymbol{M}\boldsymbol{w}_2\) can be efficiently sampled from exactly the correct distribution, when \(\boldsymbol{M}\) is a JLT matrix, and is kept private from both parties. We prove this result by analyzing the underlying joint multivariate normal distributions corresponding to \(\boldsymbol{M} \boldsymbol{w}_1\) and \(\boldsymbol{M} \boldsymbol{w}_2\), and showing that the mean and covariance (which fully determine the distribution) depend only on the values \(||\boldsymbol{w}_1||_2, ||\boldsymbol{w}_2||_2\), and \(\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle\) See Section 4.5 for more details.

4.1.2.5 Applications to distributed exponential mechanism.

We first briefly describe the connection between product sampling and the exponential mechanism. Ignoring many details, the joint exponential mechanism \(M\) outputs a value \(i\) on input \(X=(x_1,\ldots,x_n)\) with probability proportional to \[w_i = e^{c \cdot f(x_i)} = e^{c \cdot f(x_{1,i} + x_{2,i})},\] where \(c\) is some constant, \(f\) is some scoring function, and the data values \(x_i\) are partitioned between the two parties (as \(x_{1,i}, x_{2,i}\)). If the scoring function \(f\) is linear, it holds that \(f(x_{1,i} + x_{2,i}) = f(x_{1,i}) + f(x_{2,i})\), and, letting \(w_{b,i} = e^{c \cdot f(x_{b,i})}\), we can rewrite \(w_i\) as follows: \[w_i = w_{1,i} \cdot w_{2,i}.\] Therefore, using product sampling, the parties can sample each item \(i\) with probability proportional to \(w_i\).

Based on this connection, we present an application of our constant-round, product sampling protocol to realize a two-party exponential mechanism in Section 4.6. However, to use our sampling protocol in this application, we must show that the leakage of our protocols preserves the differential privacy guarantee. We indeed prove that our constant-round JLT-based protocol can achieve differential privacy—even when the JLT matrix \(\boldsymbol{M}\) is public—by adding correctly distributed noise to \(\langle \boldsymbol{M} \boldsymbol{w}_1, \boldsymbol{M} \boldsymbol{w}_2 \rangle\). This allows parties to execute the exponential mechanism when the cost function is additively distributed across the two parties, with sublinear communication, in the case that \(\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle \in \omega(\log n/n)\).

4.1.3 Related Work

4.1.3.1 Sampling from streaming data.

Many prior papers (e.g. [27], [28], [29], [30], [31]) have studied the problem of sampling data from a data stream. In this setting the goal is to achieve \(L_p\) sampling for arbitrary \(p\) without having to process or store all the streaming data, thus requiring sublinear computation. These works generally operate in the one-party setting and do not consider privacy.

4.1.3.2 Secure multiparty sampling.

A few prior works [32], [33] have investigated the problem of two and multi-party private sampling in the information theoretic setting. These works focus on identifying the necessary setup to enable sampling from various distributions. We instead focus on the computational setting, and focus on reducing communication. Recently, Champion et al. [34] also considered the computational setting, but they focus on sampling from a publicly-known distribution whereas we sample from a private one.

4.1.3.3 Secure multiparty computation of differentially private functionalities.

Starting with the work of Dwork et al. [35] there has been a good amount of work (e.g. [36], [37], [38], [120], [121], [122]) on using MPC to realize differentially private functionalities to protect the privacy of individual inputs given the output of the MPC. These works have focused on building efficient, private applications in machine learning and other fields, whereas we focus on reducing the communication necessary for the specific functionalities of sampling.

4.1.3.4 Secure sketching.

A long line of work [39], [40], [41], [42], [43] has investigated building secure sketches for securely estimating statistics of Tor usage, web traffic, and other applications. These works focus on building sublinear communication and computation protocols for computing specific statistics such as unique count, median, etc.

4.2 Two-party \(L_1\) Sampling

In this section, we describe a secure two-party \(L_1\) sampling protocol. Given two \(n\)-dimensional vectors \(\boldsymbol{w}_1 = (w_{1,1}, \ldots, w_{1,n})\) and \(\boldsymbol{w}_2 = (w_{2,1}, \ldots, w_{2,n})\) as the private inputs from parties \(P_1\) and \(P_2\) respectively, the protocol samples from the \(L_1\) distribution according to \(\boldsymbol{w}_1 + \boldsymbol{w}_2\).

4.2.0.1 Notation: \(L_p\) norm.

Let \(\boldsymbol{w} = (w_1, \ldots, w_n) \in \mathbb{R}^n\) be a non-zero vector. The \(L_p\) norm \(\Vert \boldsymbol{w}\Vert_p\) of \(\boldsymbol{w}\) is defined as \(\Vert \boldsymbol{w}\Vert_p := \left ( \sum_j |w_j|^p \right )^{1/p} .\) When there is no subscript, it means \(L_2\) norm; that is, \(\Vert \boldsymbol{w}\Vert := \Vert \boldsymbol{w}\Vert_2\)

4.2.0.2 Assumptions.

Throughout the paper, we assume that the values \(w_{b,i}\) are represented by fixed-point precision numbers, and consider the cost of communicating such a number to be independent of \(n\). We assume all weights in vectors \(\boldsymbol{w}_1\) and \(\boldsymbol{w}_2\) are non-negative.

4.2.0.3 Ideal functionality.

We first define an ideal functionality for the two-party \(L_1\) sampling. Slightly abusing the notation, let \(L_1(\boldsymbol{w}_1, \boldsymbol{w}_2)\) be a two-input sampling procedure based on the \(L_1\) distribution of \(\boldsymbol{w}_1 + \boldsymbol{w}_2\): \[\Pr[ L_1(\boldsymbol{w}_1, \boldsymbol{w}_2) \mbox{ samples } i ] = \frac{w_{1,i} + w_{2,i}}{ \|\boldsymbol{w}_1 + \boldsymbol{w}_2\|_1 }.\]

We give a more formal description of the functionality \(\mathcal{F}_{L_1}\) in the figure below. In Section 4.2.2, we present a protocol that securely realizes this functionality.

\(n \in \mathbb{N}\). The dimension of the input weight vectors \(\boldsymbol{w}_1\) and \(\boldsymbol{w}_2\).

Receive inputs \(\boldsymbol{w}_1\) and \(\boldsymbol{w}_2\) from \(P_1\) and \(P_2\) respectively.

Sample \(i \in [n]\) with probability \(\frac{w_{1,i} + w_{2,i}}{ \|\boldsymbol{w}_1 + \boldsymbol{w}_2\|_1 }\)

4.2.1 A toy protocol towards securely realizing \(\mathcal{F}_{L_1}\)

We describe our first attempt, which is insecure, but provides good intuition on how we construct a secure protocol. In fact, the attack on this broken protocol, as well as the fix presented in the next sub-section, remain relevant when we move to product sampling and \(L_2\) sampling as well. Since we assume that all the weights are non-negative, we observe that letting \(p = \frac{\|\boldsymbol{w}_1\|_1}{ \|\boldsymbol{w}_1\|_1 + \|\boldsymbol{w}_2\|_1}\), the above measure can be re-written as follows:

\[\label{eqn:l1} \Pr[ L_1(\boldsymbol{w}_1, \boldsymbol{w}_2) \mbox{ samples } i ] = \frac{w_{1,i}}{\|\boldsymbol{w}_1\|_1} \cdot p + \frac{w_{2,i}}{\|\boldsymbol{w}_2\|_1} \cdot (1-p).\]

Party \(P_1\) samples \(i_1\) from the \(L_1\) distribution according to \(\boldsymbol{w}_1\), such that \(\Pr[P_1 \mbox{ samples } i_1] = \frac{w_{1,i_1}}{\|\boldsymbol{w}_1\|_1}\).

Party \(P_2\) samples \(i_2\) from the \(L_1\) distribution according to \(\boldsymbol{w}_2\), such that \(\Pr[P_2 \mbox{ samples } i_2] = \frac{w_{2,{i_2}}}{\|\boldsymbol{w}_2\|_1}\).

Then, \(P_1\) and \(P_2\) execute a secure protocol for the following procedure:

Execute a coin toss protocol with bias \(p\). Let \(b\) be the output of the coin-flip.

4.2.1.1 Insecurity of the protocol.

However, this protocol has a subtle security issue. For example, let \(i\) be the eventual output index of the protocol. Then, we have the following:

If the coin flip \(b\) is 0, which happens with probability \(p\), it holds that \(i\) is always the same as \(i_1\).

On the other hand, if the coin flip \(b\) is 1, then \(i\) will be the same as \(i_1\) if and only if \(i_2 = i_1\), which happens with probability \(\frac{w_{2, i_1}}{ \| \boldsymbol{w}_2 \|_1 }\). This implies that we have \[\Pr[ i = i_1 | i_1 ] = p + (1-p) \cdot \frac{w_{2, i_1}}{ \| \boldsymbol{w}_2 \|_1 }\]

Now consider a distinguisher that corrupts \(P_1\), chooses inputs \(\boldsymbol{w}_1\) and \(\boldsymbol{w}_2\), and checks the above conditional probability, which is possible since the distinguisher can also see \(i_1\) through the corrupted \(P_1\). To prove security, we should be able to construct a simulator for \(P_1\) that fools this distinguisher. However, a simulator for \(P_1\) doesn’t know \(\boldsymbol{w}_2\), which causes the above conditional probability to be unsimulatable.

In a sense, by having \(P_1\) choose \(i_1\), the protocol allows \(P_1\) to measure the conditional probability \(\Pr[ i = i_1 | i_1 ]\), which depends on the value \(w_{2, i_1}\) thereby leaking information about \(P_2\)’s input to \(P_1\).

4.2.2 Secure \(L_1\) sampling protocol

4.2.2.1 Oblivious sampling.

We address the insecurity of the toy protocol by having the parties sample obliviously from \(\boldsymbol{w}_1\), \(\boldsymbol{w}_2\). This way, each party would not know whether the final output index matches the sample taken from its own vector, or the sample taken from the other party’s vector. Specifically, we will construct our protocol under the framework described below:

The parties obliviously sample \(i_1\) according to \(L_1\) distribution of \(\boldsymbol{w}_1\). The output index \(i_1\) is secret shared between the two parties. Let \(\langle i_1 \rangle\) denote the secret share of \(i_1\). Likewise, they obliviously sample \(\langle i_2 \rangle\) from \(L_1\) distribution of \(\boldsymbol{w}_2\).

If \(b=0\), output the decryption of \(i_1\); otherwise output the decryption of \(i_2\).

4.2.2.2 Ideal functionalities.

Formally, we define an ideal functionality \(\mathcal{F}_{{\mathsf{osample(L_1)}}}\) as follows:

The functionality considers two participants, the sender and the receiver. The functionality is parameterized with a number \(n\).

Inputs: The sender has an \(n\)-dimensional weight vector \(\boldsymbol{w}\). The receiver has no input.

Choose a random pad \(\pi \in \{0,1\}^\ell\), where \(\ell = \lceil \log_2 n \rceil\).

We also give an ideal functionality \(\mathcal{F}_{{\mathsf{biasCoin}}}\) for the biased coin tossing.

The functionality considers two participants \(P_1\) and \(P_2\) and proceeds as follows:

4.2.2.3 \(L_1\) sampling protocol.

Based on the above functionalities, we describe a protocol securely realizing \(\mathcal{F}_{L_1}\) in the \((\mathcal{F}_{\mathsf{osample(L_1)}}, \mathcal{F}_{\mathsf{biasCoin}})\)-hybrid.

Execute \(\mathcal{F}_{\mathsf{osample(L_1)}}\) with \(P_1\) as a sender with input \(\boldsymbol{w}_1\) and \(P_2\) as a receiver. Let \(\langle i_1 \rangle\) be the secret share of the output index.

Execute \(\mathcal{F}_{\mathsf{osample(L_1)}}\) with \(P_2\) as a sender with input \(\boldsymbol{w}_2\) and \(P_1\) as a receiver. Let \(\langle i_2 \rangle\) be the secret share of the output index.

Execute \(\mathcal{F}_{\mathsf{biasCoin}}\) where \(P_1\) has input \(\|\boldsymbol{w}_1\|_1\) and \(P_2\) has input \(\|\boldsymbol{w}_2\|_1\). Let \(\langle b \rangle\) be the secret share of the output bit.

Protocol : Two-party \(L_1\) sampling in the \((\mathcal{F}_{\mathsf{osample(L_1)}}, \mathcal{F}_{\mathsf{biasCoin}})\)-hybrid.

4.2.2.4 Securely realizing \(\mathcal{F}_{\mathsf{osample(L_1)}}\) with threshold FHE.

The main idea of the protocol is having the parties securely sample a random number \(r\) from \([s]\), where \(s := \| \boldsymbol{w} \|_1\). Our construction is found in Protocol [fig:os-fhe].

The sender and the receiver execute \(\mathcal{F}_{2PC}\) to uniformly sample \(r\) from the range \([0, s)\). This is possible, since \(s\) has a fixed point representation. Let \(r_1\) and \(r_2\) be the secret share of \(r\) given to \(P_1\) and \(P_2\) respectively.

The sender and the receiver set up a threshold FHE scheme. The plaintext space of the FHE is \(GF(2)\), which allows homomorphic bitwise-xor and bitwise-AND operations. Let \(\unicode{x27E6}m \unicode{x27E7}\) denote an FHE encryption of plaintext \(m\) which can be a bit or bits depending on the context.

The receiver sends \(\unicode{x27E6}r_2 \unicode{x27E7}\) so that the sender can compute \(\unicode{x27E6}r \unicode{x27E7}:= \unicode{x27E6}r_1 \unicode{x27E7}\oplus\unicode{x27E6}r_2 \unicode{x27E7}\).

Output \(i \in [1,n]\) such that \(r \in [cnt_{i-1}, cnt_{i}]\). Let \(\unicode{x27E6}i \unicode{x27E7}\) be the output encryption from the above homomorphic evaluation.

The sender chooses a random pad \(\pi\), and then it sends \(\unicode{x27E6}c \unicode{x27E7}= \unicode{x27E6}i \unicode{x27E7}\oplus\unicode{x27E6}\pi \unicode{x27E7}\) to the receiver.

The two parties perform threshold decryption so that \(c\) is decrypted to the receiver.

We note that we give another construction that relies on sub-linear 1-out-of-\(m\) oblivious transfer (OT), but requires computation that is exponential in the bit precision in Section 7.2.3.

4.2.2.5 Securely realizing \(\mathcal{F}_{\mathsf{biasCoin}}\).

The secure construction for \(\mathcal{F}_{\mathsf{biasCoin}}\) is straightforward and can be found in Section [apx:bcoin].

4.3 Two Party \(L_2\) Sampling

In this section we consider the two-party \(L_2\) sampling functionality. Given input vectors \(\boldsymbol{w}_1, \boldsymbol{w}_2\), this functionality samples from the distribution \(D_{L_2}(\boldsymbol{w}_1, \boldsymbol{w}_2)\) with the following probability mass function: \[\Pr[ D_{L_2}(\boldsymbol{w}_1, \boldsymbol{w}_2) \mbox{ samples } i ] = \frac{(w_{1,i} + w_{2,i})^2}{ \sum_j (w_{1,j} + w_{2,j})^2 } = \frac{(w_{1,i} + w_{2,i})^2}{ \| \boldsymbol{w}_1 + \boldsymbol{w}_2 \|_2^2 }.\]

We begin by presenting a non-private protocol for two-party \(L_2\) sampling with \(\tilde{O}(1)\) communication in Section 4.3.1, the construction is found in Protocol [fig:l2approx]. We then show how to implement the protocol securely in Section 4.3.2.

4.3.1 A non-private \(L_2\) sampling protocol with \(\tilde{O}(1)\) communication

We begin by defining and showing how to sample from a helper distribution \(D_{\mathsf{ignore}}\).

The following lemma will be useful for arguing the validity of the final protocol.

We now present the \(L_2\) sampling protocol \(\Pi_{L_2}\), which is described in Protocol [fig:l2approx]. We show the correctness and efficiency of the protocol.

4.3.1.1 Inputs:

Parties \(P_1\) and \(P_2\) have inputs \(\boldsymbol{w}_1\) and \(\boldsymbol{w}_2\) respectively.

Parties run \(\Pi_{\mathsf{ignore}}\) with inputs \(\boldsymbol{w}_1, \boldsymbol{w}_2\) that samples from \(D_{\mathsf{ignore}}(\boldsymbol{w}_1, \boldsymbol{w}_2)\) and obtain output \(i\).

For \(b \in \{1, 2\}\), \(P_b\) sends \(w_{b,i}, ||\boldsymbol{w}_b||_2^2\). Both parties compute \[\Pr_{D_{\mathsf{ignore}}(\boldsymbol{w}_1, \boldsymbol{w}_2)}[i] = \frac{w_{1,i}^2 + w_{2,i}^2}{||\boldsymbol{w}_1||_2^2 + ||\boldsymbol{w}_2||_2^2} \quad \mbox{and} \quad f_c(\boldsymbol{w}_1, \boldsymbol{w}_2, i) = \frac{w_{1,i}^2 + 2w_{1,i}w_{2,i} + w_{2,i}^2}{||\boldsymbol{w}_1||_2^2 + ||\boldsymbol{w}_2||_2^2}\]

Parties output \(i\) with probability \[\begin{aligned} \frac{f_c(\boldsymbol{w}_1, \boldsymbol{w}_2, i)}{2 \cdot \Pr_{D_{\mathsf{ignore}}(\boldsymbol{w}_1, \boldsymbol{w}_2)}[i]} &= \frac{c \cdot \Pr_{D_{L_2}(\boldsymbol{w}_1, \boldsymbol{w}_2)}[i]}{2 \cdot \Pr_{D_{\mathsf{ignore}}(\boldsymbol{w}_1, \boldsymbol{w}_2)}[i]}\\ &= \frac{\Pr_{D_{L_2}(\boldsymbol{w}_1, \boldsymbol{w}_2)}[i]}{2/c \cdot \Pr_{D_{\mathsf{ignore}}(\boldsymbol{w}_1, \boldsymbol{w}_2)}[i]} \end{aligned}\] and otherwise return to step 1.

4.3.2 Secure \(L_2\) Sampling From FHE

Let \(B \in \tilde{O}(1)\). The parties perform the following steps for \(j \in [B]\):

Input: \((\widehat{\mathsf{out}}_1, \ldots, \widehat{\mathsf{out}}_B)\) and threshold decryption keys.

Output: \(i_j\) corresponding to the minimum \(j\) such that \(\widehat{\mathsf{out}}_j\) decrypts to \(i_j \neq 0\). Or \(\bot\) if no such \(j \in [B]\) exists.

Protocol : Two-party \(L_2\) sampling in the \((\mathcal{F}_{L_1}, F_{2PC})\)-hybrid.

4.3.2.1 \(L_2\) sampling protocol.

We present our secure \(L_2\) sampling protocol in Protocol [fig:l2]. For two \(n\)-dimensional vectors \(\boldsymbol{w}_1\) and \(\boldsymbol{w}_2\), we denote by \(\boldsymbol{w}_1 \odot \boldsymbol{w}_2\) the \(n\)-dimensional vector whose \(i\)-th entry is equal to \(w_{1,i} \cdot w_{2,i}\).

Our \(L_2\) sampling protocol uses ideal functionality \(\mathcal{F}_{L_1}^{ss}\), which works essentially the same as \(\mathcal{F}_{L_1}\) except that the output index is secret shared among both parties. We can securely realize this functionality with semi-honest security through a trivial change in the protocol \(\Pi_{L_1}\); for the sake of completeness, we provide the details in Section 7.2.4.

4.3.2.2 Efficiency and correctness.

It is clear that the total communication complexity of the protocol is \(\tilde{O}(1)\), since each step in the loop has complexity \(\tilde{O}(1)\) and the loop iterates \(B \in \tilde{O}(1)\) number of times. Correctness is also immediate, since the protocol simply implements the \(\Pi_{L_2}\) sampling procedure, which was proven in Section 4.3.1 to be correct, and to require at most \(B \in \tilde{O}(1)\) samples, with all but negligible probability,

4.3.2.3 Security.

4.3.3 A non-private \(L_p\) sampling protocol with \(\tilde{O}(1)\) communication

4.3.3.1 Inputs:

Parties \(P_1\) and \(P_2\) have inputs \(\boldsymbol{w}_1\) and \(\boldsymbol{w}_2\) respectively.

Parties run \(\Pi_{\mathsf{ignore}}\) with inputs \(\boldsymbol{w}_1, \boldsymbol{w}_2\) that samples from \(D_{\mathsf{ignore}, p}(\boldsymbol{w}_1, \boldsymbol{w}_2)\) and obtain output \(i\).

For \(b \in \{1, 2\}\), \(P_b\) sends \(w_{b,i}, ||\boldsymbol{w}_b||_p^p\). Both parties compute \[\Pr_{D_{\mathsf{ignore}, p}(\boldsymbol{w}_1, \boldsymbol{w}_2)}[i] = \frac{w_{1,i}^p + w_{2,i}^p}{||\boldsymbol{w}_1||_p^p + ||\boldsymbol{w}_2||_p^p} \quad \mbox{and} \quad f_c(\boldsymbol{w}_1, \boldsymbol{w}_2, i) = \frac{(w_{1,i} + w_{2,i})^p}{||\boldsymbol{w}_1||_p^p + ||\boldsymbol{w}_2||_p^p}\]

Parties output \(i\) with probability \[\begin{aligned} \frac{f_c(\boldsymbol{w}_1, \boldsymbol{w}_2, i)}{2^{p-1} \cdot \Pr_{D_{\mathsf{ignore}, p}(\boldsymbol{w}_1, \boldsymbol{w}_2)}[i]} &= \frac{c \cdot \Pr_{D_{L_2}(\boldsymbol{w}_1, \boldsymbol{w}_2)}[i]}{2^{p-1} \cdot \Pr_{D_{\mathsf{ignore}, p}(\boldsymbol{w}_1, \boldsymbol{w}_2)}[i]}\\ &= \frac{\Pr_{D_{L_2}(\boldsymbol{w}_1, \boldsymbol{w}_2)}[i]}{2^{p-1}/c \cdot \Pr_{D_{\mathsf{ignore}, p}(\boldsymbol{w}_1, \boldsymbol{w}_2)}[i]} \end{aligned}\] and otherwise return to step 1.

In this section we present a \(\tilde{O}(1)\) sampling protocol for \(L_p\) sampling for constant \(p\). We present only the insecure version, extending it to a secure sampling protocol can be done entirely analogously to the construction for \(L_2\) sampling given in Section 4.3.2.

Given input vectors \(\boldsymbol{w}_1, \boldsymbol{w}_2\), \(L_p\) sampling refers to sampling from the distribution \(D_{L_p}(\boldsymbol{w}_1, \boldsymbol{w}_2)\) with the following probability mass function: \[\Pr[ D_{L_p}(\boldsymbol{w}_1, \boldsymbol{w}_2) \mbox{ samples } i ] = \frac{(w_{1,i} + w_{2,i})^p}{ \sum_j (w_{1,j} + w_{2,j})^p } = \frac{(w_{1,i} + w_{2,i})^p}{ \| \boldsymbol{w}_1 + \boldsymbol{w}_2 \|_p^p }.\]

We begin by defining and showing how to sample from a helper distribution \(D_{\mathsf{ignore}, p}\).

The following lemma will be useful for arguing the validity of the final protocol.

We now present the \(L_p\) sampling protocol \(\Pi_{L_p}\) in Protocol [fig:lpexact]. We show the correctness and efficiency of the protocol below.

The proof is found in Section [apx:pfLemma6]. We note that this result strictly generalizes Lemma 14. In particular, setting \(p=2\) in the above protocol yields a protocol with exactly the same parameters as the \(L_2\) sampling protocol.

4.4 Two-party Product Sampling

We next consider the problem of two-party sampling from a product distribution. Specifically, given \(n\)-dimensional vectors \(\boldsymbol{w_1}=(w_{1,1},\ldots,w_{1,n})\) and \(\boldsymbol{w_2}=(w_{2,1},\ldots,w_{2,n})\) as the private inputs from \(P_1\) and \(P_2\) respectively, we wish to sample from the distribution \(D_{\mathsf{prod}}\) defined by

\[\Pr[D_{\mathsf{prod}}(\boldsymbol{w}_1,\boldsymbol{w}_2)=i] = \frac{w_{1,i}\cdot w_{2,i}}{\sum_{j=1}^n w_{1,j} \cdot w_{2,j}} = \frac{w_{1,i} \cdot w_{2,i}}{\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle}\]

Of course, if \(\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle= 0\), the probability space is not well-defined, and in this case, we require the protocol to simply output \(\bot\).

As before, we assume that all weights in \(\boldsymbol{w}_1\) and \(\boldsymbol{w}_2\) are non-negative.

4.4.0.1 Ideal functionality.

We now define an ideal functionality \(\mathcal{F}_{\mathsf{prod}}\) for two-party product sampling. This functionality is parametrized by a function \(f_{\mathsf{Leak}}\) capturing the leakage that the functionality gives to the adversary.

\(n \in \mathbb{N}\). The dimension of the input weight vectors \(\boldsymbol{w}_1\) and \(\boldsymbol{w}_2\).

Receive inputs \(\boldsymbol{w}_1\) and \(\boldsymbol{w}_2\) from \(P_1\) and \(P_2\) respectively.

If \(\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle=0\), send \(\mathsf{leak}\) to the adversary and \(\bot\) to \(P_1\) and \(P_2\).

Otherwise, sample \(i\) with probability \(\frac{w_{1,i} \cdot w_{2,i}}{\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle}\), send \(\mathsf{leak}\) to the adversary, and send \(i\) to \(P_1\) and \(P_2\).

4.4.1 Impossibility of sublinear product sampling

Our goal is to find a protocol for two-party sampling with sublinear (in \(n\)) communication. However, unlike the case for \(L_1\) sampling, we show that this goal is actually impossible. Roughly speaking, if parties are allowed to have arbitrary input vectors, then a sublinear communication solution to product sampling implies a sublinear communication solution to the disjointness problem, which is known to be impossible.

For our impossibility result, we first define the two-party disjointness problem.

4.4.1.1 Disjointness problem.

The disjointness problem checks if two input sets \(S\) and \(T\) are disjoint (i.e., \(S \cap T = \emptyset\)). Specifically, we consider a function \(\mathsf{DISJ}^n: \{0,1\}^n \times \{0,1\}^n \to \{0,1\}\) defined as: \[\mathsf{DISJ}^n(v_S, v_T) = \left \{ \begin{array}{l} 1 \mbox{ if } \langle v_S, v_T \rangle= 0 \\ 0 \mbox{ otherwise } \end{array} \right .\] In the above, \(v_S\) and \(v_T\) are the characteristic vectors of \(S\) and \(T\) respectively. The communication complexity of the solution to the disjointness problem is known to have a linear lowerbound, as shown in the following Theorem:

4.4.1.2 Our impossibility result.

We first observe that a simple reduction from Disjointness gives us that is impossible to achieve sublinear product sampling. Specifically, disjointness can be directly learned from whether the product sampling protocol outputs \(\bot\) or not.

Our impossibility result is stronger. We show that it is impossible to achieve sublinear product sampling even when the product sampling protocol is executed with input vectors \(\boldsymbol{w}_1\) and \(\boldsymbol{w}_2\) in which all coordinates are bounded away from \(0\), which in particular guarantees that \(\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle\) is bounded away from \(0\).

Before stating a formal theorem below, for \(0 < \gamma < 1\), we first define \(\gamma\)-heaviness; we say that a vector \(\boldsymbol{w}\) is \(\gamma\)-heavy when each coordinate of \(\boldsymbol{w}\) is a number contained in \([\gamma,1]\).

We construct a protocol computing \(\mathsf{DISJ}^n\) by taking advantage of \(\Pi_{\mathsf{prod}}\) as follows:

4.4.1.3 The protocol for \(\mathsf{DISJ}^n\)

Parties \(\mathbf{A}\) and \(\mathbf{B}\) each get as input a vector \(\tilde{\boldsymbol{a}}\), \(\tilde{\boldsymbol{b}} \in \{0,1\}^n\). The goal is to output \(1\) if the vectors are “disjoint” and \(0\) otherwise.

Before we prove the lemmas, we briefly describe how we can use these lemmas to achieve a protocol that correctly computes \(\mathsf{DISJ}\) with probability at least \(2/3\). Note that we can get a gap by setting \(\gamma = \frac{1}{2n}\). In other words, parties output \(1\) when disjoint with probability at least \(\frac{2}{5}\). Parties output \(1\) when not disjoint with probability at most \(\frac{1}{3}\). Since we have a constant gap between completeness and soundness, this can be amplified to \(2/3\) and \(1/3\) by running the protocol a constant number of times.

4.4.1.4 Remarks.

We would like to characterize the sublinearity condition for product sampling protocols using the normalized input vectors. We can do this since without loss of generality we can assume that input vectors to the product sampling protocols are normalized; in particular, for any (non-normalized) vectors \(\boldsymbol{w}_1\) and \(\boldsymbol{w}_2\), we have

\[\Pr\left[D_{\mathsf{prod}}\left(\frac{\boldsymbol{w}_1}{\|\boldsymbol{w}_1\|_1},\frac{\boldsymbol{w}_2}{\|\boldsymbol{w}_2\|_1}\right)=i\right] = \frac{ \frac{w_{1,i}}{\|\boldsymbol{w}_1\|_1} \cdot \frac{w_{2,i}}{\|\boldsymbol{w}_2\|_1}}{ \langle \frac{\boldsymbol{w}_1}{\|\boldsymbol{w}_1\|_1}, \frac{\boldsymbol{w}_2}{\|\boldsymbol{w}_2\|_1} \rangle} = \Pr[D_{\mathsf{prod}}(\boldsymbol{w}_1,\boldsymbol{w}_2)=i].\]

Specifically, we show below that the impossibility theorem implies that in order to achieve sublinear communication complexity for product sampling, we would need, at the minimum, a promise on the inputs that guarantees that \[\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle \in \Omega(1/n^2),\] when \(\boldsymbol{w}_1, \boldsymbol{w}_2\) are normalized vectors.

To do this, first note that the theorem implies that sublinear communication product sampling needs to have \(\gamma \in \Omega(1/n)\). Now, in the proof, any non-disjoint binary vectors \(\tilde{\boldsymbol{a}}\), \(\tilde{\boldsymbol{b}}\) to the \(\mathsf{DISJ}\) problem has \(\langle \tilde{\boldsymbol{a}}\), \(\tilde{\boldsymbol{b}} \rangle \ge 1\), and these vectors are transformed to \(g_\gamma(\tilde{\boldsymbol{a}})\) and \(g_\gamma(\tilde{\boldsymbol{b}})\). Let \(\boldsymbol{w}_1\) and \(\boldsymbol{w}_2\) be the normalized vectors \(g_\gamma(\tilde{\boldsymbol{a}})\) and \(g_\gamma(\tilde{\boldsymbol{b}})\); that is, \(\boldsymbol{w}_1 := g_\gamma(\tilde{\boldsymbol{a}}) / \| g_\gamma(\tilde{\boldsymbol{a}}) \|_1\) and \(\boldsymbol{w}_2 = g_\gamma(\tilde{\boldsymbol{b}}) / \| g_\gamma(\tilde{\boldsymbol{b}}) \|_1\). Since each entry of \(g_\gamma(\tilde{\boldsymbol{a}})\) and \(g_\gamma(\tilde{\boldsymbol{a}})\) is at most 1, we have \(\|g_\gamma(\tilde{\boldsymbol{a}})\|_1 \le n\) and \(\|g_\gamma(\tilde{\boldsymbol{b}})\|_1 \le n\). Therefore, we have \[\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle\ge \frac {\langle g_\gamma(\boldsymbol{a}), g_\gamma(\boldsymbol{b}) \rangle } { n \cdot n } \ge \frac 1 {n^2}.\]

4.4.2 Product sampling while leaking at most the inner product

4.4.2.1 Assumptions.

As before, we assume that all weights in \(\boldsymbol{w}_1\) and \(\boldsymbol{w}_2\) are non-negative. As discussed in the previous subsection, we also assume, without loss of generality, that \[\| \boldsymbol{w}_1 \|_1 = \| \boldsymbol{w}_2 \|_1 = 1.\]

4.4.2.2 Overview.

We now show that the impossibility result of Section 4.4.1 can be bypassed if we make some assumptions on the inputs. Specifically, if we restrict ourselves to the case when \(\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle=\omega\left(\frac{\log n}{n}\right)\), then we can achieve a sublinear communication protocol for product sampling on inputs \(\boldsymbol{w}_1, \boldsymbol{w}_2\)⁷. Of course, by observing that the protocol uses sub-linear communication, due to our lower-bound, both parties will learn that such a promise on the inputs is satisfied; the lower bound implies that some leakage about the inputs is necessary. In our protocol, we show that the information leaked is at most the inner product \(\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle\). (Formally, we set \(f_{\mathsf{Leak}}(\boldsymbol{w}_1, \boldsymbol{w}_2) =\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle\).) Interestingly, we show that this is the case even though our protocol does not, and cannot,⁸ actually compute \(\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle\).

4.4.2.3 Product sampling protocol.

Roughly, the protocol works as follows. The protocol proceeds in rounds where in each round \(P_1\) and \(P_2\) use the oblivious \(L_1\) sampling with a single input vector (\(\mathcal{F}_{\mathsf{osample(L_1)}}\)) to produce two secret-shared sampled indices, one from \(P_1\)’s input vector, and one from \(P_2\)’s input vector. The parties then run a secure 2-PC protocol to securely compare these values, and if they are equal, output the sampled index. If the two sampled indices are not equal, the parties move to the next round.

We describe a private two-party protocol for product sampling leaking at most the inner product (see Protocol [prot:prodsampapx]). This protocol is in the \(\{\mathcal{F}_{\mathsf{osample(L_1)}}, \mathcal{F}_{2PC} \}\)-hybrid model.

Protocol : Product sampling (\(\Pi_{\mathsf{prod}}^{IP}\)) in the \(\{\mathcal{F}_{\mathsf{osample(L_1)}}, \mathcal{F}_{2PC} \}\)-hybrid.

4.4.2.4 Security.

4.4.2.5 Performance.

As shown above, the number of rounds \(r\) needed by this protocol is distributed as the number of Bernoulli trials (with probability \(p = \langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle\)) needed to get one success. Thus, the expected number of rounds is \(r=\frac{1}{\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle}\). In each round, the communication consists of a secure 2-PC of equality on \(O(\log n)\)-bit inputs, which can be done in \(O(\log n)\) communication and \(O(1)\) rounds. Thus, in total, this protocol has expected communication \(O(\frac{\log n}{\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle})\) and \(O(\frac{1}{\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle})\) rounds. This communication is sublinear in \(n\) when \(\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle= \omega\left(\frac{\log n}{n}\right)\).

Trading efficiency for privacy. In the proof above, the simulator requires the value of \(\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle\), which is not revealed by the output. However, a slight modification to the protocol allows us to remove this leakage at the cost of additional, though still sub-linear, communication. Instead of terminating the protocol the first time there is a collision in the \(L_1\) samples, we can pad the communication cost by making \(O(\frac{n}{\log n})\) calls to \(\mathcal{F}_{{\mathsf{osample(L_1)}}}\). Under the promise of \(\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle= \omega(\frac{\log n}{n})\), this ensures a collision in the outputs (with all but negligible probability). The parties can then use \(O(\frac{n}{\log n})\) communication to obliviously find and output the collision, without revealing the index, and avoiding the leakage of \(\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle\).

Generalizing this idea, we arrive at a set of similar protocol modifications that support a continuous set of tradeoffs: instead of choosing between leaking \(\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle\) to the simulator, or padding to the maximum communication, we can choose to leak some lower bound on \(\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle\), and modify the protocol to make a proportionate number of calls to \(\mathcal{F}_{{\mathsf{osample(L_1)}}}\), search (obliviously) for a collision, and repeat if necessary.

Without a full proof, we provide some intuition for the fact that this tradeoff between leakage and communication is inherent. We can do that by generalizing the statement of Theorem 24. We first modify the definition of \(\gamma\)-heavy defined previously: for any \(t(n) = O(n)\), we say that a vector \(\boldsymbol{w}\) of length \(n\) is \(\gamma_{t,n}\)-heavy if each of the \(t := t(n)\) coordinates of \(\boldsymbol{w}\) is a number contained in \([\gamma, 1]\). In particular, we now allow \(t(n) = o(n)\). Then, with a small modification to the reduction, we can prove that if \(\boldsymbol{w}_1\) and \(\boldsymbol{w}_2\) are \(\gamma_{t,n}\)-heavy, and if there exists a protocol \(\Pi_{\mathsf{prod}}\) for product sampling with communication at most \(C := C(n, \gamma)\), then there exists a protocol for computing \(\mathsf{DISJ}^t\) with communication \(\log(n) + O(C)\). In the modified reduction, the parties simply increase the weights of the \(t\) input slots (as before), and append \(n-t\) entries containing 0 at the end. Since we know that \(\mathsf{DISJ}^t\) requires \(O(t)\) communication, the implication is that we have increasingly weaker communication bounds as we are provided increasingly strong promises on the inner product. Conversely, for a certain set of input vectors, observing the communication of the sampling protocol gives you a bound on the inner product of the inputs. The less communication observed, the tighter that bound, and the greater the leakage.

4.5 Product Sampling in Constant Rounds

4.5.0.1 Achieving constant rounds through parallel repetition.

In Sections 4.4, we showed a sublinear communication protocol for product sampling when \(\langle \boldsymbol{{w}_1}, {\boldsymbol{w}_2} \rangle\) is sufficiently large. Moreover, this protocol provably leaked no more information than the inner product. However, this protocol required \(O(1/\langle \boldsymbol{{w}_1}, {\boldsymbol{w}_2} \rangle)\) rounds of communication. This raises the question of whether constant-round sublinear product sampling is possible under the same restrictions on the inputs.

Our protocol to achieve this takes a relatively standard approach. Suppose that we are given the value of \(\langle \boldsymbol{{w}_1}, {\boldsymbol{w}_2} \rangle\). Then, since the expected number of samples until a collision is a function of \(\langle \boldsymbol{{w}_1}, {\boldsymbol{w}_2} \rangle\), we can just run the inner loop of protocol \(\Pi_{\mathsf{prod}}\) in parallel sufficiently many times to guarantee that the protocol would terminate with all but negligible probability.

4.5.0.2 How many times to repeat?

However, there is one catch. It is not actually possible to compute \(\langle \boldsymbol{{w}_1}, {\boldsymbol{w}_2} \rangle\) in sublinear communication! One simple solution is to use our promise on the input: we could run the inner loop enough times to guarantee termination for any inputs satisfying the promise (e.g. \(\omega(\frac{n}{\log n})\) times). However, this forces us to adopt the worst-case communication cost, which might be undesirable. (Recall, it also offers the least leakage, which might be desirable.) Instead, we re-establish the trade-off between leakage and efficiency as follows. We begin by computing an approximation of the inner product in sublinear communication (see Section 4.5.1). Using this approximation, we can then realize our sublinear communication, constant round protocol for product sampling as follows in the next subsection.

4.5.1 Secure approximation of the inner product

We achieve a protocol that securely approximates the inner product with sublinear communication. In particular, we take advantage of the well known Johnson–Lindenstrauss Transform (JLT) [126], [127] sketch.

4.5.1.1 Additional assumptions about \(\boldsymbol{w_1}\) and \(\boldsymbol{w_2}\)

. We assumed that \(\boldsymbol{w_1}\) and \(\boldsymbol{w_2}\) are normalized and correlated such that \(\langle \boldsymbol{{w}_1}, {\boldsymbol{w}_2} \rangle = \omega(\log n/n)\). In a similar vein, we assume that the cosine similarity of the two vectors \(\boldsymbol{w}_1\) and \(\boldsymbol{w}_2\) is not small, e.g., \(\omega(1/\log n)\).

Recall the cosine similarity between the two vectors \(\boldsymbol{w}_1\) and \(\boldsymbol{w}_2\) is defined as \(\cos(\boldsymbol{w}_1, \boldsymbol{w}_2) = \frac{\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle}{ \| \boldsymbol{w}_1 \|_2 \cdot \| \boldsymbol{w}_2 \|_2}\). Since the \(L_1\) norm of each vector is equal to 1, their \(L_2\) norms will typically much smaller than 1, which implies that the cosine similarity is usually much larger than \(\langle \boldsymbol{{w}_1}, {\boldsymbol{w}_2} \rangle\).

4.5.1.2 Approximating the inner product using JLT sketches.

The JLT sketch of \(\boldsymbol{x}\) is equal to \(\boldsymbol{M} \boldsymbol{x}\), where \(\boldsymbol{M}\) is a random \(k \times n\) matrix with \(k \ll n\). More specifically, the inner product of the two vectors is approximated as follows:

\({\mathsf approxIP}(\boldsymbol{w}_1, \boldsymbol{w}_2)\): \(\rhd\) \(\boldsymbol{w}_1\) and \(\boldsymbol{w}_1\) are \(n\) dimensional vectors.

Choose \(k \times n\) matrix \(\boldsymbol{M}\) such that each entry \(M_{i,j}\) is chosen from an independent Gaussian distribution of mean \(0\) and variance \(1\).

Output \(\frac 1 k \cdot \langle \boldsymbol{M}\boldsymbol{w}_1, \boldsymbol{M}\boldsymbol{w}_2 \rangle\). (Here, we slightly abuse the notation and treat the vectors \(\boldsymbol{w}_1\) and \(\boldsymbol{w}_2\) as column vectors.)

4.5.1.3 Privacy of the approximate output.

What is interesting is that the approximate inner product doesn’t reveal anything more than the inner product itself. In this sense, it satisfies the notion of private approximation introduced in [129]. In particular, we prove the following:

4.5.1.4 Private protocol via JLT.

Using the JLT sketch, we can design a private protocol approximating the inner product. See Protocol [fig:ip-fhe]. The protocol uses threshold FHE (e.g., [130]).

4.5.1.5 Inputs:

Parties \(P_1\) and \(P_2\) has inputs \(\boldsymbol{w}_1\) and \(\boldsymbol{w}_2\) respectively.

They securely sample \(k \times n\) matrix \(\boldsymbol{M}\) described in the above with in the threshold FHE. In particular, they jointly generate an encrypted random seed \(\unicode{x27E6}s \unicode{x27E7}\). Using this randomness, parties homomorphically evaluates \(\unicode{x27E6}PRG(s) \unicode{x27E7}\), where \(PRG\) is a pseudorandom generator, to obtain the JLT matrix \(\unicode{x27E6}\boldsymbol{M} \unicode{x27E7}\).

Each party \(P_b\) homomorphically evaluates \(\unicode{x27E6}\tilde{ \boldsymbol{w}}_b \unicode{x27E7}= \unicode{x27E6}\boldsymbol{M} \boldsymbol{w}_b \unicode{x27E7}\).

Party \(P_1\) sends \(\unicode{x27E6}\tilde{\boldsymbol{w}}_1 \unicode{x27E7}\) to \(P_2\).

Party \(P_2\) homomorphically evaluates \(\unicode{x27E6}\langle \tilde {\boldsymbol{w}}_1, \tilde{\boldsymbol{w}}_2 \rangle \unicode{x27E7}\) and sends it to \(P_1\).

Parties execute threshold decryption to obtain and output \(\frac 1 k \cdot \langle \tilde{\boldsymbol{w}}_1, \tilde {\boldsymbol{w}}_2 \rangle\).

4.5.1.6 Security.

Since every protocol message is a ciphertext, based on semantic security of the threshold FHE, it is easy to see that the protocol securely realizes a functionality for computing \({\mathsf approxIP}\). Based on Lemma 29, the leakage profile of the functionality is \(\langle \boldsymbol{w}_1, \boldsymbol{w}_1 \rangle\), \(\langle \boldsymbol{w}_2, \boldsymbol{w}_2 \rangle\), and \(\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle\).

4.5.2 Constant-round protocol for product sampling

Note that the Protocol [prot:prodsampapx] has the following structure. In particular:

The probability that Protocol [prot:prodsampapx] samples a good index and halts in a given trial is \(p = \langle\boldsymbol{w}_1, \boldsymbol{w}_2 \rangle\).

We need to repeat \(r\) trials in parallel so that the probability that all \(r\) trials fail is negligible. In other words, we should have \[(1 - p)^{r} \le e^{-p \cdot r} \le e^{-\omega(\log \kappa)}.\] This means that we should have \(r > \frac {\omega(\log \kappa)}{p}\).

Moreover, in the previous subsection, we discussed how to obtain a good estimate \(\tilde p = (1 \pm \epsilon) p\). Therefore, we should have \[r > \frac {(1+\epsilon) \cdot \omega(\log \kappa) }{\tilde p}> \frac {\omega(\log \kappa)} p .\]

In summary, by running \(\frac {(1+\epsilon) \cdot \omega(\log \kappa) }{\tilde p}\) instances in parallel, we achieve constant round protocols for product sampling with negligible failure probability. The final protocol should perform extra steps to hide from which trial the output comes from, and these changes can made in a straightforward way.

4.6 Two-party Exponential Mechanism

Recall that one of our main motivations for this work is to instantiate a two-party version of the exponential mechanism to achieve differential privacy. We observe that for many natural loss functions (i.e., when the loss function is additive across the two parties), the exponential mechanism on two parties is essentially equivalent to product sampling. We explain this further with a concrete example in Section 4.6.1.

4.6.1 A concrete example

Suppose we want to choose a classifier minimizing the \(L_2\) error over a test dataset while preserving differential privacy of the labeled examples. Suppose there are \(n\) machine learning classifiers \((c_1, \ldots, c_n)\), and a test dataset \(D = (d_1, \ldots, d_{|D|})\) consists of \(|D|\) rows. Let \(\ell_j \in \{0,1\}\) be the label of the \(j\)-th row \(d_j\) of the dataset. For a machine learning classifier \(c_i\), we define its \(L_2\) loss function as follows:

Now, consider a two-party federated setting in which the parties would like to perform computation on the aggregation of their local datasets. In particular, we assume party \(P_1\) (resp., party \(P_2\)) holds dataset \(D_1\) (resp., \(D_2\)) with \(|D_1| = |D_2|\). Let \(D = D_1 || D_2\).

4.6.1.1 DP mechanism in the central curator model.

In our mechanism, the central curator would receive input from parties \(P_1\) and \(P_2\) and choose classifier \(c_i\) with a \((\epsilon, 0)\)-DP guarantee using the exponential mechanism.

We observe that the \(L_2\) loss function \(f^{c_i}_{\mathsf{loss}}(D)\) over the entire dataset \(D\) can be computed by each party \(P_b\) first locally computing \[f^{c_i}_{\mathsf{loss}}(D_b) := \sum_{j \in |D_b|}(c_i(d_{b,j}) - \ell_{b,j})^2/|D|,\] and then computing \(f^{c_i}_{\mathsf{loss}}(D) = f^{c_i}_{\mathsf{loss}}(D_1) + f^{c_i}_{\mathsf{loss}}(D_2).\)

Based on the above observation, in our mechanism, each party \(P_b\) computes a vector \(\boldsymbol{v}_b\) as follows:

For \(b \in \{1,2\}\), let \(\boldsymbol{v}_b = (v_{b,1}, \ldots, v_{b, n})\), where \(v_{b,i} = e^{-\epsilon \cdot \frac{ f^{c_i}_{\mathsf{loss}}(D_b)}{2 \Delta u}}\) and \(\Delta u = \frac{\epsilon}{40(\log n + t)}\), and \(\kappa\) is the security parameter. Then, each party \(P_b\) computes \(\boldsymbol{w}_b := \frac{\boldsymbol{v}_{b}}{||\boldsymbol{v}_b||_1}\) (i.e., the normalization of \(\boldsymbol{v}_b\)), and sends \(\boldsymbol{w}_b\) to the central curator. Finally, the curator will choose classifier \(c_i\) with the probability \(\frac{w_{i,i} \cdot w_{i,2}}{\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle}\).

4.6.1.2 Utility.

Although using a larger value \(\Delta u\) deteriorates the utility of the mechanism, we show that the utility is still acceptable. Applying Theorem 3.11 in [131] to our setting, we have \[\Pr\left[f_{\mathsf loss}^{c_i}(D) \le f_{\mathsf loss}^{c_{opt}}(D) - \frac{2 \Delta u}{\epsilon} \cdot (\log n + \kappa) \right] \le e^{-\kappa}.\]

Noting that \(\Delta u = \frac{\epsilon}{40(\log n + \kappa)}\), we have \[\Pr\left[f_{\mathsf loss}^{c_i}(D) \le f_{\mathsf loss}^{c_{opt}}(D) - 1/20 \right] \le e^{-\kappa}.\] Viewing \(f_{\mathsf loss}^{c_{opt}}(D)\) as the optimal accuracy for the chosen classifier. This implies that our mechanism returns a classifier that is at most 5% less accurate than the optimal classifier. We note that an even smaller loss in accuracy can be achieved by increasing \(\Delta u\) and the minimum size of \(D\) accordingly.

Jumping ahead, we use this larger \(\Delta u\) in order to achieve differential privacy of the approximate inner product evaluation to be described in the next section.

4.6.2 Differentially-Private Inner Product for the Exponential Mechanism

4.6.2.1 Issue: DP is broken due to leakage \(\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle\).

Based on the result of the previous subsection, we can simply run the product sampling protocol to achieve a two-party exponential mechanism without the central curator. However, there is one issue we need to address. In particular, the leakage from the previously described protocols for product sampling violates the DP guarantee of the exponential mechanism; the leakage \(\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle\) is clearly not differentially private with respect to \(P_2\)’s input.

Thus, to instantiate the exponential mechanism, we give an alternative inner product approximation protocol that achieves differential privacy. Using this approximation, we can build a protocol that is able to sample from exactly the product distribution while additionally leaking a value \(\mathsf{leak}\) that is differentially-private and thus does not violate the DP guarantee of the exponential mechanism. We build such a protocol based on the approximate inner product using the JLT given in Section 4.5.1.

4.6.2.2 Approximating the inner product differentially privately.

We now describe a mechanism, executed by a trusted curator, to approximate the inner product on inputs \(\boldsymbol{w}_1\) and \(\boldsymbol{w}_2\) with \({w}_{b,i} = \frac{{v}_{b,i}}{||\boldsymbol{v}_b||_1}\) for \(b \in \{1,2\}\) and \(i \in [n]\) as described above. This mechanism is essentially the approxIP algorithm described in Section 4.5.1 with noise added in the exponent to guarantee differential privacy.

Choose \(k \times n\) matrix \(\boldsymbol{M}\) such that each entry \(M_{i,j}\) chosen from an independent Gaussian distribution of mean \(0\) and variance \(1\). The dimension \(k\) (with \(k \ll n\)) is determined appropriately according to Lemma 28.

Choose a value \(x\) from the Laplace distribution \({\mathsf Lap}(1/(\Delta u \cdot |D|))\).

Output \(e^x \cdot \langle \boldsymbol{M}\boldsymbol{w}_1, \boldsymbol{M}\boldsymbol{w}_2 \rangle\).

4.6.2.3 Public sampling of \(\boldsymbol{M}\).

Contrary to Section 4.5 where \(\boldsymbol{M}\) was sampled inside the FHE, here, the matrix \(\boldsymbol{M}\) can be publicly sampled (e.g., through a commonly chosen random PRG seed), since DP is achieved through adding a Laplace noise.

4.6.2.4 Differential privacy.

We say that two datasets \(D\) and \(D'\) are neighboring if they differ in exactly one row.

Recall that, as in the application described in Section 4.6.1, the parties’ inputs to the product sampling are of the form \(\boldsymbol{w}_b\) where \({w}_{b,i} \propto e^{-\epsilon \cdot \frac{f^{c_i}_{\mathsf{loss}}(D_b)}{2 \Delta u}}\) for some additive loss functions \(f^{c_i}_\mathsf{loss}\).

4.6.2.5 Correctness.

We briefly analyze the estimation error due to the added noise. Since \(x\) is drawn from \(\mathsf{Lap}(1/(\Delta u \cdot |D|))\), the probability that \(|x| \geq \epsilon\) is at most \(e^{-\Delta u \cdot |D|}\). When \(|D| > \Delta u \cdot \kappa/\epsilon\), this probability is at most \(e^{-\kappa}\), which is negligible.

In other words, when \(D\) is a sufficiently large dataset, with overwhelming probability, the incurred multiplicative error is \(e^{|x|} \le e^{\epsilon} < 1+2\epsilon\). Thus, differential privacy adds at most a \(1 \pm 2\epsilon\) multiplicative error on top of the error of the approximation algorithm.

4.6.2.6 Removing the curator.

We described the inner product approximation protocol as being run by a trusted curator. As is standard, we can replace this curator with a secure 2-PC evaluating the mechanism to achieve computational DP.

4.6.3 Instantiating the Exponential Mechanism

We now have all the necessary pieces to instantiate a sublinear communication protocol to evaluate the exponential mechanism for a database \(D\) held jointly by two parties.

4.6.3.1 The two-party exponential mechanism protocol.

We now describe the distributed exponential mechanism where \(P_b\) has input \(D_b\) and the loss functions have low sensitivity. This protocol is in the \(\mathcal{F}_{\mathsf{osample(L_1)}}\)-hybrid model.

Protocol : Exponential Mechanism Protocol (\(\Pi_{\mathsf{EM}}\)) in the \(\{\mathcal{F}_{2PC}, \mathcal{F}_{\mathsf{osample(L_1)}}\}\)-hybrid model

4.6.3.2 Security.

It is easy to see that this protocol runs an enough number of product samplings in parallel so that it does not output \(\mathsf{abort}\), except with negligible probability (see Section 4.5.2). Therefore, for the proof, we assume that the protocol does not output \(\mathsf{abort}\).

5 MinHash

5.1 Introduction

5.1.0.1 Min-hash sketch.

The min-hash sketch is a simple and well-known technique to produce an unbiased estimate of the Jaccard index [132], [133]. The Jaccard index [134] is a similarity measure between two sets \(A\) and \(B\), denoted \(J(A, B)\), defined as the fraction of the elements in the intersection of \(A\) and \(B\) divided by the number of elements in their union. That is, \(J(A,B)=\frac{|A \cap B|}{|A \cup B|}\). The Jaccard index has seen wide application for clustering of websites and documents [132], [135], community identification [136], DNA matching [63], and machine learning [137], [138].

Computing the Jaccard index exactly, especially when the input sets are large, can be costly. The min-hash sketch allows communication-efficient approximation [132]. The basic idea behind the min-hash sketch is to apply a random hash function \(h\) to both sets \(A\) and \(B\) and then compare the minimum hashes (denoted \(\min h(A)\), \(\min h(B)\)) in both sets. If \(\min h(A) = \min h(B)\), it means that an element in \(A \cap B\) has been hashed to the minimum value among elements in \(A \cup B\). This occurs with probability \(J(A,B)\). Thus, to get an unbiased approximation of the Jaccard index, it suffices to repeat this procedure with sufficiently many random hashes.

5.1.0.2 Private Jaccard index via min-hash.

Due to its simplicity and efficiency, the min-hash sketch has become a popular tool to approximate the Jaccard index. Moreover, since the min-hash sketch only needs to compare the minimum hashes, it has been a key building block when maintaining privacy of the input sets is important, e.g., if the input sets represent fingerprints, DNA, or medical records.

There are two classes of solutions for privacy-preserving min-hash. The first class of solutions (e.g. [63], [64], [65], [139]) considers how to compute the min-hash and Jaccard index in a two-party setting, where the parties do not trust each other with their private inputs. The goal of these works is to design secure two-party computation protocols for computing the min-hash sketch as efficiently as possible, but they generally do not consider the privacy implications of revealing the output. The second line of work (e.g. [66], [67], [68]) considers how to make the min-hash approximation privacy-preserving by adding noise to the local min-hash sketches.

5.1.0.3 Our work.

These works serve as the starting point for our study. In particular, we first present a protocol that addresses the privacy of min-hash-based approximations from two perspectives. Similar to the first class of solutions, our protocol ensures that no private information about the input is revealed beyond the Jaccard index. Additionally, in line with the second class of solutions, our protocol guarantees that even the Jaccard index output satisfies differential privacy, which is achieved through adding a small amount of noise. Next, we explore whether any variant of differential privacy can be achieved without adding noise to the protocol, which would improve its accuracy. Interestingly, we demonstrate that under specific constraints on the inputs, the resulting protocol still provides a certain level of privacy guarantees.

More formally, we define three ideal functionalities to capture flavors of min-hash. \({\mathcal{F}_\mathsf{minH}}\) computes the min-hash and then outputs both the min-hash count and the random hashes used. On the other hand, \({\mathcal{F}_\mathsf{privH}}\) computes the min-hash functionality and outputs only the min-hash count. This corresponds to a setting where the min-hash is computed by a trusted curator who does not disclose the hashes used. Finally, we define \({\mathcal{F}_\mathsf{noisy\mbox{-}minH}}\) which adds noise to the min-hash count computed by \({\mathcal{F}_\mathsf{minH}}\). For our first result we show that for appropriate noise levels, the \({\mathcal{F}_\mathsf{noisy\mbox{-}minH}}\) functionality achieves both high accuracy and differential privacy, and design a secure two-party computation of \({\mathcal{F}_\mathsf{noisy\mbox{-}minH}}\) that is both computation and communication-efficient. For our second result, we consider a setting in which the outputs of \({\mathcal{F}_\mathsf{minH}}\) or \({\mathcal{F}_\mathsf{privH}}\) have already been released without added noise and show that, under certain conditions on the inputs, this setting also provides privacy guarantees for individuals’ inputs.

5.1.0.4 Differentially-private and secure computation of min-hash.

To build a protocol for differentially-private min-hash we observe that the min-hash count has low global sensitivity. This allows us to define a functionality \({\mathcal{F}_\mathsf{noisy\mbox{-}minH}}\), parameterized by \((\epsilon, \delta)\) which adds (properly-tuned) Laplace noise to the output of the min hash (See Figure [prot:noisy] for details.) We then prove the following theorem about this functionality.

To realize a protocol for DP estimation of the Jaccard index, we now just need to instantiate this functionality. We show how this can be done efficiently using a PSI-CA functionality in Section 5.4. In Section 5.9, we evaluate the performance when instantiating the PSI-CA protocol [140], [141] in the semi-honest setting. The resulting protocol has better accuracy compared to the prior work [48], [66]. We recommend this protocol to compute differentially-private estimates of the Jaccard index.

5.1.0.5 Privacy after leakage of min-hash output.

While ideally, parties should follow recommendations to add noise to the output of the min-hash count before releasing it (as in functionality \({\mathcal{F}_\mathsf{noisy\mbox{-}minH}}\)), in practice, this may not happen. Further, there may be historical counts that have already been released without added noise. We refer to settings in which such output is released as “output leakage.”

We ask whether any privacy for an individual can be salvaged in this case. Somewhat surprisingly, we show that under certain conditions on the inputs to \({\mathcal{F}_\mathsf{minH}}\) or \({\mathcal{F}_\mathsf{privH}}\), the error of the min-hash approximation itself is sufficient to achieve (variants of) differential privacy–meaning that the presence of an individual element in one of the two input sets cannot be inferred given the output of \({\mathcal{F}_\mathsf{minH}}\) or \({\mathcal{F}_\mathsf{privH}}\). Essentially, the error of the sketch acts as noise to protect the privacy of the inputs. Similar observations that sketching algorithms inherently preserve privacy under certain input restrictions have previously been shown for the Johnson-Lindenstrauss sketch [44], the LogLog sketch [45], [46], and other sketches [47].

We first consider the simpler case of the privacy of an individual once the output of \({\mathcal{F}_\mathsf{privH}}\) has been released. Recall that in this setting a set of private hashes is chosen by the functionality and these hashes are not returned as output of the functionality. Standard differential privacy in this setting requires that conditioned on knowledge of \(A\) and all but one element of \(B\) (denoted by \(x^*\)), the probability that the functionality outputs any value \(\mathsf{out}\) when \(x^* \in B\) versus when \(x^* \notin B\) differs by a factor of at most \(e^\epsilon\) with all but negligible probability.

We note that min-hash is not differentially private in this setting if \(A \cap B\) is either too large or too small. For example, if \(|A \cap B|=0\) when \(x^* \notin B\) and 1 when \(x^* \in B\), then min-hash always outputs 0 in the first case and outputs a count \(\ge 1\) with noticeable probability in the second. We prove the following theorem showing that when this is not the case the min-hash output is differentially private:

We stress that this theorem crucially relies on the fact that the parties, and the adversary, do not have any information about the chosen hashes, and cannot learn the evaluation of the hashes on their own inputs. Note that for this theorem to be useful in a two-party protocol, the parties must compute the hashes under a 2-PC or FHE. This is unlikely to be done in practice. Thus, typically, the parties will locally store the hashes⁹ during the computation. To understand the privacy of this approach, we consider the case of the \({\mathcal{F}_\mathsf{minH}}\) functionality where the output leakage includes the hash functions as well as the counts.

Unfortunately, in this case there is a problem when trying to argue privacy. In the standard DP setting, we assume that the adversary knows all of the inputs (in this case, all entries in both sets \(A\) and \(B\)) except for some input \(x^*\) and wants to determine, from the output of the computation, whether \(x^*\) was in the other party’s set. If the hashes are known, then the output of min-hash is deterministic: The adversary can exactly reconstruct the min-hash execution for the case when \(x^*\) is in the set and when it is not, and then see which of these matches the output it received. Since the min-hash protocol provides a good approximation of the Jaccard index, the adversary will be able to exactly determine whether or not \(x^* \in B\) with noticeable probability.

Note that the above attack works only if the adversary knows the entirety of both sets \(A, B\) and just tries to distinguish whether \(x^* \in B\) or not. Realistically, especially when the inputs are large, the adversary would not know the entire input of the honest party. More precisely, we assume that given the adversary’s set (and even the intersection between the two sets), the honest party’s set still has sufficiently high min-entropy. With this assumption, we turn to the tool of distributional DP (DDP) [142] which allows us to analyze differential privacy when the distribution of inputs has sufficient uncertainty.

We begin with a relatively strong assumption on the amount of uncertainty the adversary has about the honest set. Specifically, we assume that every element that is not in the intersection is highly unpredictable (i.e., has a high amount of min-entropy), even conditioned on all the other set elements. Under this assumption, we prove the following theorem:

Not surprisingly, the proof of this Theorem (given in Section 5.6) leverages the fact that when each element has individual high min-entropy, hashing each element acts as a strong randomness extractor, thus resulting in sufficient random noise for privacy.

5.1.0.6 DDP over a polynomial-size universe.

However, this assumption that every item has high min-entropy is quite strong. For example, consider the setting where each item in \(B\) is chosen from a polynomial-size universe. In this case, while individual items cannot have much min-entropy, the honest party’s set may still collectively have high min-entropy as long as it is large enough. Thus, for our third result, we analyze what happens under this weaker assumption that only the full honest set, instead of each individual item, has high min-entropy.

Note that in this case, we cannot apply the hash function as randomness extractor technique. This is because in order to guarantee that the randomness extractor yields output that is negligibly close to uniform, we must lose superlogarithmic in \(n\) bits of entropy from each input. However, in the case we are currently considering, each element has at most \(O(\log n)\) bits of min-entropy. Further, we in fact have no guarantee that each element has individually high min-entropy (since the elements are not necessarily independent), but only that the total min-entropy of the non-intersection items is high. Nevertheless, we show \({\mathcal{F}_\mathsf{minH}}\) still achieves DDP, by proving a new strong chain rule for min-entropy (see Section 5.8.5).

Specifically, we consider the following class of distributions \(\mathcal{C}\) over secret sets \(R\) of size \(n\):

5.1.0.7 On spoiling bits and leakage resilience.

Consider a distribution over sets of \(n\) elements \(R = R_1, \ldots, R_n\), where each \(R_i\) is chosen from a universe of size \(\ell \in \Omega(n)\). Note that the set \(R\) can have min-entropy \(\Omega(n \lg(\ell))\) while it can still be possible that for every \(i\), the marginal distribution over \(R_i\) has only constant min-entropy (see Example 1.1 in [143]). To deal with such situations, Skórski [144] proves a theorem showing the existence of “spoiling bits.” Namely, given \(R_1, \ldots, R_n\), some additional information known as spoiling bits can be released such that, conditioned on this information, for each \(i \in [n]\), the distribution of \(R_i\) conditioned on \(R_{<i}\), where \(R_{<i}\) denotes \((R_1, \ldots, R_{i-1})\), is nearly flat (in the sense that the min/max entropy gap is at most a small additive constant). Further, the total number of spoiling bits that are released is small.

It is not hard to use Skórski’s result to show that if \(R\) starts out with sufficiently high min-entropy then for a large fraction of \(i\) (those in the set \(V \subseteq [n]\)), the distribution of \(R_i\) conditioned on \(R_{<i}\) has high min-entropy of at least \(\Omega(\log(n))\), while the remaining indices (those in the set \(W = [n]\setminus V)\)) may have low min-entropy.

Unfortunately, this result is very brittle in the sense that the flatness conditions hold only for this particular distribution of \(R\) conditioned on the spoiled bits. Specifically, despite the flatness condition being satisfied for this distribution, the random variables \(R_i\) are not independent of one another. Thus, if additional information is leaked on \(R_j\) after the spoiling bits are computed, then the flatness guarantees may no longer hold for \(R_i\).

In our setting, we require additional leakage \(\{\ell_i\}_{i \in W}\) on the elements \(\{R_i\}_{i \in W}\). One issue is that the set \(W\) (i.e., low min-entropy elements conditioned on the spoiling bits) is only known after the spoiling bits are computed. This leaves us with a dilemma:

To solve this problem, we prove a new variant of the spoiling lemma that computes the spoiling bits at the same time as the additional leakage \(\ell_i\) for \(i \in W\) is computed so that the spoiling bits also contain \(\{\ell_i\}_{i \in W}\), while still maintaining the flatness condition. The types of leakage that can be captured are essentially those such that the leakage \(\ell_i\) for \(i \in W\) can be expressed as a function of \(R_i\) and the leakages \(\{\ell_j: j > i, j \in W\}\). It turns out that the leakage we need for our result has this form.

We state our theorem in general terms as we believe it may find further applications in leakage resilient cryptography. For the formal theorem statement see Theorem 58.

5.1.0.8 A note on composition.

One known weakness of the DDP definition is the lack of a general composition theorem [142]. However, for the specific setting of our min-hash protocols we can leverage the small output of min-hash to argue composition properties after leakage of several outputs. Specifically, suppose that the adversary executes a min-hash protocol with \((\epsilon, \delta)\)-DDP security twice with the same honest party’s input both times. Since each min-hash protocol outputs a single number between \(0\) and \(k\) (i.e., \(\lg k\) bits long), when we apply Theorem 3, the leakage profile increases to a total of at most \(L + 2 \cdot \lg k\) bits. However, according to Theorem 3, as long as \(|L| + 2 \lg k \le c \cdot n \lg \ell\), each protocol execution will preserve DDP, and therefore the composition of the two protocol executions will preserve \((2 \epsilon, 2\delta)\)-DDP. In general, assuming that the initial leakage \(|L|\) is a small constant, this type of DDP composition will hold for \(O(n \cdot \frac{\lg \ell}{ \lg k })\) executions.

5.1.0.9 Comparison to other approaches.

We note that an alternative approach to get a differentially-private estimate of the Jaccard index is via mergeable cardinality estimation sketches (e.g. [48]) to compute (an approximation of) the set intersection cardinality and use this via the inclusion-exclusion principle to compute the Jaccard index. We give a detailed comparison of error from our protocol vs. the best known cardinality estimator [48] in Section 5.9.

5.2 Related Works

5.2.0.1 Differential privacy (DP).

Differential privacy protects the privacy of individuals by limiting an adversary’s ability to learn information about an individual input from the output of a computation [1], [2]. For a good overview of differential privacy and many of the algorithms to achieve it, both in the standard curator setting and in distributed settings, we refer the reader to the book by Dwork and Roth [3].

5.2.0.2 Optimizing secure computation using differential privacy.

Another direction of work has considered how to use DP to reduce the cost of secure computation, especially when we aim for DP-style guarantees from the final output. [69] first proposed such optimization for the problem of secure summation. [70] and [71] applied the differential privacy relaxation to improve efficiency of set-intersection protocols. [72], and [145] consider graph-parallel computations and design more efficient solutions with differential private leakages. [146] consider classic tasks like sorting, merging, and range-query data structures with differential privacy relaxation. [73] consider multiparty shuffle that allows a differentially private leakage and shows that it suffices to achieve end-to-end differential privacy in the shuffle model of DP.

5.2.0.3 Private sketching.

Sketching algorithms, or “sketches” are sublinear space algorithms for approximating certain properties of large inputs or data streams. The main idea behind sketching algorithms is to generate a compact summary data structure that allows for efficient storage, merging, and processing.

Some recent works [44], [45], [46], [47], [48], [49], [50], [51], [52], [53], [54], [55], [56], [57] have additionally observed that sketches can often also aid in achieving privacy as the inherent loss of information in the sketch can essentially make the sketch itself be differentially private or to only require a little additional noise.

A line of research pertinent to our work involves constructing private sketches for set cardinality estimations [58], [59], [60], [61], [62]. Recently, [48] proposed a private mergeable sketch that can be used to estimate the size of the intersection and union of sets.

5.2.0.4 Secure approximation.

Secure approximation studies what functions can be securely approximated without revealing anything beyond the true output [74], [75]. While this notion is quite different from that of differentially private approximation that we consider here, we note that our FHE-based protocol described in Section 5.5 additionally achieves this.

5.2.0.5 Adversarially robust property-preserving hash functions and robust sketching.

Property-preserving hash (PPH) functions allow compressing large input \(x\) into a short digest \(h(x)\) such that some property \(P(x,y)\) can be computed given only \(h(x)\) and \(h(y)\). Adversarially-robust PPH [76], [77], [78], [79] aim to further guarantee that \(P(x,y)\) is correctly computed (i.e., robust) even if the inputs \(x\) and \(y\) are chosen after the hash function \(h\) is fixed. A related concept of robust sketching, e.g. [80], [81] aims to construct sketches that provide good approximations even when inputs are chosen after the randomness of the sketch is fixed.

Both of these approaches are similar to our work in that they also study the consequences of making the choice of hash (or sketch) known to the adversary. However, these works focus on robustness to adversarial inputs, while we instead focus on the privacy of the output when the adversary additionally sees the hash functions.

5.2.0.6 Differentially private min-hash

DP min-hash aims to make min-hash approximation differentially private by adopting standard DP mechanisms such as adding DP noises to the output to hide individual items in the input sets. In particular, [66] achieves local DP (LDP) min-hash by either adding Laplacian noise, or using generalized randomized response to perturb the minhash vectors. Other than this, there are also other earlier efforts. For example, [67] attempts to use a flawed exponential mechanism to achieve DP. This leads to a faulty claim of \(\epsilon\)-DP, as pointed out in [66]. [68] correctly applies exponential mechanism. However, this results in a large amount of noise being added to the results.

5.3 Preliminaries

A function \(g\) is negligible, denoted \(\mathsf{negl}(\cdot)\), if for every positive integer \(c\), there is an integer \(n_c\) such that for all \(n \ge n_c\) we have \(g(n) \le 1/n^c\). Let \(\kappa\) denote the security parameter.

5.3.0.1 Range of hash functions and the random oracle model.

We model each hash function as a random oracle that maps each item to a real value in \([0, 1]\), and the output of the hash function is long enough to ensure that the probability of any two different items having a hash collision is negligible.

5.3.0.2 Notation.

Let \(\mathcal{U}\) denote the universe of input elements. In this paper, we will consider two input sets \(A,B \subseteq \mathcal{U}\). Let \(n_A = |A|, n_B = |B|\). Let \(I = A \cap B\), \(n_{I} = |I|\). We will also let \(B_{+x^*} = B \cup \{x^*\}\).

Let \({\mathsf Eq}\) be an equality function; i.e., \({\mathsf Eq}(a, b) = 1\) if \(a=b\) and \(0\) otherwise. For a hash function \(h\) and a set \(A\), we let \(h(A) := \{h(a): a \in A\}.\) Let \(\mathsf{B}(m,p)\) be the binomial distribution with \(m\) trials and each trial having success probability \(p\).

5.3.0.3 Basic min-hash functionality.

We describe the basic min-hash functionality in Figure [prot:pb]. In this work, we will consider several variants and consider privacy implications.

Input: \(P_1\) and \(P_2\)’s input vectors \(A = (x^A_1, \ldots, x^A_{n_A})\) and \(B = (x^B_1, \ldots, x^B_{n_B})\).

5.3.0.4 Differential privacy.

We will write \(Lap(b)\) to denote the Laplace distribution with scale \(b\). Given any function \(f:\mathbb{N}^{|\mathcal{X}|} \to \mathbb{R}^k\), the Laplace mechanism that adds noise drawn from Laplace distribution; that is, given an input database \(X\), the mechanism outputs \(f(X) + (Y_1, \ldots , Y_k),\) where \(Y_i\) are i.i.d. random variables drawn from \(Lap(\Delta f/\epsilon)\). It is known that the Laplace mechanism achieves \((\epsilon, 0)\)-differential privacy [3], Theorem 3.6.

5.3.0.5 Distributional differential privacy (DDP).

We adapt the original definition [142] for our purpose to consider a two-party protocol that takes sets as input more explicitly. Specifically, we consider a computational indistinguishability variant for our DDP definition.

5.3.0.6 Tail bound for a Binomial distribution.

5.4 Min-Hash with DP

Since the noiseless min-hash functionality cannot achieve DP as discussed above, we consider a noisy variant that provides DP. We first consider the global sensitivity of \({\mathcal{F}_\mathsf{minH}}\) and use the standard Laplace mechanism to provide DP.

5.4.1 Sensitivity

Let \(B = (x_1^B, \ldots, x_{n_B}^B)\) and \(B_{+x^*}= (x_1^B, \ldots, x_{n_B}^B, {x^*})\), and WLOG, we consider two neighboring inputs \((A, B) \mbox{ and } (A, B_{+x^*})\); the case in which \({x^*}\) is added into \(A\) can be shown symmetrically.

We show how changing the input sets from \(B\) to \(B_{+x^*}\) affects the final count. Let \({x^*}\) be the \((n_B+1)\)-th element of \(B_{+x^*}\). Consider iteration \(j\) of Step 2 in Figure [prot:pb]. Since we model each hash function \(h_j\) as a random oracle, \((y^B_{1, j}, \ldots, y^B_{n_B+1, j})\) will be uniformly distributed. Now, consider how the min-hash \(u_j^B\) is computed. The value \({x^*}\) from \(B_{+x^*}\) can affect the min-hash \(u_j^B\) (and thereby the final count \(c\)), only if \(y^B_{n_B+1,j}\) is smaller than \((y^B_{1, j}, \ldots, y^B_{n_B, j})\).

The probability that \(y^B_{n_B+1,j}\) will be less than all \(y^B_{i,j}\)s is at most \(1/(n_B+1)\) by a symmetry argument. Note the final output is computed as the sum of \(k\) of these trials. Let \[S_{x^*}= \left\{j \in [k]: y^B_{n_B+1,j} < \min_{i \in [n_B]} \{y^B_{i,j}\} \right\}.\] Therefore, we consider a binomial distribution \(|S_{x^*}| \sim \mathsf{B}(k,1/(n_B+1)),\) which represents how many iterations \(j\) cause \({x^*}\) to be the min-hash \(u_j^B\). In other words, \(|S_{x^*}|\) captures the sensitivity of min-hash. Therefore, given the failure probability \(\delta\), the following measure can be used as the global sensitivity: \[\mathsf{\sigma}(\delta, k, n_B) := \arg\min_s ~~\{s: \Pr_{h_1, \ldots, h_k}[|S_x| \ge s] \le \delta \}\]

According to the above lemma, Asymptotically, with \(k = \Omega(\kappa)\), we have \(\sigma(\delta = \mathsf{negl}(\kappa), k, n = \Theta(k^2)) = O(\lg\lg k)\).

5.4.2 Noisy Min-Hash

We consider a variant \({\mathcal{F}_\mathsf{noisy\mbox{-}minH}}\) of \({\mathcal{F}_\mathsf{minH}}\) described in Figure [prot:noisy].

Protocol : Noisy Min-Hash Functionality \({\mathcal{F}_\mathsf{noisy\mbox{-}minH}}\)

5.4.2.1 A two party protocol \(\pi_{\mathsf NMH}\) securely realizing \({\mathcal{F}_\mathsf{noisy\mbox{-}minH}}\)

. We construct a two party protocol that securely realizes functionality \({\mathcal{F}_\mathsf{noisy\mbox{-}minH}}\). The protocol takes advantage of an ideal functionality \(\mathcal{F}_{{\mathsf psi\mbox{-}ca}}\) of private set intersection cardinality (PSI-CA) [140] that computes the exact cardinality of the intersection of the two input sets, as described in Figure [func:psica].

Protocol : Functionality of Private Set Intersection Cardinality \(\mathcal{F}_{{\mathsf psi\mbox{-}ca}}\)

In particular, in order to compute the noisy min-hash match counts, the parties construct two sets consisting min-hash values and additional dummy elements and then run \(\mathcal{F}_{{\mathsf psi\mbox{-}ca}}\) on these sets. To reflect the Laplace noise into elements of a set, the protocol uses unary encoding, which introduces some inefficiency. However, as the tail probability of Laplace noise decreases exponentially, the unary encoding length can be bounded with a small value, and the protocol’s overall efficiency is still maintained. Detailed steps of the protocol are provided in Figure [pro:ph].

It is worth noting that the above task could also be implemented using a generic two-party computation (2PC) protocol. However, [140] proposed an efficient PSI-CA protocol that outperforms 2PC protocols for small input sizes (using the start-to-finish comparison including the 2PC preprocessing steps). See Section 5.9.1 for more details of this PSI-CA protocol. Since our input sets are small, we chose to present protocol \(\pi_{\mathsf NMH}\) using the PSI-CA functionality.

We will prove below that protocol \(\pi_{\mathsf NMH}\) securely realizes \({\mathcal{F}_\mathsf{noisy\mbox{-}minH}}\). It implies that protocol \(\pi_{\mathsf NMH}\) is also \((\epsilon, \delta)\)-computational-DP [148]. The main benefit of the protocol is that the hash computations can be computed locally and the communication complexity of the protocol is sub-linear in \(n_A\) and \(n_B\) even when the protocol implementing \(\mathcal{F}_{{\mathsf psi\mbox{-}ca}}\) has a linear communication complexity.

Protocol : Two-party Noisy Min-hash Protocol \(\pi_{\mathsf NMH}^{\mathcal{O}}\)

Input: \(P_1\) and \(P_2\)’s input vectors \(A = (x^A_1, \ldots, x^A_{n_A})\) and \(B = (x^B_1, \ldots, x^B_{n_B})\).
Protocol:

Output: \(P_1\) and \(P_2\) output \((\mathsf{pre}, {\mathsf coins}_A, c^A)\) and \((\mathsf{pre}, {\mathsf coins}_B, c^B)\) respectively.

5.5 Noiseless Protocol in the Private Hash Setting

In Figure [prot:cu], we describe the min-hash protocol \({\mathcal{F}_\mathsf{privH}}\) in the private hash setting. We show that if \(J(A, B)\) is a constant, there exist parameter regimes where \({\mathcal{F}_\mathsf{privH}}\) without noise satisfies differential privacy. Our observation is that the final count \(c\) follows a binomial distribution in the private hash setting, which can be treated as noise to obscure the sensitivity.

Protocol : Min-Hash in the Private Hash Setting \({\mathcal{F}_\mathsf{privH}}\)

\({\mathcal{F}_\mathsf{privH}}\) works exactly the same as \({\mathcal{F}_\mathsf{minH}}\) except that it outputs only the final count \(c\) (with the prefix \({\mathsf{pre}}\) hidden to the participants).

5.5.0.1 Remark.

While \({\mathcal{F}_\mathsf{privH}}\) could be considered as a trusted curator model, a two-party protocol realizing it can be constructed without relying on a trusted curator. In particular, the computation of \((u^A_1, \ldots, u^A_k)\) (including all \(n\) hash evaluations) can be performed locally under a threshold FHE so that only the encryption of them may be sent to party \(B\). Then, by computing the remaining steps under FHE and delivering the result using a threshold decryption, the protocol will securely realize \({\mathcal{F}_\mathsf{privH}}\) in the semi-honest setting. We note that the resulting protocol has sublinear communication in \(n\) since only the \(k\) inputs to the comparisons need to be communicated.

5.6 DDP of \({\mathcal{F}_\mathsf{minH}}\)

In this section, we show that there are parameter regimes where the public min-hash protocol \({\mathcal{F}_\mathsf{minH}}\) can satisfy DDP without adding noise. In Figure [fig:dist], we first describe the family of distributions we consider in the context of our min-hash protocol. The distribution models a situation in which the adversary, having corrupted one of the two parties, has access to the view of the party and even the actual intersection. However, the adversary does not know the other party’s input set (except from the intersection).

Parameterized with \((n_A, n_B, n_I)\), a distribution \({\cal D}_{A, B}\) in this family samples \((A, B)\) such that

Below, we show that \({\mathcal{F}_\mathsf{minH}}\) achieves DDP under certain circumstances.

5.6.0.1 Remark.

An easy way to realize \({\mathcal{F}_\mathsf{minH}}\) is to have each party locally hash their inputs using the \(k\) public hash functions and to locally compute the minimum for each iteration. The parties can then run a simple two-party computation to compute the number of times these minimums match. We note that this protocol has communication and computation that is sublinear in the input size as it only depends on the number of hash functions. By Theorems 50 and 51 this protocol achieves DDP when the conditions of either of the theorems are satisfied.

5.7 Proof of Theorem 50

We first give the intuition of the proof. We assume that each of the non-intersecting elements has high min-entropy. WLOG, consider an adversary corrupting \(P_1\). The view of the adversary will be \[\mathsf{view}^{{\mathcal{F}_\mathsf{minH}}}_{P_1}(A, B) := (A,c, h_1, \ldots, h_k).\] As shown above, the sensitivity can be upper-bounded by a small value \(s\).

Unlike \({\mathcal{F}_\mathsf{privH}}\), however, when we show the existence of sufficient noise from the remaining iterations, we need to take the additional leakage into consideration.

First, since the hash functions are public, iterations are no longer independent of each other as needed by the analysis in Section 5.5. We address this issue by employing the fact that each of the non-intersecting items has high min-entropy. In the random oracle model, as long as the adversary does not query hash function \(h\) on some point \(x\), \(h(x)\) is uniformly random to the adversary. Since the non-intersecting items have high min-entropy, the adversary is negligibly likely to query any of them to the hash functions, thus guaranteeing independence.

5.7.0.1 Good iterations and Poisson Binomial distribution.

Now, to see how the remaining iterations still hide the sensitivity even with the public hash functions, let \(R = B \setminus I\). For the remaining \(k-s\) iterations, the high min-entropy of each element in \(R\) will jitter the final count. In particular, consider the \(j\)th hash function \(h_j\) in the protocol (among the \(k-s\) remaining iterations) and let \[v_j^A = \min h_j(A),~~ v_j^I = \min h_j(I),~~ v_j^R = \min h_j(R).\] Suppose \(v_j^A = v_j^I\). Then, if \(v_j^R \ge v_j^I\), the min-hash \(u_j^A\) of \(A\) will be equal to the min-hash \(u_j^B\) of \(B\) (both of which are equal to \(v_j^I\)) and the final count \(c\) will be incremented due to this \(j\)th iteration. However, if \(v_j^R < v_j^I\), then it will be \(u_j^A \ne u_j^B\), and the final count will not be incremented. This way, the distribution of \(v_j^R\) will jitter the final count. The above discussion can be formalized into the following definition.

The second condition of the definition requires that \(\min h_j(I)\) is somewhere in the middle (parameterized by \(\theta \in \Theta(1)\)) so that the distribution of \(R\) (i.e., random \(v_j^R\)) may reduce the final count with a decent chance (and also keep the count with a decent chance). As long as \(n_I/n_A\) is a constant fraction, there are sufficiently many \(\theta\)-good iterations, although we lose some iterations. In particular, if we let \(k_g\) be the number of good iterations, we have \(k_g = \Theta(k)\).

With public hash functions and thereby \(\min h_j(I)\) being leaked to the adversary, it turns out that the noise from the \(k_g\) iterations follows a Poisson Binomial distribution, which is a generalization of a Binomial distribution where each trial has a different success probability. However, using the techniques of [149], we can still show that this distribution works as a good noise to hide the private data.

5.7.1 Proof

WLOG, we consider two neighboring inputs \((A, B) \mbox{ and } (A, B_{+x^*}).\) DDP for the case in which \({x^*}\) is added into \(A\) can be shown symmetrically. We prove the theorem by a hybrid argument. In particular, we define a slightly different ideal functionality \({\mathcal{F}_\mathsf{minH}}^{(1)}\) as follows:

We set up the following hybrids. We will argue that for any \({x^*}\in \mathcal{U}\) and over \((A, B, I) \leftarrow\Delta_\mathsf{PH}\), it holds \[\begin{aligned} & (\mathsf{view}^{{\mathcal{F}_\mathsf{minH}}}_{P_1}(A,B), I) \stackrel{c}{\approx} (\mathsf{view}^{{\mathcal{F}_\mathsf{minH}}^{(1)}}_{P_1}(A,B), I) \\ & ~~~~~ \approx_{\epsilon, \delta} (\mathsf{view}^{{\mathcal{F}_\mathsf{minH}}^{(1)}}_{P_1}(A, B_{+x^*}), I) \stackrel{c}{\approx}(\mathsf{view}^{{\mathcal{F}_\mathsf{minH}}}_{P_1}(A, B_{+x^*}), I) \end{aligned}\] for any constant \(\epsilon> 0\) and for some \(\delta\in \mathsf{negl}(\kappa)\), as long as each element in \(B \setminus I\) has high min-entropy.

Recall that the min-entropy of each element \(x^B_i\) with \(i \in B\setminus A\) is at least \(\kappa\). Therefore, the probability that any adversary making at most polynomially many oracle queries queries any \(x^B_i\) is \(\mathsf{negl}(\kappa)\). Conditioned on the adversary not querying any such \(x^B_i\), any \(y^B_{i,j}\) for \(j \in [k]\) is chosen uniformly random from \(\mathcal{U}\). The same argument shows \(\mathsf{view}^{{\mathcal{F}_\mathsf{minH}}}_{P_1}(A,B_{+x^*}), I) \stackrel{c}{\approx} (\mathsf{view}^{{\mathcal{F}_\mathsf{minH}}^{(1)}}_{P_1}(A,B_{+x^*}), I)\). Therefore, we are left only to show \((\mathsf{view}^{{\mathcal{F}_\mathsf{minH}}^{(1)}}_{P_1}(A,B), I) \approx_{\epsilon,\delta}(\mathsf{view}^{{\mathcal{F}_\mathsf{minH}}^{(1)}}_{P_1}(A, B_{+x^*}), I)\).

5.7.1.1 DDP of \({\mathcal{F}_\mathsf{minH}}^{(1)}\).

We show \((A, I, h_1, \ldots, h_k, c) \approx_{\epsilon,\delta}(A, I, h_1, \ldots, h_k, c_{+{x^*}})\), where \(c\) is the final count from \({\mathcal{F}_\mathsf{minH}}^{(1)}(A, B)\) and \(c_{+{x^*}}\) is the final count from \({\mathcal{F}_\mathsf{minH}}^{(1)}(A, B_{+x^*})\). We show how to leverage the uncertainties of \(x^B_i \in R = B \setminus A\) so that good iterations work like the needed noise to guarantee DP.

The proof is found in Section 7.3.2. This lemma shows that a random hash leads to a good iteration with probability \(p_\theta\), which is constant in our setting based on the assumption about \(n_A, n_I, n_R\).

Recall that \(S_{{x^*}}\) was the random variable that represents the set of iterations \(j\) such that the min-hash \(u_j^B\) comes from \({x^*}\) when \(P_2\)’s input is \(B_{+x^*}\). From Lemma 45, with overwhelming probability \(|S_{{x^*}}| = O(\lg\lg\kappa)\).

Now, fix \(A, I, {x^*}\) and \(h_1, \ldots, h_k\) and let \(G_{\theta}\) be the set of iterations \(j\) in which a \(\theta\)-good event takes place; i.e., \[G_{\theta} = \left\{ j \in [k]: \mathsf{good}_\theta(h_j, A, I, n_B) \right\}.\]

Let \(K_{\theta} = G_{\theta} \setminus S_{{x^*}}\). The following lemma shows that the \(\theta\)-good events takes place \(\Theta(\kappa)\)-many times, with overwhelming probability.

5.7.1.2 Our goal.

For a set \(W\), define \(c_W \stackrel{def}{=} \sum_{j \in W} \mathsf{Eq}(u^A_j, u^B_j)\). Let \(\overline K_\theta := [k] \setminus K_\theta\). Note that the contributions to the final output can be divided into two parts:

Essentially, for any final count \(q\), we are interested in comparing the two probabilities: \[\Pr[c_{\overline K_\theta} + c_{K_\theta} = q] \mbox{~and~} \Pr[c^{{+x^*}}_{\overline K_\theta} + c^{{+x^*}}_{K_\theta} = q].\]

Following our discussion on sensitivity in Section 5.4, the difference of \(c_{\overline K_\theta}\) and \(c^{{+x^*}}_{\overline K_\theta}\) is upper-bounded by \(s = O(\lg\lg \kappa)\). Note that we have \(c_{K_\theta} = c^{{+x^*}}_{K_\theta}\) because \(j \in K_\theta\) implies \(j \not \in S_{{x^*}}\). Therefore, we only need to analyze the single distribution of \(c_{K_\theta}\) as a noise and compare the following two probabilities: \[\Pr[c_{K_\theta} = q] \mbox{~~and~} \Pr[c_{K_\theta}+s = q].\]

5.7.1.3 Distribution of \(c_{K_\theta}\).

We have \(c_{K_\theta} = \sum_{j \in K_\theta} c_j,\) where \(c_j = {\mathsf Eq}(u_j^A, u_j^B)\). Note that since we have \(j \in K_\theta\), a \(\theta\)-good event takes place in iteration \(j\), i.e., \(\min h_j(A) = \min h_j(I)\).

Let \(\gamma_j = 1 - \min h_j(I)\). Note that the hash of each item \(R\) is randomly distributed in \({\mathcal{F}_\mathsf{minH}}^{(1)}\). Therefore, the probability that \(c_j = 1\) is \((\gamma_j)^{n_R}\), in which case every hash of items in \(R\) must be at least \(\min h_j(I)\).

Let \(\eta_{-\theta} = 1/2-\theta\) and \(\eta_{+\theta}= 1/2+\theta.\) Since \(j\) is a good iteration, we have \((\gamma_j)^{n_R} \in [\eta_{-\theta}, \eta_{+\theta}].\) Therefore, letting \(p_j = (\gamma_j)^{n_R}\), we have \(c_j \sim \mathsf{Ber}(p_j)\), where \(\mathsf{Ber}\) denotes the Bernoulli distribution. Since these Bernoulli distributions are independent from each other, can apply Lemma 55 below to conclude that \(C_{K_\theta} \approx_{\epsilon,\delta} C_{K_\theta}+s\).

For \(j \in [n]\), consider \(c_j \sim \mathsf{Ber}(p_j)\). With \(p_J = \{p_j\}_{j=1}^n,\) let \(\mathsf{PB}(n, p_J)\) denote the distribution of \(\sum_{j \in [n]} c_j\). This distribution is called a Additive Poisson Binomial distribution.

We conclude the proof by showing that the Additive Poisson Binomial distribution with appropriate parameters satisfies the following DP-like property.

5.8 Highlights of Proof of Theorem 51

Here, we highlight only the important parts of the proof of Theorem 51. The full proof can be found in Section 7.3.5. We show that \({\mathcal{F}_\mathsf{minH}}\) satisfies DDP even when the size of the universe \(\mathcal{U}\) of size \(n_R \cdot \ell\) is polynomial in \(\kappa\) with \(\ell = \Omega(n_R^3)\), and the secret set \(R\) is chosen from the uniform distribution on \(\mathcal{U}\), conditioned on arbitrary leakage on \(R\) of length \(L\), where \(L \le n_R (\lg \ell - 3 \lg n_R - 2)\). WLOG, we assume that the adversary corrupts \(P_1\).

We set \(n_R' := n_R/3\); looking forward, it is the size of a subset \(R' \subset R\), each of whose elements has high remaining min-entropy even after leakage (that we will define in the proof) is considered.

5.8.1 Min-hash Graph

Consider running the min-hash protocol \({\mathcal{F}_\mathsf{minH}}\) with \(k\) iterations such that \(k_g\) of them belong to \(G_\theta\). For this, we consider all the hash outputs in two different stages and define the following sets: \[H_1 = \{ h_j(A_{+{x^*}})\}_{j=1}^k, ~ H_2 = \{h_j(\mathcal{U}\setminus A_{+x^*})\}_{j=1}^k.\] Since we are in the random oracle model, each hash value is chosen uniformly at random. For our analysis, we construct the following bipartite graph \((\mathcal{X},\mathcal{Y},\mathcal{E})\), which we call the min-hash graph, based on the sets \(A, I\) and \({x^*}\) along with the hash functions as follows:

	1	2	3	4	5	6	7	8	9	10	11
\(h_1\)	0.83	0.25	0.77	0.85	0.93	0.35	0.86	0.92	0.49	0.21	0.5
\(h_2\)	0.62	0.83	0.27	0.59	0.63	0.26	0.4	0.26	0.72	0.36	0.6
\(h_3\)	0.68	0.11	0.67	0.29	0.82	0.3	0.62	0.23	0.67	0.35	0.7
\(h_4\)	0.02	0.43	0.22	0.58	0.69	0.67	0.93	0.56	0.11	0.42	0.8

5.8.1.1 Example.

Let the universe be \(\mathcal{U}= [11]\). Let \(A = \{1, 2, 3, 4\},\) \(I = \{2, 3\}\), \({x^*}= 11\). Let the threshold range for the \(\theta\)-good iterations be \([0.2, 0.7]\). Assume that our protocol runs in 4 iterations using the hash functions defined in table [examplehash].

Figure [fig:bigraph] shows the constructed min-hash graph. We have \(\mathcal{X}= \{5, 6, \ldots, 10\}\) and \(\mathcal{Y}= \{h_1, h_2\}\); \(h_3\) has been ruled out since \(p_3 = h_3(2) = 0.11 \not \in [0.2, 0.7]\), and \(h_4\) has been ruled out because \(\min h_4(A) \ne \min h_4(I)\). Moreover, we have \(p_1 = h_1(2) = 0.25\) and \(p_2 = h_2(3) = 0.27\). Note that \((8, h_2) \in \mathcal{E}\), because \(h_2(8) < p_2\).

5.8.2 Fixed Subsets \((R',T)\) of Secret Items and Good Iterations

We fix \(H_1\) and thereby the nodes \(\mathcal{X}\) and \(\mathcal{Y}\) of the min-hash graph. In this section, as the first step, we fix subsets \(R' \subset \mathcal{X}\) and \(T \subseteq \mathcal{Y}\) and analyze the noise over the choice of \(H_2\). In other words, we are treating \(H_2\) as private the adversary. Extending this, in the next section, we will consider the actual protocol setting where the hash functions are public and then analyze the noise over a distribution of \(R'\).

5.8.2.1 Edge distribution in the min-hash graph.

The probability (over the choice of \(H_2\)) that an edge \((i,j)\) forms is exactly equal to \(p_j\). Moreover, since we are in the random oracle model, the probability that \((i,j)\) forms is independent of the probability that any other edge in the graph forms.

5.8.2.2 Noise distribution.

We are interested in the probability \(E^{R'}_{T,r}\) over the choice of \(H_2\) that the final count is reduced by exactly \(r\) due to the elements of \(R'\) over a bundle \(T\) of iterations. In the random oracle model, the probability depends only on the size of the sets \(n' = |R'|\) and \(k_b = |T|\). Therefore, we will often use the notation \(E^{n'}_{k_b,r} = E^{R'}_{T,r}\). We will sometimes even omit \(k_b\) and write \(E^{n'}_r\). Observe that \(E^{n'}_{r}\) is another way of representing an Additive Poisson Binomial distribution. That is, \(E^{n'}_{k_b,r} = \Pr[ \mathsf{PB}(k_b, p_J) = r ].\) Therefore, based on Lemma 55, we have the following:

The above indicates that the distribution over \(r\) is amenable for use as a noise distribution in a differential privacy context.

5.8.3 Noise Over the Choice of \(R'\) with Public Hash Functions

Our main technical challenge is to show that the properties needed for differential privacy hold even when the hash functions are public.

For this, we first fix \(H_1\) and \(H_2\). Then, we consider the derived min-hash graph \(G = (\mathcal{X}, \mathcal{Y}, \mathcal{E})\). Let \({\cal D}\) be the distribution of \(R'\). For any set \(T\) of iterations of size \(k_b\) and any integer \(r\), let \(I_{R', T, r}\) be the indicator random variable that is set to \(1\) if set \(R'\) contributes \(-r\) to the total count in the min-hash protocol. We define a random variable \(D_{T,r}\) that is the probability that \(R'\) contributes to the noise reduction \(r\) over iterations in \(T\): \[D_{T,r}({\cal D}) := \Pr_{R' \sim {\cal D}}[ I_{R', T, r} ] = \sum_{R'} \Pr_{R' \sim {\cal D}}[R'] \cdot I_{R', T, r}.\]

5.8.3.1 Conditions for the hash functions.

The universal quantifier for \(H_1\) and \(H_2\) in the above can be slightly relaxed so that the condition holds with all but small probability over the choice of the hash functions, which can be captured by showing that \(D_{T,r}\) is close to its mean \(E_{\hat k,r}^{n_R'}\) (and then applying Corollary 56).

5.8.3.2 Geometric collision property.

This is essentially to show that \(D_{T,r}\) is strongly concentrated around its mean. We could try to apply Chernoff bound to show the concentration property, but we cannot do so because \(I_{R'_i, T, r}\) and \(I_{R'_j, T, r}\) are not necessarily independent if \(R'_i \cap R'_j \neq \emptyset\). Therefore, we instead use Chebyshev for bounding the tail, which requires \(D_{T,r}\) to have small variance. Thus, our next goal is to upperbound \(\mathsf{Var}[D_{T,r}]\). To do so, we introduce a property of distributions \(\mathcal{D}\) over sets \(R'\) which we call the “Geometric Collision Property”. In a nutshell, this property states that the probability that two sets \(R'_1, R'_2\) drawn independently from \(\mathcal{D}\) have intersection of size \(z\) is at most \((\frac{1}{n^{0.5}})^z\) for all \(z \in [n']\). We show that \(\mathsf{Var}[D_{T,r}]\) can be bounded for any distribution over sets \(R'\) that has this property.

5.8.3.3 Multiple bundles of iterations towards DDP with negligible \(\delta\).

We are not quite done yet. Using the above lemma, we are only able to reduce the failure probability only to \(\sim 1/\sqrt{n}\), whereas we would like the failure probability to be negligible. In order to do that, we split the “good” iterations into \(u\) bundles, where \(u\) is a small superconstant number \(u = \lg\lg \kappa\), and argue that with overwhelming probability at least one bundle serves as a good noise. Note that hash outputs are independent in each bundle and so the probability that all \(u\) bundles fail should be \(~(\frac{1}{\sqrt{n}})^u\), which is negligible. For this, we set the parameter \(k_b = k_g/u\), where \(k_g\) is the number of good iterations.

5.8.4 Geometric Collision Property In the Face of Leakage

We conclude the proof by showing that \(R'\) indeed has the geometric collision property. It is not hard to see that the uniform distribution over all sets \(R'\) of size \(n'\) from a universe of size \(n' \cdot \ell\) (where \(\ell \in \Omega(n^3)\)) satisfies the “Geometric Collision Property”. It would seem, therefore, that we could take this as our secret distribution and the analysis would be complete. Unfortunately, even for the case in which the distribution is sets of size \(n'\) chosen uniformly at random from the universe, the analysis is not straightforward. The difficulty stems from the fact that the “noise” in the protocol is tied to the input itself. Therefore, if information about the input is leaked in any other part of the protocol, then the noise distribution changes and may no longer satisfy the required properties. Specifically in our case, learning the number of matches across the two parties’ sets with respect to some of the hash functions leaks information about the secret set of the honest party (since the secret set affects those counts).

5.8.4.1 Strong chain-rule for min-entropy.

We first observe that our initial min-entropy in the distribution over secret sets \(\mathcal{R}\) is high (approximately \(\frac{8n}{9} \lg \ell + 2n\)) and that the entire information leaked about \(R\) from the counts of the iterations that are not \(\theta\)-good is small. We can lower-bound the remaining min-entropy in \(\mathcal{D}\), therefore, using the weak chain rule for min-entropy [150], Lemma 2.2.

If we want to use the weak chain rule to lower bound the remaining min-entropy with all but \(2^{-\kappa}\) probability, however, we need to take a hit of \(\kappa\) in the min-entropy. Recall that each individual element in \(R\) can be viewed as being chosen from a set of size \(\ell\) and thus has min-entropy of at most \(\lg(\ell) \ll \kappa\). Thus, after applying the weak chain rule and losing more than \(\kappa\) bits of min-entropy, we can have certain elements that have only constant min-entropy, thus implying that collisions are likely in those positions. So the weak chain rule, while leaking only a small number of bits overall, can ruin the geometric collision property. Even worse, the min-entropy definition doesn’t rule out the case in which all elements of \(R\) (i.e. the marginal distributions over each element in \(R\)) have only constant min-entropy, while the total min-entropy in \(R\) remains high!

This phenomenon has been previously observed and studied in the literature [144]. One way to deal with such a counter-intuitive situation is to actually leak a small amount of additional information, known as “spoiled” bits. This will lower the total min-entropy in \(R\), but will ensure that a large fraction of blocks in \(R\) still have high min entropy of at least \(1.5\lg(n)\). We extend the techniques of [144] to produce spoiling leakage so that the min-entropy in \(R\) still stays high in our protocol. We discuss more details about the strong chain rule in the next section.

5.8.5 Strong Chain Rule

Our strong chain rule considers min-entropy where leakage functions \(\ell_1(\cdot), \ldots, \ell_n(\cdot)\) are additionally considered. We describe our theorem in a general way, and we hope that it may find future applications in leakage-resilient cryptography.

5.8.5.1 Sequence of random variables.

Recall that \(R\) is the set of secret items in the min-hash protocol. Here, we treat \(R\) as a sequence of block-by-block random variables \(R = (R_1,\dots,R_n)\), associated with (potentially randomized) leakage functions \(\ell_1(\cdot), \ldots, \ell_n(\cdot)\) with randomness \(\rho_1, \ldots, \rho_n\). You can think of the blocks as coming in a streaming fashion in order of \(R_1, R_{2}, \ldots, R_n\).

5.8.5.2 Leakage functions.

Loosely speaking, the properties we require of the leakage functions are that the \(i\)-th leakage \(\ell_i\) can be computed given \((R_i, \rho_i)\), and all the outputs of \((\ell_{i+1}, \ell_{i+2}, \ldots, \ell_n)\). We also require that the total number of valid sequences of leakages from \(\ell_1(\cdot), \ldots, \ell_n(\cdot)\) should be sufficiently small (see Property 1 in Theorem 58 below).

5.8.5.3 Spoiling functions.

Our theorem below states the existence of a spoiling function \(f(\cdot)\) with certain properties, as well as properties of the random variables \((R_1,\dots,R_n)\) and \((\rho_1, \ldots, \rho_n)\) conditioned on the output of the spoiling function \(f(R)\).

The properties of \((R_1,\dots,R_n)\) and \((\rho_1, \ldots, \rho_n)\) are roughly the following: (1) There exist disjoint sets \(V, W\) such that \(V \cup W = [n]\) that are determined by \(f(R)\). (2) Blocks \(\{R_i\}_{i \in V}\) have high min-entropy conditioned on \(f(R)\). (3) Blocks \(\{R_i\}_{i \in W}\) have small support size (low max-entropy) conditioned on \(f(R)\). (4) For \(i \in V\), the random strings \(\rho_i\) are uniform random and independent conditioned on \(f(R)\). (See Properties (5)-(8) in Theorem 58).

The properties of \(f(\cdot)\) are roughly the following: (1) The failure probability (outputting \(\bot\)) is small. (2) As long as the total number of valid sequences of leakages from \(\ell_1(\cdot), \ldots, \ell_n(\cdot)\) is sufficiently small, the image size of \(f\) is small. This property ensures that we do not lose too much of the total min-entropy of \(R\) by releasing \(f(R)\) (3) The leakages \(\{\ell_i(\cdot)\}_{i \in W}\) can be computed given \(f(R)\). (See Properties (2)-(4) in Theorem 58)

5.8.5.4 Our theorem.

The main difference between our spoiling lemma and prior ones is that our min and max entropy guarantees on \(R = (R_1,\dots,R_n) \mid f(R)\) hold even with respect to additional leakage \(\{\ell_i\}_{i \in W}\) which is included in the spoiled bits \(f(R)\).

The proof is found in Section 7.3.7. Typically, one would like to set \(c\) as large as possible, while ensuring that the size of \(V\) remains above some threshold. The achievable tradeoffs between \(c\) and \(|V|\) are determined by the min-entropy of \(R\) before the spoiling bits \(f(R)\) are released. For our applications, we require \(c = n^{1.5}\) and \(|V| \geq n/3\). In Section 7.3.8, We show that our min-entropy assumption on \(R\) implies that this parameter setting is achievable.

5.9 Empirical Evaluation

5.9.1 Comparison With Prior Work

We compare our noisy min-hash (NMH) protocol \(\pi_{\mathsf NMH}\) with the current state-of-the-art approach, called sketch-flip-merge (SFM) [48] and the generalized randomized response mechanism (GRR) [66]. In particular, we evaluate the trade-off between communication cost and cardinality estimation accuracy, while achieving (almost) the same level of privacy guarantee as follows:

5.9.1.1 Our protocol.

We calculate the communication of our protocol \(\pi_{\mathsf NMH}\) with \(\mathcal{F}_{{\mathsf psi\mbox{-}ca}}\) instantiated with the PSI-CA protocol described in Figure [pro:psica]. It is a variant of the protocol in [140], where \(H_2\) is applied to \(\{a'_i\}_{i \in [v]}\) and \(\{b'_j\}_{j \in [w]}\) in order to reduce the communication. The original protocol is secure under the DDH assumption in the random oracle model. Essentially the same security proof found in the original paper can be applied to show the security of this variant, when \(H_2\) is also modeled as a random oracle.

Let \(G\) be a multiplicative group of order \(q\). Let \(H_1:\{0,1\}^* \to G\) and \(H_2:\{0,1\}^* \to \{0,1\}^\lambda\) be hash functions.

Input: \(P_1\) has \(C = \{c_1, \ldots, c_v\}\) and \(P_2\) has \(S = \{s_1, \ldots, s_w\}\).

We briefly sketch the security proof here while referring the full proof to the original paper [140]. We first show the simulator for the corrupted \(P_1\). Let \(t\) be the protocol output (i.e., set intersection cardinality). The simulator chooses random \(t\) indices \((i_1, \ldots, i_t)\) (resp., \((j_1, \ldots, j_t)\)) from \([v]\) (resp., \([w]\)). In order to prepare \((b_1, \ldots, b_w)\), the simulator replaces \(H_1(s_{j_k})^{R_s}\) (for \(k \in [t]\)) with \(H_1(c_{i_k})^{R_s}\), and the remaining values \(H_1(s_h)^{R_s}\) are simulated with random numbers. Since \(H_1\) is a random oracle (i.e., for an input \(x\), we have \(H_1(x) = g^r\) for a random \(r\)), this simulation is indistinguishable under the DDH assumption. When \(P_2\) is corrupted, the first message \(\{a_i = H_1(c_i)^{R_c} : i \in [v]\}\) is simulated by random values. The simulation is also indistinguishable under the DDH assumption. The above PSI-CA protocol exchanges \(v + w\) elliptic curve points and \(w\) hashes, resulting in \((v + w)\cdot 256 + w\cdot 80\) bits. In protocol \(\pi_{\mathsf NMH}\), the parties will run this PSI-CA protocol by setting \(v = w = k + 2\ell_B\).

5.9.1.2 Sketch-Flip-Merge (SFM) [48].

While our main focus is on comparing the accuracy of Jaccard Index estimation, in the absence of available code, we had to rely on their analysis of the relative root mean squared error (RRMSE) of cardinality estimation instead of Jaccard Index estimation. This poses challenges in evaluating the accuracy of the Jaccard Index for SFM. In particular, although the Jaccard Index can be estimated by calculating the ratio of estimated intersection size over the estimated union size, its RRMSE cannot be directly calculated from RRMSEs for the intersection and union sizes. This is because the two estimates have dependency, and we can only conjecture that the derived estimate through the division operation will probably have a worse RRMSE.

In the end, giving a slight advantage to SFM, we decided to focus on the accuracy of cardinality estimation of the size of the union only. In our case, the union size was estimated based on the Jaccard Index from the min-hash protocol and \(n_A\) and \(n_B\). Following the approach of SFM [48], we perform \(m = 1000\) estimates to measure the accuracy in the form of relative root mean squared error (RRMSE); that is, letting \(\hat{n}_{U,1},\dots,\hat{n}_{U,m}\) be the union size estimates, and \(n_U\) be the real union size, we define \(\text{RRMSE}(\hat{n}_{U,1},\dots,\hat{n}_{U,m}; n_U)\) to be \(\frac{1}{n_U}\sqrt{\frac{1}{m}\sum_{i=1}^m(\hat{n}_{U,i}- n_{U})^2}\). To match the communication complexity, we set the sketch of SFM to be a \((B \times P)\)-bit matrix such that \(B \cdot P = 592w\) and \(P = 24\).

5.9.1.3 Generalized Randomized Response (GRR) [66].

We also compare our protocol with the generalized randomized response MinHash protocol in [66]. Following their guidance in experiments, we select the range of their hash function to be a single bit and let their protocol use \(592w\) hash functions to match our communication cost.

Since their actual protocol would take too long to run for large \(n\) and \(k\), in order to facilitate the large number of hash functions, we wrote code simulating the error based on their privacy and utility analysis. As with the other protocols, we perform 1000 estimates to lower the variance of the errors. To align with our other comparison with SFM, we report the relative root mean squared error of the union size.

5.9.1.4 Comparison Results.

In Figure [fig:mh_sfm_grr_acc], we demonstrate the comparison of NMH, SFM, and GRR. We set \(n = 10^6\). The result shows that our error is consistently smaller than both SFM and GRR for a reasonable range of communication costs, which corresponds to the usage of \(k \in [100, 500]\) hashes for our noisy min-hash protocol. Specifically, as we increase the number of hash functions, both our protocol and SFM achieve increased accuracy, due to larger sketch sizes that better represent the input sets. On the other hand, while GRR performs well with smaller communication, adding more communication becomes counter productive. This is because each additional bit in their protocol corresponds to an extra hash function output, which increases the noise needed to achieve the same privacy guarantee. Finally, both GRR and our protocol exhibit spikes in the accuracy trends, corresponding to crossover points where a significant amount of additional noise is required to keep \(\delta\) from getting above \(2^{-40}\).

In concluding remarks, it is noteworthy that both SFM and GRR protocol discloses the entire noisy sketch, revealing collective information about the party’s input set. We note that differential privacy does not prohibit revealing collective information about the inputs; rather, it mandates that individual contributions should not be discernible in the output. In contrast, our min-hash protocol employs secure two-party computation and discloses no information about the input set, except for the final \((\lg k)\)-bit output. When deciding which scheme to use, depending on the specific use case, this observation may need to be taken into account.

5.9.2 DDP of Noiseless Protocol

We empirically evaluate the DDP guarantee for the noiseless min-hash protocol in the public hash setting, in which each element in the secret set has high entropy. As before we set \(n_A = n_B = 10^6\). Figure [fig:mh_k_vs_eps] shows how the privacy parameter \(\epsilon\) changes with the number \(k\) of iterations. Although we demonstrate the results for \(JI \leq 0.5\), the results for \(JI > 0.5\) are similar. We omit the data points where \(\epsilon\) is greater than 5, which happens when \(k\) is small, and focus on the more meaningful \(\epsilon\) range. We observe the following:

6 FACTS

6.1 Introduction

The proliferation of fake and misleading information online has had significant impact on political discourse [151] and has resulted in violence [152]. Large services like Facebook and YouTube have begun to remove or label content that they know to be fraudulent or misleading [153], [154], through a combination of a manual process of reviewing posts/videos and automated machine learning techniques.

However, on end-to-end encrypted messaging services (EEMS), like Signal, WhatsApp, Telegram, etc., where so-called “fake news” is also shared, such review is impossible. At no point do the providers see the plain-text, unencrypted contents of messages transmitted through their systems and thus cannot identify and remove offending material. Such platforms must instead rely on their users to identify and report malicious content. Even then, identifying and removing users who repeatedly post misleading and dangerous content may still be difficult because some platforms, like Signal, also hide the path the message took, so identifying and addressing the original source of the misinformation may not be possible.

Tyagi et al. [155] introduced a first approach for overcoming this challenge and allow EEMS to effectively traceback an offending message to find the originator based on a user complaint. The traceback procedure also assures that all other messages remain private and that innocent parties cannot be blamed for originating the offending messages.

While innovative, there are two notable shortcomings of Tyagi et al.’s traceback scheme. First, it requires extensive “housekeeping” on the part of the platform that scales as the number of messages in the system. Second, a single, possibly malicious, complaint can trigger a traceback and thus reveal the message contents as well as the history of prior recipients, which is counter to the goals of EEMS to maintain the privacy of users communicating through this system. One malicious user (e.g., a government agent) can reveal the source of a piece of information (e.g., a leak) that they have received, violating the privacy of the sender (e.g., the leaker) by issuing a single complaint to the EEMS. While it may be possible to apply manual review to these complaints, the scale of possible complaints could make this impractical. Several follow-on papers [156], [157] show how to achieve source-tracking for EEMS to identify the source of a message without relying on traceback of all intermediate recipients. However, these systems still allow a single complainer to trigger the source-tracking.

In this paper, we aim to resolve this conflict between privacy and ability to identify misinformation in EEMSs by first observing that “fake news” messages are, by definition, viral and are thus received, and likely complained about, by a large number of users. Private messages, such as leaks, on the other hand, are likely to be targeted and are thus only received by a small number of users; indeed, any message received by only a few users is inherently less impactful overall and more likely deserving of privacy protections. This leads to a more nuanced approach for identifying fake news: apply a threshold approach to complaint management, whereby only viral fake news would overcome the threshold and trigger an audit.

Counting the number of complaints in a private manner is a non-trivial problem if the privacy of the EEMS’ clients is to be maintained prior to the threshold being reached, even given available cryptographic solutions. For example, a homomorphic encryption solution (e.g., [88]) would enable the checking and updating of counts for each message, but the access patterns of clients checking and updating counters could reveal how many complaints a message receives even if the threshold is not reached. Oblivious RAM (ORAM) (e.g. [158], [159]) could be used to protect the access patterns, but have high computational overheads and usually assume clients may share secrets and are not malicious. Pricate Information Retrieval (PIR) does not assume clients are trusted, but has different scalability challenges and does not address the problem of obliviously updating without revealing which message is being complained about.

We propose a different approach we call a Fuzzy Anonymous Complaint Tally System (FACTS). FACTS maintains an (approximate) counter of complaints for each message, while also ensuring that, until a threshold is exceeded, the status of these counters is kept private from the server and all users who have not received the message. FACTS builds on top of any end-to-end encrypted messaging platform, incurring only small overhead for message origination and forwarding. In particular, FACTS maintains the communication pattern of the underlying messaging system, requiring no new communication or secrets between users even for issuing complaints.

To avoid the high overheads of existing solutions, FACTS uses a novel oblivious data structure we call a collaborative counting Bloom filter (CCBF). This data structure allows us to obliviously increment and query approximate counters on millions of messages while only requiring 12MB of storage. Moreover, incrementing a counter only requires flipping one bit on the server and only uses the minimal communication of \(\log{|T|}\) bits to address a single bit in the server-stored bit vector \(T\). While the resulting counters are only approximate, we show experimentally and analytically that we are able to enforce the threshold on complaints with good accuracy, namely, below 10% error in theory, and below 3% in most realistic deployment scenarios.

6.1.1 Setting and Goals

FACTS is built on top of an end-to-end encrypted messaging system (EEMS). For this work, we focus on the setting of server-based EEMSs with a server \(S\) that enables (authenticated) encrypted communication between the system users. Examples of such EEMSs include Signal and WhatsApp, among many others.

To make sure that FACTS is compatible with existing encrypted messaging systems, we make the following performance requirements:

To ensure privacy of messages and complaints, FACTS requires that complaints remain hidden from the server (and colluding clients) until a threshold of complaints is reached. Additionally, FACTS ensures integrity of the complaint process ensuring correctness of complaint counts and the identity of the revealed originator once the threshold is reached. Specifically, FACTS satisfies the following security guarantees:

6.1.2 Building FACTS

Recall that our goal is to enable privacy-preserving counters to tally complaints on each message \(m\). This suggests an immediate solution where the server stores an encrypted counter for each message, and clients interact with the server to increment the counter and check the threshold. While implementing such counters is certainly possible using homomorphic encryption [160] or standard secure computation techniques [25], [113], [161] , the problem is that the access pattern of clients’ updates to counters leaks information to the server by revealing the complaint histogram. This suggests a further modification to store the counters inside an oblivious RAM (ORAM) [158] to hide such access patterns from the client. However, in our setting this would require a multi-client ORAM [85], [86], [87] which incurs significant performance penalties including at least \(O(\log{n})\) communication overhead when there are \(n\) distinct messages. Moreover, this would require direct communication between clients to maintain their ORAM state, and additionally, no security against malicious clients.

In FACTS, we take a different approach. Instead of relying on encryption to hide the counters from the server, we hide the counters in plain sight by mixing together the counters for all the messages in a way oblivious to the server. To make this possible, we relax the functionality of FACTS to only enforce approximate, rather than exact, thresholds. That is, the threshold will be triggered on a message \(x\) after \((1\pm \epsilon) t\) complaints for a small error \(\epsilon\). Making this relaxation allows us to use a sketch-based approach for counting the complaints.

To achieve this functionality obliviously, we develop a collaborative counting Bloom filter (CCBD). This data structure consists (roughly) of a collection of Bloom filters, one for each message, where the Bloom filters corresponding to different messages are mixed together to hide them from the server. Specifically, the server stores a table of \(s\) bits. A random subset of \(v\) bits (\(V_x\)) is assigned to each message \(x\) at origination; these bits will be used for tracking complaints about this message (for intuition, one can think of these bits as forming a Bloom filter for storing the set of complaints about the message). We stress that the server has no information about which bits correspond to which messages.

To complain about a message \(x\), a user who has received \(x\) can find the corresponding bit locations, and will (attempt to) flip one of the bits from 0 to 1. However, allowing users to flip any bit they choose, would allow malicious users to significantly accelerate complaints for a message they wish to disclose. To prevent this behavior, we restrict each client to only be able to flip (i.e., complain on) a small (of size \(u\)) set of locations \(U_C\). Thus, to complain about a message \(m\), a client first identifies the set \(V_x\) of bits corresponding to \(x\). Then, she checks how many of these bits have already been set to 1, and if this exceeds a specified threshold, notifies the server to trigger an audit. If the threshold for \(x\) is not yet exceeded, the client sees whether any of the 0 bits in \(V_x\) are in her set \(U_C\), and if there are any such bits, she flips one of them (chosen at random) from 0 to 1. Otherwise, the user still flips a random bit in their set \(U_x\), so the server cannot discern anything about the message being complained on. We prove in Section 6.4 that the actual number of complaints necessary to trigger an audit can be calculated with high precision allowing us to (approximately) enforce the desired threshold.

6.1.3 Limitations of FACTS

In order to present FACTS, it is also important to recognize what our system does not do.

First, unlike some prior work, e.g. [162], [163], FACTS does not attempt to automatically detect misinformation. Instead, it relies on users reporting it when they see it. This reliance on users has inherent benefits and limitations. While our system is not subject to the kinds of machine-generated false positives that can arise from, e.g., hash collisions [164], our model is inherently vulnerable to any sufficiently large group of dishonest users, who could trigger an audit on a benign message. This is why we suggest the possibility of a manual human review process on message contents before the service provider would take any action on an audited message; see 6.8.0.5.

Second, due to the approximate nature of FACTS, it works most effectively for relatively large thresholds, say in the hundreds and above. For our application to fake news detection, this is reasonable as such messages are likely to garner a large number of complaints, and indeed this was our main motivation for this paper. We leave as interesting possible future work to implement a system supporting smaller thresholds, even as small as 2, efficiently.

One additional functionality limitation is that, as is true with any application using Bloom filters, the CCBF data structure can fill up once too many complaints have been registered. To deal with this issue it is necessary to periodically reset the counters and refresh the CCBF data structure. We refer to each such refresh period as an epoch, and in the remainder of the paper only present algorithms for a single epoch.

Finally, on the security side, an important limitation is that FACTS reveals meta-data on who issues complaints (but not what message they complain on). It is important to consider what is revealed by this meta-data. By observing the timing of messages and complaints, the server can make some inferences about what messages users are sending and complaining about. For example, suppose that the server sees that \(A\) sends a message to \(B\), and then \(B\) issues a complaint. Then, it may be reasonable for the server to assume that \(A\) has sent the message which \(B\) complained about, even though this is not directly leaked by our system. Nonetheless, our definition guarantees that the server cannot be certain that this is indeed the case. We note that the messaging meta-data is already a byproduct of the underlying EEMS platform. FACTS only adds complaint meta-data to this leakage; see 6.8 for some further discussion.

6.1.4 Paper Layout

The remainder of the paper is organized as follows. In Section 6.2, we introduce some of the notation we use throughout the paper. Then, in Section 6.3 we describe the syntax and functionality of FACTS. In Section 6.4 we present and analyze our main building block, the CCBF data structure. Then, in Section 6.5, we show how to use a CCBF to instantiate FACTS. We demonstrate the accuracy and performance through experimental evaluation in Section 6.6 and then prove the security of FACTS in Section 6.7. Finally, we describe some variants of FACTS and directions for future work in Section 6.8 and present related work in Section 6.9.

6.2 Preliminaries

We use \([n]\) to denote the set \(1, \ldots, n\). We write \(x \gets X\) to indicate that the value \(x\) is sampled uniformly at random from the set \(X\). We use \(\kappa\) to denote a statistical security parameter and \(\kappa\) to denote a computational security parameter. We also assume the existence of a hash function \(H:\{0,1\}^* \rightarrow \{0,1\}^*\) which is modeled as a random oracle. We let \({\mathsf poly}(\cdot)\) denote a polynomial function and \(\mathsf{negl}(\cdot)\) denote a negligible function.

6.3 Fuzzy Anonymous Complaint Tally System (FACTS)

In this section, we present the syntax for FACTS and describe how FACTS is used. We show how to instantiate FACTS in Section 6.5.

6.3.0.1 Assumptions:

We assume that each user \(A\) has a unique identifier \(ID_A\), and that the server can authenticate these IDs. (We will abuse notation to use \(A\) to represent the user and also the id \(ID_A\)). We also assume that the server has an identifier \(ID_S\) (we will denote this by \(S\)) that can be authenticated by all users.

Additionally, we assume that the underlying end-to-end encrypted messaging system (EEMS) offers methods \(\textrm{send}(A,B,x)\) and \(\textrm{receive}(A,B,x)\) for sending and verifying a message \(x\) sent from user \(A\) to user \(B\). Moreover, we assume that this communication is encrypted and authenticated. In particular, \(\textrm{receive}\) verifies that the received message was sent by \(A\) and was not modified in transit. Importantly, we do not assume that this platform is anonymous, instead assuming that the full messaging history i.e., who sent a message to whom and the size of that message, is available to the server.

6.3.0.2 Syntax:

FACTS is a tuple of protocols \(\textsf{FACTS}=(\textsf{Setup}, \textsf{SendMsg}, \textsf{RcvMsg}, \textsf{Complain}, \textsf{Audit})\). The first is used to set up FACTS, the next two are used to send and verify messages, while the last two methods are used to issue complaints and audit received messages.

6.3.0.3 Usage:

The following workflow demonstrates the standard usage of FACTS. To originate a new message \(x\), a user \(A\) runs the \(\textsf{SendMsg}\) protocol with the server \(S\) to create metadata \(\tag_x\). \(\textsf{SendMsg}\) then sends this metadata and the message \((\tag_x,x)\) to the receiving user \(B\) using the messaging platforms \(\textrm{send}\) method. Upon receiving a message \((\tag_x, x)\), \(B\) first locally runs \(\textsf{RcvMsg}(A,B,\tag_x,x)\) to verify that the received message and tag are valid, if this fails he ignores the message. To forward a received message \((\tag_x,x)\), a user \(A\) runs \(\textsf{SendMsg}\) with the server \(S\) to produce metadata \(_x\); \(A\) then discards this metadata, and the original message \((\tag_x, x)\) is sent instead using the messaging platform’s \(\textrm{send}\) method.¹¹

If a user \(B\) receives a message \((\tag_x, x)\) that it considers “fake”, he can use the \(\textsf{Complain}\) protocol to issue a new complaint on this message. After issuing a complaint, \(B\) checks whether the threshold of complaints on \(x\) has been reached. If so, he calls \(\textsf{Audit}\) to trigger an audit on the message \((\tag_x,x)\), revealing \(x\) and the originator of \(x\) to the server \(S\).

We note that users may join and leave during the execution of FACTS as long as the total number of identifiable users does not exceed \(c\).

6.4 Collaborative counting Bloom filter

Our system records complaints in a special data structure which we call a collaborative counting Bloom filter, or CCBF. This data structure shares some of the same basic functionality as a counting Bloom filter [97], [98] or count-min sketch [165], which is to insert elements and compute the (approximate) frequency of a given element.

Our CCBF differs from a usual count-min sketch in that each update operation is accompanied by a user id, and each user can only perform a single update for a given element. This can be thought of as a strict generalization of the normal count-min sketch operations, where the latter may be simulated by our CCBF by choosing a unique user id for each update.

The actual data structure for the CCBF is also far simpler than the 2D array of integers used for a count-min sketch; instead, we store only a single length-\(s\) bit vector \(T\). As a result, our CCBF will have the following desirable properties:

The downside to our CCBF is a far lower accuracy of the count operation in general compared to count-min sketches. However, we will show that, for careful parameter choices, the count operation is highly accurate within a certain range, which is precisely what is needed for the current application.

6.4.1 CCBF Construction

Note that \(\mathsf{TestCount}\) is probabilistic, in the sense that it may return false when the actual count is greater than \(t\), or true when the actual count is less than \(t\). Our construction guarantees the correctness probability is always at least \(\tfrac{1}{2}\), and our tail bounds below show the correctness probability quickly goes towards \(1\) when the actual count is much smaller or larger than \(t\).

The performance and accuracy of the CCBF is governed by three integer parameters \(s\), \(u\), and \(v\), with \(u,v\le s\), which must be set at construction time. The first, \(s\), is the fixed size of the table \(T\). Each user \(i\) is randomly assigned a static set of exactly \(u\) locations in the \(T\); i.e., a uniformly random subset of \(\{0,1,2,\ldots,s-1\}\), which we call the user set. Similarly, each possible item \(x\) is assigned a random set of exactly \(v\) bit vector locations, which we call the item set.

The two CCBF operations can be implemented by a single server and any number of clients. The protocols are simple and straightforward, save for the calculation of the tipping point \(\tau\) which we present in the next subsection.

In these protocols, the size-\(s\) bit vector \(T\) is considered public or world-readable; it is known by all parties at all times. In reality, the server who actually stores \(T\) may send it to the client periodically, or whenever a client initiates a or protocol. However, the bit vector \(T\) is only writable by the server.

The \(\mathsf{Increment}(x,C)\) protocol, outlined in [alg:increment], involves the User attempting to set a single bit from 0 to 1 within the item set for \(x\). However, the user is only allowed to write locations within their own user set. So, if there are no 0 bits in the intersection of these two index sets, the user instead changes any other arbitrary 0 bit in its own user set in order to maintain item obliviousness.

Since the bit vector \(T\) is considered world-readable, the only communication here is the single index \(i\) from client to server over an authenticated channel. In reality, to avoid race conditions, the server will actually send the table entry values \(T[i]\) for all \(i\in U_C\) to the user first and lock the state of the global bit vector \(T\) until receiving the single index response back from the user.

The \(\mathsf{TestCount}(x,t)\) protocol is not interactive as it only requires reading the entries of \(T\). The precise computation of the tipping point \(\tau\) is detailed in the next section. Note that this computation depends only on the total number of bits set in the bit vector \(T\) as well as the parameters \(s,u,v\); therefore the computation of \(\tau\) is independent of the item \(x\) and could for example be performed once by the server and saved without violating item obliviousness.

6.4.2 Calculating the tipping point

The key to correctness of the protocol is a calculation of the tipping point \(\tau\), which is the expected number of 1 bits within any item set, if that item has been incremented \(t\) times. We now derive an algorithm to compute this expected value exactly, in \(O(tv)\) time and \(O(v)\) space.

Let \(s\) be the total size of the table \(T\) and \(m\le s\) be the total number of calls to so far. That is, \(m\) equals the number of 1 bits in \(T\). Recall that \(u,v \le s\) are the number of table entries per user and per item, respectively.

We first derive the probability that two subsets of the \(s\) slots have given-size intersection. Next we derive a recursive formula for \(\tau\) using these intersection probabilities. The nearest integer to \(\tau\) can then be efficiently computed using a simple dynamic programming strategy.

6.4.2.1 Intersection probabilities

For the remainder, we use Knuth’s notation \({n}^{\underline{k}}\) to denote the falling factorial, defined by \[{n}^{\underline{k}} = \frac{n!}{(n-k)!} = n\cdot(n-1)\cdot(n-2)\cdots(n-k+1).\]

Because the numerator and denominator are each products of \(b+k\) single-precision integers, the value of [eqn:intersect] can be computed in \(O(b)\) time to full accuracy in machine floating-point precision.

Furthermore, equation [eqn:intersect] has the convenient property that, after altering any value \(a\), \(b\), or \(k\) by \(\pm 1\), we can update the probability with only \(O(1)\) additional computation. So, for example, one can compute the probabilities for every \(k\le b\) in the same total time \(O(b)\).

6.4.2.2 Recurrence for number of unfilled message slots

Fix an arbitrary item \(x\), and let \(w\le v\) denote the number of 0 bits of \(T\) within \(x\)’s item set. Let \(k\le m\) denote the number of operations performed on item \(x\) performed so far.

First, for convenience define \(p_w\) to be the probability that an arbitrary user is able to write to one of the \(w\) remaining unfilled slots for the message. From 59, we have \[p_w = 1 - \frac{{(s-u)}^{\underline{w}}}{{s}^{\underline{w}}}, \label{eqn:pw}\] which can be computed in \(O(w)\) time. In fact, we pre-compute all possible values of \(p_w\) with \(0\le w\le v\) in \(O(v)\) total time.

Now consider the random variable for the number of 0 bits within \(x\)’s item set after \(k\) ’s on \(x\), if the item set originally had \(w\) 0 bits. Define \(R_{w,k}\) to be the expected value of this random variable, which can be calculated recursively as follows.

If \(w=0\), then the slots are all filled, and if \(k=0\) then there are no more ’s, so the number of unfilled slots remains at \(w\). Otherwise, the first will fill an additional slot with probability \(p_w\), leaving \(w-1\) remaining unfilled slots, and otherwise will leave \(w\) remaining unfilled slots. This implies the following recurrence relation:

\[R_{w,k} = \begin{cases} 0,& w=0 \\ w,& k=0 \\ p_w R_{w-1,k-1} + (1 - p_w) R_{w,k-1},& w,k\ge 1 \end{cases}\]

All values of \(R_{w,t}\) with \(0\le w \le v\) can be computed in \(O(tv)\) time and \(O(v)\) space, using a straightforward dynamic programming strategy.

6.4.2.3 Computing the tipping point

We now show how to compute the tipping point value \(\tau\), which is the expected number of filled item slots after \(t\) s on that item, by summing the \(R_{w,t}\) values over all possible values of \(w\) based on the number of other calls to \(m\).

To this end, define \(q_w\) to be the probability that \(w\le v\) slots for a given item are unfilled after \(m\) total calls to for other items. Because other calls to are for other unrelated items, each one goes to a uniformly-random unfilled slot over the entire size-\(s\) table \(T\). Therefore \(q_w\) is the same as the probability of a size-\(m\) set and a size-\(v\) set having intersection size exactly \(v-w\). From 59, this is \[q_w = \frac{{m}^{\underline{v-w}} \cdot{} {v}^{\underline{v-w}} \cdot{} {(s-m)}^{\underline{w}}} {{s}^{\underline{v}} \cdot{} (v-w)!}.\] We can pre-compute all values of \(q_w\) for \(0\le w\le v\) in total time \(O(v)\).

After pre-computing the values of \(p_w\), \(R_{w,t}\), and \(q_w\), we can finally express the tipping point \(\tau\) as a linear combination \[\label{eqn:r} \tau = v - \sum_{w=0}^v q_w R_{w,t},\] rounded to the nearest integer.

6.4.3 Tail Bounds

Next, we prove lower and upper bounds on the probability of filling a single additional item slot during an operation, [lem:lowerp,lem:upperp] respectively. The proofs, which are intricate but not especially surprising, can be found in 7.4.1.

In order to make our scheme practically realizable, we state and prove explicit rather than asymptotic results, with all constants specified. These constants in themselves are not particularly meaningful; rather, they represent the tightest values which worked with our proof techniques and the parameter ranges we deemed reasonable for the application in mind.

Now we use the probability upper bound to prove an upper bound on the tipping point \(\tau\).

We can now state our main theorems on the accuracy of the CCBF data structure. Consider a call to the predicate function \(\mathsf{TestCount}(x,t)\), which attempts to determine whether the number of prior \(\mathsf{Increment}{}\) calls with the same item \(x\) is at least \(t\). Our exact computation of the tipping point \(r(t)\) shows that this function always returns the correct answer with at least 50% probability. But of course, so would a random coin flip!

Let \(k\) be the actual number of calls to \(\mathsf{Increment}(x,C)\) that have occurred. Then two kinds of errors can occur: a false positive if \(\mathsf{TestCount}(x,t)\) returns true but \(k<t\), and a false negative if \(\mathsf{TestCount}(x,t)\) returns false when \(k\ge t\). Intuitively, both errors occur with higher likelihood when the true count \(k\) is close to \(t\). Our main theorem captures and quantifies this intuition, saying that, ignoring low-order terms, \(\mathsf{TestCount}\) is accurate to within a 10% margin of error with high probability.

6.5 Instantiating FACTS

We are now ready to present our construction of FACTS. This construction is based on the collaborative counting Bloom filter (CCBF) data structure presented in Section 6.4 to obliviously count the number of complaints on each message. It uses an underlying EEMS for sending end-to-end encrypted messages between users.

6.5.0.1 Setup:

The setup procedure for FACTS first sets up the underlying end-to-end encrypted messaging system (EEMS). For simplicity, we assume that there is a fixed number \(c\) of users using the system. Setup generates all necessary keys for the server \(S\) and all \(c\) users and distributes the keys. We note that if the messaging system is already setup, FACTS can simply leverage this for communication. Additionally, the server initializes an empty CCBF data structure.

6.5.0.2 Sending and receiving messages:

We now describe how FACTS originates, forwards, and verifies messages. We start our description with an auxiliary protocol \(\textsf{Originate}(A,x)\) between a user \(A\) and the server \(S\) to originate a new message \(x\). This protocol is used to create an origination tag \(\tag_x\) containing information about the message and originator. This tag binds the originator’s identity \(A\) to the message \(x\) to enable recovery upon an audit, while keeping \(A\) private from receiving users, and keeping the message \(x\) private from the server \(S\).

Roughly, this protocol works by having \(S\) produce a signature on (a hash of) the message together with the originator’s identity. Due to the use of the hash, \(S\) produces this signature without learning anything about the message, while the fact that \(S\) includes the originator’s identity in this signature prevents a malicious originator from including the wrong identity in the message. Moreover, since the tag is bound to the message, this prevents a replay attack where an adversary reuses tags across messages to change the identity of the originator.

Next, we describe the \(\textsf{SendMsg}\) protocol which makes use of the \(\textsf{Originate}\) protocol to send a message \(x\) between clients \(A\) and \(B\) while preserving (encrypted) information about the originator of \(x\). \(x\) can either be a newly originated message or a forward of a previously received message. In either case, \(\textsf{SendMsg}\) runs the \(\textsf{Originate}\) protocol to produce a new tag \(_x\) on the message \(x\). In the case of a new message, \(_x\) is sent along with the message, while in the case of a forward, it is discarded and the message is forwarded along with its original tag instead.

\(\textsf{RcvMsg}\) is a non-interactive algorithm that allows a receiving user to verify the tag, \(\tag_x\), affiliated with a message \(x\). Specifically, the receiver \(B\) verifies the server’s signature included in \(\tag_x\) to make sure that the tag indeed corresponds to \(x\) and that the originator id has not been modified. Importantly, \(B\) can perform this verification without learning the identity of the originator since the tag contains an encryption of this identity (this ciphertext is what is verified by \(B\)).

6.5.0.3 Complaints and Audit:

We now describe how FACTS allows users to complain about received messages and to trigger an audit once enough complaints are registered on a message. For these methods we make extensive use of a CCBF data structure for (approximately) counting complaints and detecting when a threshold of complaints has been reached.

The \(\textsf{Complain}\) protocol is used by a receiving user to issue a complaint on a received message \((\tag_x, x)\). We assume that prior to issuing a complaint the user verifies that \(\tag_x\) is valid using the \(\textsf{RcvMsg}\) protocol, and thus will only consider the case of valid tags. To issue a complaint on \((\tag_x,x)\), the user \(C\) calls CCBF.\(\mathsf{Increment}(\tag_x,C)\). As described in Section 6.4, this runs a protocol with the server in which the user (eventually) sends the location of a bit to flip to 1 to increment the CCBF count for the message \(x\). To prevent malicious adversaries from flooding FACTS with complaints, we enforce a limit of \(L\) complaints per user per epoch. Note that since the server knows the identities of complaining users, he can easily enforce this restriction.

Two important observations are in order here. First, we use \(\tag_x\) rather than the message \(x\) as the item to increment in the CCBF. The reason for this is that the tag is unpredictable to an adversary who has not received the message \(x\) through FACTS (even if \(\mathcal{A}\) knows \(x\)). Second, we note that the CCBF.\(\mathsf{Increment}\) procedure is inherently sequential. It requires that the CCBF table \(T\) be locked for the duration of the \(\mathsf{Increment}\) call to prevent race condition and to maintain obliviousness (see Section 6.4 for discussion). This means that only one user can run this procedure at a time. Thus, we focus on making this procedure as cheap as possible to minimize the impact of this bottleneck. In case multiple clients call \(\textsf{Complain}\) at overlapping times, the server can queue these complaints and process them one at a time.

The \(\textsf{Audit}\) protocol checks whether a threshold of complaints has been reached for a given message \(x\) and, if so, triggers an audit of this message. This protocol works by using the CCBF.\(\mathsf{TestCount}\) protocol to check whether the threshold \(t\) of complaints has been reached on this message. If this returns True, then the user simply sends \((\tag_x,x)\) to the server who first checks the validity of the tag, and then if it’s valid, decrypts the corresponding part of the tag to recover the identity of the message originator.

An important observation is that the CCBF.\(\mathsf{TestCount}\) operation is read-only and thus does not need to block. Thus, unlike the \(\textsf{Complain}\) command, many clients can execute the \(\textsf{Audit}\) command in parallel.

We note that \(\textsf{Audit}\) allows the server to learn the message \(x\) and the originator \(A\). We do not specify what the server does upon learning this information, as that is specific to a particular use of FACTS. One possible option is for the server to review \(x\) to see if it is truly a malicious message, and if so, block the user \(A\) from sending further messages. However, this decision is orthogonal to the FACTS scheme and we do not prescribe a particular action here.

6.6 Experimental Evaluations

In this section, we empirically evaluate the accuracy and performance of FACTS. We perform two sets of experiments. The first, measures the error in terms of number of complaints above or below the threshold as a function of the total number of complaints. The second, measures the performance overhead for messaging and complaint as a function of the threshold.

6.6.1 Experimental parameters

For our experiments, we set the maximum number of complaints per epoch \(n=1,000,000\). If we consider an epoch of one day, this results in approximately 11.6 complaints per second. To understand accuracy and efficiency of FACTS, we measure them for a range of thresholds \(100 \le t \le 1000\). With these fixed, we set the remaining parameters according to Corollary 60. In particular, we set the server’s storage \(s=96n=12MB\). The user set size \(u\) varies from (approximately) 47,000 to 470,000 bits, while the message set size \(v\) goes from (approximately) 740 to 7400.

6.6.2 Accuracy and stability

To measure the accuracy of FACTS, we observe the actual number of complaints necessary to cause an audit on a single message as a function of the background noise (i.e., total complaints on other messages). We calculate both the mean and the standard deviation of this value to capture the accuracy and stability of the complaint mechanism. To get a statistically meaningful estimate of these, our experiments run 1000 iterations of each parameter configuration.

The results of our experiments are presented in Figure [fig:Accuracy]. The left side of this figure shows the mean number of complaints to trigger an audit for a given threshold \(t\). As can be seen from the error bars, the absolute errors in number of complaints is quite small, with a maximum deviation of about 10 complaints at a threshold of 1000. Not surprisingly, we see that this error increases as the background noise increases, but the mean number of complaints remains remarkably steady at the desired value. The right side of Figure [fig:Accuracy] shows the relative standard deviation of the number of complaints as a function of background noise. From this graph we can see that the relative error is only a few percent, with a maximum relative error of about 3.5%. Not surprisingly, the threshold 100 measurement incurs the highest relative error because the noise is a much higher ratio when compared to the threshold. These experiments suggest that FACTS achieves good accuracy for a wide variety of threshold and background noise.

6.6.3 Performance overhead

Our next set of experiments measures the performance overhead of FACTS as a function of the threshold to start an audit. Specifically, we measure the overhead of sending a message using FACTS, and the cost of issuing a complaint. We note that for the message sending cost, we do not measure the cost of the EEMS communication, instead only measuring the added overhead due to FACTS.

For these experiments, we implemented both the client and server using the Rust programming language. We used SHA-3 for a hash function, and for encryption and signatures we used Rust’s ring library[166] implementation of OpenSSL’s ChaCha20-Poly1305 protocol and Ed25519 respectively. To instantiate the CCBF, we used a simple library bitvec[167] that allows memory to be bit addressed, rather than byte addressed, which gains us a quick, compact way to store the CCBF data structure.

To simulate network overhead, we implemented a simple web server and client, which communicated over a (simulated) 8 Mbps network with a latency of 80ms, using TLS 1.3. Since we are only measuring the overheads of FACTS over the underlying EEMS, our measurements did not include the time to send the message over the EEMS, nor the time to establish the TLS connection. All experiments were run on a 4.7Ghz Intel Core i7 with 16GB of RAM, with a sample size of 100 for each metric. As in the accuracy experiments, we set \(n=1,000,000\) and threshold varying from 100 to 1000, with the remaining parameters determined by Corollary 60.

For our measurement of message origination we looked at the cost of originating and sending a message of size 2MB. Creating and sending such a message with the encrypted hash and identity took 98ms, which indicates that the major bottleneck in this process is the 80ms network latency. We see then that when a user wishes to forward a message, they will still call \(\textsf{Originate}(A,x)\), but then forward the original message whereas in an EEMS this would just require a forward. Thus, the overhead of FACTS on a forward is slightly less than 100ms.

Figure 5 shows our measurements of the time to issue a complaint as a function of the audit threshold. The time for this is dominated by the time to retrieve the user set (i.e., the bits that the user can write) from the server. Since the size of this set \(u=O(n/t)\), this time grows inversely with the threshold \(t\). Thus, as the threshold increases, the total complaint time decreases very quickly, going down to essentially just the network latency when \(t\ge 400\).

These experiments show that both the (added) cost of sending messages and the cost of complaints (for sufficiently large \(t\)) are dominated by the networking costs. Thus, as long as the latency of the network is reasonable, FACTS can scale to millions of complaints per day.

6.7 Security of FACTS

In this section we analyze the security of FACTS. We provide security definitions capturing the privacy and integrity guarantees provided by FACTS and prove that our protocols described in Section 6.5 achieve these definitions.

6.7.1 Adversary Model

We consider two different types of adversaries against FACTS. The first is an honest-but-curious server \(S\). Such a server may also collude with some of the users. However, all such users, as well as the server, will follow the protocol. This adversary class models what the FACTS server learns in running the system, so we want to limit what the server learns. However, we have to assume that the server acts honestly, as a malicious server can fully break the integrity and availability of FACTS. For example, since the server produces the signatures binding originators to messages, a malicious adversary with knowledge of this key could arbitrarily assign originators by forging this signature.

We also consider a second type of adversary controlling a group of malicious users who do not collude with the server. Such users may want to violate the confidentiality of FACTS by learning extra information about messages or complaints, beyond what they learn through the messages they validly receive. Or, they may want to break the integrity of the complaint and audit mechanism of FACTS to blame innocent parties for audited messages, or to delay or speed-up the auditing of targeted messages. This models an external adversary, say a malicious company or government, who may want to distribute fake information without being audited or may want to block certain information or users from the system.

6.7.2 Privacy

6.7.2.1 Privacy vs. Server:

We first give a definition for privacy against a semi-honest server who may also collude with some semi-honest users. In this setting we aim to argue that unless a message is audited or is received by an adversarial user, the server learns no information about the message or the complaints on the message. In particular, the server should not be able to tell whether any message is a new message or a forward and how many, if any, complaints this message may have. In fact, the only thing that the server learns is the metadata of who is sending messages to whom and who is issuing complaints, but not anything more.

Specifically, we propose a real-or-random style definition to capture privacy against the server. This definition captures the fact that the view of the server (and colluding users) until a message is audited or received by a colluding user just consist of random values, and thus is independent of the messages and complaints.

Concretely, we define the following game between an adversary \(\mathcal{A}\) controlling the server (and possibly some colluding users) and a challenger.

First, consider the server’s view on a \(\textsf{SendMsg}\) command. This view consists of a message \(h = H(r||m)\) for \(r \leftarrow\{0,1\}^\kappa\) and the leakage from EEMS.\(\textrm{send}\), i.e., the identities \(A\) and \(B\), as well as \(|(\tag_x,x)|\). Since the challenger uses the same sender, receiver, and message length, the only thing left to prove is that \(h\) is indistinguishable from random. Since \(r\) is chosen uniformly at random, and \(H\) is a random oracle, \(H(r||m)\) is uniformly random to \(\mathcal{A}\) unless \(\mathcal{A}\) queries \(H(r||m)\). However, since \(\mathcal{A}\) makes at most \({\mathsf poly}(\kappa)\) queries to \(H\), the probability that he makes this query is at most \({\mathsf poly}(\kappa)/2^{\kappa} \le \mathsf{negl}(\kappa)\).

Next, we consider the \(\textsf{Complain}\) commands. The server’s view on a complaint consists of the complainer’s ID \(C\) and an index in the CCBF to flip to 1. In a real execution of \(\textsf{Complain}\), this index is chosen at random from the set \(S_C \cap V_x\) where \(S_C = \{i\in U_C \mid T[i] = 0\}\) and \(V_x\) is the list of item locations for \(x\).¹⁴ However, since \(U_C\) and \(V_x\) are chosen at random, we can equivalently sample a random 0-index in the bit vector \(T\) and then choose \(U_C\) and \(V_x\) conditioned on them containing this location. Hence the location sent to the server is uniformly random unless \(\mathcal{A}\) makes the corresponding \(H\) query, which only happens with \(\mathsf{negl}(\kappa)\) probability. 0◻

The above theorem states that, beyond the meta-data of who sent a message to whom and who has sent complaints and when, FACTS reveals no information about messages and complaints to a semi-honest server until an audit occurs (or a malicious user receives a message). Moreover, the view of the server is completely random when conditioned on the meta-data. Now, suppose that a message \(x\) is audited (or is received by an adversary-controlled user). When this happens, the adversary learns the tag and message \((\tag_x,x)\). This enables \(\mathcal{A}\) to learn the identity of the originator (by decrypting it from \(\tag_x\)) and to learn the entire history of this message, i.e., the transmission and complaint history of \(x\). However, since the server’s view of all other messages is indistinguishable from independent random strings (modulo the meta-data), the adversary does not learn anything more about these messages as a result of an audit on \(x\).

6.7.2.2 Privacy vs. Users

We now proceed to analyze security of our protocol against (possibly malicious) users that are not colluding with the server. This models the case of a third party adversary that tries to learn information about the messages and complaints in FACTS. Here, we no longer assume that a message \(x\) is never received by a malicious user and thus we cannot use a real-or-random style definition as before. Instead, we argue that a user cannot distinguish a new message from a forwarded message unless another corrupted user has previously seen that message. This also shows that a malicious user cannot learn the identity of the message originator. Since users do not receive any communication on complaints, we only consider message privacy here.

Concretely, we define the following game between an adversary \(\mathcal{A}\) controlling a set of users, and a challenger.

The view of \(B\) on an execution of \(\textsf{SendMsg}(\cdot,B,\tag_x,x)\) consists of the received message and tag \((\tag_x, x)\) where \(\tag_x = (r,e,\sigma)\). Since \(e\) is a semantically secure encryption of the identity of the originator, \(\mathcal{A}\) cannot distinguish between the case when \(e = {\mathsf Enc}(A)\) (when \(b=0\)) and the case when \(e = {\mathsf Enc}(O)\) (when \(b=1\)) except with advantage negligible in \(\kappa\). Additionally, since \(\tag_x\) is generated identically both when \(b=0\) and \(b=1\) except for this change in \(e\), this means that \(\tag_x\) does not help \(\mathcal{A}\) distinguish between these two cases. 0◻

6.7.3 Integrity

We now turn to the integrity guarantees provided by FACTS. We aim for a few different notions of integrity to show that malicious users cannot interfere with the complaint and audit process. First, no adversary controlling a subset of the users should be able to frame an honest user as the originator of an audited message he did not originate. Second, an adversary controlling a subset of the users should not be able to significantly delay the audit of a malicious message. In particular, such an adversary should not be able to prevent a malicious message sent by one of his users from being audited. Finally, an adversary controlling a small set of users should not be able to significantly speed up the auditing of a targeted message. In particular, such an adversary should not be able to cause an audit without complaints from some honest users.

We begin by defining the following game between a challenger and an adversary \(\mathcal{A}\) controlling a subset of the users to capture the inability of an adversary to forge a valid tag that it has not seen before.

A valid tag \(\mathsf{tag}_y\) with originator \(O\) consists of \(\mathsf{tag}_y = (r,e,\sigma)\) where \(r\) is a random seed s.t. \(H(r||y) = h\), \(e = {\mathsf Enc}_{PK_S}(O)\), and \(\sigma= \mathop{\mathrm{\textsf{Sig}}}_{SK_S}(h||e)\). Thus, to frame \(O\), \(\mathcal{A}\) needs to produce a valid signature on \(h||{\mathsf Enc}(O)\). \(\mathcal{A}\) can observe tags from polynomially many messages originated by Ø, but except with probability negligible in \(\kappa\) none of them will have the same value \(h\). Thus, by the unforgeability of \(\mathop{\mathrm{\textsf{Sig}}}\), \(\mathcal{A}\) cannot produce the necessary signature except with probability negligible in \(\kappa\). 0◻

Next, we give a definition that captures the ability of an adversary controlling a subset of the users to delay the audit of a particular message. Our goal is to show that the adversary cannot protect a malicious message from being audited.

Next, we define the following game to capture the ability of a small number of malicious users to cause the audit of some message. Importantly, this definition also captures the case where malicious users try to audit an honest message (on which there are no complaints by honest users). Specifically, the following game is between an adversary \(\mathcal{A}\) corrupting at most \(\ell\) users and a challenger

This follows immediately from Theorem [thm:fp] because each user \(\in \mathcal{A}\) makes at most \(L\) complaints. 0◻

6.8 Alternative FACTS

In this section we describe several optimizations or enhancements to the basic FACTS protocol.

6.8.0.1 Revealing even less during audits.

Recall that the FACTS system we presented reveals two things to the server (or an auditor) after the threshold of complaints has been reached: the user id of the message’s originator, and the contents of the message itself. Indeed, one of our motivations was to avoid revealing the entire path or tree of message forwarding as in prior work [155].

However, in some environments, even this may be too much to release to a service provider that could be, for instance, compromised or influenced by an oppressive regime. An advantage of FACTS is that the system for tallying complaints actually does not require this information in order to function properly. Here we briefly sketch simple modifications to the scheme to achieve this additional hiding, with a note of caution that we have not analyzed the formal security under these variants.

Hiding the originator’s identity entails omitting the encrypted user id from the origination protocol. To do this, the server’s signature \(\sigma\) of the message hash and sender identity should be replaced with a blind signature of the message hash only. In this way, a later audit which reveals the (unblinded) signature will not reveal anything about the originator’s identity. The disadvantage of course would be that there is no way for the system to identify and penalize users who regularly submit fake news to the platform.

To hide the message contents, these would simply be omitted from what is sent to the auditor once the threshold is reached. In this case, the notion of “audit” may be understood to be simply confirming that some message (with the given hash) has passed the threshold of complaints, and publishing the hash to all users as potentially fake news that has received a large number of complaints. The client software could easily be configured to flag such messages as they are received afterwards, without ever revealing to the server any contents or recipients of such message. Here the disadvantage is obviously that no third-party auditing or fact-checking is possible, raising the possibility of false positives in which messages are flagged.

6.8.0.2 Throttling complaints.

The FACTS system and underlying CCBF data structure assume a global limit \(n\) on the number of complaints per epoch, but do not require any per-user limit besides the natural limit of \(u\), the size of the user set.

However, there is some potential for abuse by users who issue many complaints in a single epoch: they may attempt to “attack” another known message by issuing multiple complaints that set bits in that message’s user set; they may collude with others and attempt to go over the total per-epoch limit of \(n\) complaints; or they may simply attempt a denial-of-service attack to prevent other complaints from being issued.

An simple solution to these problems is to apply a limit \(\ll u\) on the maximum number of complaints per user per epoch. This is easy for the server to apply, since users are authenticated during the Complain protocol. More nuanced limits based on a user’s reputation or longevity on the platform could also be applied.

Users with a small “quota” of allowed complaints per epoch could even be encouraged to participate initially in the complaint process by forwarding questionable content to a trusted reputable user on the system, who would then presumably apply their own judgment and possible issue a complaint in turn. This idea is aligned with many existing content moderation settings on (unencrypted) social media platforms.

6.8.0.3 Handling epoch rollover.

As described, the FACTS system resets all counters at the end of a single epoch. However, this may mean that if a “fake news” message is first detected towards the end of an epoch, the complaints for this message may get split between the current and next epochs and thus fail to trigger an audit in either epoch.

A potential solution to this problem is to always run two epochs concurrently, where each epoch lasts for time \(t\), and the epochs start times are \(t/2\) apart. Users complain in both of the epochs, and an audit occurs if the number of complaints in either epoch exceed the threshold. This way, regardless when a “fake news” message is first detected, there will be an epoch with at least \(t/2\) time left to accumulate complaints. Since we assume that fake news messages are ones that are received and complained on by many users, and that users are likely to complain shortly after first receiving a message, this provides enough time for a threshold of complaints to be reached.

6.8.0.4 Regional complaint servers.

The most significant performance bottleneck in FACTS is the necessary global lock on the table \(T\) while a single user is waiting to download their user set \(U_C\) and reply with their complaint index. Even though the communication size is quite small for practical settings, the inherent latency across global communications networks may impose a challenge.

For example, if many complaining users have a round-trip latency of more than 200ms, then the global complaint rate among all users cannot be higher than 5 complaints per second, or some 432,000 complaints per day, regardless of any parameter settings or chosen epoch length.

One possible solution for a large-scale platform facing this issue would be to allow multiple local complaint servers, each with their own CCBF table \(T\), to independently operate and accumulate complaints per messages. This makes sense, as most targeted misinformation content is local to a given country or region, and it would still be possible for each regional server to share audited message information with others in order to prevent spread of viral false content between regions.

6.8.0.5 Third-party audits.

While many messaging and social media platforms currently employ their own “in-house” teams for content moderation, there have been some attempts at separating the role of the server from that of auditor.

From a protocol standpoint, we can imagine a separate Server and Auditor: the former is semi-honest, handles the encrypted messaging system and maintains the public CCBF table \(T\). The Auditor is fully honest and non-colluding, but computationally limited; intuitively, the third-party Auditor should only be involved once a messaged has passed the desired threshold of complaints.

The FACTS system supports this option easily with the need for any additional cryptographic setup during origination. Because the CCBF table \(T\) is globally shared among all users as well as the Auditor, any complaining user who computes \(\mathsf{TestCount}{}\) on their own to see that the probabilistic threshold has been surpassed, can then forward their complaint (i.e., the opened message) directly to the Auditor. Being fully honest, the Auditor may hold a copy of the decryption key from origination and use this to determine what kind of action may be necessary (such as suspending the originating user’s account, flagging the message, etc.).

While it doesn’t appear idea imposes any additional interesting challenges from a cryptographic standpoint, it could be useful for some kinds of messaging platforms.

6.8.0.6 Hiding message metadata.

Our FACTS system is certainly no more private than the underlying EEMS which is being used to actually pass messages between users. In our analysis, we explicitly assumed that the EEMS leaks metadata on the sender and recipient of each message, but not the contents.

However, some existing EEMS attempt to also obscure this metadata in transmitting messages, so that the server does not learn both sender and recipient of any message. This can trivially be accomplished by foregoing a central server and doing peer-to-peer communication (note that FACTS may still be useful as a central complaint repository); or using more sophisticated cryptography to hide metadata [168], [169], [170].

Of particular interest for us is the recently deployed sealed sender mechanism on the popular Signal platform [171]. The goal in this case is to obscure the sender, but not the recipient, from the server handling the actual message transmission. We note that this concept plays particularly well with FACTS, as the additional leakage in our protocol of the identity of each complaining user, can be presumably correlated via timings with the receipt of some message, but this is exactly what is revealed under sealed sender already! Both systems thus work to still hide message sender and originator identities (at least until an audit is performed).

However, note that recent work [172] has shown that some timing attacks are still possible under sealed sender, and the same attacks would apply just as well to FACTS. But the solutions proposed in [172] might also be deployed alongside FACTS to prevent such leakage; we leave the investigation of this question for future work.

6.9 Related Work

6.9.0.1 Message Franking:

The most common approach today for reporting malicious messages in encrypted messaging systems is message franking [82], [83], [84]. Message franking allows a recipient to prove the identity of the sender of a malicious message. However, message franking is focused on identifying the last sender of a message, whereas we are interested in identifying the originator. Moreover, message franking does not provide any threshold-type guarantees to prevent unmasking of senders given only one (or a few) complaints.

6.9.0.2 Oblivious RAM (ORAM):

Oblivious Random-Access Memory (ORAM) [158], [173], [174] allows a client to obliviously access encrypted memory stored on a server without leaking the access pattern to the server. The standard ORAM definition assumes a single user with full control over the database. While some important progress has been made on multi-client ORAM protocols [85], [86], [87], these solutions are still not scalable to millions of malicious users as would be needed for our application.

6.9.0.3 Oblivious Counters and Oblivious Data-Structures:

Like CCBF, oblivious counters [88], [89] build counters that can be stored and incremented without revealing the value of the counter. However, these techniques focus on exact counting, and do not provide efficient ways for storing large numbers of counters, as needed for our applications. More generally, oblivious data-structures, e.g. [90], [91], [92] construct higher-level data structures such as heaps, trees, etc. to enable oblivious operations over encrypted data. However, these largely focus on higher-level applications and do not provide the compression achieved by CCBF.

6.9.0.4 Privacy-Preserving Sketching:

CCBF can be viewed as a small data structure (a sketch) for storing the counts of complaints on a large set of messages. There has indeed been a lot of recent interest (e.g., [40], [41], [42], [43], [175], [176], [177]) in private sketching algorithms for cardinality estimation, frequency measurement, and other approximations. However, these works generally focus on a multi-party setting, with multiple parties running secure computation to evaluate the statistic in question. Since our goal was to restrict ourselves to user-server communication only, such techniques do not seem applicable to our setting.

7 Omitted Proofs

7.1 Secure Search

7.1.1 Proof of Lemma 2

7.1.2 Proof of Theorem 6

7.2 Secure Sampling

7.2.1 Definitions

We assume that readers are familiar with security notions of standard cryptographic primitives [178] and formal definitions of a protocol securely realizing an ideal functionality (cf. [179]).

It is well know that Yao’s protocol securely realizes \(F_{2PC}\) in the semi-honest security setting with a constant round and \(O(|C_1| + |C_2|)\) communication [180].

We say that two vectors \(\boldsymbol{d} = (d_1, d_2, \ldots)\) and \(\boldsymbol{d}' = (d_1', d_2', \ldots)\) are neighboring if they have the same length, and there exists only one index \(i\) s.t. \(d_{i} \neq d'_{i}\).

7.2.1.1 Differential privacy in the two-party setting.

Our presentation here follows the similar definitions given in prior work [43], [183], [184]. For a two-party protocol \(\Pi\) and an input \(({\boldsymbol{d}}_1, {\boldsymbol{d}}_2)\), we let \(\Pi({\boldsymbol{d}}_1, {\boldsymbol{d}}_2)\) denote the execution of \(\Pi\) on this input. For an adversary \(\mathcal{A}\) (corrupting either \(P_1\) or \(P_2\)), we define Let \(\mathsf{view}^\Pi_A({\boldsymbol{d}}_1, {\boldsymbol{d}}_2)\) be the view of the protocol to \(A\) (consisting of input, the random tape, the protocol transcript, and the output).

We can securely realize \(\mathcal{F}_{\mathsf{biasCoin}}\), by executing \(\mathcal{F}_{2PC}\) for the following circuit \(C_{\mathsf{coinflip}}\). Since we just execute \(\mathcal{F}_{2PC}\) with a circuit, security of the protocol is immediate.

\(C_{\mathsf{coinflip}}( \|\boldsymbol{w}_1\|_1, \{r_{1,j}\}_{j=1}^\kappa, b_1, \|\boldsymbol{w}_2\|_1, \{ r_{2,j} \}_{j=1}^\kappa, b_2)\) \(\rhd\) \(r_j, b_j\) are random bits.

\(P_1\)’s input is \((\|\boldsymbol{w}_b\|_1, \{r_{1,j}\}, b_1)\) and \(P_2\)’s input is \((\|\boldsymbol{w}_2\|_1, \{r_{2,j}\}, b_2)\). We require \(\|\boldsymbol{w}_1\|_1, \|\boldsymbol{w}_2\|_1, \{r_{1,j}\}, \{r_{2,j}\} \in \{0,1\}^\kappa\), and \(b_1, b_2 \in \{0,1\}\).

Let \(s_1 = \|\boldsymbol{w}_1\|_1\) and \(s_2 = \|\boldsymbol{w}_2\|_1\). Let \(s = s_1 + s_2\). Compute \({\mathsf mask}= 0^{\kappa-h}1^h\) such that \(s \& {\mathsf mask}= s\) and \(s | {\mathsf mask}= {\mathsf mask}\) where \(\&\) (resp., \(|\)) denotes bitwise AND (resp., bitwise OR) operation. Note that there is a single \(h\) satisfying the above conditions, i.e., the effective bit-length \(h\) of \(s\) with \(2^{h-1} \le s < 2^h\). This computation can be done by checking all possible candidates of \(h\) one by one in \(O(\kappa)\) steps.

For \(j = 1,...,\lambda\), let \(r_j = (r_{1,j} \oplus r_{2,j}) \& {\mathsf mask}\). Note that it holds \(r_j < 2^h\).

Find the first \(j^*\) such that \(r_{j^*} \le s\). If there is no such \(j^*\) output error.

If \(r_{j^*} \le s_1\), output \(b\) to both \(P_1\) and \(P_2\). Otherwise, output \(b\) to \(P_1\) and \(b \oplus 1\) to \(P_2\).

Note that \(\Pr[r_j > s] = 1 - s/2^{h} < 1/2.\) Therefore, with \(\lambda\) repetitions, we have a good \(j^*\) with probability \(1-2^{-\lambda}\). Finally, we have \(\Pr[r_{j^*} \le s_1| r_{j^*} \le s] = \frac {s_1}{s} = p\).

7.2.2 Proofs

We describe the simulator \({\mathsf{Sim}}\) in the \(\{\mathcal{F}_{\mathsf{osample(L_1)}}, \mathcal{F}_{2PC} \}\)-hybrid model for the case that Party 1 is corrupted. The simulator and proof of security are analogous in the case that Party 2 is corrupted.

\({\mathsf{Sim}}\) receives as input \(\boldsymbol{w}_1\), the output \(i^*\), and \(||\boldsymbol{w}_1 + \boldsymbol{w}_2||_2\). \({\mathsf{Sim}}\) samples \(r^*\) from a geometric distribution with success probability \(p = ||\boldsymbol{w}_1 + \boldsymbol{w}_2||_2^2\).

\({\mathsf{Sim}}\) invokes Party 1 on input \(\boldsymbol{w}_1\). For \(i \in [r^*-1]\), Party 1 sends its input to the first three invocations of \(\mathcal{F}_{\mathsf{osample(L_1)}}\) and \({\mathsf{Sim}}\) returns to it three random values in \(\mathbb{Z}_n\). Party 1 sends its input to the second three invocations of \(\mathcal{F}_{\mathsf{osample(L_1)}}\) and \({\mathsf{Sim}}\) returns to it three random values in \(\mathbb{Z}_n\). Party 1 sends its input to the \(\mathcal{F}_{2PC}\) functionality and \({\mathsf{Sim}}\) returns to it \(\bot\). For \(i = r^*\), Party 1 sends its input to the first three invocation of \(\mathcal{F}_{\mathsf{osample(L_1)}}\) and \({\mathsf{Sim}}\) returns to it three random values in \(\mathbb{Z}_n\). Party 1 sends its input to the second three invocations of \(\mathcal{F}_{\mathsf{osample(L_1)}}\) and \({\mathsf{Sim}}\) returns to it three random values in \(\mathbb{Z}_n\). Party 1 sends its input to the \(\mathcal{F}_{2PC}\) functionality and \({\mathsf{Sim}}\) returns to it \(i^*\).

It is clear that the view of Party 1 is identical in the ideal and real world, assuming that \({\mathsf{Sim}}\) samples the first succeeding round, \(r^*\), from the correct distribution. In the following, we argue that this is indeed the case.

As was shown in the correctness analysis, if the protocol has not already halted before round \(r\), then the probability of halting (and outputting some valid index) in round \(r\) is: \[||\boldsymbol{w}_1||^2 + 2 \langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle + ||\boldsymbol{w}_2||^2 = ||\boldsymbol{w}_1 | \boldsymbol{w}_2||_2^2.\] Since \(r^*\) is defined as the round in which the protocol halts, the distribution on \(r^*\) is exactly the distribution on the number of Bernoulli trials (with probability \(p = ||\boldsymbol{w}_1 | \boldsymbol{w}_2||_2^2\)) needed to get one success. Sampling the number of rounds is therefore equivalent to sampling the random variable corresponding to the number of rounds from a geometric distribution with success probability \(p = ||\boldsymbol{w}_1 | \boldsymbol{w}_2||_2^2\), which is exactly what \({\mathsf{Sim}}\) does.

Send \(\boldsymbol{w}\) to \(\mathcal{F}_{{\mathsf{osample(L_1)}}}\) and receive \(\pi\) as the output. Place \(\pi\) on the random tape of the sender.

Send a random encryption for \(\unicode{x27E6}r_2 \unicode{x27E7}\) in Step 4. Due the semantic security of the underlying FHE scheme, the simulation is indistinguishable. Next, we give the simulation of the receiver.

The simulator receives output \(i \oplus\pi\) from \(\mathcal{F}_{{\mathsf{osample(L_1)}}}\).

The simulator works as functionality \(F_{2PC}\) sending a random \(r_2\) as the output to the receiver.

The simulator runs the key generation protocol honestly, and stores the threshold decryption key of the sender.

In step 6, the simulator computes \(c := \unicode{x27E6}i \oplus\pi \unicode{x27E7}\) and sends it to the receiver.

The simulator runs the threshold decryption protocol honestly. The simulation is perfect.

We describe the simulator \({\mathsf{Sim}}\) in the \(\{\mathcal{F}_{L_1}^{ss}, \mathcal{F}_{2PC}\}\)-hybrid model for the case that Party 1 is corrupted. The simulator and proof of security are analogous in the case that Party 2 is corrupted.

\({\mathsf{Sim}}\) receives as input \(\boldsymbol{w}_1\) and the output \(i^*\). \({\mathsf{Sim}}\) invokes Party 1 on input \(\boldsymbol{w}_1\). For \(j \in [B]\), the simulator works as follows:

Upon Party 1 sending its input to \(\mathcal{F}_{L_1}\), \({\mathsf{Sim}}\) returns a uniformly random share \(r\)

In place of the encryption of \(w_{2,i_j}\) from Party 2, \({\mathsf{Sim}}\) sends Party 1 an FHE ciphertext encrypting 0.

Upon Party 1 sending its input to \(\mathcal{F}_{2PC}\), \({\mathsf{Sim}}\) returns to Party 1 an FHE ciphertext encrypting 0.

The only differences in the view of Party 1 in the ideal and hybrid worlds, are that (1) In the hybrid world it gets a secret share of \(i_j\), whereas in the ideal world it gets a uniformly random value; (2) In the hybrid world it gets an encryption of \(w_{2,i_j}\) from Party 2, whereas in the ideal world it gets an encryption of 0; (3) In the hybrid world it gets encryptions of \(i_j\) or \(0\) from the ideal functionality \(\mathcal{F}_{2PC}\), whereas in the ideal world it always gets encryptions of \(0\).

Receiving uniformly random values instead of correct secret shares does not affect the view of Party 1, since the additive secret sharing used has perfect secrecy. Further, switching from encryptions of \(w_{2,i_j}\) and \(i_j\) to encryptions of \(0\) is indistinguishable due to the semantic security of the threshold FHE scheme. Thus, the view of Party 1 is computationally indistinguishable in the hybrid world and the ideal world.

We want to show that \[\begin{aligned} \Pr_{D_{L_p}(\boldsymbol{w}_1, \boldsymbol{w}_2)}[i] &= \frac{(w_{1,i} + w_{2,i})^p}{||\boldsymbol{w}_1 + \boldsymbol{w}_2||_p^p}\\ &= \frac{(w_{1,i} + w_{2,i})^p}{c \cdot (||\boldsymbol{w}_1||_p^p + ||\boldsymbol{w}_2||_p^p)}\\ &\leq \frac{2^{p-1} (w^p_{1,i} + w^p_{2,i})}{c \cdot (||\boldsymbol{w}_1||_p^p + ||\boldsymbol{w}_2||_p^p)}\\ &= 2^{p-1}/c \cdot \Pr_{D_{\mathsf{ignore}, p}(\boldsymbol{w}_1, \boldsymbol{w}_2)}[i]. \end{aligned}\] The inequality holds due to Jensen’s inequality with convex function \(f(x) = x^p\): \[\begin{aligned} 1/2^p \cdot(w_{1,i} + w_{2,i})^p & = f(1/2 \cdot w_{i,1} + 1/2 \cdot w_{i,2}) \\ & \le 1/2 \cdot f(w_{i,1}) + 1/2 \cdot f(w_{2,i}) \\ & = 1/2 (w_{i,1}^p + w_{i,2}^p). \end{aligned}\] This completes the proof of Lemma 21.

Note that \(\Pi_{L_p}\) simply performs rejection sampling in a distributed setting where sampling from \(D_{\mathsf{ignore}, p}(\boldsymbol{w}_1, \boldsymbol{w}_2)\) and computing the probabilities is done in a distributed manner. It is therefore well-known that as long as for all \(i \in [n]\), \[\label{eq:rejection_2} \Pr_{D_{L_p}(\boldsymbol{w}_1, \boldsymbol{w}_2)}[i] \leq 2^p/c \cdot \Pr_{D_{\mathsf{ignore}, p}(\boldsymbol{w}_1, \boldsymbol{w}_2)}[i],\] then \(\Pi_{L_2}\) samples from the exact correct distribution, and the number of samples required from \(D_{\mathsf{ignore}, p}(\boldsymbol{w}_1, \boldsymbol{w}_2)\) in protocol \(\Pi_{L_2}\) follows a geometric distribution with probability \(c/2^p\). Thus, if condition ([eq:rejection_2]) is met, the protocol samples exactly correctly and completes in an expected \(2^p/c \leq 2^p \in \tilde{O}(1)\) number of rounds (since \(c \geq 1\) and \(p \in O(1)\)). Further, it can be immediately noted that condition ([eq:rejection_2]) is met due to Lemma 21. Finally, each round has \(\tilde{O}(1)\) communication, since \(\Pi_{\mathsf{ignore}}\) has communication \(\tilde{O}(1)\) (by Lemma 18) and since, in addition to that, only a constant number of length \(\tilde{O}(1)\) values are exchanged in each round. Combining the above, we have that \(\Pi_{L_p}\) has expected communication \(\tilde{O}(1)\) and worst case (with all but negligible probability) communication \(\tilde{O}(1)\).

We will show that the joint distribution over the \(i\)-th entries of \(\tilde{\boldsymbol{w}_1} := \boldsymbol{M} \boldsymbol{w}_1 = (\tilde w_{1,1}, \ldots, \tilde w_{1,k})\), \(\tilde{\boldsymbol{w}_2} = \boldsymbol{M} \boldsymbol{w}_2 = (\tilde w_{2,1}, \ldots, \tilde w_{2,k})\) can be sampled perfectly, given only \(\langle \boldsymbol{w}_1, \boldsymbol{w}_1 \rangle\), \(\langle \boldsymbol{w}_2, \boldsymbol{w}_2 \rangle\), and \(\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle\).

Due to independence of each of the coordinates of \(\tilde{\boldsymbol{w}_1}, \tilde{\boldsymbol{w}_2}\), this immediately implies that the entire \(\mathsf{approxIP}(\boldsymbol{w}_1, \boldsymbol{w}_2)\) can be simulated perfectly given only \(\langle \boldsymbol{w}_1, \boldsymbol{w}_1 \rangle\), \(\langle \boldsymbol{w}_2, \boldsymbol{w}_2 \rangle\), and \(\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle\).

We begin by noting that \[\begin{aligned} \tilde{w}_{1,i} &=& w_{1,1} M_{i,1} + w_{1,2} M_{i,2} + \cdots + w_{1,n} M_{i,n} \\ \tilde{w}_{2,i} &=& w_{2,1} M_{i,1} + w_{2,2} M_{i,2} + \cdot + w_{2,n} M_{i,n} \end{aligned}\]

In the following, we show how to jointly sample \((\tilde{w}_{1,i}, \tilde{w}_{2,i})\).

7.2.3 Realizing \(\mathcal{F}_{\mathsf{osample(L_1)}}\) with OT with Less Precision

We implement oblivious sampling using a 1-out-of-\(m\) OT scheme. In particular, the receiver, as an OT receiver, chooses a random index from \([m]\), and the sender, as an OT sender, prepares an \(m\)-dimensional input vector that encodes the \(L_1\) distribution of \(\boldsymbol{w}\) in a way that we will describe soon.

7.2.3.1 Assumption about the level of precision of the input.

With this approach, each element from the prepared OT input vector will be chosen uniformly with probability \(\frac 1 m\). Therefore, the size \(m\) affects the level of precision of the sampling. In particular, we set \(\mu := 1/m\) as a precision unit, and we assume the following:

For each \(i \in [n]\), it holds that \(\frac{w_{i}}{\|\boldsymbol{w}\|_1}\) is a multiple of \(\mu\). If the input vector \(\boldsymbol{w}\) is not consistent with the above requirement, one can round it by using the following function \({\mathsf{rounding}}_{\mu}(\boldsymbol{w})\):

Let \(\boldsymbol{w} = (w_1, \ldots, w_n)\). For \(i = 1, \ldots, n\), compute \(w_i' = {\mathsf{trunc}}_{\mu}(w_i)\). Here, for any real number \(x\) with \(x \in [0,1]\), denote \({\mathsf{trunc}}_{\mu}(x) = \tilde x \cdot \mu\) where \(\tilde x\) is an integer that minimizes \(\Delta := x - \tilde x \cdot \upsilon\) subject to \(\Delta \ge 0\). Typically, we have \(\mu = 2^{-q}\) for a certain positive integer \(q\), and \({\mathsf{trunc}}_{\mu}(x)\) is simply truncating the lower order bits in the binary representation of \(x\).

Find \(j = \arg\max_{i \in [n]} (w_i - w_i')\), and increase \(w'_j\) by \(\mu\).

The above algorithm makes sure that for all \(i\), it holds \(|w'_i - w_i| < \mu\), which means \(\boldsymbol{w}'\) is a good approximation of \(\boldsymbol{w}\), with each element having an additive error of at most \(\mu\). To see why, note that in step 1, some \(w'_i\)s will get truncated leading to small difference, i.e., \(w_i - w_i' < \mu\). In step 2, since the truncated weights are added back to the elements in decreasing order of difference \((w_i - w_i')\), only some of the truncated \(w'_i\)s will be updated to \(w'_i + \mu\) (which will still be close to \(w_i\)) until the \(L_1\) norm of \(\boldsymbol{w}'\) becomes 1.

In a situation where the low precision is acceptable, this OT-based solutions could be more efficient. However, if one needs a higher level of precision, we recommend using the FHE-based solution described in the next subsection. We also observe that by using an OT protocol with \(O(\log m)\) communication [187], Theorem 2.2, we expect we could support fairly large values of \(m\). The bottleneck on larger \(m\) is likely to be storage, and the computation time needed for the OT.

Inputs: The sender has input \(\boldsymbol{w}\). We require every \(w_{i}/\|\boldsymbol{w}\|_1\) is a multiple of \(\mu := \frac 1 m\).

Given \(\boldsymbol{w}\), the sender prepares an \(m\)-dimensional input vector as follows:

Let \(k_i = \frac{w_i}{\|\boldsymbol{w}\|_1} \cdot m\). Insert \(k_i\) copies of the index \(i\) into the \(m\)-dimensional vector \(\boldsymbol{v}\); that is, there should be \(k_i\) slots (out of \(m\)) whose value is \(i\) in the \(m\)-dimensional vector \(\boldsymbol{v}\). Note that for each \(i\), the fraction of the slots containing the index \(i\) in \(\boldsymbol{v}\) is \(\frac{k_i}{m} = \frac{w_i}{\|\boldsymbol{w}\|_1}\).

Let \(\boldsymbol{v} = (v_{1}, \ldots, v_{m})\). The sender shuffles \(\boldsymbol{v}\) and blinds it by updating \(v_{i} := v_{i} \oplus\pi\) with a randomly chosen \(\pi\).

Execute an OT protocol where the sender is the OT sender with input \(\boldsymbol{v}\) and the receiver is the OT receiver with a randomly chosen number from \([m]\). Let \(u\) be the output to the OT receiver.

Protocol : Oblivious sampling protocol realizing \(\mathcal{F}_{{\mathsf{osample(L_1)}}}\) based on \(1\)-out-of-\(m\) OT

7.2.3.2 Oblivious sampling protocol.

With the assumption about the level of precision of the input vector \(\boldsymbol{w}\), we can implement oblivious sampling. See Protocol [fig:osample-ot].

Due to security of the OT protocol, the OT sender won’t know the OT receiver’s choice. However, since the OT receiver does know the sampled index, which leaks information about the data array, we hide this information from the OT receiver by having the OT sender shuffle the input vector.

At the end of the OT protocol, the sender and receiver will hold the sampled index \(i\) in a secret shared form; that is, the sender will hold \(\pi\) and the receiver \(\pi \oplus i\). Note that the sender re-uses the same \(\pi\) across all inputs, in order to fix its share, independently of the receiver index. Security holds even with the re-use of this value because the receiver learns only a single element.

7.2.3.3 Security.

7.2.4 \(L_1\) Sampling Protocol with Secret Shared Output

We give a more formal description of the functionality \(\mathcal{F}_{L_1}^{ss}\).

\(n \in \mathbb{N}\). The dimension of the input weight vectors \(\boldsymbol{w}_1\) and \(\boldsymbol{w}_2\).

Receive inputs \(\boldsymbol{w}_1\) and \(\boldsymbol{w}_2\) from \(P_1\) and \(P_2\) respectively.

Sample \(i \in [n]\) with probability \(\frac{w_{1,i} + w_{2,i}}{ \|\boldsymbol{w}_1 + \boldsymbol{w}_2\|_1 }\)

We describe a protocol securely realizing \(\mathcal{F}_{L_1}^{ss}\) in the \((\mathcal{F}_{\mathsf{osample(L_1)}}, \mathcal{F}_{\mathsf{biasCoin}})\)-hybrid.

Protocol : Protocol securely realizing \(\mathcal{F}_{L_1}^{ss}\) in the \((\mathcal{F}_{\mathsf{osample(L_1)}}, \mathcal{F}_{\mathsf{biasCoin}})\)-hybrid.

7.3 Min-Hash

7.3.1 Proof of Lemma 49

Let \(c= p/(1-p)\). We need to find \(a < np\) and \(b > np\) satisfying the following condition. \[\begin{aligned} \frac{\Pr_{\mathsf{B}(n,p)}[a]}{\Pr_{\mathsf{B}(n,p)+s}[a]} = \frac{ \binom{n}{a} p^a (1-p)^{n-a}} { \binom{n}{a-s} p^{a-s} (1-p)^{n-a+s}} < \left ( \frac{ n-a} {a -s} \right)^s \cdot c^s \le e^\epsilon. \end{aligned}\] \[\begin{aligned} \frac{\Pr_{\mathsf{B}(n,p)}[b]}{\Pr_{\mathsf{B}(n,p)+s}[b]} = \frac{ \binom{n}{b} p^{b} (1-p)^{n-b}} { \binom{n}{b-s} p^{b-s} (1-p)^{n-b+s}} > \left ( \frac{ n-b} {b} \right)^s \cdot c^s \ge e^{-\epsilon}. \end{aligned}\]

7.3.1.1 Case 1: \(s \ge \epsilon\).

We set \(a = \frac{np + s(1-p)\cdot e^{\epsilon/s}}{e^{\epsilon/s}\cdot (1-p) + p}\) and \(b = \frac{e^{\epsilon/s} \cdot np}{(1-p) + e^{\epsilon/s} \cdot p}\). Note these values satisfy the above inequalities.

To show the second requirement of the tail bound, it suffices to show that \(\Pr_{\mathsf{B}(n,p)}[X \le a+s] = \mathsf{negl}(\kappa)\); the other case holds similarly.

Let \(\mu := np \in \Theta(\kappa)\) and let \(d = 1- (a+s)/\mu\). By applying the Chernoff bound, we have \[\Pr_{X \leftarrow\mathsf{B}(n,p)}[X \le a+s] = \Pr[ X \le (1-d) \mu ] \le \exp(-d^2\mu/2).\] We will show that we have \(d = \Omega(\frac{1}{\lg\lg \kappa})\), which implies that with \(\mu \in \Theta(\kappa)\), the above probability is negligible in \(\kappa\). Let \(t = s(1-p)\cdot e^{\epsilon/s}\) and \(u = e^{\epsilon/s}\cdot (1-p) + p\). Then, we have \(a = \frac{\mu + t}{u}\). Note that we have \(\epsilon/s \cdot (1-p) + 1 \le u \le e\). We have the following: \[\begin{aligned} d = \frac{\mu - a - s}{\mu} = \left(1 - \frac{1}{u}\right) - \frac{t/u + s}{\mu} \ge \frac{\epsilon/s \cdot (1-p)}{e} - \frac{t/u + s}{\mu} = \Omega\left(\frac{1}{\lg\lg \kappa}\right) - \tilde O(1/\kappa). \end{aligned}\]

7.3.1.2 Case 2: \(s \le \epsilon\).

Let \(a = \frac{c}{c+e} \cdot (n+e\epsilon) < np\) and \(b = \frac{c}{c+e^{-1}} \cdot n > np\). Observe that \(\frac{n-a} {b-s} \cdot c = e\) and \(\frac{n-b}{b} \cdot c = e^{-1}\). Therefore, given \(s \le \epsilon\), the above inequalities hold. The tail bounds specified as the second condition of the lemma can be shown using the Chernoff bound since \(np-a \in \Theta(np) = \Theta(\kappa)\) and so does \(b-np\).

7.3.2 Proof of Lemma 53

Let \(t = \frac{1}{n_R}\), Let \(\mathsf{low} = 1-\left(\frac{1}{2}+\theta\right)^t\) and \(\mathsf{high} = 1-\left(\frac{1}{2}-\theta\right)^t\). Then, we have: \[\begin{aligned} &\Pr_{h}[\mathsf{good}_\theta(h, A, I, n_B)] \\ & = \Pr_{h} \left[\left(\min h(I) \geq\mathsf{low}\right) \land \left(\min h(I) = \min h(A)\right)\right] \\ & ~~~~ - \Pr_{h} \left[\left(\min h(I) \geq \mathsf{high}\right) \land \left(\min h(I) = \min h(A)\right)\right]\\ & = \Pr_{h} \left[\min h(A) \geq\mathsf{low}\right] \cdot \Pr\left[ \min h(I) = \min h(A) \mid \min h(A) \geq\mathsf{low}\right] \\ & ~~~~ - \Pr_{h} \left[\min h(A) \geq \mathsf{high}\right] \cdot \Pr\left[ \min h(I) = \min h(A) \mid \min h(A) \geq \mathsf{high}\right] \\ & \ge \left(\frac{1}{2}+\theta\right)^{t \cdot n_A} \cdot \frac{n_I}{n_A} - \left(\frac{1}{2}-\theta\right)^{t \cdot n_A} \cdot \frac{n_I}{n_A} \end{aligned}\]

7.3.3 Proof of Lemma 54

We can lowerbound \(|K_\theta|\) with \(\mathsf{B}(k-s, p_\theta)\). Recall \(s \in O(\lg\lg\kappa)\). Let \(\mu = (k-s) p_\theta = \Omega(\kappa).\) Applying the Chernoff Bound, we have \(\Pr\left[ |K_\theta| \le (1 - 1/3) \mu \right] \le \exp(-\mu/18) \le \mathsf{negl}(\kappa). \quad\blacksquare\)

7.3.4 Proof of Lemma 55

7.3.4.1 Hockey stick divergence.

We first review hockey stick divergence [188], [189], [190]. The hockey-stick divergence between two probability measures \(P,Q\) over \(Z\) is defined as: \[\mathsf{D}^\mathsf{hs}_{\alpha}\left({X}, {Y}\right) = \sup_{S \subseteq Z}(X(S) - \alpha Y(S)) = \sum_{z \in Z} [(X(z) - \alpha Y(z)]_+,\] where \(\alpha \geq 1\) and \([x]_+ = \max\{x,0\}\). We observe that the following holds directly from the definition of the hockey stick divergence.

7.3.4.2 Proof of Lemma 6.

Let \({\eta_{-\theta}}= 1/2 - \theta\) and \({\eta_{+\theta}}= 1/2 + \theta\). For brevity, we let \(C\) denote \(\mathsf{PB}(n,p_J)\). For any distribution \({\cal D}\), let \(P_{\cal D}\) denote the probability measure with respect to \({\cal D}\). We first show that for any \(\epsilon>0\), it holds that \(\mathsf{D}^\mathsf{hs}_{e^\epsilon}\left({P_{C}}, {P_{C + 1}}\right)\) is at most \[\begin{aligned} \max\left(\mathsf{D}^\mathsf{hs}_{e^\epsilon}\left({P_{\mathsf{B}(\lceil\frac{n}{2}\rceil, {\eta_{+\theta}})}}, {P_{\mathsf{B}(\lceil\frac{n}{2}\rceil, {\eta_{+\theta}})+1}}\right), \mathsf{D}^\mathsf{hs}_{e^\epsilon}\left({P_{\mathsf{B}(\lceil\frac{n}{2}\rceil, \eta_{-\theta})}}, {P_{\mathsf{B}(\lceil\frac{n}{2}\rceil, \eta_{-\theta})+1}}\right) \right). \end{aligned}\]

We start with an upper bound of the hockey-stick divergence is reached at extreme points. We rely on the results in [149]. Although they use the Renyi divergence, their results are general enough to be applied to any \(f\)-divergence.

Next, we apply data processing inequality to simplify ([eqa:twobinom]) from the above lemma.

We extend the above to upper bound the hockey-stick divergence between probability measures differed by an integer amount greater than 1, i.e., \(P_{C}\) and \(P_{C+s}\) for \(s>1\).

Finally, to give a bound on the divergence, we can apply Lemma 49 to argue that the binomial distribution hides the small sensitivity. Specifically, as \(\lceil\frac{n}{2}\rceil \in \Theta(\kappa)\) and \(s =\lg\lg\kappa\), we can claim \((\epsilon, \delta)\)-DDP with \(\delta= \mathsf{negl}(\kappa)\).

Similarly, it holds that \(\mathsf{D}^\mathsf{hs}_{e^\epsilon}\left({P_{C+s}}, {P_{C}}\right) \le \mathsf{negl}(\kappa)\). 0◻

7.3.5 Proof of Theorem 51

7.3.5.1 On the definition of a \(\theta\)-good iteration.

We keep the same definition of a \(\theta\)-good iteration, except we set the exponent to \(1/n_R'\), instead of \(1/n_R\), and we also require \(\theta \le 1/10\). In particular,

7.3.5.2 Bundle of good iterations \(K_\theta\).

The total number of iterations in the min-hash protocol \(\pi_{\mathsf NMH}\) is \(k = \Omega(\kappa \cdot \lg\lg\kappa)\). We require that \(n_R/k^2 = \Omega(\kappa)\).

Using Lemma 53, with all but negligible probability, at least \(\Omega(\kappa \cdot \lg\lg\kappa)\) iterations are \(\theta\)-good. Recall that \(G_\theta\) denotes the set of \(\theta\)-good iterations, and \(K_\theta = G_\theta \setminus S_{x^*}\). We set \(k_g = |K_\theta|\). We further divide these \(k_g\) iterations into \(u = \lg\lg \kappa\) bundles, each of which is of size \(k_b = \Omega(\kappa)\). Those bundles are denoted by \(K_{\theta, 1}, \ldots, K_{\theta,u}\). We also let \(K_{bad} := \overline{K_{\theta}}\).

7.3.5.3 Random variables for the protocol output.

Let \(out^+_{bad}\) be the protocol’s match count for sets \(A, B_{+x^*}\) w.r.t. the hash functions in \(K_{bad}\): \[out^+_{bad} := \left | \{ j \in K_{bad} : \min h_j(A) = \min h_j(B_{+x^*}) \} \right |.\] Likewise, let \(out_{bad}\) be the number of matches for sets \(A\) and \(B\) (instead of \(B_{+x^*}\)) in iterations in \(K_{bad}\). Similarly, for \(i \in [u]\), we let \(out^+_{i}\) and \(out_i\) denote the output for the \(i\)-th bundle, with or without \({x^*}\) respectively. Note that \(out^+_{i} = out_i\), since we ruled out \(S_{x^*}\) from \(K_\theta\). Note that the final output of the min-hash protocol for input \(B_{+x^*}\) is equal to \(out^+_{bad} + \sum_{i=1}^u out_i\); the final output for input \(B\) is \(out_{bad} + \sum_{i=1}^u out_i\). Let \[\boldsymbol{out} = out^+_{bad} || out_{bad} || out_1 || \cdots || out_u.\] We also consider the output with the \(i\)th bundle missing; that is, for \(i \in [u]\) let \[\boldsymbol{out}_{-i} = out^+_{bad} || out_{bad} || out_1 || \cdots || out_{i-1} || out_{i+1} || \cdots || out_u.\]

7.3.5.4 Upper-bounding leakage from the output.

Since \(|K_{bad}|\) and \(|K_{\theta,i}|\) are at most \(k \in poly(\kappa)\), we can safely assume that the total number of bits in \(\boldsymbol{out}\) is \[2\lg |K_{bad}| + \sum_{i=1}^u \lg |K_{\theta, i}| \leq ( 2 + \lg\lg \kappa) \lg |poly(\kappa)| \le \kappa.\]

7.3.5.5 Distribution of \(R\) and its min-entropy.

The original distribution on the secret set \(R\) is the uniform distribution over all sets of size \(n_R\) with each element is chosen from a universe \(\mathcal{U}\). The universe \(\mathcal{U}\) has size \(\ell \cdot n_R\) with \(\ell \ge 4(n_R)^3\).

Now choose, uniformly at random, a partition \(\{U_1, \ldots, U_{n_R}\}\) of \(\mathcal{U}\) where each \(|U_j| = \ell\) such that the element in the \(j\)th slot of \(R\) belongs to \(U_j\). These universes \(\{U_1, \ldots, U_{n_R}\}\) are leaked in the analysis.

Let \({\cal D}\) denote the original distribution over the set \(R\), but conditioned on the leaked information \(\{ U_1, \ldots, U_{n_R} \}\). The distribution \({\cal D}\) is equivalent to a distribution over streams of \(n_R\) elements, where the element in the \(i\)-th slot is chosen uniformly at random from \(U_i\). Therefore, \({\cal D}\) has min-entropy \(n_R \lg \ell\).

We additionally consider arbitrary leakage \(f(R)\) of length \(L\) such that \[n_R \lg \ell - L \ge \frac{8n_R}{9} \lg \ell + 2n_R.\]

7.3.5.6 Available iterations in a bundle.

For a fixed set \(Z \subseteq R\), in a min-hash graph, we say that a set of iterations in the \(i\)th bundle \(K_{\theta, i}\) is available with respect to \(Z\) if there are no edges from \(Z\) to that set. In other words, no elements in \(Z\) contribute to the final count reduction for any of those iterations. In this sense, those iterations are is still available for the count reduction by the other elements than those in \(Z\). More formally, consider a graph \(G \leftarrow{\bf MinhashG}_{H_1}(A, I, {x^*}, H_2)\) and letting \(G = (\mathcal{X}, \mathcal{Y}, \mathcal{E})\), we define \[\mathsf{Avail}_G(K_{\theta,i}, Z) := \{ j \in K_{\theta, i}: \forall z \in Z: (z, j) \not \in \mathcal{E})\}.\]

7.3.5.7 Existence of a good bundle.

We now describe an experiment to check if the \(i\)th bundle of iterations is good in the sense that given the fixed hash, the distribution \({\cal D}\) (after the leakage) satisfies the DP-like property conditions specified in Lemma [cor:fixed_oracle_prob]. Roughly speaking, Lemma [cor:fixed_oracle_prob] shows that a bundle will be good with a high probability.

7.3.5.8 Failure probability \(\mathsf{FAIL}_{1,i}\).

We claim that \(\mathsf{FAIL}_{1,i}\) takes place with a negligible probability. By applying [150], Lemma 2.2, the average min-entropy of \({\cal D}|\boldsymbol{out}_{-i}\) is at least \(n_R\lg \ell - L - \kappa\), which implies that the min-entropy of \({\cal D}| \boldsymbol{out}_{-i}\) is at least \(n_R\lg \ell - L - 2\kappa \ge \frac{8n_R}{9} \lg \ell + n_R\) with probability \(1 - 2^{-\kappa}\) (assuming that \(n_R \ge 2\kappa\)).

7.3.5.9 Failure probability \(\mathsf{FAIL}_{2,i}\).

7.3.5.10 Failure probability \(\mathsf{FAIL}_{3,i}\)

We show that \(\mathsf{FAIL}_{3,i}\) occurs with negligible probability. Let \(n = n_R\) and \(n' = n'_R\) for brevity of notation. Recall that \(n' = n/3\). Let \(X_j\) be an indicator variable that represents whether there is an edge from \((n-n')\) nodes to iteration \(j\). Therefore, we have \[\Pr_{H_2}[|T| = r] = \Pr_{H_2} \left[ \sum_{j=1}^{k_b} X_j = k_b - r \right].\] Recall that \(p_j \le 1-({\eta_{-\theta}})^{1/n'}\) and \(\Pr[X_j = 1] = 1 - (1 - p_j)^{n-n'} \le 1 - ({\eta_{-\theta}})^\frac{n-n'}{n'} = 1 - ({\eta_{-\theta}})^2 \le 1 - (2/5)^2\). Therefore, we have \[m:= {\bf E}\left[ \sum_{j=1}^{k_b} X_j \right] \le k_b \cdot (1 - ({\eta_{-\theta}})^2) \le 0.84 k_b\] Using the Chernoff bound and due to \(k_b \in \Omega(\kappa)\), we have \[\Pr_{H_2}\left [|T| \le \frac{k_b}{10} \right] = \Pr_{H_2} \left[ \sum_{j=1}^{k_b} X_j \ge \frac{9}{10}k_b \right] \le \exp\left(-\frac{(0.9 k_b - m_0)^2}{2 m_0}\right) = \exp(-\Omega(\kappa)). \quad\blacksquare\]

7.3.5.11 Failure probability \(\mathsf{FAIL}_{4,i}\).

By Lemma [cor:fixed_oracle_prob], for all \(i \in [u]\), conditioned on \(\mathsf{FAIL}_{1,i}\), \(\mathsf{FAIL}_{2,i}\), \(\mathsf{FAIL}_{3,i}\) not occurring, let \[p_4 := \Pr_{H_1, H_2}[\textbf{IsAGoodBundle}(i, \boldsymbol{out}_{-i}, {\cal D}, A, I, {x^*}, H_1, H_2) =\ \mathsf{FAIL}_{4,i}].\] Then, we have \(p_4 \in O(k_v \lg^3(\kappa)/(n_R)^{0.5})\).

7.3.5.12 Existence of a good bundle out of \(u\) bundles.

Observe that conditioned on \(\mathsf{FAIL}_{1,i}\), \(\mathsf{FAIL}_{2,i}\), \(\mathsf{FAIL}_{3,i}\) not occurring, the process outputs \(\mathsf{FAIL}_{4,i}\) independently of \((\boldsymbol{out}_{-i}, R')\), since the hash values in \(H_2\) for any iteration are chosen independently of those for the other iterations. Using the above, since \(k_v /\sqrt{n_R} = O(1/\sqrt{\kappa})\), we have the following:

7.3.5.13 Noise distribution.

We define a noise distribution \(\Phi\) and give an analysis of the hockey stick divergence of \(\Phi(r)\) and \(\Phi(r-\lg\lg(\kappa))\).

7.3.5.14 Putting it all together.

Let \(c\) be the final count produced by running protocol \(\pi_{\mathsf NMH}\). We consider the probabilities \[\Pr_{H_1, H_2,\mathcal{D}}[ c \mid B_{+x^*}] \mbox{~~and~~} \Pr_{H_1, H_2, \mathcal{D}}[c \mid B].\]

We consider only runs of the protocol that yield \(c\) and for which there exists some \(i^* \in [u]\) such that the process \({\bf IsAGoodBundle}\) returns SUCCESS given \(\boldsymbol{out}_{-i^*}\) as input. We just have argued that such an \(i^*\) exists with all but negligible probability.

Further, we consider only runs of the protocol for which \(out^+_{bad} - out_{bad} \leq s = \lg\lg(\kappa)\). By Lemma 45, this also occurs with all but negligible probability, We will also leak \(k_v = |\mathsf{Avail}(K_{\theta, i^*}, R \setminus R')|\).

Conditioned on the above events, by the definition of the distribution \(\Phi\), the value \(out_{i^*}\) contributes \((k_v - r)\) to the final count \(c\) with probability \(p = \Phi(r)\). Recall that every iteration \(j\) in \(K_{\theta,i^*}\) is good, which means \(\min h_j(A) = \min h_j(I)\), potentially contributing to the output.

Therefore, assuming none of bad events occur (which happens with overwhelming probability), by applying Lemma 82, the probability that the ratio of probabilities of a certain output \(out\) for \(B_{+x^*}\) and \(B\) is not contained in \([e^{-\epsilon}, e^{\epsilon}]\) is \(\mathsf{negl}(\kappa)\), and therefore we conclude that the protocol satisfies the DDP security.

7.3.6 Proof of Lemma [cor:fixed_oracle_prob]

When considering the probability of \(D_{T,r}({\cal D})\) and \(I_{R', T, r}\) over the choice of \(H_2\), the identity of \(T\) doesn’t matter except for its size \(k_b = |T|\). Therefore, in this case, we will simply use \(D_{k_b,r}({\cal D})\) and \(I_{R', k_b, r}\) Moreover, when it is clear from the context, we will sometimes omit \(k_b\) and \({\cal D}\) and say \(E^{R'}_{r} = E^{n'_R}_r\) and \(I_{R',r} = I_{R', k_b, r}\), and \(D_r = D_{k_b,r}({\cal D})\).

Then, Lemma [cor:fixed_oracle_prob] follows by taking a union bound over different cases of \(r \in [k_b]\).

7.3.6.1 Proof for Case 1.

We first consider Case (1). By applying the Case (1) of Corollary 56, we have \(E^{n_R'}_{r} \in \mathsf{negl}(\kappa)\). Given \(E^{n_R'}_{r} \in \mathsf{negl}(\kappa)\), we show \[\Pr_{H_2}\left[ D_r({\cal D}) \leq \mathsf{negl}(\kappa) \right] \ge 1 - \mathsf{negl}(\kappa).\]

Recall that \(D_r({\cal D}) = \sum_{R'} \rho(R') \cdot I_{R',r}\). Assume toward the contradiction that the negation of the statement holds. This means there are polynomials \(p\) and \(q\), and a collection \(\mathsf Heavy\) of \(R'\)s such that \[\Pr_{H_2}\left[ \sum_{R' \in {\mathsf Heavy}} \rho(R') \cdot I_{R', r} \ge 1/p(\kappa) \right] \ge 1/q(\kappa).\]

The above implies that \(\sum_{R' \in {\mathsf Heavy}} \rho(R') \ge 1/p(\kappa)\). Now, since \({\cal D}\) and \(H_2\) are independent, we have \(\sum_{R' \in {\mathsf Heavy}} \rho(R') \Pr_{H_2}\left[I_{R', r} \right] \ge \frac{1}{p(\kappa) q(\kappa)}.\) However, considering that \(\Pr_{H_2}\left[I_{R', r}\right] = E^{n_R'}_{r}\), which is negligible, the above is a contradiction.

7.3.6.2 Proof for Case 2.

We will bound \(D_r = \sum_{R'} \rho(R') \cdot I_{R', r}\) using Chebyshev inequality. For this, we would like to bound the variance of \(D_r\).

We start with showing the following lemma, which will allow us to ignore the tail when we bound the variance. Below, the value \(z\) will correspond to the size of the intersection of the two sets \(R'_i\) and \(R'_j\).

We set the parameters for \(H_1, k_b, a\) and \(b\) as stated in Lemma [cor:fixed_oracle_prob]. Let \(\mathcal{D}\) be a distribution with the geometric collision property. Then, we show that for every \(a \leq r \leq b\), we have \[\mathsf{Var}_{H_2} [D_r] \leq \frac{16 \lg^3(\kappa)}{\sqrt{n_R}} \left (E^{n'_R}_{k_b, r} \right )^2.\]

Consider any \(r \in [a, b]\). Recall that \(D_r := \sum_{R' \in \mathsf{Supp}(\tilde{\mathcal{D}})} \rho(R') \cdot I_{R', r}\).

\[\begin{aligned} \mathsf{Var}_{H_2}[D_r] &\quad \quad = \sum_{R'_i, R'_j} \rho(R'_i) \cdot \rho(R'_j) \cdot ( \mathbb{E}[I_{R'_i, r } \cdot I_{R'_j, r} ] - \mathbb{E}[I_{R'_i, r }] \cdot \mathbb{E}[I_{R'_j, r}] ) \\ &\quad \quad \leq \sum_{R'_i, R'_j : |R'_i \cap R'_j| \ge 1} \rho(R'_i) \cdot \rho(R'_j) \cdot \mathbb{E}[I_{R'_i,r} \cdot I_{R'_j, r}]\\ &\quad \quad = \sum_{z=1}^{n'_R} \Pr_{R'_i, R'_j \sim {\cal D}} [|R'_i \cap R'_j| = z] \cdot \mathbb{E}[I_{R'_i,r} \cdot I_{R'_j, r}] \\ &\quad \quad \le \sum_{z=1}^{n'_R} \left(\frac{1}{\sqrt{n_R}}\right)^z \cdot \left( 1 + \frac{\zeta \cdot (e^{\zeta \epsilon/3} + 1)}{{\eta_{-\theta}}^{ z k/n'_R} } \right) \cdot \left(E^{n'_R}_r\right)^2 \\ &\quad \quad \le \sum_{z=1}^{n'_R} \left(\frac{1}{\sqrt{n_R}}\right)^z \cdot \left( \zeta \cdot \frac{e^\zeta + 2}{(2/5)^{\zeta/3} } \right) \cdot \left(E^{n'_R}_r\right)^2 \\ &\quad \quad \le \sum_{z=1}^{n'_R} \left(\frac{1}{\sqrt{n_R}}\right)^z \cdot \left( 8^{\zeta+1} \right) \cdot \left(E^{n'_R}_r\right)^2 \\ \end{aligned}\] The first inequality holds because if \(R'_i\) are \(R'_j\) are disjoint, then \(I_{R'_i,r}\) and \(I_{R'_j, r}\) are independent over the choice of \(H_2\), and the relevant terms are canceled out. The second inequality is due to the geometric collision property of \({\cal D}\) and Lemma 85. The third inequality holds with \(\epsilon\leq 3\) since \(\theta < 1/10\) and \(k_b\) is much smaller than \(n'_R\). Therefore, we have \(\mathsf{Var}_{H_2}[D_r] \le 8 \cdot \left(E^{n'_R}_r\right)^2 \cdot \sum_{z=1}^{n'_R} \left( \frac{\lg^3\kappa }{\sqrt{n_R}} \right)^z \leq \frac{16 \lg^3 \kappa}{\sqrt{n_R}} {\left (E^{n'_R}_{r} \right )}^2\). 0◻

Finally, by Chebyshev, we have that for all \(a \leq r \leq b\), \[\begin{aligned} \label{eq:chebyshev} \Pr_{H_2}\left [D_r \notin [e^{-\epsilon/3} (E^{n'_R}_{k_b, r}), e^{\epsilon/3} (E^{n'_R}_{k_b, r})] \right ] & \leq \Pr\left [|D_r -E^{n'_R}_{k_b, r}| \geq (1-e^{-\epsilon/3}) \cdot E^{n'_R}_{k_b, r} \right ] \\ & \leq \frac{\mathsf{Var}[D_r]}{(1-e^{-\epsilon/3})^{2} \cdot (E^{n'_R}_{k_b, r})^2} \leq \frac{16 \lg^3(\kappa)}{(1-e^{-\epsilon/3})^{2}\sqrt{n_R}}. \end{aligned}\]

7.3.7 Strong Chain Rule

7.3.7.1 Strong chain rule for a special case: achieving flatness through clustering.

Fortunately, a stronger version of the chain rule is known to hold for a special leakage pattern, i.e., when elements are conditioned in order [144]; very roughly speaking, for every \(i\), the min-entropy of \(R_i | (R_1, \ldots, R_{i-1})\) is essentially the same as the min-entropy of \((R_1, \ldots, R_i)\) minus the min-entropy of \((R_1, \ldots, R_{i-1})\) at the sacrifice of an additional small leakage, which is called a spoiling leakage.

They achieve this by grouping possible sequences with a similar distributional characteristic into the same cluster. Then, in every cluster, the distribution of sequences conditioned on that cluster will be essentially flat. Now, the spoiling leakage corresponds to the cluster identifier. By making every cluster contain sufficiently many sequences (leading to sufficient min-entropy due to flatness), the total number of clusters can be small (leading to a short spoiling leakage).

7.3.7.2 Notes on notations.

For brevity, in this section, we omit the subscript from \(n_R\), i.e., we denote \(n = n_R\). For any sequence of random variables \(R = R_1, \dots, R_n\) (for the secret input \(R\)), we denote \(R_{<i} = R_1,\dots,R_{i-1}\) and \(R_{\leq i} = R_1,\dots,R_{i}\). Likewise, we extend such subscript notations and use \(R_{>i}\) and \(R_{\ge i}\). We use lower case \(r = r_1,\dots,r_n\) to denote the actual set/sequence.

7.3.7.3 Strong chain rule for our setting.

We first adapt the result in [144] into our setting. Then, we argue that a sufficient number of elements still have high min-entropy, even conditioned on the previous elements. Finally, we show that these high min-entropy (conditioned) elements provide the geometric collision property.

By following the general idea of [144], we will build clusters, and the spoiling leakage will be the cluster identifier. However, we will slightly change the way we build clusters.

7.3.7.4 Condition 1.

Throughout our proof, we let \(\Pr[r_i]\) denote \(\Pr[R_i = r_i]\) for brevity, whenever the referred random variable is clear. Before forming the clusters, we will first like to exclude all sequences \(r \in \mathcal{U}= U_1 \times\dots\times U_n\) having a very small probability \(\Pr_{R}[R_i = r_i \mid R_{<i} = r_{<i}] < \epsilon/\ell\) for any \(i \in [n]\) and only consider the remaining \(\mathcal{U}' \subset \mathcal{U}\). Specifically, we let \(f(r) = \perp\) for all \(r \notin \mathcal{U}'\). As we will see later, this probability lower bound is vital to upper bound \(|Im(f)|\).

7.3.7.5 Building clusters.

For each \(r \in \mathcal{U}'\), we describe how to compute \(f(r) = (f_1(r), f_2(r),\) \(\ldots, f_n(r))\), which will serve as the cluster identifier. Let \(\mathsf{r}(a)\) denote a rounding function that rounds \(a\) to the closest multiple of \(\delta/2\). We say \(a \approx_{\mathsf{r}} a'\) if \(\mathsf{r}(a) = \mathsf{r}(a')\).

7.3.7.6 Conditions 2 and 3.

Condition 2 follows from how \(V\) is computed in step 3. We now show that condition 3 holds. In particular, \(\forall y \in Im(f(\cdot)) \setminus \{\bot\}, \forall r \mbox{ s.t. } f(r) = y, \forall i \in V\) we have \[\begin{aligned} \Pr[r_i \mid r_{<i}, y] & = \Pr[r_i \mid r_{<i}, y_{\geq i}] = \frac{\Pr[r_i \land r_{<i} \land y_{\geq i}]}{\Pr[r_{<i} \land y_{\geq i}]} &= \frac{\Pr[r_i \land r_{<i} \land y_{> i} ]}{\Pr[r_{<i} \land y_{>i} ]\Pr[y_i \mid r_{<i} \land y_{>i} ] } \end{aligned}\]

The first equality is due to \(y_{< i}\) being a deterministic function of \(r_{< i}\), \(y_{\geq i}\). Similarly, the nominator of the final fraction is due to \(y_{i}\) being a deterministic function of \(r_{\leq i}\), \(y_{>i}\). Moreover, \(y_{i,2}\) (i.e., \(\mathsf{true}\)) can be deterministically computed from \(y_{i,1}\)(i.e., \(\mathsf{r}(\mathsf{sp}^1_i(r))\)), \(r_{<i}, y_{>i}\). Therefore, the above is equal to \[\label{eq:sp1_sp2} \frac{\Pr[r_i \land r_{<i} \land y_{> i} ]}{\Pr[y_{i,1} \mid r_{<i} \land y_{>i}] \Pr[r_{<i} \land y_{>i} ]} = \frac{\Pr[r_i \mid r_{<i} \land y_{>i} ]}{\Pr[y_{1,i} \mid r_{<i} \land y_{>i} ]} =\frac{2^{-\mathsf{sp}^1_i(r)}}{2^{-\mathsf{sp}^2_i(r)}} \leq \frac{2^\delta}{n^{1.5}}.\]

The last inequality holds since \(i \in V\), \(\mathsf{r}(\mathsf{sp}^1_i(r)) - \mathsf{r}(\mathsf{sp}^2_i(r)) \ge 1.5\lg(n)\).

7.3.7.7 Condition 4.

For \(r, y, i\) as quantified in the theorem statement, we have \[\begin{aligned} & \left | \left \{ r_i : \Pr[R_i = r_i|R_{<i} = r_{<i}, y]] \geq 0 \right \} \right | \\&= \left | \left \{ r_i : \Pr[R_i = r_i \land R_{<i} = r_{<i} \land y ]] \geq 0 \right \} \right |\\ &= \left | \left \{ r_i : \Pr[R_i = r_i \land R_{<i} = r_{<i} \land y_{i,1}, y_{i,2}, y_{>i} ]] \geq 0 \right \} \right |\\ &\leq \left | \left \{ r_i : \Pr[R_i = r_i | R_{<i} = r_{<i}, y_{i,1}, y_{i,2}, y_{>i} ]] \geq 0 \right \} \right | \end{aligned}\] By a similar argument as above, for all \(r_i\) s.t. \(\Pr[R_i = r_i | R_{<i} = r_{<i}, y_{i,1}, y_{i,2}, y_{>i} ]] \geq 0,\) it holds \[\Pr[R_i = r_i | R_{<i} = r_{<i}, y_{i,1}, y_{i,2}, y_{>i} ] = \frac{\Pr[r_i \mid r_{<i} \land y_{>i} ]}{\Pr[y_{i,1} \mid r_{<i} \land y_{>i} ]} = \frac{2^{-\mathsf{sp}^1_i(r)}}{2^{-\mathsf{sp}^2_1(r)}}\geq \frac{2^{-\delta}}{n^{1.5}}\] where the inequality holds since \(i \in W\), we know that \(\mathsf{r}(\mathsf{sp}^1_i(r)) - \mathsf{r}(\mathsf{sp}^2_i(r)) \le 1.5\lg(n)\). This means that \(\left | \left \{ r_i : \Pr[R_i = r_i | R_{<i} = r_{<i}, y_{\ge i} ]] \geq 0 \right \} \right | \leq 2^\delta \cdot n^{1.5}.\)

7.3.7.8 Condition 5.

To bound \(|Im(f)|\), we first upper bound \(y_{i,1}\). Recall that \(\Pr_{R}[R_i = r_i \mid R_{<i} = r_{<i}] \geq \epsilon/\ell\) for all \(i \in [n]\) and \(r \in \mathcal{U}'\). Therefore, \(\Pr_{R}[R_i = r_i \mid R_{<i} = r_{<i}, y] \geq \epsilon/\ell\) for all \(i \in [n]\), and \(\forall r\) such that \(f(r) = y\).

Therefore, for all \(r\in \mathcal{U}', i \in [n]\), we have \(\mathsf{sp}^1_i(r) \le \lg (\ell) + \lg (1/\epsilon),\) which implies that \(y_{i,1}\) has at most \(2(\lg(\ell)) + \lg (1/\epsilon))/\delta\) different possibilities.

To upper bound the number of possibilities of the remaining parts, it suffices to upper bound the number of choices for set \(W\) of size \(m\), as well as the number of possibilities for \(H_i\)’s in each slot \(i \in W\). Clearly, the former is \(\binom{n}{m}\). For the latter part, note that each iteration appears at most once over all \(m\) slots. Therefore, the problem becomes how we can assign \(k_b\) different iterations into \(m+1\) positions (with some positions possibly containing none) while assigning them to the \(m+1\)th position when they never appear in any slot of \(W\). This is a well-known problem of stars and bars with \(m+1\) variables and sum \(k_b\), which has \(\binom{m+k_b}{m}\) possibilities. Since we have \(k_b!\) different orderings for \(k_b\) iterations, the upper bound is \(\binom{m+k_b}{m} \cdot (k_b!)\). We have:

\[\begin{aligned} |Im(f)| &\leq \left (2 (\lg(\ell) + \lg (1/\epsilon)) /\delta \right )^n \left ( \sum_{m=0}^n {n \choose m} {m + k_b \choose m} \cdot k_b! \right )\\ &= \left (2 (\lg(\ell) + \lg (1/\epsilon)) /\delta \right )^n \left ( \sum_{m=0}^n {n \choose m}\frac{(m+k_b)!}{m!} \right )\\ &\leq n \cdot {n \choose n/2} \cdot \frac{(n+k_b)!}{n!} \cdot \left ( 2(\lg(\ell) + \lg (1/\epsilon)) /\delta \right )^{n}\\ &\leq n \cdot (2e)^{n/2} \cdot \frac{(n+k_b)!}{n!} \cdot \left ( 2(\lg(\ell) + \lg (1/\epsilon)) /\delta \right )^{n}. \end{aligned}\]

7.3.7.9 Condition 6.

Finally, condition 6 follows from the definition of the clustering procedure. In particular, \(H_W = \bigcup_{i \in W} H_i\) contains all the iterations that \(r_W\) covers. The available set can be computed by \(K_{\theta,*} \setminus H_W\). This concludes our proof.

It can be seen that in the above proof, the only properties that we used of the additional leakage \(H_i\) is that for \(i \in W\), \(H_i\) depends only on \(R_i, y_{>i}\) and that the number of choices for the output of the sequence of leakages \([H_i]_{i \in W}\) is bounded by some \(B\). Theorem 58 stated in Section 5.8.5 is restatement of Theorem 86 with respect to any such leakage function.

Note that the leakage functions \(\ell_i\) specified above can model leakage with respect to a random oracle \(h\), by letting \(\rho_i = h(R_i)\).

7.3.8 Proof of Lemma 80

For brevity, we denote \({\cal D}= {\cal D}_{2,i}\) and \({\cal D}_\mathsf{leak}= {\cal D}_{1,i}\) in the experiment \(\mathsf IsAGoodBundle\). We show that \({\cal D}\) has the geometric collision property. In other words, we would like to show that when \(R\) is chosen uniformly at random from universe \(\mathcal{U}\) then the distribution of these \(n_R = n_B - n_I\) elements has the geometric collision property even with the leakage.

Towards this goal, by applying Theorem 86 to this distribution, we show that even with the leakage, there are at least \(n_R/3\) elements that preserves enough min-entropy. We next show how these elements with sufficient min-entropy give the geometric collision property.

7.3.8.1 Fraction of blocks with high entropy.

Using Theorem 86, with setting \(\ell \geq 4n^3\) and assuming sufficient min-entropy of \(R\), we first show that one can ensure more than \(1/3\) fraction of the blocks having min-entropy at least \(1.5\lg(n)\), upon leaking the outcome of \(f'\) and all previous blocks.

First notice that with all but \(\epsilon n\) probability, \(f'(R) \neq \perp\). Therefore, it suffices to let \(\epsilon= 2^{-\kappa}\). Then, by setting \(\delta= 1\), we have \[\begin{aligned} \lg(|Im(f)|) &\leq n \cdot (2e)^{n/2} \cdot (n+{k_b})^{k_b} \cdot \left ( 2(\lg(\ell) + \lg (1/\epsilon)) /\delta \right )^{n}\\ &= \bigg(\lg(n) +n/2 \cdot \lg(2e) \bigg)+ {k_b} \cdot \lg (n+{k_b}) + n \cdot (1+\lg(\lg(\ell) + \kappa))\\ & < 3n/2 + 2 k_b \lg n + n (2 + \lg \kappa) < 0.5 n \lg n \end{aligned}\] for sufficiently large \(n\) with \(k_b = \Omega(\kappa)\) and \(n/k_b^2 = \Omega(\kappa)\).

Combining the above with Remark 88, we have \(\Pr_{{\cal D}_\mathsf{leak}}[f'(R) =y] \geq \epsilon n / (2 \cdot 2^{0.5 n \lg (n)})\) and for every \(y\in Im(f')\setminus\{\perp\}\). Moreover, for every \(r\) such that \(f'(r) = y\), we have \[\begin{aligned} \Pr_{{\cal D}}[r] = \Pr_{{\cal D}_\mathsf{leak}}[r \mid y] = \frac{\Pr_{{\cal D}_\mathsf{leak}}[r \wedge y]}{\Pr_{{\cal D}_\mathsf{leak}}[y]} \le \frac{2^{-(\frac{8n}{9} \lg \ell+n)} }{(\epsilon n/2) \cdot 2^{-0.5n \lg n}} = 2^{-(\frac{8}{9} \log \ell - 0.5 \lg n + 1) \cdot n} \cdot (2/\epsilon n), \label{eq:min-ent-leak_2} \end{aligned}\] which suggests \({\cal D}\) has min-entropy at least \((\frac{8}{9} \log \ell - 0.5 \lg n + 1) \cdot n - \lg (2/\epsilon n)\). We show that the following holds: The min-entropy of at least \(n' = n/3\) blocks, conditioned on the outcome of all prior blocks as well as \(y\), is at least \(\lg(n^{1.5})\).

Towards a contradiction, assume otherwise. Let \(V\) be the set of blocks with min-entropy at least \(\lg(n^{1.5})\) and let \(W\) be the set of blocks with min-entropy less than \(\lg(n^{1.5})\) (as defined in Theorem 86). We will show that if \(|V| \leq n/3\) there exists a point \(r\) in the support of \(\mathcal{{\cal D}}\) such that \(\Pr_{\mathcal{{\cal D}}}[r] > 2^{-(\frac{8}{9} \log \ell - 0.5 \lg n + 1) \cdot n} \cdot (2/\epsilon n)\), contradicting the min-entropy of \(\mathcal{{\cal D}}\).

First, find any value \(r^*_V\) such that \(\Pr_{\mathcal{{\cal D}}}[R_V =r^*_V] \geq \frac{1}{\ell^{|V|}}\). Note that \(r^*_V\) must exist since the support size of \(R_V\) is at most \(\ell^{|V|}\). Let \({\mathsf Supp}_W(r^*_V) = \{r: r_V = r^*_V \wedge \Pr_{\mathcal{{\cal D}}}[R = r] > 0 \}\). Then, we have \(\Pr_{{\cal D}}[R \in {\mathsf Supp}_W(r^*_V)] %\sum_{r \in \supp_W(r^*_V)} \Pr_{\mathcal{\DDD}}[R_V = r_V \wedge R_W = r_W] = \Pr[R_V = r^*_V] \geq \frac{1}{\ell^{|V|}}. %\geq \frac{1}{(4\cdot n^3)^{|V|}}.\)

Second, we show that \(|{\mathsf Supp}_W(r^*_V)| \le (2 \cdot n^{1.5})^{|W|}\). Consider any \(r \in {\mathsf Supp}_W(r^*_V)\). Applying the fourth condition of Theorem 86 with \(\delta= 1\), condition on any \(y\in Im(f')\setminus\{\perp\}\), for any \(i \in W\) and any fixing of \(R_{< i} = r_{< i}\), the number of elements in the support of \(R_i \mid r_{< i}\) is at most \(2 \cdot n^{1.5}\), which implies that \(|{\mathsf Supp}_W(r^*_V)|\) must be at most \((2 \cdot n^{1.5})^{|W|}\), since the positions for \(V\) are fixed to \(r^*_V\).

Based on the above two arguments, by the averaging argument, there must be some \(r^* \in {\mathsf Supp}_W(r^*_V)\) for which \(\Pr_{\mathcal{{\cal D}}}[R = r^*] \geq \frac{1}{(\ell)^{|V|}} \cdot \frac{1}{(2 \cdot n^{1.5})^{|W|}}.\) Therefore, we have \[\begin{aligned} -\lg \Pr_{{\cal D}}[r^*] & = |V| \lg (\ell) + |W| \lg (2 n^{1.5}) = |V| \lg \ell + |W| + 1.5 (n-|V|) \lg n \\ & \leq n + |V| \lg (\ell/n) + 1.5 n \lg n \leq n + n/3 \lg (\ell/n) + 1.5 n \lg n \\ & = n + n/3 \lg (\ell) - 1/3 n \lg n + 1.5 n \lg n, \end{aligned}\] where the second to last line follows assuming \(|V| < n/3\).

To reach contradiction to ([eq:min-ent-leak_2]), we require that \[n + n/3 \lg (\ell) - 1/3 n \lg n + 1.5 n \lg n \leq \left(\frac{8}{9} \lg \ell - 0.5 \lg n + 1 \right) \cdot n - \lg (2/\epsilon n).\] The above is implied by \(5/3 n \lg n \leq 5/9 n \lg \ell - \lg (2/\epsilon n).\)

When \(\ell \geq 4n^3\) the above is implied by \(5/3 n \lg n \leq 5/3 n \lg n + 10/3 n - \lg (2/\epsilon n),\) which is true for \(n \geq \lg(1/\epsilon) = \kappa\). Thus we reach contradiction to ([eq:min-ent-leak_2]). We therefore conclude that \(|V| \geq n/3\).

7.3.8.2 Geometric collision property.

Note that we can equivalently view \(R'\) in the support of \({\cal D}\) as a set of size \(n'\), or as a stream of elements of length \(n'\), where the element in the \(i\)-th block (for \(i \in [n']\)) comes from universe \(U_{i}\), and \(\{U_{1}, \ldots, U_{n'}\}\) are mutually disjoint. Taking the second view, given \(R',S'\) in the support of \({\cal D}\), we have that \(|R' \cap S'| = z\) if and only if there exists some set \(Z \subseteq [n']\) of size \(z\) such that (1) the ordered set of elements in the blocks of \(R'\) indexed by \(Z\) (denoted \(R'_{Z}\)) is equal to the ordered set of elements in the blocks of \(S'\) indexed by \(Z\) (denoted \(S'_{Z}\)) and (2) the set of elements in the blocks of \(R'\) indexed by \([n'] \setminus Z\) (denoted \(R'_{ \overline{Z}}\)) and the set of elements in the blocks of \(S'\) indexed by \([n'] \setminus Z\) (denoted \(S'_{ \overline{Z}}\)) are disjoint.

We are now ready to analyze the probability that \(|R' \cap S'| = z\) for \(R', S'\) drawn from \({\cal D}\), and for \(z \in [n']\): \[\begin{aligned} \Pr_{R', S' \leftarrow {\cal D}} [|R' \cap S'| = z] &= \sum_{Z \subseteq [n'], |Z| = z} \Pr_{R', S' \leftarrow {\cal D}} \left [ \left ( R'_{Z} = S'_{Z} \right ) \wedge \left ( R'_{\overline{Z}} \cap S'_{\overline{Z}}) = \emptyset \right ) \right]\\ &\leq \sum_{Z \subseteq [n'], |Z| = z} \Pr_{R', S' \leftarrow {\cal D}} [R'_{Z} = S'_{Z}] ~~\leq \sum_{Z \subseteq [n'], |Z| = z} \left ( \frac{1}{n^{1.5}} \right )^z \end{aligned}\] The second inequality holds since each element in the stream has min-entropy at least \(\lg (n^{1.5})\). Therefore, we have \(\Pr_{R', S' \leftarrow {\cal D}} [|R' \cap S'| = z] \le {{n/3} \choose z} \cdot \left ( \frac{1}{n^{1.5}} \right )^z \leq \left ( \frac{1}{n^{0.5}} \right )^z.\)

7.4 FACTS

7.4.1 Proofs for tail bound probabilistic analysis

We start with the following standard way to approximate numbers near 1 with exponentials.

We also re-state this straightforward consequence of the Hoeffding/Chernoff bound on the sum of random variables:

Now we can proceed to the proofs of the main theorems on the accuracy of the CCBF.

8 Timeline

The following timeline outlines the completion of the remaining research items and the preparation of the final dissertation. As noted, sublinear secure protocol works are largely complete. The schedule focuses on the execution of the follow-up work on FACTS.

9 Preliminary Results

As detailed in previous sections, the research objectives for our first line of work (Secure Search, Sampling, and Similarity) and the second line of work (the threshold traceback for originator tracking via FACTS) have been successfully met. The results have been published and evaluated.

Therefore, this section presents preliminary approach solely for our proposed follow-up work of FACTS for messaging accountability tracking, targeting spreader instead of originator: using efficient secure 2-party computation with malicious security. The next steps include experiments and cryptography analysis. The goal of these experiments is to demonstrate and justify that with the most state-of-art 2-party computation building blocks such as oblivious sorting, our design is practical and efficient with today’s large scale of number of users and messages on End-to-End Encrypted messaging systems. The goal of the cryptography analysis is to deliver strict security and privacy guarantee while keeping the protocol practical and cost-friendly.

9.1 Microbenchmarks

Unlike the hash-based structures (CCBF) used in the original FACTS system, our proposed follow-up design runs by two non-colluding servers with malicious security. A primary concern is whether these secure 2-party computations introduces prohibitive data transfer or latency, which in turns hurts the capability of the design to handle the large scale of users and messages.

We benchmarked all major building blocks on two standard CloudLab servers with average network bandwidth and latency inside United States. Table 1 shows the major communication-intensive 2-party operations and their servers’ communication cost.

9.2 Bandwidth Overhead Analysis

This new follow-up work focuses on the actual spreader of reported messages instead of the originator. A fundamental change of this is that we rely on non-colluding 2-party to make the design practical. With strong security guarantee, this often comes at the cost of increased communication overhead as well as computation cost. So far the main concern is the bandwidth and latency between two servers would set a limit of how many complaints we could handle in each epoch.

Since our design is sequential across different components and fully parallel within each component, to assess this feasibility, we micro-benchmarked each building blocks using the state-of-art protocols on 1 million reports per epoch.

Protocols	Communication throughput	Communication rounds
Poseidon Hash	21.4GB	400
Schnorr Signature Verification	1.1TB	2250
Oblivious Sorting	1.01TB	2610
Oblivious Deduplication	82.3GB	60

Protocol : Throughput and Round for each components in the follow-up FACTS for spreaders with one million complaints per epoch

9.3 Conclusion of Preliminary Data

These benchmarks indicate that while the proposed tools for this follow-up work are communicational heavier than the original FACTS primitives, they provide efficient and lightweight protocols for the large scale of popular E2EE messaging systems such as Signal. This confirms the capability of our design. We are also working on evaluation of end-to-end server runtime to show the actual cost of our design.

10 Conclusion

This proposal outlines a comprehensive research agenda addressing the tension between data privacy and utility, culminating in a rigorous examination of accountability in end-to-end encrypted systems.

Our first line of work has established efficient, privacy-preserving primitives for Secure Search, Secure Sampling, and Privacy-Preserving Set Similarity via Min-Hash, demonstrating that sublinear communication is achievable without compromising security and privacy guarantees. Our second line of work, FACTS and the ongoing follow-up work, focused on a more specific application, enabled threshold tracebacks of originator or spreader of reported messages on E2EE messaging systems, proving that accountability mechanisms can coexist with encryption, while still remain sublinear to the number of total messages but only number of complaints.

Together, these three lines of work will contribute a unified framework for scalable, privacy-preserving protocols and applications, providing both theoretical advancements and practical tools.

References

[1]

C. Dwork, “Differential privacy,” in Automata, languages and programming: 33rd international colloquium, ICALP 2006, venice, italy, july 10-14, 2006, proceedings, part II 33, Springer, 2006, pp. 1–12. doi: 10.1007/11787006_1.

[2]

C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” in Theory of cryptography (TCC 2006), Springer, 2006, pp. 265–284. doi: 10.1007/11681878_14.

[3]

C. Dwork, A. Roth, et al., “The algorithmic foundations of differential privacy,” Foundations and Trends in Theoretical Computer Science, vol. 9, no. 3–4, pp. 211–407, 2014, doi: 10.1561/9781601988195.

[4]

M. Yasuda, T. Shimoyama, J. Kogure, K. Yokoyama, and T. Koshiba, “Secure pattern matching using somewhat homomorphic encryption,” in CCSW’13, proceedings of the 2013 ACM cloud computing security workshop, co-located with CCS 2013, berlin, germany, november 4, 2013, A. Juels and B. Parno, Eds., ACM, 2013, pp. 65–76. doi: 10.1145/2517488.2517497.

[5]

J. H. Cheon, M. Kim, and K. E. Lauter, “Homomorphic computation of edit distance,” in FC 2015 workshops, M. Brenner, N. Christin, B. Johnson, and K. Rohloff, Eds., in LNCS, vol. 8976. Springer, Heidelberg, 2015, pp. 194–212. doi: 10.1007/978-3-662-48051-9_15.

[6]

J. H. Cheon, M. Kim, and M. Kim, “Optimized search-and-compute circuits and their application to query evaluation on encrypted data,” IEEE Trans. Inf. Forensics Secur., vol. 11, no. 1, pp. 188–199, 2016, doi: 10.1109/TIFS.2015.2483486.

[7]

M. Kim, H. T. Lee, S. Ling, B. H. M. Tan, and H. Wang, “Private compound wildcard queries using fully homomorphic encryption,” IEEE Trans. Dependable Secur. Comput., vol. 16, no. 5, pp. 743–756, 2019, doi: 10.1109/TDSC.2017.2763593.

[8]

D. X. Song, D. Wagner, and A. Perrig, “Practical techniques for searches on encrypted data,” in 2000 IEEE symposium on security and privacy, IEEE Computer Society Press, May 2000, pp. 44–55. doi: 10.1109/SECPRI.2000.848445.

[9]

D. Boneh, G. Di Crescenzo, R. Ostrovsky, and G. Persiano, “Public key encryption with keyword search,” in EUROCRYPT 2004, C. Cachin and J. Camenisch, Eds., in LNCS, vol. 3027. Springer, Heidelberg, May 2004, pp. 506–522. doi: 10.1007/978-3-540-24676-3_30.

[10]

R. Curtmola, J. A. Garay, S. Kamara, and R. Ostrovsky, “Searchable symmetric encryption: Improved definitions and efficient constructions,” in ACM CCS 2006, A. Juels, R. N. Wright, and S. De Capitani di Vimercati, Eds., ACM Press, 2006, pp. 79–88. doi: 10.1145/1180405.1180417.

[11]

M. Chase and S. Kamara, “Structured encryption and controlled disclosure,” in ASIACRYPT 2010, M. Abe, Ed., in LNCS, vol. 6477. Springer, Heidelberg, 2010, pp. 577–594. doi: 10.1007/978-3-642-17373-8_33.

[12]

Y. Ishai, E. Kushilevitz, S. Lu, and R. Ostrovsky, “Private large-scale databases with distributed searchable symmetric encryption,” in CT-RSA 2016, K. Sako, Ed., in LNCS, vol. 9610. Springer, Heidelberg, 2016, pp. 90–107. doi: 10.1007/978-3-319-29485-8_6.

[13]

D. S. Roche, D. Apon, S. G. Choi, and A. Yerukhimovich, “POPE: Partial order preserving encoding,” in ACM CCS 2016, E. R. Weippl, S. Katzenbeisser, C. Kruegel, A. C. Myers, and S. Halevi, Eds., ACM Press, 2016, pp. 1131–1142. doi: 10.1145/2976749.2978345.

[14]

V. Pappas et al., “Blind seer: A scalable private DBMS,” in 2014 IEEE symposium on security and privacy, IEEE Computer Society Press, May 2014, pp. 359–374. doi: 10.1109/SP.2014.30.

[15]

D. Cash, S. Jarecki, C. S. Jutla, H. Krawczyk, M.-C. Rosu, and M. Steiner, “Highly-scalable searchable symmetric encryption with support for Boolean queries,” in CRYPTO 2013, part i, R. Canetti and J. A. Garay, Eds., in LNCS, vol. 8042. Springer, Heidelberg, 2013, pp. 353–373. doi: 10.1007/978-3-642-40041-4_20.

[16]

B. Fuller et al., “SoK: Cryptographically protected database search,” in 2017 IEEE symposium on security and privacy, IEEE Computer Society Press, May 2017, pp. 172–191. doi: 10.1109/SP.2017.10.

[17]

O. Pandey and Y. Rouselakis, “Property preserving symmetric encryption,” in EUROCRYPT 2012, D. Pointcheval and T. Johansson, Eds., in LNCS, vol. 7237. Springer, Heidelberg, 2012, pp. 375–391. doi: 10.1007/978-3-642-29011-4_23.

[18]

M. Bellare, A. Boldyreva, and A. O’Neill, “Deterministic and efficiently searchable encryption,” in CRYPTO 2007, A. Menezes, Ed., in LNCS, vol. 4622. Springer, Heidelberg, 2007, pp. 535–552. doi: 10.1007/978-3-540-74143-5_30.

[19]

A. Boldyreva, N. Chenette, Y. Lee, and A. O’Neill, “Order-preserving symmetric encryption,” in EUROCRYPT 2009, A. Joux, Ed., in LNCS, vol. 5479. Springer, Heidelberg, 2009, pp. 224–241. doi: 10.1007/978-3-642-01001-9_13.

[20]

A. Boldyreva, N. Chenette, and A. O’Neill, “Order-preserving encryption revisited: Improved security analysis and alternative solutions,” in CRYPTO 2011, P. Rogaway, Ed., in LNCS, vol. 6841. Springer, Heidelberg, 2011, pp. 578–595. doi: 10.1007/978-3-642-22792-9_33.

[21]

M. S. Islam, M. Kuzu, and M. Kantarcioglu, “Access pattern disclosure on searchable encryption: Ramification, attack and mitigation,” in NDSS 2012, The Internet Society, 2012.

[22]

P. Grubbs, R. McPherson, M. Naveed, T. Ristenpart, and V. Shmatikov, “Breaking web applications built on top of encrypted data,” in ACM CCS 2016, E. R. Weippl, S. Katzenbeisser, C. Kruegel, A. C. Myers, and S. Halevi, Eds., ACM Press, 2016, pp. 1353–1364. doi: 10.1145/2976749.2978351.

[23]

P. Grubbs, K. Sekniqi, V. Bindschaedler, M. Naveed, and T. Ristenpart, “Leakage-abuse attacks against order-revealing encryption,” in 2017 IEEE symposium on security and privacy, IEEE Computer Society Press, May 2017, pp. 655–672. doi: 10.1109/SP.2017.44.

[24]

B. Chor, E. Kushilevitz, O. Goldreich, and M. Sudan, “Private information retrieval,” J. ACM, vol. 45, no. 6, pp. 965–981, 1998, doi: 10.1145/293347.293350.

[25]

A. C.-C. Yao, “How to generate and exchange secrets (extended abstract),” in 27th FOCS, IEEE Computer Society Press, 1986, pp. 162–167. doi: 10.1109/SFCS.1986.25.

[26]

O. Goldreich and R. Ostrovsky, “Software protection and simulation on oblivious RAMs,” J. ACM, vol. 43, no. 3, pp. 431–473, 1996, doi: 10.1145/233551.233553.

[27]

G. Cormode and H. Jowhari, “L p samplers and their applications: A survey,” ACM Computing Surveys (CSUR), vol. 52, no. 1, pp. 1–31, 2019.

[28]

S. Ganguly, “Counting distinct items over update streams,” Theoretical Computer Science, vol. 378, no. 3, pp. 211–222, 2007.

[29]

H. Jowhari, M. Saglam, and G. Tardos, “Tight bounds for lp samplers, finding duplicates in streams, and related problems,” in Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, 2011, pp. 49–58.

[30]

M. Monemizadeh and D. P. Woodruff, “1-pass relative-error l\({}_{\mbox{p}}\)-sampling with applications,” in Proceedings of the twenty-first annual ACM-SIAM symposium on discrete algorithms, SODA 2010, austin, texas, USA, january 17-19, 2010, M. Charikar, Ed., SIAM, 2010, pp. 1143–1160. doi: 10.1137/1.9781611973075.92.

[31]

D. P. Woodruff and P. Zhong, “Distributed low rank approximation of implicit functions of a matrix,” in 2016 IEEE 32nd international conference on data engineering (ICDE), IEEE, 2016, pp. 847–858.

[32]

M. M. Prabhakaran and V. M. Prabhakaran, “On secure multiparty sampling for more than two parties,” in 2012 IEEE information theory workshop, IEEE, 2012, pp. 99–103.

[33]

V. M. Prabhakaran and M. M. Prabhakaran, “Assisted common information with an application to secure two-party sampling,” IEEE Transactions on Information Theory, vol. 60, no. 6, pp. 3413–3434, 2014.

[34]

J. Champion, abhi shelat, and J. Ullman, “Securely sampling biased coins with applications to differential privacy,” in ACM CCS 2019, L. Cavallaro, J. Kinder, X. Wang, and J. Katz, Eds., ACM Press, 2019, pp. 603–614. doi: 10.1145/3319535.3354256.

[35]

C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor, “Our data, ourselves: Privacy via distributed noise generation,” in Annual international conference on the theory and applications of cryptographic techniques, Springer, 2006, pp. 486–503.

[36]

A. Acar, Z. B. Celik, H. Aksu, A. S. Uluagac, and P. McDaniel, “Achieving secure and differentially private computations in multiparty settings,” in 2017 IEEE symposium on privacy-aware computing (PAC), IEEE, 2017, pp. 49–59.

[37]

C. Clifton and B. Anandan, “Challenges and opportunities for security with differential privacy,” in International conference on information systems security, Springer, 2013, pp. 1–13.

[38]

R. Eriguchi, A. Ichikawa, N. Kunihiro, and K. Nuida, “Efficient noise generation to achieve differential privacy with applications to secure multiparty computation,” in International conference on financial cryptography and data security, Springer, 2021, pp. 271–290.

[39]

T. Elahi, G. Danezis, and I. Goldberg, “PrivEx: Private collection of traffic statistics for anonymous communication networks,” in ACM CCS 2014, G.-J. Ahn, M. Yung, and N. Li, Eds., ACM Press, 2014, pp. 1068–1079. doi: 10.1145/2660267.2660280.

[40]

R. Jansen and A. Johnson, “Safely measuring tor,” in ACM CCS 2016, E. R. Weippl, S. Katzenbeisser, C. Kruegel, A. C. Myers, and S. Halevi, Eds., ACM Press, 2016, pp. 1553–1567. doi: 10.1145/2976749.2978310.

[41]

L. Melis, G. Danezis, and E. De Cristofaro, “Efficient private statistics with succinct sketches,” in NDSS 2016, San Diego, CA: Internet Society, 2016.

[42]

R. Wails, A. Johnson, D. Starin, A. Yerukhimovich, and S. D. Gordon, “Stormy: Statistics in tor by measuring securely,” in ACM CCS 2019, L. Cavallaro, J. Kinder, X. Wang, and J. Katz, Eds., ACM Press, 2019, pp. 615–632. doi: 10.1145/3319535.3345650.

[43]

S. G. Choi, D. Dachman-Soled, M. Kulkarni, and A. Yerukhimovich, “Differentially-private multi-party sketching for large-scale statistics,” PoPETs, vol. 2020, no. 3, pp. 153–174, 2020, doi: 10.2478/popets-2020-0047.

[44]

J. Blocki, A. Blum, A. Datta, and O. Sheffet, “The johnson-lindenstrauss transform itself preserves differential privacy,” in 53rd FOCS, IEEE Computer Society Press, 2012, pp. 410–419. doi: 10.1109/FOCS.2012.67.

[45]

S. G. Choi, D. Dachman-Soled, M. Kulkarni, and A. Yerukhimovich, “Differentially-private multi-party sketching for large-scale statistics,” Proceedings on Privacy Enhancing Technologies, vol. 2020, no. 3, pp. 153–174, 2020, doi: 10.2478/popets-2020-0047.

[46]

A. Smith, S. Song, and A. Guha Thakurta, “The flajolet-martin sketch itself preserves differential privacy: Private counting with minimal space,” NeurIPS 2020, vol. 33, pp. 19561–19572, 2020, doi: 10.5555/3495724.3497365.

[47]

L. Wang, I. Pinelis, and D. Song, “Differentially private fractional frequency moments estimation with polylogarithmic space,” in International conference on learning representations, 2022.

[48]

J. Hehir, D. Ting, and G. Cormode, “Sketch-flip-merge: Mergeable sketches for private distinct counting,” in ICML 2023, PMLR, 2023, pp. 12846–12865. doi: 10.5555/3618408.3618930.

[49]

C. Dickens, J. Thaler, and D. Ting, “Order-invariant cardinality estimators are differentially private,” Advances in Neural Information Processing Systems, vol. 35, pp. 15204–15216, 2022, doi: 10.5555/3600270.3601376.

[50]

T. Li, Z. Liu, V. Sekar, and V. Smith, “Privacy for free: Communication-efficient learning with differential privacy using sketches,” arXiv preprint arXiv:1911.00972, 2019, doi: 10.48550/arXiv.1911.00972.

[51]

R. Pagh and M. Thorup, “Improved utility analysis of private countsketch,” Advances in Neural Information Processing Systems, vol. 35, pp. 25631–25643, 2022, doi: 10.5555/3600270.3602128.

[52]

D. Mir, S. Muthukrishnan, A. Nikolov, and R. N. Wright, “Pan-private algorithms via statistics on sketches,” in ACM SIGMOD-SIGACT-SIGART, 2011, pp. 37–48. doi: 10.1145/1989284.1989290.

[53]

L. Melis, G. Danezis, and E. De Cristofaro, “Efficient private statistics with succinct sketches,” in Proceedings 2016 network and distributed system security symposium, in NDSS 2016. Internet Society, 2016. doi: 10.14722/ndss.2016.23175.

[54]

R. Bassily and A. Smith, “Local, private, efficient protocols for succinct histograms,” in Proceedings of the forty-seventh annual ACM symposium on theory of computing, 2015, pp. 127–135. doi: 10.1145/2746539.2746632.

[55]

R. Bassily, K. Nissim, U. Stemmer, and A. Guha Thakurta, “Practical locally private heavy hitters,” Advances in Neural Information Processing Systems, vol. 30, 2017, doi: 10.5555/3294771.3294989.

[56]

Z. Huang, Y. Qiu, K. Yi, and G. Cormode, “Frequency estimation under multiparty differential privacy: One-shot and streaming,” arXiv preprint arXiv:2104.01808, 2021, doi: 10.14778/3547305.3547312.

[57]

F. Zhao, D. Qiao, R. Redberg, D. Agrawal, A. El Abbadi, and Y.-X. Wang, “Differentially private linear sketches: Efficient implementations and applications,” NeurIPS 2022, vol. 35, pp. 12691–12704, 2022, doi: 10.5555/3600270.3601192.

[58]

R. Stanojevic, M. Nabeel, and T. Yu, “Distributed cardinality estimation of set operations with differential privacy,” in 2017 IEEE symposium on privacy-aware computing (PAC), IEEE, 2017, pp. 37–48. doi: 10.1109/pac.2017.43.

[59]

H. Sparka, F. Tschorsch, and B. Scheuermann, “P2KMV: A privacy-preserving counting sketch for efficient and accurate set intersection cardinality estimations,” Cryptology ePrint Archive, 2018, doi: 10.14279/DEPOSITONCE-8374.

[60]

S. Nuñez von Voigt and F. Tschorsch, “Rrtxfm: Probabilistic counting for differentially private statistics,” in Digital transformation for a sustainable society in the 21st century: I3E 2019 IFIP WG 6.11 international workshops, Springer, 2020, pp. 86–98. doi: 10.1007/978-3-030-39634-3_9.

[61]

B. Kreuter, C. W. Wright, E. S. Skvortsov, R. Mirisola, and Y. Wang, “Privacy-preserving secure cardinality and frequency estimation,” Google, LLC, 2020.

[62]

R. Pagh and N. M. Stausholm, “Efficient Differentially Private \(F_0\) Linear Sketching,” in 24th international conference on database theory (ICDT 2021), K. Yi and Z. Wei, Eds., in Leibniz international proceedings in informatics (LIPIcs), vol. 186. Dagstuhl, Germany: Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021, pp. 18:1–18:19. doi: 10.4230/LIPIcs.ICDT.2021.18.

[63]

E. D. Cristofaro, S. Faber, P. Gasti, and G. Tsudik, “Genodroid: Are privacy-preserving genomic tests ready for prime time?” in WPES 2012, ACM, 2012, pp. 97–108. doi: 10.1145/2381966.2381980.

[64]

M. S. Riazi, B. Chen, A. Shrivastava, D. Wallach, and F. Koushanfar, “Sub-linear privacy-preserving near-neighbor search.” Cryptology ePrint Archive, Report 2019/1222, 2019.

[65]

S. Faber, “Variants of privacy preserving set intersection and their practical applications,” PhD thesis, University of Maryland, 2016.

[66]

M. Aumüller, A. Bourgeat, and J. Schmurr, “Differentially private sketches for jaccard similarity estimation,” in SISAP 2020, in LNCS. Springer, 2020, pp. 18–32. doi: 10.1007/978-3-030-60936-8_2.

[67]

Z. Yan, Q. Wu, M. Ren, J. Liu, S. Liu, and S. Qiu, “Locally private jaccard similarity estimation,” Concurr. Comput. Pract. Exp., vol. 31, no. 24, 2019, doi: 10.1002/cpe.4889.

[68]

Z. Yan, J. Liu, G. Li, Z. Han, and S. Qiu, “PrivMin: Differentially private MinHash for jaccard similarity computation,” CoRR, vol. abs/1705.07258, 2017, doi: 10.48550/arXiv.1705.07258.

[69]

A. Beimel, K. Nissim, and E. Omri, “Distributed private data analysis: Simultaneously solving how and what,” in Crypto 2008, Springer, 2008, pp. 451–468. doi: 10.1007/978-3-540-85174-5_25.

[70]

X. He, A. Machanavajjhala, C. J. Flynn, and D. Srivastava, “Composing differential privacy and secure computation: A case study on scaling private record linkage,” in ACM CCS 2017, B. M. Thuraisingham, D. Evans, T. Malkin, and D. Xu, Eds., ACM Press, 2017, pp. 1389–1406. doi: 10.1145/3133956.3134030.

[71]

A. Groce, P. Rindal, and M. Rosulek, “Cheaper private set intersection via differentially private leakage,” Proc. Privacy Enhancing Technologies (PETS), vol. 2019, no. 3, pp. 6–25, 2019, doi: 10.2478/popets-2019-0034.

[72]

S. Mazloom and S. D. Gordon, “Secure computation with differentially private access patterns,” in ACM CCS 2018, D. Lie, M. Mannan, M. Backes, and X. Wang, Eds., ACM Press, 2018, pp. 490–507. doi: 10.1145/3243734.3243851.

[73]

S. D. Gordon, J. Katz, M. Liang, and J. Xu, “Spreading the privacy blanket: - differentially oblivious shuffling for differential privacy,” in ACNS 22: 20th international conference on applied cryptography and network security, G. Ateniese and D. Venturi, Eds., in LNCS, vol. 13269. Rome, Italy: Springer, Cham, Switzerland, 2022, pp. 501–520.

[74]

J. Feigenbaum, Y. Ishai, T. Malkin, K. Nissim, M. Strauss, and R. N. Wright, “Secure multiparty computation of approximations,” in ICALP 2001, F. Orejas, P. G. Spirakis, and J. van Leeuwen, Eds., in LNCS, vol. 2076. Springer, Heidelberg, 2001, pp. 927–938. doi: 10.1007/3-540-48224-5_75.

[75]

S. Halevi, R. Krauthgamer, E. Kushilevitz, and K. Nissim, “Private approximation of NP-hard functions,” in 33rd ACM STOC, ACM Press, 2001, pp. 550–559. doi: 10.1145/380752.380850.

[76]

E. Boyle, R. LaVigne, and V. Vaikuntanathan, “Adversarially robust property-preserving hash functions,” in Innovations in theoretical computer science conference (ITCS), San Diego, CA, USA: Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2019, pp. 16:1–16:20. doi: 10.4230/LIPIcs.ITCS.2019.16.

[77]

N. Fleischhacker and M. Simkin, “Robust property-preserving hash functions for hamming distance and more,” in EUROCRYPT 2021, part III, A. Canteaut and F.-X. Standaert, Eds., in LNCS, vol. 12698. Springer, Cham, Switzerland, 2021, pp. 311–337.

[78]

N. Fleischhacker, K. G. Larsen, and M. Simkin, “Property-preserving hash functions for hamming distance from standard assumptions,” in EUROCRYPT 2022, part II, in LNCS. Springer, Cham, Switzerland, 2022, pp. 764–781.

[79]

J. Holmgren, M. Liu, L. Tyner, and D. Wichs, “Nearly optimal property preserving hashing,” in CRYPTO 2022, part III, in LNCS. Springer, Cham, Switzerland, 2022, pp. 473–502.

[80]

I. Attias, E. Cohen, M. Shechner, and U. Stemmer, “A framework for adversarial streaming via differential privacy and difference estimators,” in ITCS 2023, LIPIcs, 2023, pp. 8:1–8:19. doi: 10.4230/LIPIcs.ITCS.2023.8.

[81]

O. Ben-Eliezer, R. Jayaram, D. P. Woodruff, and E. Yogev, “A framework for adversarially robust streaming algorithms,” J. ACM, vol. 69, no. 2, pp. 17:1–17:33, 2022, doi: 10.1145/3498334.

[82]

Y. Dodis, P. Grubbs, T. Ristenpart, and J. Woodage, “Fast message franking: From invisible salamanders to encryptment,” in CRYPTO 2018, part i, H. Shacham and A. Boldyreva, Eds., in LNCS, vol. 10991. Springer, Heidelberg, 2018, pp. 155–186. doi: 10.1007/978-3-319-96884-1_6.

[83]

N. Tyagi, P. Grubbs, J. Len, I. Miers, and T. Ristenpart, “Asymmetric message franking: Content moderation for metadata-private end-to-end encryption,” in CRYPTO 2019, part III, A. Boldyreva and D. Micciancio, Eds., in LNCS, vol. 11694. Springer, Heidelberg, 2019, pp. 222–250. doi: 10.1007/978-3-030-26954-8_8.

[84]

P. Grubbs, J. Lu, and T. Ristenpart, “Message franking via committing authenticated encryption,” in CRYPTO 2017, part III, J. Katz and H. Shacham, Eds., in LNCS, vol. 10403. Springer, Heidelberg, 2017, pp. 66–97. doi: 10.1007/978-3-319-63697-9_3.

[85]

M. Maffei, G. Malavolta, M. Reinert, and D. Schröder, “Maliciously secure multi-client ORAM,” in Applied cryptography and network security, D. Gollmann, A. Miyaji, and H. Kikuchi, Eds., Cham: Springer International Publishing, 2017, pp. 645–664.

[86]

A. Chakraborti and R. Sion, “ConcurORAM: High-throughput stateless parallel multi-client ORAM,” in 26th annual network and distributed system security symposium, NDSS 2019, san diego, california, USA, february 24-27, 2019, The Internet Society, 2019. Available: https://www.ndss-symposium.org/ndss-paper/concuroram-high-throughput-stateless-parallel-multi-client-oram/

[87]

T. Hoang, R. Behnia, Y. Jang, and A. A. Yavuz, “MOSE: Practical multi-user oblivious storage via secure enclaves,” in Proceedings of the tenth ACM conference on data and application security and privacy, in CODASPY ’20. New York, NY, USA: Association for Computing Machinery, 2020, pp. 17–28. doi: 10.1145/3374664.3375749.

[88]

J. Katz, S. Myers, and R. Ostrovsky, “Cryptographic counters and applications to electronic voting,” in EUROCRYPT 2001, B. Pfitzmann, Ed., in LNCS, vol. 2045. Springer, Heidelberg, May 2001, pp. 78–92. doi: 10.1007/3-540-44987-6_6.

[89]

E.-J. Goh and P. Golle, “Event driven private counters,” in FC 2005, A. Patrick and M. Yung, Eds., in LNCS, vol. 3570. Springer, Heidelberg, 2005, pp. 313–327.

[90]

X. S. Wang et al., “Oblivious data structures,” in ACM CCS 2014, G.-J. Ahn, M. Yung, and N. Li, Eds., ACM Press, 2014, pp. 215–226. doi: 10.1145/2660267.2660314.

[91]

M. Keller and P. Scholl, “Efficient, oblivious data structures for MPC,” in ASIACRYPT 2014, part II, P. Sarkar and T. Iwata, Eds., in LNCS, vol. 8874. Springer, Heidelberg, 2014, pp. 506–525. doi: 10.1007/978-3-662-45608-8_27.

[92]

E. Shi, “Path oblivious heap: Optimal and practical oblivious priority queue,” in 2020 IEEE symposium on security and privacy, IEEE Computer Society Press, May 2020, pp. 842–858. doi: 10.1109/SP40000.2020.00037.

[93]

A. Akavia, D. Feldman, and H. Shaul, “Secure search on encrypted data via multi-ring sketch,” in ACM CCS 2018, D. Lie, M. Mannan, M. Backes, and X. Wang, Eds., ACM Press, 2018, pp. 985–1001. doi: 10.1145/3243734.3243810.

[94]

A. Akavia, C. Gentry, S. Halevi, and M. Leibovich, “Setup-free secure search on encrypted data: Faster and post-processing free,” Proc. Priv. Enhancing Technol., vol. 2019, no. 3, pp. 87–107, 2019, doi: 10.2478/popets-2019-0038.

[95]

R. Wen, Y. Yu, X. Xie, and Y. Zhang, “LEAF: A faster secure search algorithm via localization, extraction, and reconstruction,” in ACM CCS 2020, J. Ligatti, X. Ou, J. Katz, and G. Vigna, Eds., ACM Press, 2020, pp. 1219–1232. doi: 10.1145/3372297.3417237.

[96]

B. H. Bloom, “Space/time trade-offs in hash coding with allowable errors,” Commun. ACM, vol. 13, no. 7, pp. 422–426, 1970, doi: http://doi.acm.org/10.1145/362686.362692.

[97]

L. Fan, P. Cao, J. M. Almeida, and A. Z. Broder, “Summary cache: A scalable wide-area web cache sharing protocol,” IEEE/ACM Trans. Netw., vol. 8, no. 3, pp. 281–293, 2000, doi: 10.1109/90.851975.

[98]

M. Mitzenmacher, “Compressed bloom filters,” in 20th ACM PODC, A. D. Kshemkalyani and N. Shavit, Eds., ACM, 2001, pp. 144–150. doi: 10.1145/383962.384004.

[99]

M. Blanton and E. Aguiar, “Private and oblivious set and multiset operations.” Cryptology ePrint Archive, Report 2011/464, 2011.

[100]

G. Asharov, I. Komargodski, W.-K. Lin, K. Nayak, E. Peserico, and E. Shi, “OptORAMa: Optimal oblivious RAM,” in EUROCRYPT 2020, part II, A. Canteaut and Y. Ishai, Eds., in LNCS, vol. 12106. Springer, Heidelberg, May 2020, pp. 403–432. doi: 10.1007/978-3-030-45724-2_14.

[101]

D. Cantor and H. Zassenhaus, “A new algorithm for factoring polynomials over finite fields,” Mathematics of Computation, vol. 36, pp. 587–592, 1981.

[102]

M. T. Goodrich, “Data-oblivious external-memory algorithms for the compaction, selection, and sorting of outsourced data,” in SPAA 2011: Proceedings of the 23rd annual ACM symposium on parallelism in algorithms and architectures, san jose, CA, USA, june 4-6, 2011 (co-located with FCRC 2011), 2011, pp. 379–388. doi: 10.1145/1989493.1989555.

[103]

G. Kellaris, G. Kollios, K. Nissim, and A. O’Neill, “Generic attacks on secure outsourced databases,” in ACM CCS 2016, E. R. Weippl, S. Katzenbeisser, C. Kruegel, A. C. Myers, and S. Halevi, Eds., ACM Press, 2016, pp. 1329–1340. doi: 10.1145/2976749.2978386.

[104]

Z. Gui, O. Johnson, and B. Warinschi, “Encrypted databases: New volume attacks against range queries,” in ACM CCS 2019, L. Cavallaro, J. Kinder, X. Wang, and J. Katz, Eds., ACM Press, 2019, pp. 361–378. doi: 10.1145/3319535.3363210.

[105]

L. Blackstone, S. Kamara, and T. Moataz, “Revisiting leakage abuse attacks,” in NDSS 2020, The Internet Society, 2020.

[106]

S. Patel, G. Persiano, K. Yeo, and M. Yung, “Mitigating leakage in secure cloud-hosted data structures: Volume-hiding for multi-maps via hashing,” in ACM CCS 2019, L. Cavallaro, J. Kinder, X. Wang, and J. Katz, Eds., ACM Press, 2019, pp. 79–93. doi: 10.1145/3319535.3354213.

[107]

A. J. Titus, S. Kishore, T. Stavish, S. M. Rogers, and K. Ni, “PySEAL: A python wrapper implementation of the SEAL homomorphic encryption library.” 2018. Available: https://arxiv.org/abs/1803.01891

[108]

“Microsoft SEAL (release 3.6).” https://github.com/Microsoft/SEAL, 2020.

[109]

J. Fan and F. Vercauteren, “Somewhat practical fully homomorphic encryption,” IACR Cryptol. ePrint Arch., vol. 2012, p. 144, 2012, Available: http://eprint.iacr.org/2012/144

[110]

S. Angel, H. Chen, K. Laine, and S. T. V. Setty, “PIR with compressed queries and amortized query processing,” in 2018 IEEE symposium on security and privacy, IEEE Computer Society Press, May 2018, pp. 962–979. doi: 10.1109/SP.2018.00062.

[111]

W. A. Stein et al., Sage Mathematics Software (Version 9.2). The Sage Development Team, 2020.

[112]

A. Ali et al., “Communication–computation trade-offs in PIR.” Usenix Security (To appear), 2021.

[113]

O. Goldreich, S. Micali, and A. Wigderson, “How to play any mental game or A completeness theorem for protocols with honest majority,” in 19th ACM STOC, A. Aho, Ed., ACM Press, May 1987, pp. 218–229. doi: 10.1145/28395.28420.

[114]

A. Andoni, R. Krauthgamer, and K. Onak, “Streaming algorithms via precision sampling,” in IEEE 52nd annual symposium on foundations of computer science, FOCS 2011, palm springs, CA, USA, october 22-25, 2011, R. Ostrovsky, Ed., IEEE Computer Society, 2011, pp. 363–372. doi: 10.1109/FOCS.2011.82.

[115]

H. Jowhari, M. Saglam, and G. Tardos, “Tight bounds for lp samplers, finding duplicates in streams, and related problems,” in Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, PODS 2011, june 12-16, 2011, athens, greece, M. Lenzerini and T. Schwentick, Eds., ACM, 2011, pp. 49–58. doi: 10.1145/1989284.1989289.

[116]

V. Braverman, R. Ostrovsky, and C. Zaniolo, “Optimal sampling from sliding windows,” J. Comput. Syst. Sci., vol. 78, no. 1, pp. 260–272, 2012, doi: 10.1016/j.jcss.2011.04.004.

[117]

F. McSherry and K. Talwar, “Mechanism design via differential privacy,” in 48th FOCS, IEEE Computer Society Press, 2007, pp. 94–103. doi: 10.1109/FOCS.2007.41.

[118]

L. Babai, N. Nisan, and M. Szegedy, “Multiparty protocols and logspace-hard pseudorandom sequences (extended abstract),” in 21st ACM STOC, ACM Press, May 1989, pp. 1–11. doi: 10.1145/73007.73008.

[119]

J. Feigenbaum, Y. Ishai, T. Malkin, K. Nissim, M. J. Strauss, and R. N. Wright, “Secure multiparty computation of approximations,” ACM Trans. Algorithms, vol. 2, no. 3, pp. 435–472, 2006.

[120]

S. Goryczka, L. Xiong, and V. Sunderam, “Secure multiparty aggregation with differential privacy: A comparative study,” in Proceedings of the joint EDBT/ICDT 2013 workshops, 2013, pp. 155–163.

[121]

M. Pathak, S. Rane, and B. Raj, “Multiparty differential privacy via aggregation of locally trained classifiers,” Advances in neural information processing systems, vol. 23, 2010.

[122]

S. Pentyala et al., “Training differentially private models with secure multiparty computation,” arXiv preprint arXiv:2202.02625, 2022.

[123]

A. A. Razborov, “On the distributional complexity of disjointness,” Theor. Comput. Sci., vol. 106, no. 2, pp. 385–390, 1992, doi: 10.1016/0304-3975(92)90260-M.

[124]

B. Kalyanasundaram and G. Schnitger, “The probabilistic communication complexity of set intersection,” SIAM J. Discret. Math., vol. 5, no. 4, pp. 545–557, 1992, doi: 10.1137/0405044.

[125]

Z. Bar-Yossef, T. S. Jayram, R. Kumar, and D. Sivakumar, “An information statistics approach to data stream and communication complexity,” J. Comput. Syst. Sci., vol. 68, no. 4, pp. 702–732, 2004, doi: 10.1016/j.jcss.2003.11.006.

[126]

W. B. Johnson and J. Lindenstrauss, “Extensions of lipschitz mappings into a hilbert space.” 1984.

[127]

P. Indyk and R. Motwani, “Approximate nearest neighbors: Towards removing the curse of dimensionality,” in 30th ACM STOC, ACM Press, May 1998, pp. 604–613. doi: 10.1145/276698.276876.

[128]

A. Kabán, “Improved bounds on the dot product under random projection and random sign projection,” in Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, sydney, NSW, australia, august 10-13, 2015, L. Cao, C. Zhang, T. Joachims, G. I. Webb, D. D. Margineantu, and G. Williams, Eds., ACM, 2015, pp. 487–496. doi: 10.1145/2783258.2783364.

[129]

J. Feigenbaum, Y. Ishai, T. Malkin, K. Nissim, M. J. Strauss, and R. N. Wright, “Secure multiparty computation of approximations,” ACM Trans. Algorithms, vol. 2, no. 3, pp. 435–472, 2006, doi: 10.1145/1159892.1159900.

[130]

C. Mouchet, J. R. Troncoso-Pastoriza, J.-P. Bossuat, and J.-P. Hubaux, “Multiparty homomorphic encryption from ring-learning-with-errors,” Proceedings on Privacy Enhancing Technologies (PoPETs), vol. 2021, no. 4, pp. 291–311, 2021.

[131]

C. Dwork and A. Roth, “The algorithmic foundations of differential privacy,” Found. Trends Theor. Comput. Sci., vol. 9, no. 3–4, pp. 211–407, 2014, doi: 10.1561/0400000042.

[132]

A. Broder, “On the resemblance and containment of documents,” in Compression and complexity of sequences, international conference on, IEEE Computer Society, 1997, p. 21. doi: 10.1109/sequen.1997.666900.

[133]

P. Li, A. B. Owen, and C.-H. Zhang, “One permutation hashing,” in In proceedings of the 26th annual conference on neural information processing systems 2012., 2012, pp. 3122–3130. doi: 10.5555/2999325.2999482.

[134]

P. Jaccard, “Étude comparative de la distribution florale dans une portion des alpes et des jura,” Bull Soc Vaudoise Sci Nat, vol. 37, pp. 547–579, 1901, doi: 10.5169/seals-266450.

[135]

A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig, “Syntactic clustering of the web,” Comput. Networks, vol. 29, no. 8–13, pp. 1157–1166, 1997, doi: 10.1016/s0169-7552(97)00031-7.

[136]

C. Tantipathananandh, T. Y. Berger-Wolf, and D. Kempe, “A framework for community identification in dynamic social networks,” in KDD, ACM, 2007, pp. 717–726. doi: 10.1145/1281192.1281269.

[137]

H. Wu, C. Wang, Y. Tyshetskiy, A. Docherty, K. Lu, and L. Zhu, “Adversarial examples for graph data: Deep insights into attack and defense,” in IJCAI 2019, ijcai.org, 2019, pp. 4816–4823. doi: 10.24963/ijcai.2019/669.

[138]

B. Jiang, H. Krim, T. Wu, and D. Cansever, “Refining self-supervised learning in imaging: Beyond linear metric,” in ICIP 2022, IEEE, 2022, pp. 76–80. doi: 10.1109/icip46576.2022.9897745.

[139]

C. Blundo, E. D. Cristofaro, and P. Gasti, “EsPRESSO: Efficient privacy-preserving evaluation of sample set similarity,” J. Comput. Secur., vol. 22, no. 3, pp. 355–381, 2014, doi: 10.3233/jcs-130482.

[140]

E. D. Cristofaro, P. Gasti, and G. Tsudik, “Fast and private computation of cardinality of set intersection and union,” in CANS 2012, Springer, 2012, pp. 218–231. doi: 10.1007/978-3-642-35404-5_17.

[141]

Y. Tan and B. Lv, “Break two PSI-CA protocols in polynomial time,” in Proceedings of the 2024 16th international conference on machine learning and computing, in ICMLC ’24. New York, NY, USA: Association for Computing Machinery, 2024, pp. 65–72. doi: 10.1145/3651671.3651682.

[142]

R. Bassily, A. Groce, J. Katz, and A. Smith, “Coupled-worlds privacy: Exploiting adversarial uncertainty in statistical data privacy,” in 54th FOCS, IEEE Computer Society Press, 2013, pp. 439–448. doi: 10.1109/FOCS.2013.54.

[143]

S. Dziembowski, T. Kazana, and M. Zdanowicz, “Quasi chain rule for min-entropy,” Inf. Process. Lett., vol. 134, pp. 62–66, 2018, doi: 10.1016/j.ipl.2018.02.007.

[144]

M. Skórski, “Strong chain rules for min-entropy under few bits spoiled,” in 2019 IEEE international symposium on information theory (ISIT), IEEE, 2019, pp. 1122–1126. doi: 10.1109/isit.2019.8849240.

[145]

S. Mazloom, P. H. Le, S. Ranellucci, and S. D. Gordon, “Secure parallel computation on national scale volumes of data,” in USENIX security symposium, Boston, MA, USA: USENIX Association, 2020, pp. 2487–2504.

[146]

T.-H. H. Chan, K.-M. Chung, B. M. Maggs, and E. Shi, “Foundations of differentially oblivious algorithms,” in 30th SODA, T. M. Chan, Ed., ACM-SIAM, 2019, pp. 2448–2467. doi: 10.1137/1.9781611975482.150.

[147]

B. Doerr, “Probabilistic tools for the analysis of randomized optimization heuristics,” CoRR, vol. abs/1801.06733, 2018, Available: https://arxiv.org/abs/1801.06733

[148]

E. Shi, T.-H. H. Chan, E. G. Rieffel, and D. Song, “Distributed private data analysis: Lower bounds and practical constructions,” ACM Trans. Algorithms, vol. 13, no. 4, pp. 50:1–50:38, 2017, doi: 10.1145/3146549.

[149]

W.-N. Chen, A. Ozgur, and P. Kairouz, “The poisson binomial mechanism for unbiased federated learning with secure aggregation,” in ICML 2022, PMLR, 2022, pp. 3490–3506.

[150]

Y. Dodis, R. Ostrovsky, L. Reyzin, and A. Smith, “Fuzzy extractors: How to generate strong keys from biometrics and other noisy data,” SIAM Journal on Computing, vol. 38, no. 1, pp. 97–139, Jan. 2008, doi: 10.1137/060651380.

[151]

M. Isaac and K. Roose, “Disinformation spreads on WhatsApp ahead of brazilian election.” https://www.nytimes.com/2018/10/19/technology/whatsapp-brazil-presidential-election.html, 2018.

[152]

E. Samuels, “How misinformation on WhatsApp led to a mob killing in india.” https://www-washingtonpost-com.proxygw.wrlc.org/politics/2020/02/21/how-misinformation-whatsapp-led-deathly-mob-lynching-india/ics/2020/02/21/how-misinformation-whatsapp-led-deathly-mob-lynching-india/, 2020.

[153]

Facebook, “How is Facebook addressing false information through independent fact-checkers?” https://www.facebook.com/help/1952307158131536.

[154]

Youtube, “How does YouTube combat misinformation?” https://www.youtube.com/howyoutubeworks/our-commitments/fighting-misinformation/#policies.

[155]

N. Tyagi, I. Miers, and T. Ristenpart, “Traceback for end-to-end encrypted messaging,” in ACM CCS 2019, L. Cavallaro, J. Kinder, X. Wang, and J. Katz, Eds., ACM Press, 2019, pp. 413–430. doi: 10.1145/3319535.3354243.

[156]

C. Peale, S. Eskandarian, and D. Boneh, “Secure source-tracking for encrypted messaging,” in ACM CCS 21: 28th conference on computer and communications security, Virtual Conference, Korea: ACM Press, 2021.

[157]

R. Issa, N. AlHaddad, and M. Varia, “Hecate: Abuse reporting in secure messengers with sealed sender.” Cryptology ePrint Archive, 2021.

[158]

O. Goldreich and R. Ostrovsky, “Software protection and simulation on oblivious RAMs,” J. ACM, vol. 43, no. 3, pp. 431–473, 1996, doi: 10.1145/233551.233553.

[159]

E. Stefanov et al., “Path ORAM: An extremely simple oblivious RAM protocol,” J. ACM, vol. 65, no. 4, pp. 18:1–18:26, 2018, doi: 10.1145/3177872.

[160]

C. Gentry, “Fully homomorphic encryption using ideal lattices,” in 41st ACM STOC, M. Mitzenmacher, Ed., ACM Press, 2009, pp. 169–178. doi: 10.1145/1536414.1536440.

[161]

M. Ben-Or, S. Goldwasser, and A. Wigderson, “Completeness theorems for non-cryptographic fault-tolerant distributed computation (extended abstract),” in 20th ACM STOC, ACM Press, May 1988, pp. 1–10. doi: 10.1145/62212.62213.

[162]

R. S. Geiger, “Bot-based collective blocklists in twitter: The counterpublic moderation of harassment in a networked public space,” Information, Communication & Society, vol. 19, pp. 787–803, Jun. 2016, doi: 10.1080/1369118X.2016.1153700.

[163]

A. Kulshrestha and J. Mayer, “Identifying harmful media in end-to-end encrypted communication: Efficient private membership computation,” in 30th USENIX security symposium (USENIX security 21), USENIX Association, 2021, pp. 893–910. Available: https://www.usenix.org/conference/usenixsecurity21/presentation/kulshrestha

[164]

R. Brandom, “Apple says collision in child-abuse hashing system is not a concern,” The Verge, 2021, Available: https://www.theverge.com/2021/8/18/22630439/apple-csam-neuralhash-collision-vulnerability-flaw-cryptography

[165]

G. Cormode and S. Muthukrishnan, “An improved data stream summary: The count-min sketch and its applications,” J. Algorithms, vol. 55, no. 1, pp. 58–75, 2005, doi: 10.1016/j.jalgor.2003.12.001.

[166]

B. Smith, “Ring.” https://lib.rs/crates/ring, Nov. 2021.

[167]

A. Payne, “Bitvec.” https://lib.rs/crates/bitvec, Nov. 2021.

[168]

N. Tyagi, Y. Gilad, D. Leung, M. Zaharia, and N. Zeldovich, “Stadium: A distributed metadata-private messaging system,” in Proceedings of the 26th symposium on operating systems principles, ACM, 2017, pp. 423–440.

[169]

H. Corrigan-Gibbs, D. I. Wolinsky, and B. Ford, “Proactively accountable anonymous messaging in verdict,” in Presented as part of the 22nd \(\{\)USENIX\(\}\) security symposium (\(\{\)USENIX\(\}\) security 13), 2013, pp. 147–162.

[170]

J. Brooks et al., “Ricochet: Anonymous instant messaging for real privacy.” 2016.

[171]

J. Lund, “Technology preview: Sealed sender for signal,” Technology preview: Sealed sender for Signal. Oct. 2018. Available: https://signal.org/blog/sealed-sender/

[172]

I. Martiny, G. Kaptchuk, A. Aviv, D. Roche, and E. Wustrow, “Improving signal’s sealed sender,” in ISOC network and distributed system security symposium – NDSS, Jan. 2021. doi: 10.14722/ndss.2021.24180.

[173]

O. Goldreich, “Towards a theory of software protection and simulation by oblivious RAMs,” in 19th ACM STOC, A. Aho, Ed., ACM Press, May 1987, pp. 182–194. doi: 10.1145/28395.28416.

[174]

R. Ostrovsky, “Efficient computation on oblivious RAMs,” in 22nd ACM STOC, ACM Press, May 1990, pp. 514–523. doi: 10.1145/100216.100289.

[175]

E. Fenske, A. Mani, A. Johnson, and M. Sherr, “Distributed measurement with private set-union cardinality,” in ACM CCS 2017, B. M. Thuraisingham, D. Evans, T. Malkin, and D. Xu, Eds., ACM Press, 2017, pp. 2295–2312. doi: 10.1145/3133956.3134034.

[176]

V. G. Ashok and R. Mukkamala, “A scalable and efficient privacy preserving global itemset support approximation using bloom filters,” in Data and applications security and privacy XXVIII - 28th annual IFIP WG 11.3 working conference, DBSec 2014, vienna, austria, july 14-16, 2014. proceedings, 2014, pp. 382–389. doi: 10.1007/978-3-662-43936-4\_26.

[177]

R. Egert, M. Fischlin, D. Gens, S. Jacob, M. Senker, and J. Tillmanns, “Privately computing set-union and set-intersection cardinality via bloom filters,” in Information security and privacy - 20th australasian conference, ACISP 2015, brisbane, QLD, australia, june 29 - july 1, 2015, proceedings, 2015, pp. 413–430. doi: 10.1007/978-3-319-19962-7\_24.

[178]

J. Katz and Y. Lindell, Introduction to modern cryptography. Chapman; Hall/CRC Press, 2007.

[179]

D. Evans, V. Kolesnikov, and M. Rosulek, “A pragmatic introduction to secure multi-party computation,” Foundations and Trends in Privacy and Security, vol. 2, no. 2–3, pp. 70–246, 2018, doi: 10.1561/3300000019.

[180]

Y. Lindell and B. Pinkas, “A proof of security of Yao’s protocol for two-party computation,” Journal of Cryptology, vol. 22, no. 2, pp. 161–188, 2009, doi: 10.1007/s00145-008-9036-8.

[181]

C. Dwork, “Differential privacy (invited paper),” in ICALP 2006, part II, M. Bugliesi, B. Preneel, V. Sassone, and I. Wegener, Eds., in LNCS, vol. 4052. Springer, Heidelberg, 2006, pp. 1–12. doi: 10.1007/11787006_1.

[182]

C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” in TCC 2006, S. Halevi and T. Rabin, Eds., in LNCS, vol. 3876. Springer, Heidelberg, 2006, pp. 265–284. doi: 10.1007/11681878_14.

[183]

A. Beimel, K. Nissim, and E. Omri, “Distributed private data analysis: Simultaneously solving how and what,” in CRYPTO 2008, D. Wagner, Ed., in LNCS, vol. 5157. Springer, Heidelberg, 2008, pp. 451–468. doi: 10.1007/978-3-540-85174-5_25.

[184]

[185]

D. Dachman-Soled, L. Ducas, H. Gong, and M. Rossi, “LWE with side information: Attacks and concrete security estimation,” in CRYPTO 2020, part II, D. Micciancio and T. Ristenpart, Eds., in LNCS, vol. 12171. Springer, Heidelberg, 2020, pp. 329–358. doi: 10.1007/978-3-030-56880-1_12.

[186]

L.-P. Liu, “Linear transformation of multivariate normal distribution: Marginal, joint and posterior.” http://www.cs.columbia.edu/~liulp/pdf/linear_normal_dist.pdf.

[187]

Y. Ishai, T. Malkin, M. J. Strauss, and R. N. Wright, “Private multiparty sampling and approximation of vector combinations,” Theor. Comput. Sci., vol. 410, no. 18, pp. 1730–1745, 2009, doi: 10.1016/j.tcs.2008.12.062.

[188]

I. Sason and S. Verdú, “Bounds among f-divergences,” CoRR, vol. abs/1508.00335, 2015, Available: https://arxiv.org/abs/1508.00335

[189]

F. Österreicher, “Csiszar’s f-divergence-basic properties,” RGMIA Research Report Collection, 2002.

[190]

G. Barthe and F. Olmedo, “Beyond differential privacy: Composition theorems and relational logic for f-divergences between probabilistic programs,” in ICALP 2013, part II, F. V. Fomin, R. Freivalds, M. Z. Kwiatkowska, and D. Peleg, Eds., in LNCS, vol. 7966. Springer, Heidelberg, 2013, pp. 49–60. doi: 10.1007/978-3-642-39212-2_8.

We can reduce \(s \eta\) further to \(\Theta( \eta \cdot (s/\ell) \cdot \log (s/\ell))\) using a Chernoff bound to bound the number of collisions contributing to the sum, but we will use \(s\eta\) for the sake of simplicity of presentation.↩︎
We do not count the public multiplications to produce powers of \(i\)↩︎
We note if we don’t wish to reveal this to the server, we can use a fixed, global upper bound, or, if it is appropriate to the application, the client can add noise to provide differential privacy. It is also worth pointing out that prior work leaks the result set size as well.↩︎
Of course, if \(\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle= 0\), the probability space is not well-defined, and in this case, we require the protocol to simply output \(\bot\).↩︎
Throughout the paper, we will describe the round and communication complexities using the asymptotic notation only based on \(n\). That is, all other parameters (e.g., security parameter) independent on \(n\) will be suppressed in the asymptotic expressions.↩︎
Here the assumption that \(\boldsymbol{w}\) are normalized is without loss of generality.↩︎
Regarding \(\langle \boldsymbol{w}_1, \boldsymbol{w}_2 \rangle\), there is a gap between the lowerbound result (i.e., \(\Omega(\frac{1}{n^2})\)) and our construction (i.e., \(\omega(\frac{\log n}{n})\)). Resolving the gap is left as an interesting open problem.↩︎
This can be shown by a simple modification of the lower bound proof from Section 4.4.1.↩︎
As noted previously, it is sufficient to store a short seed to identify the hash.↩︎
We use cryptographic hash functions to instantiate the hashes in the random oracle model.↩︎
We note that since the underlying messaging scheme is encrypted, the actual ciphertext sent will not be the same as the ciphertext received.↩︎
We note that since \(S \in \mathcal{A}\), \(\mathcal{A}\) can produce valid-looking tags for each of these messages by producing the necessary signatures.↩︎
Recall that \(s\) is the size of the CCBF bit vector \(T\)↩︎
Technically, the item used in the CCBF is the tag \(\mathsf{tag}_x\), but we use \(x\) here for ease of notation.↩︎
Recall that FACTS enforces a limit of \(L\) complaints per user per epoch.↩︎

Abstract

1 Introduction

1.1 The Linear Bottleneck in Secure Computation

1.2 Sublinear Secure Protocols: Primitives for Sublinear Privacy

1.3 FACTS: Accountability via Probabilistic Structures

2 Literature Review

2.1 Line 1: Secure Search, Sampling, and Similarity

2.1.0.1 Differential privacy (DP).

2.1.1 Secure Search

2.1.1.1 Secure pattern matching (SPM) on FHE-encrypted data.

2.1.1.2 Searchable encryption (SE).

2.1.1.3 Property Preserving Encryption (PPE).

2.1.1.4 General Techniques (PIR, MPC, ORAM, ODS).

2.1.2 Secure Sampling

2.1.2.1 Sampling from streaming data.

2.1.2.2 Secure multiparty sampling.

2.1.2.3 Secure MPC of differentially private functionalities.

2.1.3 Private Sketching and Secure Sketching

2.1.3.1 Secure sketching.

2.1.3.2 Private sketching.

2.1.4 Set Similarity via Min-Hash

2.1.4.1 Jaccard Index and Min-Hash.

2.1.4.2 Differentially Private (DP) Min-Hash.

2.1.4.3 Optimizing Secure Computation using DP.

2.1.4.4 Secure approximation.

2.1.4.5 Robust sketching and property-preserving hashing.

2.2 Line 2: FACTS (Forwarding Accountability in End-to-End Encrypted Messaging Platforms)

2.2.0.1 Message Franking.

2.2.0.2 Scalable Oblivious Data Structures.

2.2.0.3 Private Sketching for User-Server Settings.

3 Secure Search

3.1 Introduction

3.1.0.1 FHE-based secure search.

3.1.0.2 Multiplications in the fetching step.

3.1.1 Motivation

3.1.1.1 Bottleneck: fetching records sequentially.

3.1.1.2 Reducing homomorphic multiplications.

3.1.2 Our Work

3.1.2.1 Parallelizing the Fetch procedure.

3.1.2.2 Compressed oblivious encoding.

3.1.2.3 No multiplications in the Encode step.

3.1.2.4 Using PIR (Private Information Retrieval).

3.1.2.5 Implementation.

3.2 Preliminaries

3.2.0.1 Chernoff bound.

3.2.0.2 FHE.

3.2.0.3 PIR.

3.2.1 Bloom Filter

3.2.1.1 \(\mathsf BF.Init()\).

3.2.1.2 \(\mathsf BF.Insert(B, \alpha)\).

3.2.1.3 \(\mathsf BF.Check(B, \beta)\).

3.2.1.4 Random oracle model for hash functions.

3.2.2 Algebraic Bloom Filter

3.2.2.1 \(\mathsf BF.Insert(B, \alpha)\).

3.2.2.2 \(\mathsf BF.Check(B, \beta)\).

3.3 Compressed Oblivious Encoding

3.3.1 Compressed Oblivious Index Encoding

3.3.1.1 Parameters.

3.3.1.2 Syntax.

3.3.1.3 Correctness.

3.3.1.4 Efficiency.

3.3.2 Compressed Oblivious Data Encoding

3.3.2.1 Parameters.

3.3.2.2 Syntax.

3.3.2.3 Correctness.

3.4 COIE Schemes

3.4.1 A Warm-up construction

3.4.1.1 \(\mathsf{Encode}(\unicode{x27E6}v_1 \unicode{x27E7}, \ldots, \unicode{x27E6}v_n \unicode{x27E7})\).

3.4.1.2 \(\mathsf{Decode}(B_1, \ldots, B_c)\).

3.4.1.3 Parameters \(c\) and \(f_p\).

3.4.1.4 Efficiency.

3.4.2 BF-COIE

3.4.2.1 Example.

3.4.2.2 BF-COIE.

3.4.2.3 Useful lemma.

3.4.2.4 Parameters \(c\) and \(f_p\).

3.4.2.5 Efficiency.

3.4.2.6 Remark.

3.4.3 COIE Scheme Based on Power Sums

3.4.3.1 Removing false positives using power sums.