Expanding the attack surface: Robust profiling attacks threaten the privacy of sparse behavioral data

Behavioral data, collected from our daily interactions with technology, have driven scientific advances. Yet, the collection and sharing of this data raise legitimate privacy concerns, as individuals can often be reidentified. Current identification attacks, however, require auxiliary information to roughly match the information available in the dataset, limiting their applicability. We here propose an entropy-based profiling model to learn time-persistent profiles. Using auxiliary information about a single target collected over a nonoverlapping time period, we show that individuals are correctly identified 79% of the time in a large location dataset of 0.5 million individuals and 65.2% for a grocery shopping dataset of 85,000 individuals. We further show that accuracy only slowly decreases over time and that the model is robust to state-of-the-art noise addition. Our results show that much more auxiliary information than previously believed can be used to identify individuals, challenging deidentification practices and what currently constitutes legally anonymous data.


INTRODUCTION
Over 22 billion connected devices, from smartphones and wearables to Internet of Things devices, passively collect fine-grained behavioral data about our lives (1).The location of a mobile phone is, for instance, collected up to 14,000 times a day (2), while a car generates up to 25 gigabytes of data every hour (3).These data are widely used.Location data, for example, are used by banks to detect fraudulent behavior (4) and predict the likelihood of loan repayment (5).They are also used by governments to monitor employment (6), quickly respond to natural disasters (7), and recently to respond to the coronavirus disease 2019 (COVID-19) pandemic (8).Last, researchers have used location data to better understand the spread of infectious diseases (9)(10)(11) or the segregation in cities (12).While extremely useful, behavioral data are also extremely personal and sensitive (13), as shown by the Cambridge Analytica affair (14) and Edward Snowden's revelations (15).In recent surveys, over 80% of Americans (16) and 80% of Britons (17) have expressed concerns over how their data are used and shared.
Finding a balance between using behavioral data for good and protecting people's privacy often relies on anonymizing the data.Once anonymized, behavioral data fall outside the scope of data protection laws and can be freely used and shared.In the European Union's General Data Protection Regulation (18) (GDPR, recital 26), data are considered anonymized when "rendered anonymous in such a manner that the data subject is not or no longer identifiable.
[ … ] To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments."Similar definitions are found in privacy laws around the world, e.g., in the California Consumer Privacy Act section 1798.140(h) (19) and in new bills currently under examination across the United States [e.g., Washington Senate Bill 5062 section 101 (20), Massachusetts Bill HD.3847 (21), and Virginia House Bill 2307 section 59.1-571 (22)].
Matching attacks, however, require the attacker to have access to auxiliary information about the target, which are also available in the dataset.For example, Gov. Weld was identified by his date of birth, gender, and zip code (26).These pieces of information were both public and available in the anonymized medical dataset.Assuming that the same information is both available in the dataset and as auxiliary information is, in general, a reasonable requirement for traditional tabular data.For behavioral data, however, this means that auxiliary information about the target and data points in the dataset have to be collected not only over the same period of time but also roughly at the same times.This is a strong requirement that can substantially limit the availability of matching auxiliary information, in particular, when the data are sparse (45,48).This has led some to question the practical risks posed by data matching for behavioral data and ultimately whether data protection laws should apply to pseudonymized behavioral datasets (48)(49)(50).
Here we present a profiling attack against sparse behavioral data capable of leveraging fully nonmatching auxiliary information, enabling the attacker to use a wide range of auxiliary information including publicly available information.Once trained, our entropybased model correctly identifies 79% of individuals in a location dataset of 0.5 million people and 93% within a set of 10 candidates.Similarly, on a grocery shopping dataset of 85,000 individuals (51), our model correctly identifies 65% and 74% within a set of 10 candidates.Using a metaclassifier, our model reaches an Area Under the Receiver Operating Characteristic (AUROC) of 0.91 and is well calibrated.Our results hold even when (i) the time gap between the dataset and the auxiliary information increases, (ii) state-of-theart noise is added to the dataset, and (iii) the dataset is large.Together, our results relax a strong requirement of current matching attacks and show that much more auxiliary information than previously thought might be available to reidentify individuals in behavioral datasets.This has broad implications for what constitutes anonymous data in today's world, challenges current deidentification practices, and emphasizes the need to develop and deploy modern privacy engineering solutions.

RESULTS
We consider a population I data of N individuals interacting with a service over a time period T data = [ t data , t data ′ ) .The service collects a behavioral dataset  data = {y i |i ∈ I data } where, for each individual i ∈ I data , y i = ((t i,1 , x i,1 ), …, (t i,ni , x i,ni )) is a trace of data points.x ∈  (e.g., a physical location) and points are time-ordered (t i,1 ≤ ⋯ ≤ t i,ni , with t data ≤ t i,1 and t i, n i < t data ′ ).An attacker holds auxiliary information  = ((t ,1 , x ,1 ), …, (t ,n , x ,n )) about a target individual j, which they hope to use to identify j in  data (Fig. 1A).The auxiliary information is recorded over a time interval T aux = [ t aux , t aux ′ ) (i.e., t aux ≤ t ,1 ≤ ⋯ ≤ t , n  < t aux ′ ) disjoint from  data (i.e., t data ′ ≤ t aux or t aux ′ ≤ t data ).We here assume that j ∈ I data and consider the general case in Discussion.

Profiling model for sparse data
We consider a space of profiles  and a map  from raw traces to profiles.Profiles aim to capture information about an individual that is both specific to that individual and stable over time.Formally, with few assumptions about the behavior of individuals,  performs nonparametric density estimation of q random variables extracted from each trace (e.g., location or time of the day).The space of profiles , where a k is the dimension corresponding to the kth variable (see Materials and Methods).
We propose an asymmetric dissimilarity function d on  to compare the profiles of individuals in the dataset with the auxiliary information available to the attacker where The model parameters  ∈ ℝ + q and  ∈ (0,1) q are shared across individuals.H is the information entropy function, and are convex combinations (see Materials and Methods).These convex combinations adjust H nonnegative gaps of concavity in Eq. 2. The gaps are then combined linearly in Eq. 1 with  controlling their respective weights.Gaps capture the amount of statistical uncertainty that would be introduced by mixing profiles X and Y. Mixing profiles from a single individual is expected to introduce less uncertainty than mixing profiles from distinct individuals, leading to smaller values for d (see the Supplementary Materials).
Using the divergence d, traces in the dataset  data are ranked according to their similarity with the auxiliary information .In particular, our model finds the most similar trace () = arg min Once () is found, a meta-classifier using the "second-over-first" score (41,45) estimates the likelihood ˆ   of () to be correct (see Materials and Methods). for , and  2 () the second most similar trace to .We train our model using a contrastive loss function, a well-known approach in image representation learning (52)(53)(54).Training traces are split over two disjoint subintervals of  data into  1 and  2 (see Materials and Methods).Traces in  1 can be viewed as anchors, compared to positive and negative examples in  2 .Formally, for  = {((y 1 ), (y 2 )) | y 1 ∈  1 , y 2 ∈  2 , y 1 ≡ y 2 }, a set of training profiles computed from  1 and  2 , where y 1 ≡ y 2 indicates two traces originating from the same individual where and with The terms of  + are nonzero for couples (X 1 , X 2 ) ∈  where the model correctly finds X 2 to be the most similar profile for X 1 (positive examples).Reciprocally, the terms of  − are nonzero on couples where X 1 is incorrectly identified (negative examples).Training minimizes  − while maximizing  + according to a balancing metaparameter  > 0 (see Materials and Methods and Fig. 1, B and C).

Empirical evaluation
We use a large-scale location dataset collected over 24 consecutive weeks for 0.5 million people through Call Detail Records (CDRs).For every interaction (call or text), CDRs typically contain the pseudonyms of the sender, the recipient, an hourly timestamp, the type of interaction (call or text), the duration of the call, as well as the approximate location of both the sender and the recipient.More specifically, for each party, approximate location refers to the antenna the party was connected to when the interaction occurred.To keep our model general, we here only use the location and hourly timestamp information.On average, traces contain 50.70 data points per week.The dataset  data and the auxiliary information  are fully disjoint and recorded respectively over the weeks  data = [1,11) and  aux = [11,16).The remaining weeks (i.e., [16,24]) are used in the next section to study the impact of the time gap g = t aux − t data ′ between  data and  aux on the model accuracy.
We also validate the generality of our model by applying it to a grocery shopping dataset provided by Instacart (51).The dataset contains the ordered list of shopping baskets purchased online by each customer during a year.Each data point is a transaction corresponding to the basket of items purchased by a customer at once.The dataset contains approximately 100,000 individuals with at least 10 recorded transactions throughout the year.Recorded transactions include the purchased quantities of each product, the aisle where each product is stored within the shop (such as vegetables and meat), as well as the day of the week and hour of the day when the transaction happened.As this shopping dataset does not contain location information, we apply our model to profile individuals according to what they typically buy (see the Supplementary Materials).Similarly to the setup we use for the location dataset, the auxiliary information used by the model here is also fully nonoverlapping (see the Supplementary Materials).
Figure 2A shows that our model has  1 = 79% chance of correctly finding a target out of N = 0.5 million individuals in the location dataset.The attacker would furthermore have  10 = 93% chance of correctly finding the target in a set of 10 candidates and  50 = 97.5% chance in a set of 50 candidates, both out of 0.5 million individuals.
Figure 2B shows that our model has  1 = 65% chance of correctly finding a target out of N = 85,000 individuals in the grocery shopping dataset.The attacker would furthermore have  10 = 74% chance of correctly finding the target in a set of 10 candidates and  50 = 80% chance in a set of 50 candidates, both out of 85,000 individuals.While a complete analysis of why individuals might be less identifiable in shopping data than location data is beyond the scope of this work, we offer some hypotheses in Discussion.
Figure 2 (C and D) shows that the meta-classifier accurately predicts whether the right individual has been found by our model.It achieves a high AUC [area under the receiver operating characteristic curve (ROC)] of 0.91 for the location dataset (0.94 for the grocery shopping dataset), and the estimated likelihood ˆ   of the right individual to be found is well calibrated.This ensures that an individual found by our model and given a high probability by the meta-classifier is likely to be the right person.For instance, for  ̂  > 0.95 (respectively 0.9 and 0.99), the empirical likelihood for () to be incorrect (false discovery rate) is 4.85% for the location dataset (respectively 10.4 and 1.0%; see inset in Fig. 2C).

Time persistence of location profiles
Our behavior is likely to change over time, as we change jobs or partners, move houses, or favor new shops (fig.S4).To evaluate the robustness of profiles against the natural drift of human behavior over time (55), we compare the performances of our model when the auxiliary information is collected after a time gap g = t aux − t data ′ .Starting from g = 0, i.e., the previous configuration where the auxiliary information  aux and the dataset  data are collected over disjoint consecutive periods of time  data = [1,11) and  aux = [11,16), we increase the time gap g and consider auxiliary information collected over  aux,g = [11 + g,16 + g).
This experiment shows (fig.S5) that the location profiles built by our model are time persistent, with accuracy  1,g decreasing slowly over time.For each added week in the time gap, we estimate that  1,g decreases by 0.93 percentage point on average [±0.29,standard error of the difference (SED), linear fit R 2 = 0.996].The AUC of the meta-classifier similarly decreases slowly over time (AUC g=0 = 0.91 down only to AUC g=9 = 0.90).
To understand why some individuals are more identifiable than others in the location dataset, we compute a handful of summary statistics for each individual and use them in a post hoc analysis with individuals being split into two groups according to their respective identification rate when g increases (see Materials and Methods).We found (fig.S5) that individuals that are more identifiable visit more unique locations (30 versus 21 medians, P < 10 −15 ), that their traces contain more geographical information (geographical entropy of traces: 2.8 versus 2.2 bits of information, P < 10 −15 ), that they spend most of their time within a small geographical region (radius of gyrations (56): 19.8 versus 21.8 km, P < 10 −15 ), and that they live in less densely populated area (area of the primary Voronoi cell: 7.2 versus 9.6 km 2 , P < 10 −15 ).These differences suggest that the lifestyle of an individual affects its identifiability in profiling attacks.

Robustness to noise addition
Noise addition has long been used as a mechanism to prevent identification.Geo-indistinguishability, a technique inspired by Differential Privacy (57), has become a popular noise addition mechanism for high-dimensional location data.It has, for instance, been implemented For individuals predicted to be correctly found in the location dataset with ˆ   > 0.95 , 4.81% are actually incorrect, showing the method to be well calibrated (see table S4).(D) The meta-classifier accurately evaluates whether the individual found by our model in the grocery shopping dataset is correct (ROC curve, AUC = 0.94).Inset: FDR for traces with ˆ   >  .For individuals predicted to be correctly found in the grocery shopping dataset with ˆ   > 0.95 , 4.22% are actually incorrect (see table S5).TPR, true positive rate; FPR, false positive rate.
by browser apps such as Location Guard (58) and Geoprivacy (59).In this work, we show for the first time how profiling attacks are possible at large scale against sparse behavioral datasets.Profiling attacks relax a strong requirement of matching attacks, especially against behavioral data: the need for auxiliary information to be recorded not only over the same time period but also roughly at the same times.Our attack significantly expands the attack surface by making a much wider range of auxiliary information usable for reidentification, even against noisy datasets.Further research in profiling attacks is likely to lead to even more powerful models.Our results emphasize the need to account for profiling attacks when evaluating what constitutes anonymous data, for instance, the European Union Article 29 WP linkability criteria (66) interpretation of GDPR.Technically, our results emphasize the need for formal privacy guarantees and technical privacy engineering solutions enabling the truly anonymous use of behavioral data.

Scalability of the attack to larger datasets
The size of a dataset is likely to affect the likelihood of a person to be identified in it, with accuracy likely to be lower in larger datasets on average (46,47).We here study the accuracy of our attack as a function of N, the number of individuals in the dataset  data , assuming everything else is kept equal (see the Supplementary Materials).
Figure 4 (A and B) shows that the accuracy  1 of our attack only decreases slowly with N for both location and grocery shopping datasets.The first derivative ∂ N converges to 0 rapidly when N increases.Last, a simple logarithmic fit (R 2 = 0.999) shows the decrease of  1 to behave as a third-order polynomial in log (N).Together, these findings strongly suggest that the accuracy of our attack would remain high in most practical settings.

Removing the membership assumption
We have, throughout the article, worked under the assumption that the attacker knew the target to be in the dataset  data .While we believe that this assumption is reasonable in many practical settings, there are situations where it might not be the case.We thus extend the meta-classifier to the case where the attacker is unsure whether the target is in the dataset.
The meta-classifier is adapted using a prior P on the probability of the target to be in the dataset.In the main text, scores were calibrated using two sets of traces obtained over two disjoint periods of time from the same individuals.Here, the prior P is used during the calibration phase as a leave-one-out parameter.Each calibration trace in the first set has its counterpart in the second set virtually removed with probability 1 − P (see Materials and Methods).P is chosen by the attacker on a case per case basis depending on the information available to them.For instance, P could be the sampling rate for sampled data or the market share for a company in the country of interest.prior P. For each prior, after calibration, we perform the attack on targets from a new set of individuals I aux such that P = J(I aux , I data ) the Jaccard index between both sets (67).The performances of our meta-classifier only decrease slightly with P as new targets, not contained in the dataset, are introduced.In particular, for the location dataset, the AUC decreases from AUC P=1 = 0.91 (main text) to AUC P=0.9 = 0.89, AUC P=0.75 = 0.89, and AUC P=0.5 = 0.88.For predictions () estimated to be correct with ˆ   > 0.95 , the likelihood for () to be incorrect is 4.6% for P = 0.9 (resp.4.3% for P = 0.75 and 3.3% for P = 0.5, see inset Fig. 4C).

Comparison with previous works on location data
While most previous works on location data have investigated attacks where matching auxiliary information is available to an attacker (34,37,45,(68)(69)(70), a few attacks using nonoverlapping auxiliary information have been proposed and evaluated on small-scale datasets (from 100 to 50,000 individuals) (71)(72)(73).These attacks are based either on Markov chains (71,72) or on histograms (73).We reimplement and compare our work to six of these methods: four based on histograms using the Jensen-Shannon (JS) divergence ( 73), Bhattacharyya (Bhat) distance ( 74), L1 distance (75), and cosine distance (76) and two based on Markov chains (71,72).We compare these methods to our approach in three scenarios: (i) no noise added to the dataset, (ii) small amounts of noise added to the dataset ( _ r = 200 m), and (iii) very large amounts of noise added to the dataset ( _ r = 2000 m).Our method outperforms all six previous methods by at least 13 percentage points in each scenario (see table S3).More specifically, we outperform the state of the art by 13.6% in scenario 1, by 16.3% in scenario 2, and by a striking 26.9% in scenario 3.  S8). 1 decreases slowly with population size (log fit, R 2 = 0.999).Inset: The first derivative converges to zero as N increases, confirming that the decrease of  1 is convex and slow.(B) Model accuracy  1 averaged over 10 runs for the grocery shopping dataset (see table S9). 1 decreases slowly with population size (log fit, R 2 = 0.99).(C) The attack is robust to a relaxation of the membership assumption (location dataset).The prior P is the Jaccard index between I aux and I data .Inset: FDR for predictions () estimated to be correct with ˆ   > 0.95 .The likelihood for () to be incorrect is 4.6% for P = 0.9 (resp.4.3% for P = 0.75 and 3.3.%for P = 0.5), showing that the method is well calibrated even when the membership assumption is relaxed (see table S6).(D) The attack is robust to a relaxation of the membership assumption (grocery shopping dataset).The ROC curves for various prior P are indistinguishable.Inset: FDR for predictions () estimated to be correct with ˆ   > 0.95 .The likelihood for () to be incorrect is 4.22% for P = 0.9 (resp.4.06% for P = 0.75 and 4.27% for P = 0.5), showing that the method is well calibrated even when the membership assumption is relaxed (see table S7).
histogram-based methods perform better than Markov-based methods (see table S3).
Deep Learning methods have been developed recently to extract representations from raw sequential data, e.g., Contrastive Predictive Coding (77), Recurrent Attention Models (78), and Autoencoder architectures with Recurrent Neural Networks (79) and Autoregressive models (80).These representations might, in future work, replace the simple profiles we used here (see Materials and Methods), although questions remain on the applicability of these methods to sparse data.Similarly, future specialized models for behavioral datasets are likely to be developed.Both will ultimately further increase the scope and accuracy of profiling attacks and the risk they pose to our privacy.

Limitation-auxiliary information
Throughout this work, we evaluate the potential of a person to be identified in a location dataset using fully nonoverlapping auxiliary information coming from the same modality.In some cases, an attacker might try to identify someone using auxiliary information coming from other modalities including publicly available information, such as social media posts, or privately collected information, such as the WiFi connection data used in the recent reidentification and subsequent outing of a U.S. priest (81).For ethical, legal (82), and contractual reasons, we did not attempt to identify individuals in our dataset using auxiliary information coming from other modalities.Unless the auxiliary information comes from a modality independent from the one used to collect the dataset (e.g., a credit card only used for expenses abroad with mobile phone dataset recorded only in the country), we expect our model to perform well and our results to qualitatively hold.

Limitation-noise addition mechanism
We here consider, Geo-indistinguishably, a local noise addition mechanism that has traditionally been used for location data (58,59).A range of other mechanisms could be considered.For instance, one could decide to report the same obfuscated location every time an individual is in a corresponding real location.One could also consider global mechanisms such as k-anonymity (83).While some of these mechanisms might prove more effective against our attack, something we leave for future work, their impact on the downstream utility of the dataset has to be carefully considered.In particular, the biases introduced by nontruthful methods are considered generally problematic and global methods such as k-anonymity have been shown to strongly affect utility (83).We are skeptical that behavioral data can be anonymized at individual level while retaining general utility.Instead, we believe modern privacy engineering methods, such as query-based systems (84,85), and formal guarantees, such as Differential Privacy, to be the way forward when it comes to safely releasing behavioral data.

Discrepancies between location and grocery shopping datasets
While a complete analysis of why individuals might be less identifiable in shopping data than location data is beyond the scope of this work, we offer some hypotheses below.First, the grocery shopping dataset used here is much sparser than the location dataset (0.52 data points per week on average for the grocery shopping dataset versus 50.70 data points per week on average for the location dataset).This is likely to affect the computation of profiles by density estimation, making them less accurate.Second, grocery shopping data points might be less identifiable than location data points.For instance, groceries online are mostly purchased from the first category page displayed by retailers (86), which could reduce the diversity of shopping baskets across individuals.Third, shopping patterns might be less stable over time than location patterns.Previous works have shown human mobility to be fairly predictable, especially with regard to home and work locations (55,87,88).On the other hand, customers seem to shop for groceries online only as a complement to traditional stores (89), with many situational factors influencing the loyalty of customers and their purchasing habits over time (90).
Last, shopping patterns have been used to train recommender systems.These systems learn to predict future purchases through collaborative filtering, deducing future purchases from what other individuals have purchased in the past.However, while recommender systems learn that customers who bought X will also likely be interested in Y, our model learns the singularities of the shopping habits  S8).(B) Our model outperforms the baselines for any amount of added noise, e.g., by 27.6 percentage points for _ r = 3000 m . 1 also visually decreases more rapidly for baselines than for our model.
of an individual to then identify the specific list of their purchases in the past.

MATERIALS AND METHODS
Our model is an open-set inductive classifier based on nearest neighbor classification (1-NN) (91).Open-set means that classes, i.e., the identities of the individuals in the dataset, are disjoint between training, validation, and testing, thus providing the attacker with a model readily applicable to new individuals.Previous studies have shown that 1-NN performances can be improved by learning the model's distance (92,93).
Our methodology contribution can be summarized into three points: (i) We propose an abstract space as input of the model, the space of profiles, and a method to map raw data into that space as collections of histograms.(ii) Within the space of profile, we propose a supervised learning method similar to recent works in distance learning (92,93) to learn a new divergence to compare profiles.(iii) From the divergence values, we propose a method to estimate the likelihood of auxiliary information to be correctly classified.Our framework, which we will now describe, makes no parametric assumption about the distribution of the data.

Formalism
Behavioral data are individual-level temporal data containing discrete events characterizing the behavior of each individual.We model the generation of these discrete events as point processes, a general nonparametric model for point pattern analysis (94).Formally, for each individual, we consider a trace y as a realization of a point process Y on T × X, where T ⊂ ℝ + is a time interval over which the data are recorded and X = ∏ k X k is a multidimensional space with each X k either discrete or an interval on the real line.For instance, in the main text, X is the (single-dimensional) finite set of the indexed geographical regions around each antenna.We further assume that these point processes are invariant, as random elements whose values are point patterns, by week translation over time.This is a modeling assumption that works well enough to model the weekly patterns, followed by individuals' behavior aside from holidays and lifechanging events (55,87,88).Under this assumption, Y has similar distributions over all T ′ × X, where T ′ ⊂ T ′ is a 1-week time interval.

Mapping traces to profiles
Using a map  :  → S from raw traces to the space of profiles, we compute profiles aiming to capture the recurrent patterns of an individual while reducing the microvariations observed in the data.Profiles are collections of density estimates corresponding to variables obtained from each point process Y i (see the Supplementary Materials).Formally, for each individual i, the profile (Y i ) = Z i = (Z i,k ) k=1, …, q is a collection of random variables Zi,k taking values on their respective probability simplex S k (with S = ∏ k=1,…,q S k ).Here, we choose Z i,k to be a histogram obtained using the random counting measure N i associated with Y i on a collection B k of Borel sets of T × X with N i (B) = # (B ∩ Y i ) the random variable counting the number of events of Y i in B, for any B ∈ B k .Aiming for profiles to be time persistent and robust to added noise, we consider collections of Borel sets B k corresponding to aggregating events time-wise (over ) and value-wise (over ) (see the Supplementary Materials).Although beyond the scope of this work, other density estimation methods, e.g., kernel density estimators, could be used for the variables Z i,k .

Divergence
Our model learns how important each random variable Z •,k is for profiles to be identifiable, and weights these variables accordingly.Each variable is valued on a probability simplex of up to a few thousand dimensions.Weights  ∈ ℝ + q and  ∈ [0,1] q are, by design, shared across individuals for the model to be inductive.This allows the model to be applied to individuals that are not seen during training and validation.
We define the model divergence d , = ∑ k=1 q  k d  k as a linear combination, weighted by , of pairwise subdivergences of the variables Z i,k on their respective simplexes.More specifically, the subdivergence

Empirical setup
Figure S1 shows how we split the location dataset to train, validate, test, and calibrate our model.In particular, the dataset is split into training sets  1 and  2 , validation sets  3 and  4 , testing sets  data and  aux , and score calibration sets  A and  B .For testing, traces are collected over T data = [ t data , t data ′ ) from individuals in I data .Auxiliary information is fully nonoverlapping, collected over T aux = [ t aux , t aux ′ ) with t data ′ ≤ t aux from targets in I aux such that the Jaccard index between I data and I aux is equal to the attacker's prior P. For training, traces are collected over a split of T data into two consecutive time intervals T 1 = [t 0 , t 1 ) (with t 0 = t data ) and T 2 = [t 1 , t 2 ) (with t 2 − t 1 = t 1 − t 0 ) from individuals in I train .Training individuals I train are disjoint from I data and I aux .For validation, traces are collected over a split of T data into two other consecutive time intervals T 3 = [t 2 , t 3 ) and T 4 = [t 3 , t 4 ) (with t 4 = t data ′ and t 4 − t 3 = t 3 − t 2 ) disjoint from T 1 and T 2 .Individuals I valid used for validation are disjoint from I data , I aux , and I train .Last, to calibrate the scores, traces are recorded over another split of T data into T A = [t A , t B ) and T B = [t B , t 4 ) (with t 4 − t B = t B − t A = t aux ′ − t aux , i.e., 5 weeks here) from individuals in I data .
We select a 80/20 split for the time, with 10 weeks of T data split into 4 and 4 weeks for T 1 and T 2 , and 1 and 1 week for T 3 and T 4 , all disjoints.We used a small number of individuals for training (10,000) and validation (1000) to illustrate the strength of our model even when tested orders of magnitude above its training size (N = 0.5 million).Individuals were kept strictly separate between training, validation, and testing to prevent overfitting and to show the inductive strength of our model.

Fig. 1 .
Fig. 1.Representation of the attack and effect of training.(A) Using auxiliary information  about the target (top: recorded over  aux ), the attacker attempts to identify the target in the dataset  data (bottom: recorded over  data ).Traces are processed by the model in three steps: (i) Time persistent profiles are computed using , (ii) the dissimilarities between the auxiliary information and profiles are computed with the divergence d, and (iii) potential candidates are ranked and the meta-classifier estimates the likelihood  of the best candidate  to be the target.(B and C) Representation of profiles built by the model before (B) and after (C) training using t-distributed stochastic neighbor embedding (t-SNE) on d (97).Each point is a profile computed from 1 week of location data for a person.The training procedure here improves the ability of our model to distinguish the profiles of a single individual from the profiles of other individuals.

Fig. 2 .
Fig. 2. The model identifies the correct individual with high probability.(A)Likelihood  m to find a target in the location dataset within the top m candidates selected by our model out of N = 0.5 million.An attacker has  1 = 79% chance of correctly identifying the target in the location dataset, with  m increasing rapidly with m. (B) Likelihood  m to find a target in the grocery shopping dataset within the top m candidates selected by our model out of N = 85,000 individuals.An attacker has  1 = 65% chance of correctly identifying the target in the grocery shopping dataset, with  m increasing rapidly with m. (C) The meta-classifier accurately evaluates whether the individual found by our model in the location dataset is correct (ROC curve, AUC = 0.91).Inset: False discovery rate (FDR) for traces with ˆ   >  .For individuals predicted to be correctly found in the location dataset with ˆ   > 0.95 , 4.81% are actually incorrect, showing the method to be well calibrated (see tableS4).(D) The meta-classifier accurately evaluates whether the individual found by our model in the grocery shopping dataset is correct (ROC curve, AUC = 0.94).Inset: FDR for traces with ˆ   >  .For individuals predicted to be correctly found in the grocery shopping dataset with ˆ   > 0.95 , 4.22% are actually incorrect (see tableS5).TPR, true positive rate; FPR, false positive rate.
Geo-indistinguishability is achieved by adding, to each data point, independent spatial noises sampled from a bidimensional Laplace distribution with mean radius _ r = 2 _ ϵ for a given parameter ϵ.Typical values of ϵ used in the literature range from 0.023 m −1 ( _ r = 100 m) to 0.0034 m −1 ( _ r = 600 m) (60-65).Knowing the noisy location of a data point thus only reveals a 95% confidence region about its real location with radius ranging from r 95 = 237 m ( _ r = 100 m) to r 95 = 1432 m ( _ r = 600 m).Figure 3 shows that the accuracy of our model on the location dataset only decreases to 78% when small amounts of noise are added ( _ r = 100 m, r 95 = 237 m) and 71% for large amounts of noises ( _ r = 600 m, r 95 = 1432 m).This shows that our model is robust to even the large amounts of spatial Laplace noise addition used in the literature and industry.Even the addition of very large amounts of noise ( _ r > 2000 m) only decreases the accuracy of the model slightly below 60%.This decrease is, however, also likely to strongly affect the utility of the data._ r = 2000 m indeed means r 95 = 4744 m (see probability density functions in inset Fig. 2C).For comparison, the average area of a zip code in New York City corresponds to a circular region of radius 1300 m.

Figure 4 (Fig. 3 .
Fig. 3.The model is robust to noise addition.(A)  1 when locations in  data are perturbed with Laplacian noises.Standard amounts of noises (average radius _ r < 600 m , r 95 < 1432 m) only decrease  1 by 7 percentage points.For large amounts of noise,  1 only slowly decreases further, e.g.,  1 = 59% for _ r = 2000 m (r 95 = 4744 m).Inset: Probability density function D ϵ (r) = ϵ 2 re −ϵr 1 r > 0 of the noise radius r.The mean radius is _ r = 1 _ ϵ .(B) The predictive power of the meta-classifier, captured by the ROC curves, only decreases slowly as the amount of added noise increases.The AUC decreases from 0.91 without noise down to, at most, 0.83 for _ r = 4000 m .

Fig. 4 .
Fig. 4. Robustness of the attack to larger datasets and relaxation of the membership assumption.(A) Model accuracy  1 averaged over 10 runs for the location dataset (see tableS8). 1 decreases slowly with population size (log fit, R 2 = 0.999).Inset: The first derivative

Fig. 5 .
Fig. 5.The model outperforms previous work on location data.(A) Our model outperforms the baselines at all scales on the location dataset.Accuracies  1 are averaged over 10 runs (see tableS8).(B) Our model outperforms the baselines for any amount of added noise, e.g., by 27.6 percentage points for _ r = 3000 m . 1 also visually decreases more rapidly for baselines than for our model.
compares the convex combination M i,i′,k =  k Z i,k + (1 −  k )Z i′,k of the histograms Z i,k and Z i′,k via the entropy H on  k (M i,i′,k ∈  k as probability simplexes arestable by convex combination) to the convex combination h i,i′,k =  k H(Z i,k ) + (1 −  k )H(Z i′,k) of the entropies taken separately.Because of the concavity of H, for all k, the subdivergence d k is valued in ℝ + .