<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Honesty is the Best Policy: On the Accuracy of Apple Privacy Labels Compared to Apps' Privacy Policies</title></titleStmt>
			<publicationStmt>
				<publisher>PoPETS</publisher>
				<date>10/01/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10577928</idno>
					<idno type="doi">10.56553/popets-2024-0111</idno>
					<title level='j'>Proceedings on Privacy Enhancing Technologies</title>
<idno>2299-0984</idno>
<biblScope unit="volume">2024</biblScope>
<biblScope unit="issue">4</biblScope>					

					<author>Mir Masood Ali</author><author>David G Balash</author><author>Monica Kodwani</author><author>Chris Kanich</author><author>Adam J Aviv</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[<p>Apple introduced privacy labels in Dec. 2020 as a way for developers to report the privacy behaviors of their apps. While Apple does not validate labels, they also require developers to provide a privacy policy, which offers an important comparison point. In this paper, we fine-tuned BERT-based language models to extract privacy policy features for 474,669 apps on the iOS App Store, comparing the output to the privacy labels. We identify discrepancies between the policies and the labels, particularly as they relate to data collected linked to users. We find that 228K apps' privacy policies may indicate data collection linked to users than what is reported in the privacy labels. More alarming, a large number (97%) of the apps with a Data Not Collected privacy label have a privacy policy indicating otherwise. We provide insights into potential sources for discrepancies, including the use of templates and confusion around Apple's definitions and requirements. These results suggest that significant work is still needed to help developers more accurately label their apps. Our system can be incorporated as a first-order check to inform developers when privacy labels are possibly misapplied.</p>]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Privacy policies are ubiquitous and required in many settings <ref type="bibr">[35]</ref><ref type="bibr">[36]</ref><ref type="bibr">[37]</ref><ref type="bibr">64]</ref>, and for better or worse, are an important tool for communicating about the behavior of systems. Natural language policies have many shortcomings and are full of technical details and jargon that significantly impact their usability as a tool to inform users clearly about the behaviors and data management practices <ref type="bibr">[28,</ref><ref type="bibr">58]</ref>. Privacy nutrition labels, or privacy labels, offer an alternative to both simplify and standardize the communication of privacy behavior similar to food nutrition labels <ref type="bibr">[20,</ref><ref type="bibr">51]</ref>. In December 2020, Apple began requiring privacy labels <ref type="bibr">[31]</ref> for all new and updated apps in the App Store. Apple's privacy labels ask developers to self-label (without verification) the data collection and sharing practices of their apps, the purposes, the types of data, and if that data is linked to user identities (see Figure <ref type="figure">1</ref> for more details). Essentially, privacy labels standardize the presentation of privacy behavior described in the privacy policy's natural language text.</p><p>In this paper, we answer the question: How do privacy labels compare to the behavior described in the privacy policies?</p><p>We conducted a large-scale analysis of the Apple App Store by reviewing 474,669 apps' privacy policies and privacy labels using a validated implementation of PrivBERT <ref type="bibr">[70]</ref>, a transformer-based privacy policy language model. We fine-tuned PrivBERT with the OPP-115 corpus and mapped its features to Apple's privacy labels to identify discrepancies between the reported behavior of apps based on their labels compared to their privacy policies.</p><p>We find that there are large differences between privacy labels and privacy policies. Most prominently, according to our analysis of the privacy policies, nearly 228K more apps may be performing some amount of data linking than the number of apps that reported similar data collection in the labels. More alarming, 97% of apps that report no data collection in their privacy label have statements in their privacy policy to the contrary. In many cases, mislabeling varies from the privacy policy regarding the kinds of data collected, particularly around app functionality and analytics or "other" functionality not prescribed by a privacy label.</p><p>We also compared free and paid apps. While paid apps use fewer privacy labels compared to free apps, the policies tell a different story: only 4% of paid apps report collecting data that is linked to users, but the policies suggest that 76% paid apps perform such collection. We further analyzed privacy-relevant data practices that are not covered by privacy labels. We found that most apps (76%) had a self-assigned content rating of 4+ on the App Store to indicate age appropriateness and enforce parental controls. Of these apps, only 50% of such apps had a policy in place to handle data collected from children. Our case study further reveals that their policy might be to claim no responsibility for collecting and handling data collected from users under 13 years of age. We also employ a similarity metric and identify that 65% of evaluated apps potentially use templates, providing insight into a possible source of discrepancies. We further analyzed the network traffic from 30 apps, showing that their data collection practices diverged from those declared in privacy labels and privacy policies.</p><p>Our analysis indicates that privacy labels are likely misapplied in great numbers, even considering that classifiers are imperfect for analyzing privacy policies. More guidance for developers would go a long way toward improving the accuracy of privacy labels. Still, there are also more concerning misapplications that could and should be addressed more broadly, such as the collection of data used to track users and apps falsely reporting that they do not collect any data. In these cases, the privacy policies are often explicit in this behavior, and the absence of a corresponding entry in the privacy label could lead to misunderstandings of the risks associated with using these apps and potentially violate Apple's App Store policies. First-level checks of the privacy policies when apps are submitted to the App Store could go a long way in highlighting and correcting some of the more common and egregious privacy label inaccuracies. In this work, we make the following contributions.</p><p>&#8226; We build and validate a hierarchical framework that uses finetuned transformer models to extract multiple features from privacy policies. &#8226; We develop and validate a mapping between features extracted from classifiers and App Store privacy labels. &#8226; We collect and analyze the privacy labels of 474,669 apps against their policies and find large differences in their reported practices. &#8226; We use a similarity metric to compare policies against templates and find that their use might indicate a likely source of observed discrepancies. We also present examples from a case study of traffic collected 30 apps and show evidence of discrepancies. &#8226; We publicly release our code and dataset of app metadata and privacy policies to facilitate further research. The artifact associated with this paper can be accessed at <ref type="url">https://github.com/m  asood/2024-pets-privacy-labels-policies</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">BACKGROUND AND RELATED WORK</head><p>Anatomy of a privacy label. The Apple privacy labels are similar in style and content to the "Privacy Facts" label developed by Kelley et al. <ref type="bibr">[53]</ref>. The structure of a label is hierarchical (see Figure <ref type="figure">1</ref>) and describes data collection practices under four levels:</p><p>(1) Privacy Type: Describes how the app handles collected data, which includes data collected for tracking users (with third parties), data collected and linked to users' identities, and data collected but aggregated/anonymized. An app's privacy label may contain a combination of one, two, or all three types. An app may also report that data is not collected, which is mutually exclusive with the other types. (2) Purpose: Discloses the intended reason for the data collection, e.g., for advertising, analytics, personalization.</p><p>(3) Data Category: Reports at a high level the category under which collected data falls. (4) Data Type: Granular information that describes the data collected under the Data Category.</p><p>Privacy nutrition labels. Privacy nutrition labels have been studied from a variety of perspectives <ref type="bibr">[16, 27, 32, 33, 51-53, 68, 71]</ref>, but Apple's privacy label is the first wide-scale deployment <ref type="bibr">[31]</ref>. In an exploratory study, Li et al. <ref type="bibr">[56]</ref> observed the adoption of iOS privacy labels on the App Store and found that very few developers voluntarily created privacy labels. Balash et al. <ref type="bibr">[15]</ref> performed a 66-week analysis of the privacy label adoption on the Apple App Store and identified a steady increase in the number of apps with privacy labels and likely under-reporting by developers forced to provide a label on a version update.</p><p>Zhang et al. <ref type="bibr">[81]</ref> conducted an in-depth interview study to determine the usability of iOS privacy labels from a user perspective. Most users found the privacy labels helpful despite misunderstandings that included unfamiliar terms and a confusing structure. Garg et al. <ref type="bibr">[40]</ref> discovered that privacy label disclosures of sensitive information reduce app demand, and thus, the accuracy of the labels is important to help users make informed choices.</p><p>Gardner et al. <ref type="bibr">[39]</ref> developed a tool to assist developers by prompting them while coding functionality that would potentially require a privacy label. Li et al. <ref type="bibr">[55]</ref> studied developers' creation of Apple's privacy nutrition labels and conducted semi-structured interviews. They found that errors and misunderstandings were prevalent in the privacy label generation process. These errors included under-reporting linked data, third-party data use, and missing data types. We observe the same when comparing the privacy policies and Li et al.'s findings regarding "knowledge blindspots" and misinterpreted Apple's definitions, likely leading to many of the misapplications we identified.</p><p>Privacy behavior of mobile apps. Numerous studies have measured the privacy behaviors of mobile applications <ref type="bibr">[8,</ref><ref type="bibr">9,</ref><ref type="bibr">18,</ref><ref type="bibr">19,</ref><ref type="bibr">21,</ref><ref type="bibr">61,</ref><ref type="bibr">69,</ref><ref type="bibr">80,</ref><ref type="bibr">82,</ref><ref type="bibr">83]</ref>. One of the first approaches to automatically identify problems in privacy policies was PPChecker <ref type="bibr">[80]</ref>, which combined an NLP analysis of privacy policy text with bytecode analysis. Andow et al. <ref type="bibr">[8]</ref> developed PolicyLint to identify contradictions within an individual policy. Andow et al. <ref type="bibr">[9]</ref> also created PoliCheck, which considers third-party versus first-party entity access to personal data for an entity-sensitive consistency check. Bui et al. <ref type="bibr">[19]</ref> extended PoliCheck to develop PurPliance that checks if data, entity, and purpose are equivalent to those extracted from data flows. In this paper, we choose Polisis <ref type="bibr">[46]</ref> as the policy analysis tool as it produces output similar to the privacy labels.</p><p>Zimmeck et al. <ref type="bibr">[82]</ref> evaluated 1,035,853 Android apps using the Mobile App Privacy System (MAPS), a pipeline based on code analysis and supervised machine learning classifiers, to identify potential non-compliance with privacy standards. Kollnig et al. <ref type="bibr">[54]</ref> analyzed 1,759 iOS apps using a combination of code analysis and network traffic monitoring, and they found that 80% of the apps that claimed not to collect any data in the privacy labels contained at least one tracker library. We find that this discrepancy probably exists at scale. Xiao et al. <ref type="bibr">[79]</ref> analyzed 5,102 apps (&#8764; 1% of our dataset) by checking the privacy labels against actual data flows and focused on two levels of labels, Purposes and Data Types. They discovered</p><p>Perform Crawl Policy Links App Metadata and Privacy Labels App Store 1 Extract Readable Text Filter and Clean Extracted Text Collect Policy HTML Split Policy into Segments Parse Data Collection Practices Create Privacy Labels Privacy Policies 2 Classify Policies and Extract Label 3 Segment and Attribute Classifiers Compare and Evaluate Privacy Label Created from Privacy Policy Privacy Label Declared on the App Store 4 that 67% of those apps failed to accurately disclose their data collection practices, particularly around the use of User ID, Device ID, and Location data. Our results complement their findings, where mentioning the collection of unique identifiers in an identifiable manner in the privacy policy is not reflected in the privacy labels. Further, our work analyzes apps at a much larger scale and covers Privacy Types, Purposes, and Data Categories.</p><p>Apple's deviations from recommendations. Although derived from Kelley et al.'s <ref type="bibr">[53]</ref> work, Apple's implementation deviates from its recommendations. While Kelley et al. noted, "presenting [labels] clearly and simply we could affect user decisions," Apple displays the nutrition label embedded down on the App Store, requiring interested users to scroll through details, where users may not see the labels before deciding to install an app. Additionally, Apple's labels do not give users choices or allow them to compare labels between apps. Further, recent user studies have found the labels to be confusing for developers <ref type="bibr">[55]</ref>, showing the possibility that developers misapply labels. Finally, Kelley et al. highlighted the need for the labels to be accurate and noted, "users believe this information is correct, is being verified, and will assume they misunderstand something before they would believe the displays are incorrect." Since Apple's privacy labels are not vetted and are not trustworthy, this points to a serious concern about providing disinformation to end users. These factors further highlight the necessity to verify and demonstrate the discrepancies we present.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">MEASUREMENT WORKFLOW</head><p>In Figure <ref type="figure">2</ref>, we present the primary measurement workflow described in detail below. During all scans, we followed best practices of limiting the number of requests and respecting 403 Errors by using exponential back-offs.</p><p>1 Crawling the App Store. We began by parsing the XML site map from Apple's App Store, which lists all apps currently published on the store, and then crawled each URL, parsing the privacy labels and associated metadata, such as the app name, version, size, type,</p><p>Does/Does Not Purpose Information Type Identifiability Does/Does Not Purpose Information Type Identifiability Identifiability identifiable aggregated/ anonymized Does/Does Not does does not First Party Collection/Use Third Party Collection/ Sharing Segment Classifier Purpose basic service /feature adtl. service /feature advertising marketing analytics /research personalization /customization service operation and security legal requirement merger/acquisition other purpose Information Type financial health contact location demographic personal identifier user online activities user profile social media data cookies and tracking elements computer information ip address and device ids survey data generic personal information other unspecified ... user rating, genre, content rating, release date, seller name, and price. Notably, the metadata includes a link to the privacy policy. We also parsed the extended privacy label details, such as the purposes and data types, by performing an additional GET request to the Apple Catalog API <ref type="bibr">[12]</ref>. In January 2024, there were 1.2M apps on the App Store. Of them, 995K apps had a privacy label, and we identified 993K apps with links to 669K unique policies (note that some apps link to the same policy).</p><p>2 Collecting Privacy Policies. We extracted the HTML for each policy using a Python script. We leveraged the readability library <ref type="bibr">[60,</ref><ref type="bibr">73]</ref>, a standalone version of the Firefox browser reader mode. The library employs a complex set of heuristics to extract relevant text from web pages <ref type="bibr">[72]</ref>, leaving us with de-cluttered HTML that we divided into segments based on the &lt;p&gt; tags. We then used a wrapper library on Google's language detection to discard non-English policies <ref type="bibr">[63]</ref>. When policies included lists where each list entry was not self-contained, we merged these lists into the preceding text to provide relevant context. We scanned short lists, i.e., where each list item was composed of &lt;20 words, and merged them into the preceding paragraph, thereby treating the entire list as a single segment. We then eliminated segments comprising &lt;20 words. After cleaning, the classifiers individually processed each segment and mapped it back to the original policy. After excluding links that returned response errors, the readability library successfully extracted relevant text from 286,717 policies, which we classified in the next stage.</p><p>3 Classifying Policies and Extracting Labels. We analyzed policies with a similar approach to Polisis <ref type="bibr">[46]</ref>, an NLP framework that classifies data collection behavior from privacy policy text. Unfortunately, the prior published Polisis implementation is proprietary, and on reaching out, the authors informed us that their website can only take up to 30K policies. We completely reimplementated the classification framework to the same standards as prior work. We replaced their CNN-based approach with a stateof-the-art language model to improve classifier performance. We used PrivBERT <ref type="bibr">[70]</ref>, a transformer-based privacy policy language model, which was developed by pre-training the RoBERTa &#119861;&#119860;&#119878;&#119864; model <ref type="bibr">[57]</ref> on 1M privacy policies. We fine-tuned PrivBERT on the OPP-115 corpus <ref type="bibr">[78]</ref>. We present an overview of the framework structure in Figure <ref type="figure">3</ref>. In Table <ref type="table">5</ref> in the Appendix, we show that the PrivBERT classifiers perform better than CNNs. We provide more details about training and evaluating the models in Appendix A.</p><p>We first passed each segment through the Segment Classifier to extract the high-level data practice. We passed any segments addressing First Party Collection/Use or Third Party Collection/Sharing through six Attribute Classifiers -Does/Does Not, Identifiability, Purpose, Personal Information Type, Action First Party, and Action Third Party -to extract annotations relevant to privacy labels. We used the Action First-Party attribute to filter any segments explicitly addressing collection on websites (and not mobile apps). We used the Action Third-Party attribute to eliminate instances wherein the third party only 'sees' and does not collect data. We successfully detected segments addressing data collection in the policies of 474,966 apps (&#119899; = 280, 767 policies), which we then used to create privacy labels.</p><p>4 Compare and Evaluate. The taxonomy of policy labeling does not always have a one-to-one mapping with Apple's privacy labels. So, we developed a grounded strategy based on qualitative coding to convert outputs from classifiers into equivalent privacy labels. Three researchers independently coded the conversions and then discussed to reach an agreement on the mappings between OPP-115 and privacy labels. The coders completed three matching tasks:</p><p>&#8226; First, the coders determined which of the data practices found by the Segment Classifier, such as First Party Collection/Use or Third Party Collection/Sharing, that when combined with the Identifiability Attribute Classifier, such as "Identifiable, " "Aggregated/Anonymized," "Does", or "Does Not", match to an appropriate Apple privacy label type, such as Data Linked to You or Data Not Collected. For example, when the framework identifies a segment with a data practice of "First Party Collection/Use" and the data is "Identifiable, " that would associate with an Apple privacy label type of Data Linked to You. &#8226; Next, the coders matched the output of the Purpose Attribute Classifier against Apple's privacy label purposes. For example, a framework output of "Basic Services/Features" gets mapped to App Functionality for privacy label purposes. &#8226; Finally, the coders matched the outputs of the Personal Information Type Attribute Classifier to the data categories provided in Apple's privacy label. For example, Polisis may identify that a segment discusses "Contact, " which then maps to the privacy label data category of Contact info. The combination of these three matching tasks provides a single privacy label entry for an app, according to the privacy policy, describing the privacy type (e.g., Data Linked to You), the purpose (e.g., App Functionality), and the data category collected (e.g., Contact Table <ref type="table">1</ref>: Deriving privacy label entries directly from segment annotations created using the Polisis framework. nerTable 1. The coding process also revealed additional, inferred privacy labels from Polisis classification that included a combination of classifications and keywords relevant for Data Used to Track You and remaining Data Categories. Table <ref type="table">2</ref> shows the inferred privacy labels. We further verified the mapping by randomly sampling labels generated from classifier outputs. In the Appendix, we present our evaluation in Table <ref type="table">4</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">LIMITATIONS</head><p>Before proceeding, it is essential to note the limitations of our approach in comparing the privacy labels with the privacy policies.</p><p>Ground truth. Foremost, we note that neither the labels nor the policies can provide comprehensive ground truth of app behavior, and even statistical and dynamic analysis has limitations. Here, we report only on observed discrepancies between the policies and the labels, but validating which is more in line with app behavior is beyond the scope of this paper. However, as these discrepancies Classifier Predictions. The outputs of language models introduce uncertainty that propagates further when combined. As a result of these inaccuracies, we can only report on the presence of statements addressing data collection practices in privacy policies and differences when compared with privacy labels. However, the reported discrepancies are much larger than the associated uncertainties. Additionally, our framework analyzes privacy policies on a per-paragraph/per-segment basis, so it cannot detect explanations of app behaviors that span multiple segments.</p><p>Train/Test Dataset. Without an updated corpus with equivalent robustness, we used the OPP-115 corpus to fine-tune language models <ref type="bibr">[78]</ref>, an extensive dataset comprising manual annotations of 23k fine-grained data practices gathered from multiple graduate-level law students. However, the dataset includes old privacy policies that the researchers collected before the introduction of present-day privacy laws. We identify the limitations introduced by this dataset and recognize the need for an updated dataset. Additionally, specific annotations in the OPP-115 corpus do not directly map to the Apple privacy label taxonomy. As such, the independent annotators used a grounded approach to develop an inferential mapping to address this limitation (see Table <ref type="table">2</ref>). Finally, in Table <ref type="table">5</ref> performance, we manually evaluate the classifier outputs on new policies by randomly sampling segments from our dataset of app policies.</p><p>Information Extraction. Privacy policies comprise varying formats, reducing the amount of information we can gather from our framework. As previously highlighted, our per-segment approach</p><p>Data Used to Track You Data Linked to You Data Not Linked to You Data Not Collected 100.0 50.0 0.0 50.0 100.0 Privacy Type Ratio 37% 83% 35% 1% 22% 40% 45% 36% Policy Only Both Label Only Privacy Policy Privacy Label Figure 4: An overview of apps declaring data collection with corresponding Privacy Types within their privacy policies (top) and on the App Store via privacy labels (bottom). The denominator is the total apps that we analyzed, i.e., 474,669 apps. Please note that the privacy types, except for Data Not Collected, are not mutually exclusive. misses information that spans multiple, non-contiguous segments.</p><p>Next, policies present information in various media formats (e.g., images) that we do not include in our analysis. Finally, many privacy policies contain links to third parties' privacy policies. We did not analyze the transitive closure of all privacy policies as part of this work. Apple's policy is for privacy labels to include all collection and tracking mechanisms, including third-party practices. Our analysis is a lower bound of data collection performed within an app, particularly related to third parties.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">RESULTS</head><p>In this section, we directly compare developers' reported privacy labels to the output of language models following the hierarchical structure of the privacy labels (see Figure <ref type="figure">1</ref>).</p><p>Privacy Types. We first consider the top level of privacy labels, the privacy types: Data Used to Track You, Data Linked to You, Data Not Linked to You, and Data Not Collected. We are primarily concerned with determining the number of apps with such a privacy type and if we can also find that privacy type in the policies. Figure <ref type="figure">4</ref> and Table <ref type="table">3</ref> provide a snapshot of the overlap of privacy types extracted from privacy policies and the privacy types declared in the privacy labels for the app on the App Store. As a helpful reminder while reading the numbers reported in this table, three of the privacy types, Data Used to Track You, Data Linked to You, and Data Not Linked to You, are not mutually exclusive. Apps may collect data linked to the user and aggregated/anonymized (i.e., not linked to the user), and they may also collect data to track the user.</p><p>The Data Linked to You privacy type indicates that the app collects data linked to users, i.e., in an identifiable manner. Of the 190,965 apps indicated such collection on the App Store, our framework identified 88% (&#119899; = 168, 121) (Fig. <ref type="figure">4</ref>; lower half; yellow bar; hatches). More concerning, we observed an additional 228,539 apps that reported this practice in their policies but did not report it on the App Store (Fig. <ref type="figure">4</ref>; top half; yellow bar; stripes). We identified that 41% (&#119899; = 88, 172) of the apps whose privacy labels stated that they collected data in an aggregated/anonymized manner, i.e., had Data Not Linked to You privacy type, also said so in their policies (Fig. <ref type="figure">4</ref>; lower half; blue bar; hatches). Of the remaining 59% (&#119899; = 127, 020) apps that had the Data Not Linked to You privacy type in their label but did not have a corresponding policy segment (Fig. <ref type="figure">4</ref>; lower half; blue bar; stripes), 76% (&#119899; = 97, 029) of those instead included segments in their privacy policy that indicated that they collect data linked to users (Table <ref type="table">3</ref>; row2; col3). This difference may result from apps not stating their aggregation practices in the same segment of the policy that addresses data collection. Despite factoring in uncertainty, there is a large gap between the practices declared in privacy labels and privacy policies.</p><p>Perhaps more problematic is apps that report they do not collect any data. Recall that the Data Not Collected privacy type is mutually exclusive, i.e., developers only added this label to apps that claim not to collect any data from users. While 36% (&#119899; = 172, 924) of the apps that we analyzed indicated in their privacy label that they did not collect any data, only 0.03% (&#119899; = 4, 359) of these apps made similar statements in their policies (Fig. <ref type="figure">4</ref>; lower half; green bar). More surprisingly, 84% (&#119899; = 173, 441) of these apps stated in their policies that they collected data linked to users (Table <ref type="table">3</ref>; row 4; col 2).</p><p>Finally, of the 108,937 apps that stated on the App Store that they collected data to track users, our framework also reported similar practices in the privacy policies of 49% (&#119899; = 53, 359) (Fig. <ref type="figure">4</ref>; bottom half; red bar; hatches). We identified an additional 123,675 apps that did not declare this practice on the App Store (Fig. <ref type="figure">4</ref>; top half; red bar; stripes). Recall that the framework infers this privacy type, and we, therefore, partially report user tracking that apps engage in, presenting a lower bound of mislabeling. Our identification of apps that fail to report data collected for tracking indicates that many apps are under-reporting their tracking practices.</p><p>Takeaways. Developers are very likely under-reporting their collection of identifiable data on the App Store. Most apps that indicate on the App Store that they do not collect any data state otherwise in their privacy policies.</p><p>Purposes. We look at how apps claim to use the data they collect. Figure <ref type="figure">5</ref> presents a snapshot of the purposes associated with data collection, as identified from privacy labels and privacy policies. As a reminder, apps may collect both linked and not linked (anonymized) data. Additionally, apps may collect data for multiple purposes. For example, an app may collect your Location in an anonymized manner to personalize user experience (Product Personalization) and in an identifiable manner to help advertisers and agencies tailor the advertisements they display (Third Party Advertising).</p><p>We find greater agreement between privacy labels and privacy policies for apps that collect data for App Functionality and Analytics. Of the 161,587 apps indicated in their privacy label that apps collect data linked to users for App Functionality, 81% (&#119899; = 130, 108) also included a corresponding statement in their privacy policy. Similarly, of the 105,729 apps that stated in their privacy label that they collect data linked to users for Analytics, 68% (&#119899; = 71, 883) also included a corresponding statement in their privacy policy (Fig. <ref type="figure">5</ref>; bottom half; left plot; yellow bars 1 &amp; 2; hatches).</p><p>We find notable discrepancies in developers' reporting of Thirdparty Advertising in their privacy policies and on the App Store. Considering data collection that is linked to users (Fig. <ref type="figure">5</ref>; left plot; bar 4), 139,765 apps exclusively declare this purpose in their privacy policies (top half) and do not report this practice on the App Store. Our findings are concerning since this is a lower bound. Privacy policies link to third-party policies instead of including details here. The results indicate that developers focus on their app's data collection practices when filling out privacy labels without considering third parties. We further highlight the problem of incomplete labeling with examples in &#167;6 and Appendix B.</p><p>Finally, we find that while 366,840 (77%) apps stated in their privacy policies that they collected data in an identifiable manner for a purpose that does not fit into any of the options that Apple provides in their privacy label, only 17,487 (5%) of these apps also addressed this on the App Store (Fig. <ref type="figure">5</ref>; left plot; yellow bar 6). It appears that developers are less forthcoming about declaring data collection in their privacy labels for purposes beyond Apple's taxonomy, making limited use of the catch-all: Other Purposes.</p><p>Takeaways. Developers are more likely to declare data collection for App Functionality and Analytics in either, privacy labels or privacy policies. Developers are also less likely to declare data collection in their privacy labels for purposes beyond Apple's taxonomy, i.e., Other Purposes.</p><p>Data Categories. We additionally analyze the data categories collected by apps as stated in their privacy labels and policies.   The ratios of the six purposes for the Data Linked to You and Data Not Linked to You privacy types. The denominator is the number of apps with the designated privacy type either in their privacy label or their privacy policy, i.e., 419,504 apps with a Data Linked to You label and 294,391 with a Data Not Linked to You label. It is helpful to note here that privacy types shown here are not mutually exclusive. Two other Privacy Types are not shown here; the Data Used to Track You privacy type refers to collection for the purpose of tracking, while the Data Not Collected refers to the absence of any data collection.</p><p>We find that apps are more likely to declare in either their privacy policies or their privacy labels that they collect Contact Info (&#119899; = 273, 351; 65%) and Identifiers (&#119899; = 320, 607; 76%) linked to users (Fig. <ref type="figure">6</ref>; middle plot; yellow bars 2 &amp; 5). Apps that collect data to track users are more likely to use Browsing History (46%; &#119899; = 106, 816), Identifiers (71%; &#119899; = 164, 732), and Usage Data (65%; &#119899; = 150, 651) (Fig. <ref type="figure">6</ref>; upper plot; red bars 1, 5, &amp; 7). Our findings are in line with previous work that showed tracking activities target users with cookies and tracking pixels (Identifiers) and monitor their browsing practices across sites and services (Browsing History and Usage Data) <ref type="bibr">[5,</ref><ref type="bibr">34]</ref>.</p><p>However, we find that apps that state in their privacy policy that they collect Browsing History (i.e., how users browse the Internet outside of the app) and Sensitive Info (such as racial/ethnic data, sexual orientation, etc.) linked to users are less likely to declare this collection in their privacy labels (Fig. <ref type="figure">6</ref>; middle plot; top half; yellow bars 1 &amp; 13). Surprisingly, of the 212,121 apps that stated in their privacy policy that they collect Browsing History linked to users, only 658 (0.3%) of these apps declared this practice in their privacy labels. While 96,837 apps indicated in their privacy policy that they collect some form of Sensitive Info, only 2% (&#119899; = 2, 144) apps also declared this collection in their privacy labels. Of notable concern, we find 22,171 apps and 11,710 apps mislabeling their collection of Identifiers and Contact Info respectively as being linked to users when their policies indicate that they use collected data to track users (see Table <ref type="table">7</ref>).</p><p>Takeaways. Developers most commonly state that they collect Identifiers and Contact Info that are linked to users. Developers that state in their privacy policies that they collect Browsing History or Sensitive Info linked to users are less likely to declare this collection in their privacy labels. Apps that track users are more likely to use Browsing History, Identifiers, and Usage Data, which is in line with prior findings about tracking practices.</p><p>Free vs. Paid Apps. The App Store has four pricing models: free apps, free apps with in-app purchases, paid apps, and paid apps with in-app purchases. Interestingly, when only observing privacy labels (Fig. <ref type="figure">7</ref>; all plots; bottom half), it would appear that paid apps have better privacy behaviors than their free counterparts. However, the altruism of paid apps compared to free apps disappears when considering the privacy policies (the top half of Figure <ref type="figure">7</ref>). The privacy policy analysis better aligns with the observations of Han et al. <ref type="bibr">[44,</ref><ref type="bibr">45]</ref>, who compared free and paid apps in the Android Play Store based on the inclusion of third-party advertising software, finding no differences between free and paid apps.</p><p>As a result of apparent under-reporting by paid apps, we find that they have the largest discrepancies of potentially under-reporting data collection practices in their privacy labels compared to the privacy policies. While the privacy policies suggest that 75% (&#119899; = 21, 330) of paid apps collect data linked to users, only 4% (&#119899; = 1, 145) paid apps have a privacy label of this type (Fig. <ref type="figure">7</ref>; second plot; yellow bar 3). More concerning, while the privacy policies of 21% (&#119899; = 6, 118) paid apps report collecting data to track users, only 2% (&#119899; = 643) paid apps report this practice on the App Store (Fig. <ref type="figure">7</ref>; first plot; red bar 3).</p><p>Content Rating. Developers provide a Content Rating as part of the app metadata to indicate the age appropriateness of their apps. These ratings are reviewed by Apple <ref type="bibr">[10]</ref> and used to enforce parental control features that restrict children from accessing the app <ref type="bibr">[11]</ref>. We find that most apps that have a 4+ content rating on the App Store (81%; &#119899; = 419, 762), while fewer apps have 9+ (3%; &#119899; = 16, 687), 12+ (9%; &#119899; = 46, 737), or 17+ (13%; &#119899; = 69, 309) content ratings. Since privacy labels do not indicate the app's data practices specific to children, users must review the privacy policy to learn this information. Given parental control settings, an app with a 4+, 9+, or 12+ rating could be used by minors, although they may not be</p><p>Browsing History Contact Info Financial Info Health &amp; Fitness Identifiers Location Usage Data User Content Diagnostics Contacts Purchases Search History Sensitive Info 100.0 50.0 0.0 50.0 100.0 Privacy Type Ratio 46% 15% 5.3% 0.1% 49% 13% 46% 4.1% 24% 0.8% 3.2% 2.2% 9.9% 0.47% 4.4% 0.25% 0.049% 34% 13% 29% 1.7% 14% 0.12% 4.7% 0.56% 0.083% Data Used to Track You Policy Only Both Label Only Privacy Policy Privacy Label Browsing History Contact Info Financial Info Health &amp; Fitness Identifiers Location Usage Data User Content Diagnostics Contacts Purchases Search History Sensitive Info 100.0 50.0 0.0 50.0 100.0 Privacy Type Ratio 51% 53% 27% 0.85% 65% 35% 68% 22% 57% 6.4% 11% 9.6% 24% 0.31% 31% 7.6% 1.9% 32% 14% 21% 15% 17% 1.6% 9.6% 1.5% 1.4% Data Linked to You Policy Only Both Label Only Privacy Policy Privacy Label Browsing History Contact Info Financial Info Health &amp; Fitness Identifiers Location Usage Data User Content Diagnostics Contacts Purchases Search History Sensitive Info 100.0 50.0 0.0 50.0 100.0 Privacy Type Ratio 38% 14% 5.1% 0.072% 35% 11% 45% 4.3% 24% 0.86% 3.8% 3.6% 11% 0.65% 11% 1.9% 0.91% 26% 18% 40% 11% 49% 1.3% 3.4% 2.2% 0.4% Data Not Linked to You Policy Only Both Label Only Privacy Policy Privacy Label Figure 6: The ratios of data categories against privacy types. The denominator is the number of apps with the designated privacy type either in their privacy label or their privacy policy, i.e., 232,648 apps with Data Used to Track You, 419,504 apps with Data Linked to You, and 294,391 apps with Data Not Linked to You. The three privacy types shown here are not mutually exclusive.</p><p>the intended or target audience for the app. However, when an app specifically targets children, it is subject to additional regulations that may require parental consent. We fine-tuned language models to identify policy segments that address International/Specific Audiences and to identify further if the segment addresses Children, then compare this output to the content rating. Only 50% (&#119899; = 179, 168) apps with a 4+ content rating also included a privacy policy segment that addresses data practices specific to children (Fig. <ref type="figure">8</ref>; all plots; left-most bar). We were more likely to find similar policy segments for apps with different content ratings that can also be accessed by children, 9+ (65%; &#119899; = 10, 118) and 12+ (59%; &#119899; = 22, 293).</p><p>We further looked at content ratings for different privacy types associated with data collection. Considering apps with a 4+ content rating, roughly half had a policy explicitly addressing children across privacy types. While 20% (&#119899; = 74, 320), 37% (&#119899; = 134, 076), and 44% (&#119899; = 159, 512) of the apps with a 4+ content rating declare in their privacy label that they collect data used to track users, linked to users, and not linked to users respectively, only 58% (&#119899; = 43, 536), 51% (&#119899; = 68, 715), and 54% (&#119899; = 86, 743) of those apps also addressed children in their privacy policies (Fig. <ref type="figure">8</ref>; plots 1, 2, &amp; 3; bottom half; left-most bars; white overlay indicates addressing children).</p><p>While adding a 4+ content rating may help developers reach a wider audience, we only identified half of these apps consider data practices specific to children in the privacy policy. Additionally, even when apps address data collection from children in their privacy policies, these segments may absolve the developer of any responsibility. For example, ChowNow <ref type="bibr">[24]</ref> is an app platform used by 3,182 different apps of local restaurants to receive online orders for takeout and delivery. ChowNow adds a content rating of 4+ to its apps on the App Store, making it accessible for children. Recall that developers choose a content rating according to Apple's guidelines <ref type="bibr">[10]</ref>; Apple does not assign this value. However, ChowNow's privacy policy absolves themselves of the responsibility of dealing with data collected from children.</p><p>Free Free In-App Paid Paid In-App 100.0 50.0 0.0 50.0 100.0 Privacy Type Ratio 36% 44% 21% 27% 18% 47% 2% 11% Data Used to Track You Policy Only Both Label Only Free Free In-App Paid Paid In-App 82% 88% 75% 78% 43% 40% 4% 15% Data Linked to You Policy Only Both Label Only Free Free In-App Paid Paid In-App 34% 41% 25% 31% 43% 61% 12% 33% Data Not Linked to You Policy Only Both Label Only Free Free In-App Paid Paid In-App Privacy Policy Privacy Label 0.65% 1.2% 6.4% 3.3% 36% 21% 84% 59% Data Not Collected Policy Only Both Label Only Figure 7: The ratios of app costs for each of the four privacy types. The denominator is the number of apps with the designated app cost that have a privacy label. Free apps are more likely than paid apps to collect data, including data used to track and linked to users. Please note that privacy types shown here are not mutually exclusive. 4+ 9+ 12+ 17+ 100.0 75.0 50.0 25.0 0.0 25.0 50.0 75.0 100.0 Privacy Type Ratio 35% 51% 46% 37% 20% 58% 34% 20% 12% 43% 23% 12% 25% 42% 38% 27% Data Used to Track You Address Children Policy Only Both Label Only 4+ 9+ 12+ 17+ 82% 88% 87% 84% 37% 47% 53% 47% 19% 35% 33% 24% 48% 63% 57% 48% Data Linked to You Address Children Policy Only Both Label Only 4+ 9+ 12+ 17+ 33% 47% 43% 39% 44% 49% 55% 44% 24% 31% 32% 24% 21% 37% 33% 23% Data Not Linked to You Address Children Policy Only Both Label Only 4+ 9+ 12+ 17+ Privacy Policy Privacy Label 1.3% 1% 0.78% 0.45% 39% 21% 20% 33% 16% 9% 9% 15% Data Not Collected Address Children Policy Only Both Label Only The ratios of the content ratings for each of the four privacy types, with an overlay (white bar) indicating the ratio of apps that also include a segment in their privacy policy, where they address privacy practices specific to children who engage with their services. The denominator is the number of apps with the designated content rating that have a privacy label. Please note that privacy types shown here are not mutually exclusive.</p><p>We acknowledge that our findings do not implicate the evaluated apps of violating COPPA <ref type="bibr">[36]</ref>, which, for example, allows PII collection with specific restrictions (e.g., geolocation) provided that developers do not use data for targeting/profiling of minors and that they obtain informed parental or legal tutor consent. We highlight the lack of declaration of data practices in privacy policies, especially when considered optional, and the need to ensure transparency across platforms. Additionally, third-party libraries offer options to help applications comply with COPPA regulations, but prior work has shown that they are often misconfigured <ref type="bibr">[67]</ref>.</p><p>Data Used to Track You Data Linked to You Data Not Linked to You Data Not Collected 100.0 50.0 0.0 50.0 100.0 Privacy Type Ratio 46% 96% 42% 0% 25% 45% 46% 31% Policy Only Both Label Only Privacy Policy Privacy Label Figure 9: An overview of the privacy types associated with data collection on the App Store, from privacy labels and privacy policies, specific to apps whose policies are similar to templates. The denominator is the total number of such apps, i.e., 300,535 apps. Please note that the privacy types, except for Data Not Collected, are not mutually exclusive.</p><p>App Genre. We present an overview of our findings by app genre in Figure <ref type="figure">11</ref> in Appendix D. We find that Games apps are most likely to collect data used to track users (60%) and linked to users (59%) (Fig. <ref type="figure">11</ref>; plots 1 &amp; 2; bar 8). Notably, while 83% apps associated with the Stickers genre stated on the App Store that they do not collect any data, our analysis found that 66% apps collected data linked to users (Fig. <ref type="figure">11</ref>; plots 2 &amp; 4; bar 23). Apps under the Stickers genre are mostly lightweight apps made by smaller developers. They tend to have a 4+ content rating to reach a larger audience. They can include a few ad spaces and analytics libraries. Our intuition is that individual developers may not be aware of the data collection from third-party analytics and advertising libraries.</p><p>App Popularity. Since the App Store does not reveal the number of downloads for an app, we instead rely on the number of user ratings as a proxy for app popularity. To better represent their disclosures, we bin rating counts within the same order of magnitude in a single category and present our findings in Figure <ref type="figure">10</ref> in Appendix D. We find that with increased popularity, apps are more likely to declare data collection linked to users and used to track users. Our findings suggest that popular apps are more likely to be more thorough in their declaration of data collection practices because they receive more scrutiny.</p><p>Privacy Policy Templates. Templates offer a valuable solution for creating privacy policies, as they provide a ready-made framework for organizations to establish clear guidelines regarding handling user data. These pre-designed templates serve as a starting point that developers can customize to meet specific requirements and legal obligations. By utilizing templates, businesses can save time and effort by avoiding the need to create privacy policies from scratch. Additionally, templates help ensure compliance with privacy regulations by incorporating standard clauses and disclosures, ensuring that the privacy policy aligns with applicable laws such as GDPR or CCPA. However, it is essential for organizations to carefully review and tailor the template's content to accurately reflect their unique practices, guaranteeing transparency in communicating their privacy practices to users.</p><p>We evaluated the policies in our dataset to identify the use of templates. We searched for privacy policy templates and generators and gathered a list of services. We then visited each service and signed up, if required. We collected a set of 15 privacy policy templates, which we cleaned and divided into individual sentences. We represented the text in both the templates and the policies using indomain word embeddings derived from privacy policies shared by Harkous et al. <ref type="bibr">[46]</ref>. For each policy in our dataset, we conducted a comprehensive sentence-level comparison. We compared each sentence in a policy against every sentence in a template. We employed the cosine similarity metric to measure the semantic resemblance between two sentences. We deemed sentences similar if their cosine similarity exceeded a threshold of 0.8. We established a criterion to determine if a policy derived from a template: if over half of the sentences in a policy were similar to over half of the sentences in the template, we identified the policy as template-like.</p><p>We find that the privacy policies of 65% (&#119899; = 306, 404) apps potentially use templates. We looked at the privacy labels these apps have declared on the App Store. Considering privacy types, 23%, 45%, 46%, and 31% of these apps declare Data Used to Track You, Data Linked to You, Data Not Linked to You, and Data Not Collected privacy types in their labels on the App Store (Fig. <ref type="figure">9</ref>; bottom half; all bars). These findings align with all evaluated apps (see Figure <ref type="figure">4</ref>). A majority of evaluated apps use template-like privacy policies. The use of templates possibly affects the discrepancies between the declaration of data collection practices in privacy labels and privacy policies. Templates often use generators, which offer significant value by ensuring developers thoroughly consider various data collection and sharing practices. These generators are similar to creating privacy labels on the App Store. However, it is essential to recognize that templates are not one-size-fits-all solutions. Developers must review and tailor policies derived from templates to accurately reflect individual apps' unique data collection practices. By carefully reviewing and customizing policies, developers can ensure the accuracy of their disclosures.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">CASE STUDIES</head><p>Without Apple verifying privacy labels (and policies), their contents may not wholly clarify actual app practices. We present case studies of app behavior to shed light on the potential disparities between stated data collection practices and real-world app behavior. We use network requests captured from app usage to behavior developers report in labels and policies.</p><p>We used an iPhone running iOS 17.3.1 (released Feb 2024) with a man-in-the-middle (MiTM) proxy <ref type="bibr">[26]</ref> to gather outgoing traffic to determine domains that apps accessed. We evaluated each app in the following manner: <ref type="bibr">(1)</ref> We installed the app directly from the App Store. <ref type="bibr">(2)</ref> We established a connection between the iPhone and the proxy. (3) Upon opening the app, the proxy captured and stored any outgoing requests made by the app. (4) After closing the app and terminating the proxy connection, we deleted the app before evaluating the next app in the sequence.</p><p>We included 39 apps in the analysis, split between (a) 24 apps that declare data collection for advertising purposes in their privacy policies but not on their privacy labels and (b) 15 apps that declare a "Data Used to Track You" privacy type in their label on the App Store, but we could not infer such a practice from their privacy policies. We then compared the domains in the captured network requests against EasyList, EasyPrivacy, and WhoTracks.Me to identify trackers <ref type="bibr">[4,</ref><ref type="bibr">41,</ref><ref type="bibr">50]</ref>. We provide an overview of our findings in Table <ref type="table">6</ref> in Appendix C. The analysis presented in this study is an exploratory case study of 30 apps' network behavior. It should not be considered representative of the practices of all apps on the App Store.</p><p>The evaluated apps contact numerous tracking domains, with Facebook and Google being the most prominent. Further, developers often do not include analytics libraries within their purview of tracking, but guides from these libraries show that their practices are more nuanced <ref type="bibr">[14,</ref><ref type="bibr">43]</ref>. Additionally, inconsistencies between privacy disclosures and network traffic persist across different app categories. When privacy policies mention third-party libraries, they refer to third-party policies, resulting in incomplete inferences from an automated approach like the one presented in this work. We elaborate on potential explanations for our observations below.</p><p>Policy Reuse. Developers with multiple apps on the App Store reuse the privacy policies linked with individual apps. While this practice may result from using generic templates for some developers, organizations can also reuse these templates with multiple services. For example, different developer accounts publish Lexington Law and CreditRepair (#1 &amp; #2 in Table <ref type="table">6</ref>), and the apps link to different privacy policies on the App Store. However, their privacy labels and privacy policies are identical. They are subsidiaries of the same organization, PGX Holdings Inc., and reuse declaration statements even if these statements apply to those subsidiaries. Developers must update templates to ensure accurate data collection practices, which can then reflect the accuracy of privacy labels.</p><p>Understanding Third Party Collection. When applications state in their privacy policies that they do not share data with third parties except to provide certain services (not including targeted advertising), it is possible that developers do not clearly understand or parse the nuances of data collection and sharing performed by integrated third parties. For example, Paypal, Crumbl, and Discord (#3, #9, #12 in Table <ref type="table">6</ref>) have policies covering data collection and sharing from third parties. To their credit, third-party libraries provide guidelines and disclosure links for developers to review before filling out their privacy labels and privacy policies (examples, <ref type="bibr">[14,</ref><ref type="bibr">43,</ref><ref type="bibr">59,</ref><ref type="bibr">74]</ref>). However, these guides include multiple caveats that can further complicate developers' understanding, requiring them to process against their use cases and translate into Apple's data collection definitions and requirements.</p><p>Understanding App Store requirements. Apple requires that developers declare all data collected in the app, including the practices of third-party partners, except for certain scenarios wherein disclosure is deemed optional <ref type="bibr">[30]</ref>. While apps like Venmo, Southwest Airlines, Open Table, and Indeed (#1, #4, #6, #11 in Table <ref type="table">6</ref>) fill their privacy labels with multiple data categories under the Data Linked to You and Data Not Linked to You privacy types, they fail to do the same while declaring Data Used to Track You. Their privacy policies include statements highlighting third-party data collection and sharing for advertising and measurement purposes, indicating the developers' understanding of such activity. However, despite the App Store requiring the disclosure of all data collection practices, the developers' interpretation of optional caveats may affect their creation of privacy labels. For example, the period tracking app, Maya (#24 in Table <ref type="table">6</ref>), declared the sharing of Usage Data for tracking users, but the third-party libraries that it uses additionally collect and use identifiers and device information to track users <ref type="bibr">[43,</ref><ref type="bibr">59]</ref>. Understanding Apple's Definition of Tracking. Apple details practices that it considers to fall under Tracking, along with examples and caveats <ref type="bibr">[30]</ref>. However, recent work has found that developers find it difficult to understand this definition and correctly declare data collection used to track users <ref type="bibr">[55]</ref>. Apps like Axolochi, WebMD, and Food Network Magazine (#19, #21, and #22 in Table <ref type="table">6</ref>) acknowledge the use of tracking technologies in their privacy policies. However, the absence of similar declarations in privacy labels can stem from confusion around their understanding of Apple's definition of tracking. A recent study by Li et al. <ref type="bibr">[55]</ref> showed that developers find it difficult to correctly identify data linked to users and data used to track users.</p><p>Next, we present possible reasons for discrepancies for apps with a Data Used to Track You privacy type in the App Store label but prove it challenging to automatically capture tracking practices from their privacy policies.</p><p>Non-exhaustive Policies. The privacy policies of Shake Shack, Kika Keyboard, Photo Prints CVS, Everpix, and FloatMe (#25, #26, #27, #28, #29 in Table <ref type="table">6</ref>) mention third party collection and sharing in terms of legal compliance and mergers/acquisitions. These privacy policies do not comprehensively cover all practices and data collection scenarios, making it difficult to identify such practices without ground truth.</p><p>Unclear Policy Statements. Even when developers declare third-party data collection and sharing in their privacy policies, such declaration is not explicit or clear to enable automatic detection and inference. The policies of Buffalo Wild Wings, The General Auto Insurance App, Conservative News (#30, #31, #32 in Table <ref type="table">6</ref>) include statements of sharing of information with "non-affiliated third parties", "vendors", "third party code and libraries", but do not make explicit the specific data categories collected and the use of this data for tracking, advertising, or advertising measurement.</p><p>Complex Formats. Being free-form documents, privacy policies do not need to be presented in standard, machine-parsable formats. While developers provide correct links to their policies on the App Store, we can only access the content of the policy behind a further link(s), as is the case with apps like McDonalds, Episode (#35, #36 in Table <ref type="table">6</ref>). Additionally, the policy for BrainBoom (#33 in Table <ref type="table">6</ref>) presents information in mixed formats, i.e., text and images, further complicating our ability to identify all practices. Finally, apps like JCPenney, Dosh, and CDL Prep Test (#37, #38, #39 in Table <ref type="table">6</ref>) provide incorrect or broken links on the App Store, resulting in the extraction of incorrect from automated crawls.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">DISCUSSION AND CONCLUSIONS</head><p>We analyzed 474,669 apps on the App Store, comparing the practices reported in privacy policies to those reported in privacy labels by performing automated NLP classification of the privacy policies. We find that most apps are likely under-reporting data collection practices in their privacy labels compared to their privacy policies. We find that almost all (97%) apps that indicate in their privacy labels that they do not collect any data engage in some form of data collection according to their privacy policy. Additionally, the privacy labels of 84% of paid apps indicate that they do not collect any data. In contrast, privacy policies suggest that the actual number may be closer to only 6.4% paid apps. Privacy policy analysis also reveals additional information about data practices not captured in privacy labels, including that most apps (81%) selected a 4+ content rating, but only 50% of these apps mention data collected from children in their privacy policies.</p><p>Ethics. The analysis and findings we present are based on publicly available data. We only mention popular apps (determined from rating counts) associated with large companies or developed by services with numerous associated apps. We reached out to Apple and shared our paper before publication. We encourage communication from developers and researchers to make use of our code and data to verify privacy labels.</p><p>In the remainder of this section, we discuss some of the implications of this analysis, such as the ground truth of privacy behavior when considering privacy labels or privacy policies. We also consider what factors likely lead to the misapplication of labels and recommendations for improving the current state.</p><p>Privacy Behavior Ground Truth. Since Apple's labels are not validated, we considered the privacy policies a reasonable reference point of comparison. However, it isn't easy to know the actual ground truth of privacy behavior, even if we fully dynamically and statically analyze every app. In this paper, we compare privacy labels against privacy policies as a point of comparison of the declaration of data practices across platforms. Privacy policies do not serve as ground truth for actual app behavior. While there are limitations to the approach we take in analyzing privacy policies using classifiers, the NLP methods of extracting free-form text levels get us closer to a viable understanding of data collection practices than the privacy labels, as currently used. We believe that this is the case for two reasons. First, classifier outputs introduce uncertainties that stem from the fact that policies are analyzed on a per-segment basis, so discussions of data aggregation or anonymization that occurs in one segment, separate from the data that is collected, might appear as data linking when it is, in fact, not linked. However, even with these statements, the app's behavior remains ambiguous according to the privacy policy regarding which specific data categories are aggregated or anonymized. Apps could often link data based on unique identifiers stated in other policy segments. Our observations suggest that developers mislabel many apps even after considering uncertainties from classier outputs. Second, there are also significant cases of under-reporting from classifiers due to how Apple links to privacy policies and the use of secondary privacy policies from third-party libraries. Many privacy policies link to other policies that we did not analyze. The App Store links also point to the developers' and not the specific apps' privacy policies. These policies usually address all services provided by the developer. For example, Subsplash <ref type="bibr">[2]</ref> and ChowNow <ref type="bibr">[24]</ref> affect thousands of apps, and it is unknown how the eventual customer uses that data and if policies reflect such scenarios.</p><p>Takeaway. We need improved notions of ground truth, which can dynamically identify data collection within apps at scale. However, even with their shortcomings, privacy policies provide a first-level check to identify discrepancies in privacy labels.</p><p>Source of Confusion Around Privacy Labels. It may also be that the processes for generating a privacy policy, including legal staff, are quite different from those selecting the labels, leaving the onus on the development team to make an accurate submission to the App Store. This split in responsibilities could confuse the kinds of data covered by the privacy label (as compared to what is in the policy) and what Apple would consider linked or not linked to users. For example, a recent study by Li et al. <ref type="bibr">[55]</ref> showed that developers find it difficult to correctly identify data linked to users and data used to track users. Our results suggest that there is a large amount of mismatch in both data linked and not linked regarding the Purposes, where App Functionality and Analytics are particularly confusing, especially when apps may collect unique identifiers, as well as collecting other kinds of data that this should match to the Other Purposes category.</p><p>Takeaway. We argue that inaccurate labels are not necessarily the developers' fault but that better guidance and education are required to help them match app practices to labels. Divergent Incentive Models. Privacy policies have become a standard and accepted part of notice and consent laws, and failure to provide an accurate and comprehensive privacy policy could lead to serious legal consequences. Companies are well incentivized to provide broad privacy policies that provide legal cover for their data collection practices in a way that protects them from any jeopardy, including hiring lawyers and other policy experts to craft and review them. Given their length and legal jargon, research shows that privacy policies are neither well understood <ref type="bibr">[66]</ref> nor actively reviewed by most users <ref type="bibr">[49]</ref>. In contrast, privacy labels are now forward-facing and published directly on the App Store without needing to follow any links to review. Recent results by Garg et al. <ref type="bibr">[40]</ref> have even suggested that privacy labels can reduce app demand in cases of collecting sensitive information. The incentive for privacy labels may be an economic rather than a legal one, and these diverging incentive models may help explain some of the large differences we observed between privacy policies and privacy labels. This setup may change, and it is reasonable to consider that privacy labels should face the same regulatory scrutiny as privacy policies due to their role. One could also argue that Apple can expand privacy labels to include more explicit details about data collection behaviors, some of which may indeed be crucial to users for making meaningful and informed decisions about whether to install an app on their computing devices. However, we need balance as adding too much information contradicts the goal of privacy labels to provide a succinct and readable description of the app behavior without needing to read the privacy policy.</p><p>Takeaway. Unfortunately, privacy labels appear to suffer from the transparency paradox <ref type="bibr">[62]</ref>: the inherent conflict between the transparency of textual meaning and the transparency of datahandling practices.</p><p>Improved NLP Models for Privacy Labels. Classification approaches <ref type="bibr">[46,</ref><ref type="bibr">78,</ref><ref type="bibr">82]</ref> offer much promise in helping to verify additional labeling of apps, like privacy labels. However, these approaches have several shortcomings as researchers did not design them for this task. Foremost, the analysis process is on a persegment basis, which is helpful in inferring practices that policies completely describe in individual segments. However, policies often describe practices in parts that automated frameworks do not correctly capture across multiple segments. This shortcoming is partly due to the models' design and training data (OPP-115 dataset <ref type="bibr">[78]</ref>), which researchers labeled on a per-segment basis. Additionally, given that services, including Google in Android <ref type="bibr">[42]</ref>, are adopting privacy labels more broadly, it may be time to update the models and training data to reflect privacy labels as the outcome. For example, the OPP-115 dataset could be re-annotated with privacy labels, forming the basis for new NLP models and more reliable tools to assist developers, researchers, and regulators better.</p><p>Takeaway. The community needs new datasets that align with the taxonomies used by Apple and Google. We also need stronger NLP approaches that can consider cross-segment contexts in privacy policies and thus comprehensively extract the nuances of data collection practices highlighted within the free-form text.</p><p>Regulation and Legal Compliance. Apple requires developers to create a single privacy label for all regions and all users of an app. The App Store does not allow developers to explicitly comply with region-specific (GDPR, CCPA) and age-specific (COPPA) laws. Instead, it encourages developers to create a single, universal label that is either too extensive or too sparse -neither version accurately represents a user's experience. Further, in the absence of vetting from Apple, the responsibility for accuracy solely lies with app developers. The existing structure of the ecosystem helps the App Store appear to care about user privacy but absolves Apple of responsibility for inaccuracy and disinformation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Recommendations for Apple.</head><p>With recent studies highlighting that privacy labels are hard to understand <ref type="bibr">[55,</ref><ref type="bibr">81]</ref>, Apple could reconsider the taxonomy and descriptions of privacy labels. Additionally, Apple's lack of obvious vetting or regulation of the privacy labels may not incentivize the creation of accurate labels, particularly without any feedback to developers. Our imperfect framework can provide a first-level check for developers to consider more comprehensive arrays of labels for their apps. With Apple imposing a short embargo to review new apps before posting to the store, the platform could also incorporate some form of policy-based analysis into the review process.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A REIMPLEMENTING AND TRAINING THE CLASSIFICATION FRAMEWORK</head><p>Hierarchical Structure. Our implementation of the framework closely follows that used for Polisis <ref type="bibr">[46]</ref>, which in-turn relies on the OPP-115 corpus <ref type="bibr">[78]</ref>. It comprises a hierarchical, multi-level set of classifiers. The framework takes a paragraph-length segment of text as input, and passes it to a Segment Classifier to first determine one or more high-level data practices addressed in the segment. These data practices may look like, First Party Collection/Use, Data Security, International/Specific Audiences, etc. The framework further passes the segment through multiple Attribute Classifiers, each of which determine one or more attribute values relevant to the data practice determined by the Segment Classifier. For example, if the segment addresses First Party Collection/Use, the Does/Does Not Attribute Classifier determines if the policy claims to engage in data collection, the Identifiability Attribute Classifier determines if the data collection can be linked to the user, the Purpose Attribute Classifier determines the stated reason for data collection, and the Personal Information Type Attribute Classifier determines the data categories addressed in the segment. The framework classifies one segment of the policy at a time, and the data practices addressed in the entire policy are determined by collating results from all segments. An overview of this structure is provided in Figure <ref type="figure">3</ref>.</p><p>Training Dataset. The Online Privacy Policies (OPP-115) dataset, created by Wilson et al. <ref type="bibr">[78]</ref>, is an annotated dataset of 115 privacy policies. Each policy is divided into paragraph-length segments, and manually annotated by law school students. Each segment was annotated at two levels -first, the annotator chose one or more high-level data practices that the segment addresses (e.g., First Party Collection/Use, Third Party Collection/Sharing); then, depending on the initial selections, they annotated segments with multiple attribute-value pairs (e.g., information_type: financial, purpose:advertising, etc.). Overall, the task covered 10 data practices and 20 associated attributes, with 138 distinct values across attributes. We developed one classifier to determine high-level data practices addressed in a segment, followed by a classifier each for the different attributes associated with the identified data practice.</p><p>Train-Test Split. For each attribute, we collected all segments that had a relevant annotation for the attribute in the OPP-115 dataset. We then performed a separate 80-20 train-test split for each collection of segments belonging to an attribute. In this aspect, we differed from Harkous et al. <ref type="bibr">[46]</ref>, who instead set 65 of the 115 policies aside for training, and used relevant segments from these 65 policies to train all attribute classifiers -a choice that would have resulted in varied amounts of training data being used for each attribute.</p><p>Evaluation Metrics. The authors of PrivBERT <ref type="bibr">[70]</ref> presented an example of fine-tuning a segment classifier using the OPP-115 corpus, in which they manually tuned the hyperparameters used to train the model. We followed a similar approach to develop each classifier. Table <ref type="table">5</ref> presents the evaluation reports for the classifier's precision, recall, and F1 scores on an unseen test set. Following the practice highlighted by Harkous et al. <ref type="bibr">[46]</ref>, we evaluate each classifier's ability to detect both the presence and absence of an attribute in a given text segment. Additionally, since the OPP-115 corpus is old, we additionally manually evaluated classifier outputs on randomly sampled segments of Apple App Store policies, which we also report in Table <ref type="table">5</ref>. For each attribute, we randomly sampled 25 segments for which the classifier reported the presence of an attribute and also sampled 25 segments for which it reported the absence of an attribute. In this manner, we cover 50 segments each for 35 attributes across privacy policies. Table <ref type="table">5</ref> also compares the classification reports for implementing the Polisis CNN-based approach against the performance of the fine-tuned BERT-based models. Finally, to verify our mapping of classifier outputs to privacy label attributes, two researchers randomly sampled and manually evaluated the outputs for 25 instances of each label output (see Table <ref type="table">4</ref>). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B CASE STUDIES OF PRIVACY POLICIES</head><p>To further provide an understanding of the differences between policies and labels, we present a few interesting examples of popular apps and their privacy policies.</p><p>Subsplash. A platform that develops and integrates multiple church services, including donations, memberships, and services, Subsplash <ref type="bibr">[1]</ref> is used by 8,015 apps of local churches on the App Store (examples, <ref type="bibr">[3,</ref><ref type="bibr">13]</ref>).</p><p>All of the hosted apps link to the same privacy policy <ref type="bibr">[2]</ref> and share the same privacy label, i.e., a Data Not Linked to You label, which states that the app collects Usage Data for Analytics, and Diagnostics data for App Functionality. Recall that the Data Not Linked to You privacy type indicates that the data that is collected is aggregated or anonymized. Subsplash's policy states that they collect Contact Info, Financial Info, Purchases, none of which are included in their privacy label. A snippet from their policy is provided below.</p><p>When you interact with Subsplash, we may collect personal information relevant to the situation, such as your name, mailing address, number, email address, and contact preferences; your credit card information and information about the Subsplash products you own, such as their serial numbers and date of purchase; and information relating to a support or service issue.</p><p>The apps additionally collects Location, and Contacts as stated in different segments but not included in the apps' privacy label.</p><p>At the same time, there are some examples of the structure of the privacy policy that may lead Polisis classifiers to under-or over-represent some behaviors. One example is the treatment of anonymization of data. A single segment highlighting anonymization but does not specify which data types are anonymized. Subsplash may use aggregated and anonymized forms of personal information for a variety of purposes, including, but not limited to, analyzing usage trends, fraud detection, and development of new Services.</p><p>As a result, Polisis is unable to match the data collection practice to anonymous linking and would classify most of the data collected by the app as linked rather than not linked. At the same time, since the policy is unclear on this point, it is difficult to fully know the data practices and if the labels are correct on this matter.</p><p>Another example involves the format of Subsplash's privacy policy which includes some data collection practices in varied visual formats, i.e., a table that includes different categories of data, examples of data types, and a column that states whether or not the stated data is collected. However, this table is implemented using &lt;div&gt; tags around each cell. The readability library interprets each of the cells as a separate paragraph, and makes it difficult to interpret the data presented here, potentially under-reporting some behavior as the segments are less complete.</p><p>ChowNow. ChowNow <ref type="bibr">[24]</ref> is an app platform used by 3,182 different apps of local restaurants to receive online orders for takeout and delivery (examples, <ref type="bibr">[22,</ref><ref type="bibr">23,</ref><ref type="bibr">65]</ref>).</p><p>All apps using the ChowNow platform link to the same privacy policy <ref type="bibr">[25]</ref> and apply the same privacy label. The label indicates that all data collection is not linked, indicating that the collected data is aggregated or anonymized. However, ChowNow's privacy policy states that they use contact information to manage user accounts and inform users about products through "electronic marketing communications". They also state that they use billing information, including card numbers, expiration date, security code, and billing address to process orders. Neither of these services can be provided in an anonymized manner, but the privacy labels lack a Data Linked to You category.</p><p>ChowNow's privacy policy also states that they share information with advertisers, but their label does not include a Data Used to Track You label. Additionally, the information that they share is mentioned as Other Information, making it difficult for the Polisis framework to identify the data categories shared with third party services. The relevant snippet is provided below.</p><p>We share Other Information about your activity in connection with your use of the Services with third-party advertisers and remarketers for the purpose of tailoring, analyzing, managing, reporting, and optimizing advertising you see on the Platforms, the Websites, the Apps, and elsewhere.</p><p>ChowNow adds a content rating of 4+ to its apps on the App Store, making it accessible for children. Recall that developers choose a content rating according to Apple's guidelines <ref type="bibr">[10]</ref>; this value is not assigned by Apple. However, ChowNow's privacy policy absolve themselves of the responsibility of dealing with data collected from children, instead placing the burden of preventing such data collection on parents, guardians, and the children themselves. The relevant snippet is provided below.</p><p>We do not knowingly collect personal information from children under the age of 13 through the Services. If you are under 13, please do not give us any personal information. We encourage parents and legal guardians to monitor their children's Internet usage and to help enforce our Privacy Policy by instructing their children to never provide us personal information without their permission. If you have reason to believe that a child under the age of 13 has provided personal information to us, please contact us, and we will endeavor to delete that information from our databases.</p><p>Walmart. A popular shopping and grocery delivery app with 6.6M user ratings, Walmart <ref type="bibr">[75]</ref> provides a large number of privacy labels on the App Store, which includes an extensive list of data categories across three privacy types, Data Used to Track You, Data Linked to You, and Data Not Collected.</p><p>Apple's description of sensitive information covers a list of example data types that are considered sensitive, providing a general overview of possible values. Walmart's privacy label does not state that it collects Sensitive Info, which users may expect from a shopping and grocery delivery app. However, Walmart states in their privacy policy that they collect (i) demographic data, (ii) background &amp; criminal information, and (iii) audio, visual and other sensory information, all of which Apple may consider sensitive information.</p><p>Credit Karma. A popular finance app with 5.4M user ratings on the App Store, Credit Karma <ref type="bibr">[29]</ref> does not use a Data Used to Track You label on the App Store despite stating in their policy that they share personal information with "other companies, lawyers, credit bureaus, agents, government agencies, and card associations in connection with issues related to fraud, credit, defaults, or debt collection".</p><p>We also observed that multiple privacy policies, including others previously mentioned in this section, ask users to refer to the policies of third party providers that they use within their services. An example snippet from Credit Karma's policy is provided below.</p><p>We may use third party API services, such as YouTube and Twilio, for certain product features. If you choose to use those features, you acknowledge and agree that you are also bound by the third party's privacy policy, such as Google's Privacy Policy for YouTube API services. You may manage your YouTube API data by visiting Google's security settings page at <ref type="url">https:// security.g  oogle.com/ settings/security/ permissions</ref>. For more information about Twilio's privacy practices, please visit <ref type="url">https:// www.twilio.com/legal/privacy</ref>. This practice not only increases the burden of gathering additional information for users, but it also makes it difficult for Polisis to infer potentially missing information included in these additional external policies. As a result, the analysis of Credit Karma and similar apps may be a lower bound of the true privacy related behavior.</p><p>Aldi. A popular grocery store in the United States, Aldi, has an app available on the App Store, which is ranked #59 in the Shopping category <ref type="bibr">[6]</ref>. The app offers a wide range of features, enabling users to conveniently order groceries, schedule deliveries or pickups, and make secure payments for their purchases. According to their privacy policy <ref type="bibr">[7]</ref>, Aldi collects (1) payment information (such as credit or debit card or EBT number, security code, expiration date and billing address); <ref type="bibr">(2)</ref> shopping list and purchase history information. It is worth noting, however, that their privacy label on the App Store does not include corresponding entries highlighting their collection of Financial Info and Purchase History.</p><p>Axolochi. A popular application under the Games category, Axolochi is ranked #78 in the Trivia sub-category <ref type="bibr">[48]</ref>. The app's privacy policy <ref type="bibr">[47]</ref> states the automatic collection of various identifiers, such as a unique user ID, IP address, device IDs, hardware or operating system-based identifiers, and identifiers assigned to user accounts. Surprisingly, the app's privacy label on the App Store does not include the Identifiers data category.</p><p>Furthermore, Axolochi offers in-app purchases for users. According to their privacy policy, when users make in-app purchases, the app collects ZIP or postal codes along with "the amount of the transaction and records of purchases" made by the user. However, it is worth noting that the privacy label on the App Store does not feature corresponding entries for Physical Address or Purchase History. This discrepancy may limit the visibility and transparency of the app's data practices, potentially leaving users with incomplete information regarding the collection and usage of their personal data within the app.</p><p>WebMD.. A widely known health-related service, WebMD hosts a flagship symptom checker app on the App Store <ref type="bibr">[77]</ref>. Their privacy policy <ref type="bibr">[76]</ref> explicitly mentions the collection of information from third-party vendors for targeted advertising purposes.</p><p>Our ad network vendors use technologies to collect information about your activities on the WebMD Sites and in our flagship WebMD App to provide you cookiebased targeted advertising on our WebMD Sites and on third party websites based upon your browsing activity and your interests.</p><p>Surprisingly, the app does not include a specific privacy type entry for Data Used to Track You in their privacy label. This absence in the privacy label highlights an instance of inconsistency in declaration of data collection practices across disclosures.</p><p>Pregnancy Tracker. The pregnancy tracking app developed by Fitness Labs has concerning discrepancies between its privacy label on the App Store <ref type="bibr">[38]</ref> and its privacy policy <ref type="bibr">[17]</ref>. The app's privacy label only includes a Data Not Linked to You privacy type, mentioning the collection of Usage Data and Diagnostics data categories. However, the privacy policy reveals a much broader scope of data collection. The policy states: they may collect personal information such as name, address, email address, phone numbers, payment information (credit or debit card), and other demographic information that can identify individuals or enable contact.</p><p>We may collect information about you such as: personal information including, for example, your name; home or business address; e-mail address; telephone, wireless or fax number; short message service or text message address or other wireless device address; instant messaging address; credit or debit card or other payment information; demographic information or other information that may identify you as an individual or allow online or offline contact with you as an individual. Unfortunately, the app's privacy label fails to include the Data Linked to You privacy type or indicate the collection of multiple data categories, including Identifiers, Financial Information, Contact Information, and Sensitive Information.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C NETWORK TRAFFIC COLLECTION</head><p>We provide an overview of the analysis of 39 apps in Table <ref type="table">6</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D ADDITIONAL TABLES AND FIGURES</head><p>We include additional tables and figures here. Figure <ref type="figure">10</ref> provides an overview of our findings based on apps' popularity. Figure <ref type="figure">11</ref> presents our findings based on app genres. Table <ref type="table">7</ref> details overlaps and discrepancies in disclosures across data categories in privacy labels and policies.   The ratios of app store genres for each of the four Privacy Types. The denominator is the number of apps with the designated app store genre that have a privacy label. </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Proceedings on Privacy Enhancing Technologies 2024<ref type="bibr">(4)</ref> Mir Masood Ali, David G. Balash, Monica Kodwani, Chris Kanich, and Adam J. Aviv</p></note>
		</body>
		</text>
</TEI>
