<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>User and Recommender Behavior Over Time: Contextualizing Activity Effectiveness Diversity and Fairness in Book Recommendation</title></titleStmt>
			<publicationStmt>
				<publisher>ACM</publisher>
				<date>06/12/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10654844</idno>
					<idno type="doi">10.1145/3708319.3733710</idno>
					
					<author>Samira Vaez_Barenji</author><author>Sushobhan Parajuli</author><author>Michael D Ekstrand</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Data is an essential resource for studying recommender systems. While there has been significant work on improving and evaluating state-of-the-art models and measuring various properties of recommender system outputs, less attention has been given to the data itself, particularly how data has changed over time. Such documentation and analysis provide guidance and context for designing and evaluating recommender systems, particularly for evaluation designs making use of time (e.g., temporal splitting). In this paper, we present a temporal explanatory analysis of the UCSD Book Graph dataset scraped from Goodreads, a social reading and recommendation platform active since 2006. We measure the book interaction data using a set of activity, diversity, and fairness metrics; we then train a set of collaborative filtering algorithms on rolling training windows to observe how the same measures evolve over time in the recommendations. Additionally, we explore whether the introduction of algorithmic recommendations in 2011 was followed by observable changes in user or recommender system behavior.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Datasets are crucial to building, evaluating, and studying recommender systems and other information access systems (IAS). However, the documentation and descriptive analysis accompanying most datasets is slim, focused primarily on high-level descriptive statistics (e.g., the numbers of users, items, and interactions). As recommender systems research has paid increasing attention to time and sequence in training and evaluation data, both through techniques like sequential recommendation <ref type="bibr">[4]</ref> and the growing consensus towards global temporal splitting for evaluating even traditional recommender systems <ref type="bibr">[25]</ref>, it is important to understand not only the static characteristics of a dataset but also how those characteristics change over the history of interactions recorded in the dataset. Temporal understanding is also valuable for documenting and contextualizing changes in user or system behavior, both for traditional evaluation and to support research on evolving concerns regarding fairness, isolation, and other social impacts.</p><p>In this paper, we present a longitudinal explanatory analysis of the UCSD Book Graph <ref type="bibr">[39,</ref><ref type="bibr">40]</ref>, a dataset containing book metadata, reviews, ratings, and interactions from the Goodreads social reading and recommendation platform, spanning more than 10 years from early 2007 to 2017. This analysis describes how the data has evolved over time in its volume as well as genre diversity and gender balance in user interactions, and the effectiveness, diversity, and fairness of collaborative filtering models trained and evaluated at different points in its history.</p><p>For the first several years, Goodreads' recommendations were purely social (direct, personal recommendations between users in the social graph). In 2011, Goodreads introduced algorithmic recommendations, generating recommendations for users after they rated at least 20 items on a five-star scale <ref type="bibr">[1]</ref>. We further examine whether there are measurable changes in user or recommender behavior on the dimensions we are analyzing associated with the introduction of this recommender system.</p><p>Describing data, user, and model changes over time is important for several reasons: i) guiding the design and evaluation of experimental setups or understanding the impact of design decisions (e.g., the effect of selecting different time points for splitting); ii) revealing how different types of bias evolve over time, both to directly understand fairness implications of the data as well as to provide temporal context for further fairness investigations; iii) assessing whether model performance and fairness measures remain consistent across time. While some prior studies have conducted exploratory analyses of recommendation datasets in different domains <ref type="bibr">[26,</ref><ref type="bibr">37]</ref>, there remains a gap in integrating longitudinal analyses of real-world recommendation datasets into fairness literature. This paper addresses this gap through the following contributions:</p><p>(1) Conducting a longitudinal exploratory analysis of user activity on Goodreads to understand how activity, fairness, and diversity metrics of user interactions evolve over time. <ref type="bibr">(2)</ref> Comparing four collaborative filtering algorithms trained on successive time windows to examine how their behavior changes over time in relation to the same set of concerns. (3) Investigation of whether the introduction of a recommender system on Goodreads is associated with measurable changes in the system's behavior</p><p>The temporal changes in data and evaluation metrics we document show that temporal recommender system evaluations are sensitive to the time window at which they are performed. Therefore, evaluating a system at a single point in time does not necessarily reflect its broader behavior. Some periods may reflect peak performance, while others reveal reduced effectiveness or increased disparities. Without a temporal perspective, evaluations risk overlooking how biases evolve as both the user base and content shift. Understanding system behavior over time is therefore essential for capturing its ability to meet users' needs and for identifying when and how bias emerges or is mitigated. It also supports more meaningful metric interpretation and the development of emerging best practices.</p><p>In the remainder of this paper, we survey related work on dataset description, bias, and longitudinal analysis ( &#167;2), detail the data, algorithms, and experimental setup ( &#167;3), present our results and observations ( &#167;4), and conclude with implications for recommender systems research on fairness and other topics ( &#167;6).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Background and Related Work</head><p>Describing Datasets. There is a growing call for detailed documentation of AI training and evaluation data, motivated by concerns about transparency, accountability, and the potential for unintended harms <ref type="bibr">[18,</ref><ref type="bibr">19,</ref><ref type="bibr">28,</ref><ref type="bibr">32]</ref>. Poorly documented datasets can obscure hidden biases, mislead users about appropriate applications, or lead to flawed evaluations <ref type="bibr">[30,</ref><ref type="bibr">31]</ref>. To address these issues, several documentation frameworks have been proposed. Gebru et al. <ref type="bibr">[18]</ref> introduce datasheets to describe the dataset's context, content, and recommended uses, similar to documentation practices in engineering. Pushkarna et al. <ref type="bibr">[32]</ref> propose data cards, which are short summaries that explain how the dataset was created and how it should be used. Nesvijevskaia <ref type="bibr">[28]</ref> presents a standardized framework databook for dynamically documenting algorithm design throughout data science projects.</p><p>In the context of recommender systems, the widely-used Movie-Lens datasets have detailed historical documentation <ref type="bibr">[21]</ref>, but this documentation was only published after the platform had been operating for over 15 years. However, such documentation is rare for other datasets. Comprehensive data documentation has been identified as essential to enable reproducible results in AI research <ref type="bibr">[20]</ref>. In this paper, we examine the UCSD Book Graph scraped from Goodreads. The dataset is briefly summarized on its website 1 and in the authors' papers <ref type="bibr">[39,</ref><ref type="bibr">40]</ref>. Ekstrand and Kluver <ref type="bibr">[13]</ref> augmented the Book Graph with publicly available book and author metadata to study fairness towards authors' gender identities; they provided a brief description of the distributions of author gender data. However, changes to the dataset's size, composition, and distributions over time have not yet been documented.</p><p>Fairness and Social Impacts. User-generated data on social media platforms and review sites often reflect societal biases. These biases can enter datasets through uneven participation, visibility, or representation, and may be further amplified during data processing <ref type="bibr">[30]</ref>. When such data are used to train recommender systems, it 1 <ref type="url">https://cseweb.ucsd.edu/~jmcauley/datasets/goodreads.html</ref> can reinforce existing patterns of popularity or exclusion and lead to biased system behavior <ref type="bibr">[2,</ref><ref type="bibr">3,</ref><ref type="bibr">43]</ref>.</p><p>Research on fairness in recommender systems seeks to document, measure, and mitigate these biases <ref type="bibr">[8,</ref><ref type="bibr">12,</ref><ref type="bibr">41]</ref>. Ekstrand and Kluver <ref type="bibr">[13]</ref> used the Book Graph we examine in this paper to study author gender representation in user profiles and recommender system outputs. Although much of the recommender systems fairness research has focused on static evaluations <ref type="bibr">[7]</ref>, recent studies highlight the importance of understanding how fairness outcomes evolve as models are retrained on new user feedback <ref type="bibr">[15,</ref><ref type="bibr">16]</ref>. Much of this work, however, is prospective, examining how fairness changes as models are trained and retrained going forward from a static dataset. Our present study complements this direction of research by studying and evaluating how user interactions and model behavior evolve historically over time in the Book Graph.</p><p>Longitudinal Analysis. Recommender systems can have longterm effects on user behavior and content exposure <ref type="bibr">[17]</ref>. Over time, they may reinforce existing patterns, leading to reduced diversity and unequal visibility across groups <ref type="bibr">[14]</ref>. Some studies have shown that short-term engagement signals do not always align with longterm outcomes, like user satisfaction or return visits <ref type="bibr">[42]</ref>. Longitudinal studies, which track user interactions over extended periods, are essential to understanding long-term effects. Many longitudinal studies of recommender systems are prospective and examine how systems and user behavior evolve going forward. Some, however, are historical, such as the filter bubble analysis of Nguyen et al. <ref type="bibr">[29]</ref> and temporal evaluations of recommender system accuracy <ref type="bibr">[5,</ref><ref type="bibr">24]</ref>. In this study, we conduct a historical longitudinal analysis to understand how the underlying data in a real-world recommendation platform has changed over time.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Data and Methods</head><p>To document the evolution of book interaction data and its influence on collaborative filtering outputs, we measure the volume and diversity of user interactions across successive time windows, as well as the results of training and evaluating collaborative filtering models at different points in the dataset's temporal span. This section details the specific methods and data preparation used in our analysis. <ref type="foot">2</ref></p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Data</head><p>The UCSD Book Graph <ref type="bibr">[39,</ref><ref type="bibr">40]</ref> is a dataset scraped in 2018 from Goodreads, a social reading and recommendation platform. It spans over a decade of book metadata, reviews, ratings, and interactions of users with public profiles, covering the period from January 2007 to November 2017. Ekstrand and Kluver <ref type="bibr">[13]</ref> augmented the book graph with public author data from library sources and OpenLibrary through the PIReT/INERTIAL Book Data Tools<ref type="foot">foot_1</ref> , providing further information such as author gender identities. The original book graph includes interactions of approximately 876K users with 1.52M million books. It spans various user activities and is organized into the following components:</p><p>&#8226; User actions, where a user adds a book to one of their "shelves"; &#8226; User ratings, provided on a 5-star scale;</p><p>&#8226; Book metadata, including title, genres, author names, and other attributes. For our analysis, we count Goodreads works as items for user interaction and recommendation. A work is the fundamental unit of browsing and cataloging in Goodreads, and represents a particular literary work including its various editions, printings, translations, etc.; individual editions are represented as books. Using version 3.0 of the Book Data Tools, we aggregated and deduplicated these individual actions to form an interaction matrix representing each observed user-work pair with a single record. Each record includes the first and last timestamps of any interaction between the user and that work, whether through shelving or rating, along with the last rating the user provided for that work. For this study, we used the last interaction timestamp, as the first timestamps had some erroneous values that made them unreliable for analysis.</p><p>To support the analysis of genre diversity we used the book genre information supplied with the Book Graph (and extracted from users' genre-related shelf names) to derive a distribution over genres for each work. The book graph genre records report the number of times each genre shelf has been applied to a book by users. We normalized these counts to obtain &#119875; (&#119892; | &#119887;), the probability that book &#119887; belongs to genre &#119892;. The book graph also contains textual reviews, but we did not consider those in this analysis. Table <ref type="table">1</ref> presents a statistical summary of the dataset, including the number of unique users, books, and authors, as well as the average rating in the ratings data. The gender-specific columns report these same statistics separately for male and female authors based on available gender data. For all experiments and analyses, we excluded authors with ambiguous or missing gender labels (using the same logic as Ekstrand and Kluver <ref type="bibr">[13]</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Experimental Setup</head><p>We used a subset of actions data for our experiments that ranged from January 1, 2007, to October 31, 2017. We begin by analyzing the interaction data within this period on a monthly basis to examine its temporal trends.</p><p>To generate and evaluate recommendations, we created successive 2-month windows of test data beginning on Jan. <ref type="bibr">1,</ref><ref type="bibr">2009</ref>, training the recommender system on the preceding 2 years of interactions and generating top-100 recommendation lists for each user in the test window who also has interactions in the training window. This setup allows us to analyze how model performance evolves over time.</p><p>We used LensKit <ref type="bibr">[10]</ref> to generate and evaluate recommendations from collaborative filtering models trained on these successive training sets. Following our prior work using this dataset <ref type="bibr">[13,</ref><ref type="bibr">34]</ref>,</p><p>(a) Unique entities (b) Interaction records we used three implicit-feedback collaborative filtering algorithms: item-based &#119896;-NN [ItemKNN, 9], implicit-feedback ALS Matrix Factorization [ImplicitMF, 36], and Bayesian Personalized Ranking [BPR, 35]</p><p>; we omit User-based &#119896;-NN due to its computational cost. We also considered non-personalized most-popular recommendations (MostPop). This selection helps ensure consistency and alignment with our previous studies, extending that line of inquiry through a longitudinal lens.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Results</head><p>In this section, we present the results of our analysis on both the interaction data and the recommendation outputs using standard evaluation metrics. We examine how structural and fairness-related properties of the data, such as genre diversity and author gender representation, evolved over time in the data, and how these same properties are reflected in the generated recommendations. This analysis also allows us to assess whether any measurable changes are associated with the introduction of Goodreads' recommender system in 2011. In each figure, the recommender introduction is marked with a triangle on the &#119909;-axis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Data</head><p>We first explore the temporal patterns in the user interaction data to understand how its structure, diversity, and fairness properties evolved over time, regardless of the recommender outputs. All measurements are on a monthly basis. Data size and growth over time. Figure <ref type="figure">1</ref> shows the monthly activity levels in the interaction data. Figure <ref type="figure">1a</ref> shows the monthly volume of unique users, books, and authors in the data. All three entities show increasing activity over time, with the sharpest increases occurring after 2011. Figure <ref type="figure">1b</ref> shows the number of records (user-book interactions) and the number of "first" interactions (the first time a user interacts with a particular book). Both lines almost completely overlap and steadily increase as expected, reflecting the platform's growth over the ten-year period. While there are several fluctuations in monthly user activity, the highest peak occurs around 2016.</p><p>Genre diversity. To quantify genre diversity in the interaction data, we derive user-specific genre distributions by aggregating the genre distributions of the books each user has interacted with during each month. For a given user, the probability assigned to each genre reflects the cumulative probability mass of that genre across all books in their interaction history, based on the book-level distributions &#119875; (&#119892; | &#119887;) defined in &#167;3. We then compute the Shannon entropy of this user-level distribution to measure the genre diversity in their interactions, with higher entropy indicating a more diverse selection. The final metric is reported as the average entropy across all users.</p><p>Figure <ref type="figure">2</ref> shows the average genre entropy across user interactions. We can see that genre diversity stabilizes after an initial rise, holding relatively stable with some fluctuations until about 2012, after which it decreases. The patterns suggest that user interaction with genres remained relatively consistent over time.</p><p>Individual fairness and popularity bias. We use the Gini index to measure the distribution of interactions across individual items and authors in the data. Higher Gini values indicate greater inequality, i.e., a limited number of books or authors receiving most of the interactions.</p><p>Figure <ref type="figure">3</ref> shows that book-level inequality is consistently high and relatively stable, especially after 2012, with author-level inequality even higher. These results suggest persistent bias in user engagement toward popular books and authors.</p><p>Author gender representation. Finally, we examine author gender representation in user interactions by computing the monthly proportion of books interacted with that are written by female authors  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>(only considering books for which the first author's gender identity is known).</head><p>As shown in Figure <ref type="figure">4</ref>, the proportion of female authors rises sharply from around 30% in the early years to just over 50% by 2011, after which it remains relatively stable through 2017 with a small jump around 2014. This is consistent with the analysis of Thelwall and Kousha <ref type="bibr">[37]</ref> suggesting that Goodreads exhibits a female-author bias, which in turn aligns with the hypothesis that female authors were more commercially successful than male authors due to being more frequently rated online <ref type="bibr">[38]</ref>, though further evidence would be needed to confirm this.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Recommendations</head><p>We now turn to the evaluation of recommendations generated by our selected collaborative filters for each time window. All measures are computed on top-100 recommendation lists for all users in a 2month test window. We evaluate these recommendation lists using the same set of metrics applied in Section 4.1, alongside the ranking performance measures and rank-biased versions of entropy and Gini.</p><p>To account for user attention bias toward higher-ranked items in a recommendation list, we apply a position-based exposure weighting model to entropy and accuracy metrics in recommendations. Specifically, we use the geometric cascade browsing model <ref type="bibr">[27,</ref><ref type="bibr">33]</ref>, defining the exposure weight &#119908; &#119894; for position &#119894; in the ranked list as:</p><p>(1) where 0 &lt; &#120574; &lt; 1 is a tunable patience parameter, which we set to LensKit's default of 0.85 in our experiments.</p><p>Coverage and accuracy. Figure <ref type="figure">5</ref> shows the number of unique items and authors that appear in the top-100 recommendation lists for each algorithm over time. ItemKNN and BPR recommend substantially more unique items and authors compared to MostPop and ImplicitMF, which is a consistent pattern over time. This suggests that the former algorithms provide broader coverage across both items and authors.</p><p>We also report RBP <ref type="bibr">[27]</ref>, NDCG <ref type="bibr">[22]</ref>, and Mean Reciprocal Rank as ranking performance metrics over time (Figures <ref type="figure">6a</ref>, <ref type="figure">6b</ref>, <ref type="figure">6c</ref>). Im-plicitMF achieves the highest performance scores in all three metrics, followed by ItemKNN, BPR, and MostPop. The ordering of systems is mostly consistent over time, but there are noticeable fluctuations in performance. The performance gap between the best-performing models (ImplicitMF and ItemKNN) and the worstperforming models (MostPop and BPR) also appears to widen over time. The low scores of MostPop suggest that personalized models have performed better in capturing meaningful variations in user taste.</p><p>Genre diversity. For each recommendation list, we compute a weighted probability distribution over genres. The probability of genre &#119892; in a user's list &#119871; is defined as:</p><p>where &#119908; &#119894; represents the exposure weight of position &#119894; and &#119875; (&#119892; | &#119871;(&#119894;)) is the probability that the book in position &#119894; belongs to genre &#119892; (see Section 3.1). We then compute entropy over the resulting genre distribution to obtain the genre entropy of a list, and average across lists to obtain an overall diversity measure of mean rankbiased genre entropy. As with profile entropy, larger entropy values indicate more diverse recommendation lists (i.e., more uniform distribution across genres).</p><p>Figure <ref type="figure">7</ref> shows the mean genre entropy across recommendation lists. Perhaps surprisingly, MostPop yields the highest entropy. We see at least two possible reasons for this. First, since MostPop recommends a fixed set of widely popular books, these items are more likely to have user-generated genre tags, possibly resulting in higher measured entropy. In contrast, the other algorithms recommend a broader set of less popular items (as seen in Figure <ref type="figure">5</ref>), which may include books with sparse or missing genre data that are excluded from the entropy calculation. Second, the most popular books themselves may be more diverse in genre, for example, including one highly popular book from each genre category.</p><p>BPR consistently exhibits the lowest genre diversity, likely due to its narrower focus on user-specific patterns in recommending items. Overall, all algorithms show relatively stable or slightly declining entropy over time, with more variation in MostPop. This suggests that personalized recommendation models are less influenced by changes in the genre distribution of the most popular books in any particular time period.</p><p>Individual fairness and popularity bias. We compute the Gini index (Gini@k) over the exposure weights &#119908; &#119894; , which estimate the exposure assigned to an item or author at position &#119894; in the ranked recommendation lists. This effectively measures the allocation of "user attention" as a resource, accounting for the fact that users are likely to give more of their attention to books at the top of a recommendation list.</p><p>Figure <ref type="figure">8</ref> shows the Gini index for author-level (left) and itemlevel (right) exposure. As expected, MostPop displays the highest inequality in exposure, with both item-and author-level Gini values consistently close to 1.0. ItemKNN and BPR produce more balanced exposure, especially in earlier years. The ordering of ItemKNN and ImplicitMF we observe here differs from that found in the movie recommendation by Ekstrand et al. <ref type="bibr">[11]</ref>, indicating that models' relative fairness properties may differ between domains or datasets.</p><p>All models show a gradual increase in Gini values over time and maintain relatively high levels of exposure inequality. This indicates a growing exposure bias toward a smaller set of items and authors, even in more personalized algorithms. The high Gini values are also influenced by the domain of items: for consistent comparison across both models and time windows, we computed Gini over all items in the dataset, with the unrecommended items receiving zero exposure.</p><p>Author gender representation. Figure <ref type="figure">9</ref> shows the proportion of books in the recommendations that have female authors. ItemKNN shows the highest and most steady increase in female author representation over time. ImplicitMF also increases gradually, while BPR stays more stable with a smaller rise. MostPop starts with the lowest proportion and increases slowly until a sharp jump near the end. All models show upward trends, but the speed and pattern of growth are different. The overall trend reflects the underlying patterns seen in the interaction data discussed in Section 4.1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Effects of Goodreads Recommendation</head><p>To evaluate whether the introduction of Goodreads' recommender system in 2011 is associated with measurable changes in system behavior, we analyzed temporal trends in both the interaction data and the outputs of our trained models. Overall, we did not observe clear or immediate changes in diversity, fairness, or effectiveness metrics around 2011 that can be directly associated with the launch of the recommender system.   However, there are some patterns that are worth further investigation. One such pattern is a gradual decrease in genre entropy in the interaction data after 2012, which may suggest an increasing influence of recommendation or narrowing user exploration. While the models differ in how they reflect underlying data trends, Implic-itMF and ItemKNN show signs of stabilization in both genre entropy and exposure Gini metrics following the system's introduction. We also did not see identifiable changes in recommender effectiveness associated with the introduction of algorithmic recommendation; increased effectiveness on offline evaluation metrics may be a sign of user interaction behavior becoming more predictable (due to the influence of algorithmic recommendations), but we do not see clear evidence of such an effect.</p><p>Overall, the observed trends appear to reflect broader, ongoing developments in the system, such as the accumulation of user interaction history, growth of the item catalog, and evolving user behavior, rather than the effects of a single intervention. Existingsystem bias (bias in item exposure due to the recommender system being active while data is collected) is a significant problem in offline recommender system evaluation <ref type="bibr">[6,</ref><ref type="bibr">7,</ref><ref type="bibr">23]</ref>. Goodreads' lack of algorithmic recommendations for its first 4-5 years of operation provides an opportunity to detect potential existing-system biases. Our analyses were not able to detect specific problems attributable to the addition of a recommender system in this particular dataset. Algorithmic recommendations have significantly less prominence in the Goodreads user experience than they do in many other platforms, so they may have less of an impact on user behavior.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Discussion</head><p>We observed different trends in diversity and fairness measures across both the interaction data and the generated recommendations. We were particularly interested in understanding how aspects of the data that affect bias, fairness, diversity, and recommender performance changed over the history of the platform.</p><p>While user activity and the item corpus have increased over time, the popularity concentration in the interaction data (as measured by the Gini index) remains relatively stable, especially after 2012. This suggests that increased user engagement is not solely concentrated on the most popular books, but that users have had a relatively steady spread of interest in both more-and less-popular books over time. Moreover, although the interaction Gini stabilized after 2012, the exposure Gini for models like BPR and ImplicitMF continues to rise. This raises the question of whether there are other patterns in user interactions that lead to increased concentration in model outputs despite stable patterns in the underlying data. Future research should investigate this possibility to better understand the drivers of popularity bias in collaborative filtering outputs.</p><p>For genre diversity, the models largely mirror the gradual changes seen in the interaction data. However, there is a slight difference around 2012: while genre entropy in user interactions peaks around that time, most algorithms show a decline. One possible explanation is that changes in user behavior took time to influence model training data, leading to a delayed rise in entropy in the recommendations. Most of the models begin showing upward trends after this point, which may reflect a lagged response to shifts in genre diversity in the data.</p><p>Our findings, in general, and particularly the changing patterns in recommender system effectiveness metrics, indicate that evaluation outcomes are dependent on the choice of test and train time windows. This means the selection of time windows is important for the comparative evaluation of recommender systems, and suggests that results may be time-specific and potentially difficult to generalize. While we did see a few inversions of recommendation model performance rankings, we do observe a shift in the performance gap between models as time progresses in the dataset. This will influence the effect sizes reported in experiments on different temporal test windows. Temporal, data-centered audits of recommender systems can uncover evolving patterns in performance, fairness, and diversity that may not be visible in static, snapshot evaluations.</p><p>Our analysis does have several limitations. Some of our results are affected by sparse or incomplete metadata, such as missing genre tags or author information, which limit the accuracy of certain measures. The UCSD book graph also only contains public profiles that were still active in 2017 or 2018, so data from private or deleted accounts is not available. We also do not know what specific recommendation techniques Goodreads deployed in 2011 (or later), so we cannot model the performance changes in their recommendation models or directly test for their influence on the data. Finally, our results are observational and correlational, so we cannot establish causality between any of our metrics or between the introduction of the Goodreads recommender system and any user behavior changes. Nevertheless, we believe such descriptive and exploratory analyses are valuable for understanding the contours of the data used by the recommender systems and user modeling research communities to train and evaluate models and to study the behavior and social impacts of humans and algorithms on social data. Our findings also point to the need for further exploration of underlying drivers such as user engagement, metadata quality, and patterns our metrics could not detect but which nonetheless influence recommendation models over time.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>Documenting and understanding datasets is important for designing experiments and contextualizing analyses and findings. The evolution of recommender systems datasets over time is often overlooked, but it is crucial for informing time-based experimental design decisions and provides context both for interpreting experimental results and for understanding historical changes when performing simulation studies of potential future dynamics <ref type="bibr">[6,</ref><ref type="bibr">15,</ref><ref type="bibr">16]</ref>.</p><p>We provided such an analysis for the UCSD Book Graph, documenting both the evolution of the dataset itself and the behavior of collaborative filtering models retrained at different time points. We found increasing representation of female authors, decreasing genre diversity, and slightly increasing concentration of interactions on popular users and items over the history of the dataset. We did not observe any clear impacts of the introduction of algorithmic recommendations on the behaviors of either users or collaborative filtering recommender models.</p><p>Our analysis lays the retrospective groundwork for future work on the temporal dynamics of recommender system behavior, including both traditional accuracy concerns and social impact considerations such as diversity and fairness. They also highlight the potential impact of the choice of splitting time on recommender experimental results. We encourage groups preparing and releasing new datasets for recommender systems research to include temporal analyses along with the descriptive statistics they publish with their datasets.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0"><p>The code is available at https://zenodo.org/records/15333266</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1"><p>https://bookdata.inertial.science</p></note>
		</body>
		</text>
</TEI>
