<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>From Newborn to Impact: Bias-Aware Citation Prediction</title></titleStmt>
			<publicationStmt>
				<publisher>ACM</publisher>
				<date>03/01/2026</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10668115</idno>
					<idno type="doi"></idno>
					
					<author>M Lu</author><author>M Wu</author><author>J Xu</author><author>W Li</author><author>F Liu</author><author>Y Ding</author><author>Y Sun</author><author>J Lu</author><author>Y Zhang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[As a key to accessing research impact, citation dynamics underpins research evaluation, scholarly recommendation, and the study of knowledge diffusion. Citation prediction is particularly critical for newborn papers, where early assessment must be performed without citation signals and under highly long-tailed distributions. We identify two key research gaps: (i) insufficient modeling of implicit factors of scientific impact, leading to reliance on coarse proxies; and (ii) a lack of bias-aware learning that can deliver stable predictions on lowly cited papers. We address these gaps by proposing a Bias-Aware Citation Prediction Framework, which combines multi-agent feature extraction with robust graph representation learning. First, a multi-agent × graph co-learning module derives fine-grained, interpretable signals, such as reproducibility, collaboration network, and text quality, from metadata and external resources, and fuses them with heterogeneous-network embeddings to provide rich supervision even in the absence of early citation signals. Second, we incorporate a set of robust mechanisms: a two-stage forward process that routes explicit factors through an intermediate exposure estimate, GroupDRO to optimize worst-case group risk across environments, and a regularization head that performs what-if analyses on controllable factors under monotonicity and smoothness constraints. Comprehensive experiments on two real-world datasets demonstrate the effectiveness of our proposed model. Specifically, our model achieves around a 13% reduction in error metrics (MALE and RMSLE) and a notable 5.5% improvement in the ranking metric (NDCG) over the baseline methods. The code can be found at https://github.com/Maekfei/BA-Cite.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Citation dynamics, as a key to accessing research impact, is crucial for research evaluation, scholarly recommendation, and the study of knowledge diffusion. Citation prediction has thus become a particularly significant task for identifying scientific innovation from newborn papers. However, citation data are highly biased. Implicit factors such as reproducibility are not well reflected in modeling, while overly correlated explicit factors lead to shortcuts, e.g., top venue = high citations <ref type="bibr">[14,</ref><ref type="bibr">26]</ref>. Thus, accurately and robustly predicting citations in such a cold-start scenario remains a pressing challenge. As citation behaviors unfold across largescale web-based scholarly platforms, addressing this challenge also contributes to robust and generalizable web mining of academic content and information diffusion.</p><p>Previous work on citation prediction falls largely into two categories. <ref type="bibr">(1)</ref> Early-cascade models use initial citation dynamics as predictors. DeepCas <ref type="bibr">[15]</ref> treats the citation cascade of a paper as a sequence, generating diffusion pathways through random walks and modeling them with BiGRU and attention to capture early spread signals. SI-HDGNN <ref type="bibr">[35]</ref> further embeds these processes in heterogeneous dynamic academic graphs, combining multi-relational structures with early citation sequences to forecast long-term scientific impact. However, these models require years of waiting for early citations to accumulate, making them ineffective in the coldstart stage when timely prediction is most needed. (2) Metadatadriven models leverage the author, institution, venue, and related descriptors. For example, HINTS <ref type="bibr">[12]</ref> models dynamic heterogeneous information networks, via graph neural networks (GNNs) to capture temporal evolution in citation time series. Cluster-Aware Text-Enhanced HGNN <ref type="bibr">[36]</ref> integrates signals with cluster-level and textual features to improve prediction. However, citation distributions are highly long-tailed <ref type="bibr">[12,</ref><ref type="bibr">20]</ref>, and these models perform poorly on lowly cited papers. Moreover, they tend to overfit correlations with explicit factors while overlooking deeper implicit factors that may have comparably higher potential to influence citation behaviors in real-world scientific activities, leading to significant performance degradation under distribution shifts.</p><p>Despite some promising solutions, existing methods cannot achieve accurate and robust citation prediction for particularly lowly cited papers due to the following two research gaps: Gap 1: Insufficient attention to implicit factors, resulting in reliance on strong correlations with explicit factors and poor generalization across domains. Current models rely heavily on factors such as author reputation and venue prestige, which correlate with citations but cannot comprehensively cover all decisive determinants of citation behaviors. Lacking fine-grained representations of implicit factors such as topic hotness, reproducibility, and collaboration structure, models would default to superficial correlations, which ultimately degrade performance when the data distribution shifts. Gap 2: A lack of bias-aware models that remain robust on lowly cited papers. In this work, bias refers to the group-level prediction disparity induced jointly by the long-tailed citation distribution, feature sparsity in cold-start settings, and the empirical risk minimization (ERM) objective that minimizes average risk <ref type="bibr">[7,</ref><ref type="bibr">17]</ref>. This structural bias causes models to systematically underperform on low-citation subgroups. Most existing methods overlook the issue, since minimizing overall error inherently biases training toward highly cited papers that dominate the loss. As a result, lowly cited papers remain underrepresented and their citation dynamics is poorly predicted, undermining the early evaluation of underrepresented elements in the research community, e.g., early career researchers and emerging research directions. Based on these gaps, we pose our core question:</p><p>How can we design bias-aware models that deliver stable predictions on low-citation papers while revealing the underlying drivers of scientific impact?</p><p>To bridge these two significant gaps, we propose a Bias-Aware Citation Prediction Framework, termed BA-Cite, which combines fine-grained feature extraction with robust GNN learning. Specifically, to tackle Gap 1, we design a multi-source informed graph learning framework that jointly models agent-derived implicit factors and heterogeneous graph structures. Six agents automatically extract fine-grained features, including reproducibility, text quality, collaboration network, topical hotness, venue prestige and roleaware author reputation, from metadata and external resources. Instead of serving as isolated inputs, these signals are fused with graph-based paper embeddings and propagated through a twostage predictor, allowing the model to integrate implicit factors with graph context. In this way, the framework reduces reliance on explicit proxies and achieves stronger generalization. To overcome Gap 2, we center learning on a two-stage predictor: The model initially estimates an intermediate exposure variable from graph embeddings and agent-derived features, and then predicts citations based on both the exposure estimate and the remaining features, while excluding superficial correlates from direct inputs. Robust learning objectives are attached to the second stage's outputs: (i) Group Distributionally Robust Optimization (GroupDRO) minimizes the worst-group risk on the prediction loss, countering head-dominated bias; and (ii) a regularization module performs what-if interventions on controllable factors, recomputing exposure and predictions while enforcing monotonicity and smoothness constraints. These objectives are jointly optimized and back-propagated through both stages and the graph encoder, shaping the pipeline toward bias-resistant and consistent behavior. We conduct systematic evaluations on two large-scale academic datasets, AMiner and OpenAlex. Experimental results demonstrate that, compared with the state-of-the-art baselines, our framework chieves around a 13% reduction in error metrics (MALE and RMSLE) and a notable 5.5% improvement in the ranking metric (NDCG).</p><p>The main contributions of this paper are highlighted as follows:</p><p>(1) Empirical finding. We identify that prior citation predictors overfit explicit signals, leading to degradation on long-tail papers and under distribution shifts. (2) Multi-source Informed Graph Learning Framework. We propose a collaborative framework that combines agent-based fine-grained implicit feature extraction with graph representation learning, enabling robust prediction even without early citation information. (3) Bias-Aware GNN Learning. We design a robust mechanism that integrates Stage-A/Stage-B modeling, GroupDRO, and a regularization module, thereby suppressing superficial correlations, highlighting true correlations, and enhancing both accuracy and robustness.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>Research on scientific impact prediction and citation-network modeling falls into three directions: early-cascade modeling, metadatadriven graph learning, and dynamic or contrastive graph approaches.</p><p>Early-cascade models frame citation accumulation as a diffusion process. DeepCas <ref type="bibr">[15]</ref> captures early propagation via randomwalk citation paths and BiGRU attention, while SI-HDGNN <ref type="bibr">[35]</ref> embeds such cascades into heterogeneous academic graphs. Despite their effectiveness with sufficient citations, these models perform poorly in cold-start settings where early prediction is crucial.</p><p>Metadata-driven graph models leverage structural and textual metadata such as authors, venues, and topics. HINTS <ref type="bibr">[12]</ref> encodes dynamic heterogeneous networks with R-GCN and GRU, while CATE-HGNN <ref type="bibr">[36]</ref> and HLM-Cite <ref type="bibr">[6]</ref> enrich semantics via clustering and pretrained language models. However, they remain sensitive to long-tail imbalance and often overfit superficial correlations.</p><p>Dynamic and contrastive frameworks model evolving citation contexts more explicitly. H2CGL <ref type="bibr">[8]</ref> builds hierarchical heterogeneous graphs with citation-aware GIN, relation-aware GAT, and contrastive learning to integrate structural and temporal cues. Related work such as TGN-TRec <ref type="bibr">[27]</ref>, NETEVOLVE <ref type="bibr">[19]</ref>, and Re-searchTown <ref type="bibr">[39]</ref> explores dynamic or agent-based paradigms emphasizing interpretability and network evolution. More recently, From Words to Worth <ref type="bibr">[41]</ref> proposes newborn article impact prediction, showing that fine-tuned LLMs can infer normalized impact (TNCSISP) from titles and abstracts alone, achieving competitive performance without citation history or external metadata.</p><p>LLM-based semantic feature extraction. Recent studies leverage large language models (LLMs) to extract high-level semantic features <ref type="bibr">[3,</ref><ref type="bibr">16,</ref><ref type="bibr">33,</ref><ref type="bibr">34,</ref><ref type="bibr">38]</ref> from titles, abstracts, and metadata for scholarly impact prediction. LLMs provide contextualized representations and latent signals such as novelty or interdisciplinarity, which are incorporated into downstream graph models as auxiliary embeddings <ref type="bibr">[2,</ref><ref type="bibr">11,</ref><ref type="bibr">37]</ref>, especially under cold-start settings. However, these features are typically treated as static enhancements, without explicitly modeling their interaction with dynamic graph structures or citation bias. Despite these advances, few studies examine the mechanisms and bias dynamics driving citation disparities. We address this gap with a unified multi-source graph learning framework, enabling robust citation prediction under cold-start and distribution shifts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Methodology</head><p>In this section, we present our framework, BA-Cite, illustrated in Fig. <ref type="figure">1</ref>. It consists of two parts: (i) agent-based fine-grained feature extraction, where multiple agents derive implicit factors from metadata and external resources; and (ii) graph learning on dynamic heterogeneous networks, where a GNN encoder integrates these signals through a two-stage forward process with bias-and robustness-oriented objectives. In the following parts, we first outline the motivation, then describe the functions of individual agents, and finally detail the three GNN modules that incorporate agentderived features into bias-aware representation learning.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Motivation</head><p>3.1.1 Addressing Bias in Long-Tailed Citation Prediction. In citation prediction, the inherently long-tailed distribution creates a persistent imbalance: a few highly cited papers dominate the learning process, while the majority receive limited attention. This skew causes models to produce inaccurate predictions-often overestimating lowly cited papers and failing to generalize under distribution shifts, such as when predicting citations for evaluating new venues or identifying emerging research topics. Such instability weakens the predictive reliability and limits the model's ability to capture the real scientific impact across domains and time, for example, leading to biased predictive results that underestimate the contributions of early-career researchers and non-mainstream research. Addressing bias under the long-tailed and shifting distributions is therefore essential not only for improving prediction robustness but also for promoting fairness and inclusiveness in scientific assessment.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.2">Capturing Core Factors Driving Citation Dynamics with Agents and GNN(s).</head><p>Existing citation prediction models often rely on handcrafted metadata or early citation signals, which are unavailable for newborn papers in cold-start settings and difficult to model manually. Agents, by contrast, can autonomously mine and reason over multi-source information -such as reproducibility, topic hotness, and collaboration network -to derive fine-grained, high-level semantic factors that are otherwise implicit or sparsely encoded in metadata. However, semantic cues alone cannot capture the structural and temporal dependencies that shape citation dynamics. While agents excel at extracting implicit knowledge, GNNs effectively model heterogeneous academic networks. Combining them enables semantic knowledge to be structurally grounded, ensuring both predictive accuracy and robustness.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Fine-grained feature extraction</head><p>To enrich paper representations, we design six domain-specific agents that automatically derive implicit features from metadata and external resources. Each agent focuses on a distinct dimension of citation dynamics, leveraging domain heuristics and multi-source knowledge to uncover fine-grained semantic cues that are difficult to encode manually. Together, these agents capture complementary aspects such as author reputation, venue prestige, collaboration patterns, reproducibility, topic hotness, and text quality, thereby expanding the feature space and enabling more balanced and generalizable representations. Given a paper , the outputs are concatenated into a unified feature vector f = [ , , , , , ], where each component corresponds to one agent's extracted factor. Below we describe the reason for choosing these factors and how each agent extracts them from metadata and external sources.</p><p>Role-aware Author Reputation (A). Readers tend to cite works by reputable scholars or rising researchers, yet an author's position within the byline also matters-first authors indicate primary contribution, last authors reflect senior leadership, while middle authors exert weaker influence <ref type="bibr">[21]</ref>. Extraction process. We assess author reputation by partitioning the author list into three roles: first author, last author, and other coauthors. For each role, we retrieve metadata such as institutional affiliation, publication count, and total citations; institutional prestige is further enriched via external sources (e.g., Wikipedia). These signals are aggregated into a continuous score on a 1-5 scale. During training, the scores of different roles are assigned different weights to reflect their varying influence on citation outcomes.</p><p>Venue Prestige (V). Prestigious venues act as credibility signals, making their publications more visible and trusted, hence more likely to be cited <ref type="bibr">[1]</ref>. Extraction process. We assess venue prestige by matching the venue name against external rankings (China Computer Federation Recommended Rankings (CCF) and Computing Research and Education Association of Australasia (CORE)) using both exact and fuzzy matching. The agent outputs a score on a 1-5 scale.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Reproducibility (R).</head><p>Open-source code or data enhances transparency, and reuse, driving credibility and long-term impact <ref type="bibr">[23]</ref>. Extraction process. We assess reproducibility by scanning the content for open-source indicators (e.g., GitHub/GitLab links). If links are detected, we verify whether the repository contains code or datasets. The agent outputs a binary score (0/1).</p><p>Collaboration Network (C). Broad and diverse collaborations, especially across institutions or countries, increase attention and citation potential through higher credibility and dissemination <ref type="bibr">[32]</ref>. Extraction process. We assess collaboration characteristics, including team size, institutional diversity, and cross-country collaboration. Institutional metadata are complemented with external lookups (e.g., Wikipedia) to estimate prestige and geographic dispersion. The agent outputs a composite score on a 1-5 scale. Topic Hotness (H). Work in trending or growing areas gains citations faster by aligning with the community's interests <ref type="bibr">[31]</ref>. Extraction process. We assess topical hotness using the paper's keywords. For each keyword, we count the number of papers in the previous year; the mean count across keywords is used as the hotness score. The agent outputs a continuous value.</p><p>Text Quality (Q). Clear, well-structured titles and abstracts improve readability, directly influencing citation outcomes <ref type="bibr">[13]</ref>. Extraction process. We assess text quality by prompting an LLM with the paper's title and abstract, together with best-paper exemplars as references. The LLM evaluates structural clarity and professional expression and produces a score on a 1-5 scale.</p><p>Output. The six features are concatenated into f and injected as attributes of the paper node in the heterogeneous academic graph. These enriched node features are then propagated through the GNN encoder, enabling the model to jointly capture structural patterns and implicit semantic signals.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Bias-Aware GNN Learning Modules</head><p>With agent-derived features injected as paper-node attributes in the heterogeneous graph, the GNN encoder propagates them together with structural and relational information to form enriched paper representations. A key challenge is that conventional metadatabased models often overfit explicit correlations such as venue, which reflect distributional skew rather than true importance. This shortcut undermines prediction on high-impact papers outside top venues and widens gaps on lowly cited cases. To address these issues, we introduce three complementary modules: (i) a two-stage predictor that first estimates an intermediate exposure variable from graph embeddings and agent-derived features, and then predicts future citations based on both the exposure estimate and remaining signals; (ii) a GroupDRO objective applied to the prediction loss, which minimizes worst-group risk and alleviates head-dominated bias; and (iii) a counterfactual intervention head that perturbs controllable factors and regularizes predictions through monotonicity and smoothness constraints. Together, these modules reduce overreliance on explicit venue cues, leverage implicit drivers of scientific impact, and yield more robust citation predictions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.1">Two-Stage</head><p>Forward Strategy for Fair Representation. Venue serves as the most salient shortcut feature in citation prediction: it is strongly correlated with citation counts but does not constitute a true determinant <ref type="bibr">[5,</ref><ref type="bibr">12,</ref><ref type="bibr">29]</ref>. As shown in Fig. <ref type="figure">2</ref>, higher-ranked venues typically exhibit higher average citation counts, but individual papers still show wide variation in citations. In computer science, this distributional pattern often arises because top venues cluster popular topics, attract well-known researchers, and also a larger proportion of papers release open-source code. A robust predictor should therefore capture the joint influence of multiple factors on future citations rather than being misled by a single explicit signal. To this end, we isolate venue effects by enforcing the influence pathway &#8594; &#8594; , where denotes early exposure and denotes future citations. A heterogeneous graph encoder first produces paper embeddings; Stage A estimates using venue (among other features), while Stage B predicts without direct venue input, letting influence only through &#710; .</p><p>Stage A: Exposure Estimation ( &#8594; ). Stage A estimates the latent early exposure variable &#710; from both the graph encoder and metadata features. The encoder operates on the full heterogeneous graph that includes venue nodes, so the paper embedding z already incorporates venue effects. Together with the venue-inclusive feature vector f (+ )  = [ , , , , , , 1 , 2 , 3 ], which includes venue prestige, reproducibility ( ), collaboration network ( ), topic hotness ( ), text quality ( ), publication year ( ), and differentiated author reputations for first ( 1 ), last ( 2 ), and other co-authors ( 3 ), these signals are passed to a feed-forward head to produce &#710; .</p><p>Stage B: Venue-Excluded Prediction ( &#8594; ). Stage B predicts the final citation count using a simplified graph where venue nodes and edges are removed, yielding a venue-excluded paper embedding. The input to the predictor is the concatenation of this embedding, the venue-excluded feature vector f (-) , and the Stage A estimate &#710; . In this way, venue influences the outcome only indirectly via exposure, enforcing the influence path &#8594; &#8594; .</p><p>Log-MSE Output and Prediction Loss. Instead of a negative-binomial likelihood, we adopt a mean squared error (MSE) objective after applying a logarithmic transformation to citation counts in order to mitigate the heavy-tailed distribution. Specifically, the predictor outputs &#710; = [z ; f (-) ; &#710; ] , and the loss is defined as</p><p>where denotes the observed citation count of paper .</p><p>Module Discussion. In this module, we restructure the prediction pipeline into two stages:</p><p>Stage B does not directly observe , so the only pathway is &#8594; &#8594; . From an information-theoretic view, ( ; | &#710; ) &lt; ( ; ), meaning the shortcut influence of on is strictly reduced and the model is forced to rely more on other implicit drivers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.2">Environment-Aware Optimization for Robust Generalization.</head><p>We apply GroupDRO <ref type="bibr">[25]</ref> to the Stage-B prediction loss to prioritize the worst environment and improve cross-environment generalization. To mitigate the venue-dominated shortcut, we partition the training data into two environments by venue tier, E = {low, high}. For environment &#8712; E with index set D , the group risk is defined as the mean loss on samples from that environment:</p><p>where ( ) denotes the model prediction for paper , is the observed citation count, and &#8467; (&#8226;) is the Stage-B prediction loss defined in Eq. 1 (Log-MSE on log(1 + )).</p><p>GroupDRO objective. We optimize a worst-group risk via adversarial reweighting:</p><p>where = ( low , high ) denotes nonnegative environment weights lying in the probability simplex &#916; 2 . The inner maximization allocates more weight to the environment with larger risk, forcing the model to improve performance on the worst-performing group. 2  (the population standard deviation of group risks). Here L is the average loss of environment , L is the mean loss across environments, and L is their standard deviation. With step size &gt; 0 and small constant ,</p><p>where ( ) is the current weight of environment . Higher-thanaverage loss increases the weight, while lower loss decreases it. Finally, weights are clamped and renormalized:</p><p>Here [ min , max ] bounds prevent degenerate values, and ( +1) denotes the updated weight for environment .</p><p>Module Discussion. This environment-aware optimization prevents the model from collapsing onto dominant groups shaped by highly cited or high-prestige papers. By forcing improvements on underrepresented environments, GroupDRO enhances fairness, promoting robust generalization across diverse citation contexts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.3">Regularization for Reasonable and Stable Prediction.</head><p>To provide actionable "what-if" estimates, we augment Stage B with a regularization head. This module turns abstract features into interpretable regularization effects: it quantifies the predicted citation change if only a controllable factor were improved (e.g., toggling : 0 &#8594; 1), while all other attributes remain fixed and the induced change in early exposure is propagated consistently. This yields predictions that are both actionable and constrained to be directionally reasonable and stable.</p><p>For a controllable factor (e.g., reproducibility ), let s = [z ; f (-) ; &#710; ] be the Stage B input constructed from the observed features, and let s ( &#8593;) be the same input after setting to a high value (keeping all other features fixed) and recomputing &#710; under this change. The per-factor counterfactual effect is defined as</p><p>where &#710; (&#8226;) = (&#8226;) and the early-exposure estimate in s ( &#8593;) is</p><p>Monotonicity and smoothness regularization. Let &#8712; {+1, -1} denote the expected direction of improvement (typically = +1), and let be a threshold that marks the "low" region of (e.g., =0). We enforce that raising should not hurt citations for low-value cases, and keep effects calibrated via a smoothness penalty:</p><p>Aggregating over controllable factors V, the total regularizer is</p><p>Module Discussion. One major source of bias in citation prediction arises from the limited and coarse metadata used in traditional models, which makes explicit factors dominate the learning process. Within our regularization module, the model leverages fine-grained, implicitly derived features to regularize representation learning, thereby mitigating shortcut reliance. By distributing explanatory power across multiple implicit drivers-such as collaboration patterns, topic dynamics, and text quality-the model becomes less dependent on any single explicit factor and achieves more balanced generalization across varying citation environments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.4">Objective.</head><p>Our training objective focuses on environmentaware risk (GroupDRO) and the proposed sensitivity regularization:</p><p>We also employ two lightweight auxiliaries on Stage B-an exposure calibration loss on &#710; and an adversarial venue-invariance loss-which are reported in ablations and described in Appendix B, but omitted here for brevity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experiments</head><p>In this section, we conduct extensive experiments to answer the following four research questions:</p><p>RQ1: How does BA-Cite perform compared with other models in terms of predictive accuracy, and ranking quality? RQ2: How does each component contribute to the overall performance of BA-Cite?</p><p>RQ3: Does BA-Cite achieve robust prediction across different data distributions after bias mitigation?</p><p>RQ4: How do different hyper-parameters affect BA-Cite ?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Experimental Setup</head><p>4.1.1 Datasets. We select two widely used publicly available academic datasets, Aminer <ref type="bibr">[28]</ref> and OpenAlex <ref type="bibr">[22]</ref>, to verify the effectiveness of our proposed framework. Both datasets focus on the computer science domain and contain heterogeneous nodes, as well as temporal relations such as cites, writes, and has_topic. The temporal coverage ranges from 2010 to 2025.</p><p>For each paper, the prediction target is its future citation number within the next five years (starting from the second year) after publication. We split the data by year, using papers published during 2010-2018 for training, 2019 for validation, and 2020 for testing.</p><p>Within each split, we randomly sample 10,000 papers for training, 1,000 for validation, and 1,000 for testing. To reduce randomness, we repeat the sampling process three times and conduct five runs with different random seeds for each sample. The reported results include the mean and standard deviation across all runs.</p><p>4.1.2 Baselines. We compare our framework with representative methods from four categories, covering classical GNNs, sequential models, large language models, and metadata-based neural models. Graph Neural Network-based methods.</p><p>&#8226; GAT (ICLR'18) <ref type="bibr">[30]</ref>: models citation relations using multi-head graph attention. &#8226; HINTS (WWW'21) <ref type="bibr">[12]</ref>: encodes temporal heterogeneous information networks for citation time-series prediction. &#8226; DyGFormer (NeurIPS'23) <ref type="bibr">[40]</ref>: applies transformer-style temporal encoding for dynamic graphs.</p><p>Sequence-based methods.</p><p>&#8226; BiLSTM-Meta (Scientometrics'21) <ref type="bibr">[18]</ref>: captures citation sequences via bidirectional recurrent modeling. &#8226; DeepCas (WWW'17) <ref type="bibr">[15]</ref>: learns citation cascade representations through random walks and BiGRU-based attention. &#8226; SI-HDGNN (KBS'22) <ref type="bibr">[35]</ref>: builds heterogeneous dynamic academic networks for impact propagation.</p><p>Large Language Model-based methods.</p><p>&#8226; GPT-4o (OpenAI'24) <ref type="bibr">[9]</ref>: leverages LLM reasoning and knowledge for citation impact estimation. &#8226; Llama-3.1-405B (Meta'24) <ref type="bibr">[4]</ref>: employs open-source LLM embeddings for academic impact inference. &#8226; NAIP (AAAI'25) <ref type="bibr">[41]</ref>: formulates newborn article impact prediction by fine-tuning large language models on title-abstract pairs with the TNCSISP metric, enabling content-only impact estimation without external metadata.</p><p>Metadata-based Neural Methods.</p><p>&#8226; BP-NN (J. Informetrics'20) <ref type="bibr">[24]</ref>: A four-layer feed-forward neural network that predicts five-year citation counts.</p><p>These baselines represent diverse paradigms in scientific impact prediction, from early cascade modeling to metadata-driven, dynamic, and LLM-enhanced frameworks, providing a comprehensive comparison foundation. For all feature-dependent baselines, we supply the fine-grained semantic features extracted in Section 3.2 to ensure consistent and enriched input representations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.3">Implementation Details.</head><p>We implement the counterfactual two-stage HGT in PyTorch. The heterogeneous encoder uses two HGT layers (hidden size 128, 4 attention heads, dropout 0.4). Node features follow our schema: paper nodes have an 8-dimensional vector [ , , , , , 1 , 2 , 3 ], while author/venue/topic nodes are initialized with 1-dimensional metadata features. For counterfactual learning, we enable reproducibility ( ) and content quality ( ) as actionable variables and apply monotonicity regularization so that larger ( , ) should not decrease predicted citations. We also employ adversarial training and an auxiliary loss. We adopt Group-DRO over 2 environments with step size = 0.1 and group-weight clipping to [0.1, 0.9]. Environments are constructed by venue prestige threshold = 0.8. Optimization uses AdamW (lr = 10 -3 , weight decay = 10 -4 ) with a 10-epoch warm-up followed by cosine decay </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Overall Performance (RQ1)</head><p>Table <ref type="table">1</ref> summarizes the overall results on the Aminer and OpenAlex datasets. In general, BA-Cite achieves the best performance across all metrics. On Aminer, it outperforms the best baselines by reducing MALE by 2.7% and RMSLE by 9.3%, while improving NDCG@10 and NDCG@20 by 2% and 4%. On OpenAlex, BA-Cite further reduces MALE by 20.3% and RMSLE by 20.1%, and boosts NDCG@10 and NDCG@20 by 12% and 4%. These consistent gains demonstrate that integrating agent-derived fine-grained features with dynamic heterogeneous graph learning effectively improves both accuracy and ranking quality. Compared to other methods, GNN-based models generally perform better than sequence-or metadata-based baselines, highlighting the importance of structural and temporal modeling. LLM-based methods achieve competitive ranking performance, which may benefit from pretrained knowledge rather than structural understanding. Overall, BA-Cite provides balanced and stable improvements, confirming its advantage in handling cold-start and long-tailed citation prediction scenarios.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Ablation Study (RQ2)</head><p>We examine the effect of each module in BA-Cite on Aminer and OpenAlex (Table <ref type="table">2</ref>). Removing fine-grained feature extraction (w/o  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Analysis of Robustness (RQ3)</head><p>As shown in Fig. <ref type="figure">4</ref>, we further evaluate whether BA-Cite maintains stable performance after bias mitigation under varying data distributions. Following the partitioning strategy in <ref type="bibr">[12]</ref>, we divide papers into lowly cited and highly cited subsets by citation counts. Across all three compared models, BA-Cite achieves the best performance, showing strong results on both low-and high-citation papers, which indicates its robustness to distributional variation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5">Analysis of Parameter Sensitivity (RQ4)</head><p>As shown in Fig. <ref type="figure">3</ref>, we analyze the sensitivity of four loss weights. <ref type="bibr">(1)</ref> . On Aminer, errors decrease with larger , peaking at 1.2-1.5, indicating that a stronger main objective improves fit under mild bias. On OpenAlex, MALE and RMSLE increase with , so smaller values (0.5-0.8) are preferable to preserve capacity for debiasing. <ref type="bibr">(2)</ref> . A small-to-moderate adversarial strength works best: Aminer achieves its lowest errors around 0.05, while OpenAlex shows a trade-off-MALE near 0.05 and RMSLE near 0.2-suggesting in 0.05-0.2. Larger values introduce instability without gains. <ref type="bibr">(3)</ref> . Both datasets exhibit a U-shaped trend: mild regularization helps (optimum around 0.1-0.2), whereas overly weak or strong settings (e.g., 0.01 or 0.5) degrade performance by allowing redundancy or over-constraining embeddings. <ref type="bibr">(4)</ref> . Fairness regularization should be conservative. Aminer reaches minimum errors around 0.05, while OpenAlex favors very small values. Strong regularization degrades accuracy on both datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>We present BA-Cite, a bias-aware citation prediction framework combining multi-agent semantic extraction with dynamic heterogeneous graph learning. By modeling author, venue, and topic dynamics, BA-Cite provides a strong semantic basis for citation reasoning. Its two-stage optimization with GroupDRO enhances robustness and mitigates overfitting to high-prestige environments. Experiments on Aminer and OpenAlex show consistent improvements over strong baselines, with stability confirmed by ablation and sensitivity analyses. BA-Cite generalizes well under distribution shifts, supporting real-world scholarly impact evaluation. Future work includes impact explanation, cross-domain transfer, and reinforcement learning-based adaptive bias mitigation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>RMSLE</head><p>Ranking quality. For papers published in the same year, let denote the predicted ranking and * the ground-truth ranking:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B Auxiliary Losses for Stage B</head><p>To further stabilize optimization and mitigate residual bias, we incorporate two lightweight auxiliary objectives at Stage B: (i) exposure calibration loss and (ii) adversarial venue-invariance loss. Both are designed to regularize the learned citation representations without introducing additional parameters or inference overhead.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.1 Exposure Calibration Loss</head><p>Empirical studies have shown that citation counts are strongly correlated with exposure factors-such as publication venue, collaboration size, or open-source visibility-which may distort predictive learning. To prevent the model from over-amplifying these factors, we impose an auxiliary calibration constraint on the predicted exposure score &#710; :</p><p>where * denotes the empirical exposure distribution estimated from the training data. This term penalizes deviations between the predicted and empirical exposure distributions, ensuring that the model's intermediate exposure estimation remains statistically consistent and well-calibrated.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.2 Adversarial Venue-Invariance Loss</head><p>Venue prestige is one of the most dominant shortcut features in citation prediction. To enhance robustness against venue bias, we introduce an adversarial objective that enforces venue-invariant latent representations. A discriminator is trained to predict the venue label from the Stage B feature embedding s , while the main encoder attempts to fool it:</p><p>This min-max game discourages the encoder from encoding venuespecific artifacts, resulting in a fairer and more transferable citation representation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.3 Implementation Note</head><p>Both auxiliaries are assigned small weights ( exp = 0.1, adv = 0.05) relative to the primary Stage B objective. They are only active during training and disabled during inference. Ablation results in Table <ref type="table">2</ref> confirm that incorporating these auxiliaries improves fairness and stability, particularly under long-tailed or domainshifted scenarios.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C Dataset Diagnostics and Descriptive Statistics</head><p>This appendix reports basic diagnostics of the evaluation splits used in our experiments, including overall scale and central tendency of citation counts, temporal coverage, authorship statistics, and venue/document-type compositions. These summaries help contextualize the long-tailed nature of citations and the salience of venue as a shortcut factor discussed in the main text.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.1 Global Characteristics</head><p>Table <ref type="table">3</ref> summarizes dataset-level statistics. We report the number of papers, citation central tendencies (mean/median/max), year range, average number of authors, and the highest venue tier present in each split.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.2 Venue Tier Distribution</head><p>Table <ref type="table">4</ref> reports the venue-tier composition for each split. Percentages are computed over all papers in the split. </p></div></body>
		</text>
</TEI>
