skip to main content


Search for: All records

Award ID contains: 1838145

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. null (Ed.)
    An important means for disseminating information in social media platforms is by including URLs that point to external sources in user posts. In Twitter, we estimate that about 21% of the daily stream of English-language tweets contain URLs. We notice that NLP tools make little attempt at understanding the relationship between the content of the URL and the text surrounding it in a tweet. In this work, we study the structure of tweets with URLs relative to the content of the Web documents pointed to by the URLs. We identify several segments classes that may appear in a tweet with URLs, such as the title of a Web page and the user's original content. Our goals in this paper are: introduce, define, and analyze the segmentation problem of tweets with URLs, develop an effective algorithm to solve it, and show that our solution can benefit sentiment analysis on Twitter. We also show that the problem is an instance of the block edit distance problem, and thus an NP-hard problem. 
    more » « less
  2. Free, publicly-accessible full text available June 9, 2025
  3. Free, publicly-accessible full text available May 13, 2025
  4. An important task for Information Extraction from Microblogs is Named Entity Recognition (NER) that extracts mentions of real-world entities from microblog messages and meta-information like entity type for better entity characterization. A lot of microblog NER systems have rightly sought to prioritize modeling the non-literary nature of microblog text. These systems are trained on offline static datasets and extract a combination of surface-level features – orthographic, lexical, and semantic – from individual messages for noisy text modeling and entity extraction. But given the constantly evolving nature of microblog streams, detecting all entity mentions from such varying yet limited context in short messages remains a difficult problem to generalize. In this paper, we propose the NER Globalizer pipeline better suited for NER on microblog streams. It characterizes the isolated message processing by existing NER systems as modeling local contextual embeddings, where learned knowledge from the immediate context of a message is used to suggest seed entity candidates. Additionally, it recognizes that messages within a microblog stream are topically related and often repeat mentions of the same entity. This suggests building NER systems that go beyond localized processing. By leveraging occurrence mining, the proposed system therefore follows up traditional NER modeling by extracting additional mentions of seed entity candidates that were previously missed. Candidate mentions are separated into well-defined clusters which are then used to generate a pooled global embedding drawn from the collective context of the candidate within a stream. The global embeddings are utilized to separate false positives from entities whose mentions are produced in the final NER output. Our experiments show that the proposed NER system exhibits superior effectiveness on multiple NER datasets with an average Macro F1 improvement of 47.04% over the best NER baseline while adding only a small computational overhead. 
    more » « less
  5. As the content on the Internet continues to grow, many new dynamically changing and heterogeneous sources of data constantly emerge. A conventional search engine cannot crawl and index at the same pace as the expansion of the Internet. Moreover, a large portion of the data on the Internet is not accessible to traditional search engines. Distributed Information Retrieval (DIR) is a viable solution to this as it integrates multiple shards (resources) and provides a unified access to them. Resource selection is a key component of DIR systems. There is a rich body of literature on resource selection approaches for DIR. A key limitation of the existing approaches is that they primarily use term-based statistical features and do not generally model resource-query and resource-resource relationships. In this paper, we propose a graph neural network (GNN) based approach to learning-to-rank that is capable of modeling resource-query and resource-resource relationships. Specifically, we utilize a pre-trained language model (PTLM) to obtain semantic information from queries and resources. Then, we explicitly build a heterogeneous graph to preserve structural information of query-resource relationships and employ GNN to extract structural information. In addition, the heterogeneous graph is enriched with resource-resource type of edges to further enhance the ranking accuracy. Extensive experiments on benchmark datasets show that our proposed approach is highly effective in resource selection. Our method outperforms the state-of-the-art by 6.4% to 42% on various performance metrics. 
    more » « less
  6. Web records are structured data on a Web page that embeds records retrieved from an underlying database according to some templates. Mining data records on the Web enables the integration of data from multiple Web sites for providing value-added services. Most existing works on Web record extraction make two key assumptions: (1) records are retrieved from databases with uniform schemas and (2) records are displayed in a linear structure on a Web page. These assumptions no longer hold on the modern Web. A Web page may present records of diverse entity types with different schemas and organize records hierarchically, in nested structures, to show richer relationships among records. In this paper, we revisit these assumptions and modify them to reflect Web pages on the modern Web. Based on the reformulated assumptions, we introduce the concept of invariant in Web data records and propose Miria (Mining record invariant), a bottom-up, recursive approach to construct the Web records from the invariants. The proposed approach is both effective and efficient, consistently outperforming the state-of-the-art Web record extraction methods on modern Web pages. 
    more » « less
  7. Producing high-quality labeled data is a challenge in any supervised learning problem, where in many cases, human involvement is necessary to ensure the label quality. However, human annotations are not flawless, especially in the case of a challenging problem. In nontrivial problems, the high disagreement among annotators results in noisy labels, which affect the performance of any machine learning model. In this work, we consider three noise reduction strategies to improve the label quality in the Article-Comment Alignment Problem, where the main task is to classify article-comment pairs according to their relevancy level. The first considered labeling disagreement reduction strategy utilizes annotators' background knowledge during the label aggregation step. The second strategy utilizes user disagreement during the training process. In the third and final strategy, we ask annotators to perform corrections and relabel the examples with noisy labels. We deploy these strategies and compare them to a resampling strategy for addressing the class imbalance, another common supervised learning challenge. These alternatives were evaluated on ACAP, a multiclass text pairs classification problem with highly imbalanced data, where one of the classes represents at most 15% of the dataset's entire population. Our results provide evidence that considered strategies can reduce disagreement between annotators. However, data quality improvement is insufficient to enhance classification accuracy in the article-comment alignment problem, which exhibits a high-class imbalance. The model performance is enhanced for the same problem by addressing the imbalance issue with a weight loss-based class distribution resampling. We show that allowing the model to pay more attention to the minority class during the training process with the presence of noisy examples improves the test accuracy by 3%. 
    more » « less
  8. Many online news outlets, forums, and blogs provide a rich stream of publications and user comments. This rich body of data is a valuable source of information for researchers, journalists, and policymakers. However, the ever-increasing production and user engagement rate make it difficult to analyze this data without automated tools. This work presents MultiLayerET, a method to unify the representation of entities and topics in articles and comments. In MultiLayerET, articles' content and associated comments are parsed into a multilayer graph consisting of heterogeneous nodes representing named entities and news topics. The nodes within this graph have attributed edges denoting weight, i.e., the strength of the connection between the two nodes, time, i.e., the co-occurrence contemporaneity of two nodes, and sentiment, i.e., the opinion (in aggregate) of an entity toward a topic. Such information helps in analyzing articles and their comments. We infer the edges connecting two nodes using information mined from the textual data. The multilayer representation gives an advantage over a single-layer representation since it integrates articles and comments via shared topics and entities, providing richer signal points about emerging events. MultiLayerET can be applied to different downstream tasks, such as detecting media bias and misinformation. To explore the efficacy of the proposed method, we apply MultiLayerET to a body of data gathered from six representative online news outlets. We show that with MultiLayerET, the classification F1 score of a media bias prediction model improves by 36%, and that of a state-of-the-art fake news detection model improves by 4%. 
    more » « less
  9. With the increase in volume of daily online news items, it is more and more difficult for readers to identify news articles relevant to their interests. Thus, effective recommendation systems are critical for an effective user news consumption experience. Existing news recommendation methods usually rely on the news click history to model user interest. However, there are other signals about user behaviors, such as user commenting activity, which have not been used before. We propose a recommendation algorithm that predicts articles a user may be interested in, given her historical sequential commenting behavior on news articles. We show that following this sequential user behavior the news recommendation problem falls into in the class of session-based recommendation. The techniques in this class seek to model users' sequential and temporal behaviors. While we seek to follow the general directions in this space, we face unique challenges specific to news in modeling temporal dynamics, e.g., users' interests shift over time, users comment irregularly on articles, and articles are perishable items with limited lifespans. We propose a recency-regularized neural attentive framework for session-based news recommendation. The proposed method is able to capture the temporal dynamics of both users and news articles, while maintaining interpretability. We design a lag-aware attention and a recency regularization to model the time effect of news articles and comments. We conduct extensive empirical studies on 3 real-world news datasets to demonstrate the effectiveness of our method. 
    more » « less
  10. Social media is the ultimate challenge for many natural language processing tools. The constant emergence of linguistic constructs challenge even the most sophisticated NLP tools. Predicting word embeddings for out of vocabulary words is one of those challenges. Word embedding models only include terms that occur a sufficient number of times in their training corpora. Word embedding vector models are unable to directly provide any useful information about a word not in their vocabularies. We propose a fast method for predicting vectors for out of vocabulary terms that makes use of the surrounding terms of the unknown term and the hidden context layer of the word2vec model. We propose this method as a strong baseline in the sense that 1) while it does not surpass all state-of-the-art methods, it surpasses several techniques for vector prediction on benchmark tasks, 2) even when it underperforms, the margin is very small retaining competitive performance in downstream tasks, and 3) it is inexpensive to compute, requiring no additional training stage. We also show that our technique can be incorporated into existing methods to achieve a new state-of-the-art on the word vector prediction problem. 
    more » « less