skip to main content


Title: Data Analytics for Air Travel Data: A Survey and New Perspectives
From the start, the airline industry has remarkably connected countries all over the world through rapid long-distance transportation, helping people overcome geographic barriers. Consequently, this has ushered in substantial economic growth, both nationally and internationally. The airline industry produces vast amounts of data, capturing a diverse set of information about their operations, including data related to passengers, freight, flights, and much more. Analyzing air travel data can advance the understanding of airline market dynamics, allowing companies to provide customized, efficient, and safe transportation services. Due to big data challenges in such a complex environment, the benefits of drawing insights from the air travel data in the airline industry have not yet been fully explored. This article aims to survey various components and corresponding proposed data analysis methodologies that have been identified as essential to the inner workings of the airline industry. We introduce existing data sources commonly used in the papers surveyed and summarize their availability. Finally, we discuss several potential research directions to better harness airline data in the future. We anticipate this study to be used as a comprehensive reference for both members of the airline industry and academic scholars with an interest in airline research.  more » « less
Award ID(s):
1952089
NSF-PAR ID:
10332952
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
ACM Computing Surveys
Volume:
54
Issue:
8
ISSN:
0360-0300
Page Range / eLocation ID:
1 to 35
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Background Human movement is one of the forces that drive the spatial spread of infectious diseases. To date, reducing and tracking human movement during the COVID-19 pandemic has proven effective in limiting the spread of the virus. Existing methods for monitoring and modeling the spatial spread of infectious diseases rely on various data sources as proxies of human movement, such as airline travel data, mobile phone data, and banknote tracking. However, intrinsic limitations of these data sources prevent us from systematic monitoring and analyses of human movement on different spatial scales (from local to global). Objective Big data from social media such as geotagged tweets have been widely used in human mobility studies, yet more research is needed to validate the capabilities and limitations of using such data for studying human movement at different geographic scales (eg, from local to global) in the context of global infectious disease transmission. This study aims to develop a novel data-driven public health approach using big data from Twitter coupled with other human mobility data sources and artificial intelligence to monitor and analyze human movement at different spatial scales (from global to regional to local). Methods We will first develop a database with optimized spatiotemporal indexing to store and manage the multisource data sets collected in this project. This database will be connected to our in-house Hadoop computing cluster for efficient big data computing and analytics. We will then develop innovative data models, predictive models, and computing algorithms to effectively extract and analyze human movement patterns using geotagged big data from Twitter and other human mobility data sources, with the goal of enhancing situational awareness and risk prediction in public health emergency response and disease surveillance systems. Results This project was funded as of May 2020. We have started the data collection, processing, and analysis for the project. Conclusions Research findings can help government officials, public health managers, emergency responders, and researchers answer critical questions during the pandemic regarding the current and future infectious risk of a state, county, or community and the effectiveness of social/physical distancing practices in curtailing the spread of the virus. International Registered Report Identifier (IRRID) DERR1-10.2196/24432 
    more » « less
  2. Abstract This project is funded by the US National Science Foundation (NSF) through their NSF RAPID program under the title “Modeling Corona Spread Using Big Data Analytics.” The project is a joint effort between the Department of Computer & Electrical Engineering and Computer Science at FAU and a research group from LexisNexis Risk Solutions. The novel coronavirus Covid-19 originated in China in early December 2019 and has rapidly spread to many countries around the globe, with the number of confirmed cases increasing every day. Covid-19 is officially a pandemic. It is a novel infection with serious clinical manifestations, including death, and it has reached at least 124 countries and territories. Although the ultimate course and impact of Covid-19 are uncertain, it is not merely possible but likely that the disease will produce enough severe illness to overwhelm the worldwide health care infrastructure. Emerging viral pandemics can place extraordinary and sustained demands on public health and health systems and on providers of essential community services. Modeling the Covid-19 pandemic spread is challenging. But there are data that can be used to project resource demands. Estimates of the reproductive number (R) of SARS-CoV-2 show that at the beginning of the epidemic, each infected person spreads the virus to at least two others, on average (Emanuel et al. in N Engl J Med. 2020, Livingston and Bucher in JAMA 323(14):1335, 2020). A conservatively low estimate is that 5 % of the population could become infected within 3 months. Preliminary data from China and Italy regarding the distribution of case severity and fatality vary widely (Wu and McGoogan in JAMA 323(13):1239–42, 2020). A recent large-scale analysis from China suggests that 80 % of those infected either are asymptomatic or have mild symptoms; a finding that implies that demand for advanced medical services might apply to only 20 % of the total infected. Of patients infected with Covid-19, about 15 % have severe illness and 5 % have critical illness (Emanuel et al. in N Engl J Med. 2020). Overall, mortality ranges from 0.25 % to as high as 3.0 % (Emanuel et al. in N Engl J Med. 2020, Wilson et al. in Emerg Infect Dis 26(6):1339, 2020). Case fatality rates are much higher for vulnerable populations, such as persons over the age of 80 years (> 14 %) and those with coexisting conditions (10 % for those with cardiovascular disease and 7 % for those with diabetes) (Emanuel et al. in N Engl J Med. 2020). Overall, Covid-19 is substantially deadlier than seasonal influenza, which has a mortality of roughly 0.1 %. Public health efforts depend heavily on predicting how diseases such as those caused by Covid-19 spread across the globe. During the early days of a new outbreak, when reliable data are still scarce, researchers turn to mathematical models that can predict where people who could be infected are going and how likely they are to bring the disease with them. These computational methods use known statistical equations that calculate the probability of individuals transmitting the illness. Modern computational power allows these models to quickly incorporate multiple inputs, such as a given disease’s ability to pass from person to person and the movement patterns of potentially infected people traveling by air and land. This process sometimes involves making assumptions about unknown factors, such as an individual’s exact travel pattern. By plugging in different possible versions of each input, however, researchers can update the models as new information becomes available and compare their results to observed patterns for the illness. In this paper we describe the development a model of Corona spread by using innovative big data analytics techniques and tools. We leveraged our experience from research in modeling Ebola spread (Shaw et al. Modeling Ebola Spread and Using HPCC/KEL System. In: Big Data Technologies and Applications 2016 (pp. 347-385). Springer, Cham) to successfully model Corona spread, we will obtain new results, and help in reducing the number of Corona patients. We closely collaborated with LexisNexis, which is a leading US data analytics company and a member of our NSF I/UCRC for Advanced Knowledge Enablement. The lack of a comprehensive view and informative analysis of the status of the pandemic can also cause panic and instability within society. Our work proposes the HPCC Systems Covid-19 tracker, which provides a multi-level view of the pandemic with the informative virus spreading indicators in a timely manner. The system embeds a classical epidemiological model known as SIR and spreading indicators based on causal model. The data solution of the tracker is built on top of the Big Data processing platform HPCC Systems, from ingesting and tracking of various data sources to fast delivery of the data to the public. The HPCC Systems Covid-19 tracker presents the Covid-19 data on a daily, weekly, and cumulative basis up to global-level and down to the county-level. It also provides statistical analysis for each level such as new cases per 100,000 population. The primary analysis such as Contagion Risk and Infection State is based on causal model with a seven-day sliding window. Our work has been released as a publicly available website to the world and attracted a great volume of traffic. The project is open-sourced and available on GitHub. The system was developed on the LexisNexis HPCC Systems, which is briefly described in the paper. 
    more » « less
  3. Understanding the characteristics of air-traffic delays and disruptions is critical for developing ways to mitigate their significant economic and environmental impacts. Conventional delay-performance metrics reflect only the magnitude of incurred flight delays at airports; in this work, we show that it is also important to characterize the spatial distribution of delays across a network of airports. We analyze graph-supported signals, leveraging techniques from spectral theory and graph-signal processing to compute analytical and simulation-driven bounds for identifying outliers in spatial distribution. We then apply these methods to the case of airport-delay networks and demonstrate the applicability of our methods by analyzing U.S. airport delays from 2008 through 2017. We also perform an airline-specific analysis, deriving insights into the delay dynamics of individual airline subnetworks. Through our analysis, we highlight key differences in delay dynamics between different types of disruptions, ranging from nor’easters and hurricanes to airport outages. We also examine delay interactions between airline subnetworks and the system-wide network and compile an inventory of outlier days that could guide future aviation operations and research. In doing so, we demonstrate how our approach can provide operational insights in an air-transportation setting. Our analysis provides a complementary metric to conventional aviation-delay benchmarks and aids airlines, traffic-flow managers, and transportation-system planners in quantifying off-nominal system performance. 
    more » « less
  4. Imputing missing data is a critical task in data-driven intelligent transportation systems. During recent decades there has been a considerable investment in developing various types of sensors and smart systems, including stationary devices (e.g., loop detectors) and floating vehicles equipped with global positioning system (GPS) trackers to collect large-scale traffic data. However, collected data may not include observations from all road segments in a traffic network for different reasons, including sensor failure, transmission error, and because GPS-equipped vehicles may not always travel through all road segments. The first step toward developing real-time traffic monitoring and disruption prediction models is to estimate missing values through a systematic data imputation process. Many of the existing data imputation methods are based on matrix completion techniques that utilize the inherent spatiotemporal characteristics of traffic data. However, these methods may not fully capture the clustered structure of the data. This paper addresses this issue by developing a novel data imputation method using PARATUCK2 decomposition. The proposed method captures both spatial and temporal information of traffic data and constructs a low-dimensional and clustered representation of traffic patterns. The identified spatiotemporal clusters are used to recover network traffic profiles and estimate missing values. The proposed method is implemented using traffic data in the road network of Manhattan in New York City. The performance of the proposed method is evaluated in comparison with two state-of-the-art benchmark methods. The outcomes indicate that the proposed method outperforms the existing state-of-the-art imputation methods in complex and large-scale traffic networks.

     
    more » « less
  5. Rapid urbanization has posed significant burden on urban transportation infrastructures. In today's cities, both private and public transits have clear limitations to fulfill passengers' needs for quality of experience (QoE): Public transits operate along fixed routes with long wait time and total transit time; Private transits, such as taxis, private shuttles and ride-hailing services, provide point-to-point transits with high trip fare. In this paper, we propose CityLines, a transformative urban transit system, employing hybrid hub-and-spoke transit model with shared shuttles. Analogous to Airlines services, the proposed CityLines system routes urban trips among spokes through a few hubs or direct paths, with travel time as short as private transits and fare as low as public transits. CityLines allows both point-to-point connection to improve the passenger QoE, and hub-and-spoke connection to reduce the system operation cost. To evaluate the performance of CityLines, we conduct extensive data-driven experiments using one-month real-world trip demand data (from taxis, buses and subway trains) collected from Shenzhen, China. The results demonstrate that CityLines reduces 12.5%-44% average travel time, and aggregates 8.5%-32.6% more trips with ride-sharing over other implementation baselines. 
    more » « less