skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, October 10 until 2:00 AM ET on Friday, October 11 due to maintenance. We apologize for the inconvenience.


Title: Improving Process Discovery Results by Filtering Out Outliers from Event Logs with Hidden Markov Models
Process Mining is a technique for extracting process models from event logs. Event logs contain abundant explicit information related to events, such as the timestamp and the actions that trigger the event. Much of the existing process mining research has focused on discovering the process models behind these event logs. However, Process Mining relies on the assumption that these event logs contain accurate representations of an ideal set of processes. These ideal sets of processes imply that the information contained within the log represents what is really happening in a given environment. However, many of these event logs might contain noisy, infrequent, missing, or false process information that is generally classified as outliers. Extending beyond process discovery, there are many research efforts towards cleaning the event logs to deal with these outliers. In this paper, we present an approach that uses hidden Markov models to filter out outliers from event logs prior to applying any process discovery algorithms. Our proposed filtering approach can detect outlier behavior, and consequently, help process discovery algorithms return models that better reflect the real processes within an organization. Furthermore, we show that this filtering method outperforms two commonly used filtering approaches, namely the Matrix Filter approach and the Anomaly Free Automation approach for both artificial event logs and real-life event logs.  more » « less
Award ID(s):
1952247 1952225
NSF-PAR ID:
10311278
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
2021 IEEE 23rd Conference on Business Informatics (CBI)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Process mining is a technique for extracting process models from event logs. Event logs contain abundant information related to an event such as the timestamp of the event, the actions that triggers the event, etc. Much of existing process mining research has been focused on discoveries of process models behind event logs. How to uncover the timing constraints from event logs that are associated with the discovered process models is not well-studied. In this paper, we present an approach that extends existing process mining techniques to not only mine but also integrate timing constraints with process models discovered and constructed by existing process mining algorithms. The approach contains three major steps, i.e., first, for a given process model constructed by an existing process mining algorithm and represented as a workflow net, extract a time dependent set for each transition in the workflow net model. Second, based on the time dependent sets, develop an algorithm to extract timing constraints from event logs for each transition in the model. Third, extend the original workflow net into a time Petri net where the discovered timing constraints are associated with their corresponding transitions. A real-life road traffic fine management process scenario is used as a case study to show how timing constraints in the fine management process can be discovered from event logs with our approach. 
    more » « less
  2. Event logs contain abundant information, such as activity names, time stamps, activity executors, etc. However, much of existing trace clustering research has been focused on applying activity names to assist process scenarios discovery. In addition, many existing trace clustering algorithms commonly used in the literature, such as k-means clustering approach, require prior knowledge about the number of process scenarios existed in the log, which sometimes are not known aprior. This paper presents a two-phase approach that obtains timing information from event logs and uses the information to assist process scenario discoveries without requiring any prior knowledge about process scenarios. We use five real-life event logs to compare the performance of the proposed two-phase approach for process scenario discoveries with the commonly used k-means clustering approach in terms of model’s harmonic mean of the weighted average fitness and precision, i.e., the F1 score. The experiment data shows that (1) the process scenario models obtained with the additional timing information have both higher fitness and precision scores than the models obtained without the timing information; (2) the two-phase approach not only removes the need for prior information related to k, but also results in a comparable F1 score compared to the optimal k-means approach with the optimal k obtained through exhaustive search. 
    more » « less
  3. Obtaining an accurate specification of the access control policy enforced by an application is essential in ensuring that it meets our security/privacy expectations. This is especially important as many of real-world applications handle a large amount and variety of data objects that may have different applicable policies. We investigate the problem of automated learning of access control policies from web applications. The existing research on mining access control policies has mainly focused on developing algorithms for inferring correct and concise policies from low-level authorization information. However, little has been done in terms of systematically gathering the low-level authorization data and applications' data models that are prerequisite to such a mining process. In this paper, we propose a novel black-box approach to inferring those prerequisites and discuss our initial observations on employing such a framework in learning policies from real-world web applications. 
    more » « less
  4. The volume, variety, and velocity of different data, e.g., simulation data, observation data, and social media data, are growing ever faster, posing grand challenges for data discovery. An increasing trend in data discovery is to mine hidden relationships among users and metadata from the web usage logs to support the data discovery process. Web usage log mining is the process of reconstructing sessions from raw logs and finding interesting patterns or implicit linkages. The mining results play an important role in improving quality of search-related components, e.g., ranking, query suggestion, and recommendation. While researches were done in the data discovery domain, collecting and analyzing logs efficiently remains a challenge because (1) the volume of web usage logs continues to grow as long as users access the data; (2) the dynamic volume of logs requires on-demand computing resources for mining tasks; (3) the mining process is compute-intensive and time-intensive. To speed up the mining process, we propose a cloud-based log-mining framework using Apache Spark and Elasticsearch. In addition, a data partition paradigm, logPartitioner, is designed to solve the data imbalance problem in data parallelism. As a proof of concept, oceanographic data search and access logs are chosen to validate performance of the proposed parallel log-mining framework. 
    more » « less
  5. null (Ed.)
    The growing availability of digital trace data has generated unprecedented opportunities for analyzing, explaining, and predicting the dynamics of process change. While research on process organization studies theorizes about process and change, and research on process mining rigorously measures and models business processes, there has so far been limited research that measures and theorizes about process dynamics. This gap represents an opportunity for new information systems research. This research note lays the foundation for such an endeavor by demonstrating the use of process mining for diachronic analysis of process dynamics. We detail the definitions, assumptions, and mechanics of an approach that is based on representing processes as weighted, directed graphs. Using this representation, we offer a precise definition of process dynamics that focuses attention on describing and measuring changes in process structure over time. We analyze process structure over two years at four dermatology clinics. Our analysis reveals process changes that were invisible to the medical staff in the clinics. This approach offers empirical insights that are relevant to many theoretical perspectives on process dynamics. 
    more » « less