Process Mining is a technique for extracting process models
from event logs. Event logs contain abundant explicit information related
to events, such as the timestamp and the actions that trigger the
event. Much of the existing process mining research has focused on
discovering the process models behind these event logs. However, Process
Mining relies on the assumption that these event logs contain accurate
representations of an ideal set of processes. These ideal sets of processes
imply that the information contained within the log represents what
is really happening in a given environment. However, many of these
event logs might contain noisy, infrequent, missing, or false process
information that is generally classified as outliers. Extending beyond
process discovery, there are many research efforts towards cleaning the
event logs to deal with these outliers. In this paper, we present an
approach that uses hidden Markov models to filter out outliers from event
logs prior to applying any process discovery algorithms. Our proposed
filtering approach can detect outlier behavior, and consequently, help
process discovery algorithms return models that better reflect the real
processes within an organization. Furthermore, we show that this filtering
method outperforms two commonly used filtering approaches, namely the
Matrix Filter approach and the Anomaly Free Automation approach for
both artificial event logs and real-life event logs.
more »
« less
Mining Timing Constraints from Event Logs for Process Model
Process mining is a technique for extracting process models
from event logs. Event logs contain abundant information related to an
event such as the timestamp of the event, the actions that triggers the
event, etc. Much of existing process mining research has been focused
on discoveries of process models behind event logs. How to uncover the
timing constraints from event logs that are associated with the discovered
process models is not well-studied. In this paper, we present an approach
that extends existing process mining techniques to not only mine but
also integrate timing constraints with process models discovered and
constructed by existing process mining algorithms. The approach contains
three major steps, i.e., first, for a given process model constructed by
an existing process mining algorithm and represented as a workflow net,
extract a time dependent set for each transition in the workflow net model.
Second, based on the time dependent sets, develop an algorithm to extract
timing constraints from event logs for each transition in the model. Third,
extend the original workflow net into a time Petri net where the discovered
timing constraints are associated with their corresponding transitions. A
real-life road traffic fine management process scenario is used as a case
study to show how timing constraints in the fine management process
can be discovered from event logs with our approach.
more »
« less
- Award ID(s):
- 1952247
- PAR ID:
- 10311282
- Date Published:
- Journal Name:
- 2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC)
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Event logs contain abundant information, such as activity names, time stamps, activity executors, etc. However, much of existing trace clustering research has been focused on applying activity names to assist process scenarios discovery. In addition, many existing trace clustering algorithms commonly used in the literature, such as k-means clustering approach, require prior knowledge about the number of process scenarios existed in the log, which sometimes are not known aprior. This paper presents a two-phase approach that obtains timing information from event logs and uses the information to assist process scenario discoveries without requiring any prior knowledge about process scenarios. We use five real-life event logs to compare the performance of the proposed two-phase approach for process scenario discoveries with the commonly used k-means clustering approach in terms of model’s harmonic mean of the weighted average fitness and precision, i.e., the F1 score. The experiment data shows that (1) the process scenario models obtained with the additional timing information have both higher fitness and precision scores than the models obtained without the timing information; (2) the two-phase approach not only removes the need for prior information related to k, but also results in a comparable F1 score compared to the optimal k-means approach with the optimal k obtained through exhaustive search.more » « less
-
Anonymization of event logs facilitates process mining while protecting sensitive information of process stakeholders. Existing techniques, however, focus on the privatization of the control-flow. Other process perspectives, such as roles, resources, and objects are neglected or subject to randomization, which breaks the dependencies between the perspectives. Hence, existing techniques are not suited for advanced process mining tasks, e.g., social network mining or predictive monitoring . To address this gap, we propose PMDG, a framework to ensure privacy for multi-perspective process mining through data generalization. It provides group-based privacy guarantees for an event log, while preserving the characteristic dependencies between the control-flow and further process perspectives. Unlike existing privatization techniques that rely on data suppression or noise insertion, PMDG adopts data generalization: a technique where the activities and attribute values referenced in events are generalized into more abstract ones, to obtain equivalence classes that are sufficiently large from a privacy point of view. We demonstrate empirically that PMDG outperforms state-of-the-art anonymization techniques, when mining handovers and predicting outcomes.more » « less
-
The volume, variety, and velocity of different data, e.g., simulation data, observation data, and social media data, are growing ever faster, posing grand challenges for data discovery. An increasing trend in data discovery is to mine hidden relationships among users and metadata from the web usage logs to support the data discovery process. Web usage log mining is the process of reconstructing sessions from raw logs and finding interesting patterns or implicit linkages. The mining results play an important role in improving quality of search-related components, e.g., ranking, query suggestion, and recommendation. While researches were done in the data discovery domain, collecting and analyzing logs efficiently remains a challenge because (1) the volume of web usage logs continues to grow as long as users access the data; (2) the dynamic volume of logs requires on-demand computing resources for mining tasks; (3) the mining process is compute-intensive and time-intensive. To speed up the mining process, we propose a cloud-based log-mining framework using Apache Spark and Elasticsearch. In addition, a data partition paradigm, logPartitioner, is designed to solve the data imbalance problem in data parallelism. As a proof of concept, oceanographic data search and access logs are chosen to validate performance of the proposed parallel log-mining framework.more » « less
-
Decision diagrams are a well-established data structure for reachability set generation and model checking of high-level models such as Petri nets, due to their versatility and the availability of efficient algorithms for their construction. Using a decision diagram to represent the transition relation of each event of the high-level model, the saturation algorithm can be used to construct a decision diagram representing all states reachable from an initial set of states, via the occurrence of zero or more events. A difficulty arises in practice for models whose state variable bounds are unknown, as the transition relations cannot be constructed before the bounds are known. Previously, on-the-fly approaches have constructed the transition relations along with the reachability set during the saturation procedure. This can affect performance, as the transition relation decision diagrams must be rebuilt, and compute-table entries may need to be discarded, as the size of each state variable increases. In this paper, we introduce a different approach based on an implicit and unchanging representation for the transition relations, thereby avoiding the need to reconstruct the transition relations and discard compute-table entries. We modify the saturation algorithm to use this new representation, and demonstrate its effectiveness with experiments on several benchmark models.more » « less