ponedeljek, 5. junij 2023 Nikolaus AUGSTEN: Data Management Meets Process Mining - Scaling Trace Clustering to Large Data
Tokrat izjemoma v sredo, 7. junija 2023, bo ob 14.30 uri izvedeno
predavanje v okviru PONEDELJKOVEGA SEMINARJA RAČUNALNIŠTVA IN INFORMATIKE
Oddelkov za Informacijske znanosti in tehnologije UP FAMNIT in UP IAM.
ČAS/PROSTOR: 7. junij 2023 ob 14.30 v FAMNIT-VP3.
----------------------------------------------
PREDAVATELJ: Nikolaus AUGSTEN
----------------------------------------------
Nikolaus Augsten is a full professor in computer science at the University of Salzburg, where he heads the Database Group. He received his PhD from Aalborg University, Denmark, in 2008. His research deals with all aspects of data management. The focus is on queries over complex objects and massive data collections, data cleaning and integration, indexing techniques, query processing and optimization, distributed data management, and numerical computations in databases. His research is triggered by problems that arise in concrete applications, for example, process mining, digital humanities, or cognitive neuroscience. The results of his research have been published in the most prestigious outlets of the database field; for his work on top-k queries over tree data he received the ICDE 2010 Best Paper Award. He has served as a PC member or referee for all major database conferences and as an associate editor of the VLDB Journal.
---------------------------------------------------------------------------------------------------------------------------
NASLOV: Data Management Meets Process Mining - Scaling Trace Clustering to Large Data
---------------------------------------------------------------------------------------------------------------------------
POVZETEK:
With the broad adoption of process mining techniques in industry, process mining tools now face massive data volumes with process logs that can store hundreds of millions of activities. Process mining queries often require fast response times since the user interacts with a dashboard to gain business insights. To analyze business processes, the so-called traces of the processes are inspected. A trace is a sequence of activities observed in the process log. To facilitate the analysis, similar traces should be grouped into clusters using the well-known DBSCAN algorithm. Unfortunately, current trace clustering approaches do not scale to large collections of traces, neither in terms of runtime nor in terms of memory usage.
We present two novel techniques that solve the scalability issue of the trace clustering problem. (1) TwoL is a new, highly effective similarity index for traces. Compared to previous techniques, TwoL optimizes a cost function to gracefully adapt to different data distributions. (2) Spread is the first linear-space DBSCAN algorithm that can process data points in any user-defined order. This is required to leverage the full potential of trace indices. Previous approaches must precompute and materialize all neighborhoods of the data points, which requires quadratic space. With respect to the state of the art in trace clustering, our techniques reduce the memory complexity and achieve speedups of more than an order of magnitude.
Seminar bo potekal v angleškem jeziku s pričetkom ob 14:30 v predavalnici FAMNIT-VP3.
Vabljeni!