This document presents a method for executing federated graph pattern queries on dispersed and heterogeneous raw log data by dynamically constructing virtual knowledge graphs (VKGs). The approach extracts only relevant log messages on demand, integrates log events into a common graph, federates queries across endpoints, and links results to background knowledge. The architecture includes modules for log parsing, query processing, and a prototype implementation demonstrates the approach for security analytics use cases. An evaluation analyzes the performance of query execution time against factors like number of extracted log lines and queried hosts.
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
Virtual Knowledge Graphs for Federated Log Analysis
1. Virtual Knowledge Graphs for
Federated Log Analysis
Kabul Kurniawan (WU Wien, Uni Wien)
Andreas Ekelhart (WU Wien)
Elmar Kiesling (WU Wien)
Dietmar Winkler (TU Wien)
Gerald Quirchmayr (Uni Wien)
A Min Tjoa (TU Wien)
This work was funded by the Austrian Science Fund (FWF) and netidee SCIENCE under grant P30437-N31, as well as the
Austrian Research Promotion Agency FFG under grant 877389 (OBARIS).
Vienna, August 17 – 20, 2021
2. Mar 9 12:00:45 Client02 systemd-logind[1201]: New seat seat0.
Mar 9 12:00:45 Client02 systemd-logind[1201]: Watching system buttons on /dev/input/event0 (Power Button)
Mar 9 12:00:45 Client02 systemd-logind[1201]: Watching system buttons on /dev/input/event3 (AT Translated Set 2 keyboard)
Mar 9 12:00:45 Client02 systemd-logind[1201]: Watching system buttons on /dev/input/event1 (AT Translated Set 2 keyboard)
Mar 9 12:00:45 Client02 sshd[1281]: Server listening on 0.0.0.0 port 22.
Mar 9 12:00:45 Client02 sshd[1281]: Server listening on :: port 22.
Mar 9 12:10:50 Client02 sshd[2124]: Accepted password for jhalley from 185.81.215.145 port 52410 ssh2
Mar 9 12:10:50 Client02 sshd[2124]: pam_unix(sshd:session): session opened for user jhalley by (uid=0)
Mar 9 12:10:50 Client02 systemd-logind[1201]: New session 1 of user jhalley.
Mar 9 12:10:50 Client02 systemd: pam_unix(systemd-user:session): session opened for user jhalley by (uid=0)
Mar 9 12:15:34 Client02 sshd[2555]: Did not receive identification string from 51.68.71.229 port 38508
Mar 9 12:15:48 Client02 sudo: jhalley : TTY=pts/0 ; PWD=/home/jhalley ; USER=root ; COMMAND=/usr/bin/apt-get update
Mar 9 12:15:48 Client02 sudo: pam_unix(sudo:session): session opened for user root by jhalley(uid=0)
Mar 9 12:15:57 Client02 sudo: pam_unix(sudo:session): session closed for user root
Mar 9 12:15:59 Client02 sudo: jhalley : TTY=pts/0 ; PWD=/home/jhalley ; USER=root ; COMMAND=/usr/bin/apt-get install xfce4
Mar 9 12:15:59 Client02 sudo: pam_unix(sudo:session): session opened for user root by jhalley(uid=0)
Mar 9 12:17:01 Client02 CRON[4546]: pam_unix(cron:session): session opened for user root by (uid=0)
Mar 9 12:17:01 Client02 CRON[4546]: pam_unix(cron:session): session closed for user root
Mar 9 12:18:06 Client02 groupadd[6959]: group added to /etc/group: name=rtkit, GID=115
Mar 9 12:18:06 Client02 groupadd[6959]: group added to /etc/gshadow: name=rtkit
Mar 9 12:18:06 Client02 groupadd[6959]: new group: name=rtkit, GID=115
Mar 9 12:18:06 Client02 useradd[6963]: new user: name=rtkit, UID=111, GID=115, home=/proc, shell=/usr/sbin/nologin
Mar 9 12:18:06 Client02 usermod[6969]: change user 'rtkit' password
Mar 9 12:18:06 Client02 chage[6974]: changed password expiry for rtkit
Mar 9 12:18:06 Client02 chfn[6977]: changed user 'rtkit' information
Mar 9 12:18:11 Client02 useradd[7149]: new user: name=usbmux, UID=112, GID=46, home=/var/lib/usbmux, shell=/usr/sbin/nologin
Mar 9 12:18:11 Client02 usermod[7155]: change user 'usbmux' password
Mar 9 12:18:11 Client02 chage[7160]: changed password expiry for usbmux
Mar 9 12:18:11 Client02 chfn[7163]: changed user 'usbmux' information
Mar 9 12:18:24 Client02 groupadd[7508]: group added to /etc/group: name=pulse, GID=116
Mar 9 12:18:24 Client02 groupadd[7508]: group added to /etc/gshadow: name=pulse
2
Motivation
4. Existing Solutions
▪ Centralized Log Management
▪ Ingest log sources from multiple endpoints, parse and index them into a central
database to analyze. [Kotenko et al., 2013]
▪ Bandwidth-intensive and computationally demanding.
[Grimaila et al., 2012], [Guillermo, 2013]
▪ Decentralized Log Analysis
▪ Partly shift the computational workloads (log pre-processing and analysis) to the log-
producing hosts. [Grimaila et al., 2012]
▪ Primarily for correlation and alerting, rather than to query dispersed log data.
[Krugel et al., 2001]
▪ Continuously ingest all log data may consume a lot of local point resources.
▪ Current solutions lack semantic relations between entities [Oliner et al., 2012],
hence it is difficult to:
▪ Integrate partial and isolated views on system states.
▪ Contextualize, link and query log data.
4
5. 5
R1. Resource-efficiency
▪ Avoid unnecessary log processing, minimize source requirements
(storage space and network bandwidth).
R2. Aggregation and integration over multiple endpoints
▪ Concurrently execute federated endpoints and deliver results.
R3. Contextualization & Background-Linking
▪ Ability to contextualize, integrate and link to background knowledge.
R4. Standards-based query language
▪ Use of an expressive, standardized query language.
Requirements
6. 6
Virtual Knowledge Graph (VKG) for Federated Log Analysis
A method to execute federated, graph pattern-based queries on dispersed,
heterogeneous raw log data by dynamically constructing virtual knowledge
graphs.
We introduce a method that:
Extracts only potentially relevant log messages only on demand.
Integrates the dispersed log events into a common graph.
Federates graph-pattern based queries across endpoints.
Links them to background knowledge.
Proposed Approach
7. Virtual Knowledge Graph Concept
▪ Data Virtualization (V)
▪ No actual data source are exposed.
▪ No data integration materialization.
▪ Graph Representation (G)
▪ Nodes: Object/Data value
representation.
▪ Edges: relations between nodes.
▪ Domain Knowledge (K)
▪ Concept and property hierarchies.
▪ Domain and range of properties.
7
Guohui Xiao et al, 2019
21. 21
Evaluation Setup:
• Machines: Microsoft Azure Virtual Machine with a Linux host (2.59 GHz vCPU, 16 GB RAM)
and Windows host for log analysis (2.90 GHz CPU, 16 GB RAM).
• Dataset: AIT log dataset (V1.1) that simulates six days of user access across multiple web
servers.
• We split large log files into smaller files.
• We reported the average time over five runs for each experiment.
Dataset description:
Evaluation
23. 23
Experiment Timeframe Query execution time in a federated setting for
different time frames
Evaluation Setup:
• Machines: Microsoft Azure Virtual Machines with seven hosts (4 Windows and 3 Linux -
2.59 GHz vCPU, 16 GB RAM) and a Windows host for log analysis (2.90 GHz CPU, 16 GB
RAM)
• Dataset: Apache Log from the AIT log dataset (V1.1)
• Host 1 to host 4 store the data from the original 4 servers in the dataset
• We reported the average times over five runs for each experiment.
Evaluation – Multiple Hosts
24. ▪ A novel approach for federated log analysis based on virtual knowledge graphs.
▪ A prototype and vocabularies demonstrated in security analytics.
▪ Evaluation: the log processing time is primarily a function of the number of extracted
(relevant) log lines and queried hosts.
Limitations:
▪ The query parameters should restrict the extracted log lines.
▪ Not a replacement for existing SIEM system.
Future Work:
▪ Query analysis improvement (i.e., automatic hosts and background-knowledge
selection).
▪ Streaming based Virtual Knowledge Graph for Log Monitoring.
24
Conclusion
25. References
25
Resource
Paper
▪ Oliner, A., Ganapathi, A., Xu, W.: Advances and challenges in log analysis.
Communications of the ACM 55(2) (2012)
▪ Igor Kotenko, Olga Polubelova, Andrey Chechulin, and Igor Saenko. 2013. Design and
Implementation of a Hybrid Ontological-Relational Data Repository for SIEM Systems.
Future Internet 5, 3 (July 2013), 355–375. https://doi.org/10.3390/f5030355
▪ Christopher Krügel, Thomas Toth, and Clemens Kerer. 2002. Decentralized Event
Correlation for Intrusion Detection. In Information Security and Cryptology — ICISC
2001, Gerhard Goos, Juris Hartmanis, Jan van Leeuwen, and Kwangjo Kim (Eds.), Vol.
2288. Springer Berlin Heidelberg, Berlin, Heidelberg, 114–131. https:
//doi.org/10.1007/3-540-45861-1_10
▪ Esther Palomar Guillermo Suárez de Tangil. 2013. Advances in Security Information
Management: Perceptions and Outcomes. NovaScience Publishers, Incorporated,
Commack, NY, USA
▪ Michael R Grimaila, Justin Myers, Robert F Mills, and Gilbert Peterson. 2012. Design
and Analysis of a Dynamically Confgured Log-based Distributed Security Event
Detection Methodology. The Journal of Defense Modeling and Simulation: Applications,
Methodology, Technology 9, 3 (July 2012), 219–241. https:
//doi.org/10.1177/1548512911399303