Automation networks offer a range of real-time applications and data, making necessary the continuous monitoring of the quality of services. The parameters of QoS (Quality of Service) seek to address priorities, bandwidth allocation and network latency control. There are several QoS parameters to characterize a computer network, and that can be used for monitoring purposes.
Each SCADA network, in a healthy state, presents a specific QoS which rarely changes given the repetitive process of the IACS operations. The continuous monitoring of QoS parameters of an automation network may anticipate problems such as malware contamination and equipment failures like switches and routers. It is very important to be aware of these changes in behavior in order to receive alerts and promptly handle them, avoiding incidents that could compromise the operation of the network and be financially or environmentally costly.
In addition to the monitoring of network traffic, it is also necessary to monitor resource consumption of critical servers, such as the processing (CPU), memory, storage capacity and hard disk failures, among others.
This work aims to establish a method by which SCADA security professionals can differentiate and qualify any problems that may be occurring through continuous monitoring of the automation network performance parameters giving a more behavioral approach than current signature-based ones.
We presented a series of tests conducted in our laboratories in order to measure the performance of a simulated automation network parameters using a small SCADA network sandbox. First we measured the normal operating parameters of the network and reap its main graphics obtained with the proper tools. In a second step we practiced several attacks against the simulated automation network. During all attacks we collected the operating parameters of the network and its main graphics.
At the conclusion of the work we compared the graphs of the network in healthy state with the graphs of the network with the security incidents described above. We detailed how the network parameters were affected by each kind of incident and built a table showing the way the main parameters of an automation network were affected by the attacks
[White paper] detecting problems in industrial networks though continuous monitoring
1. DETECTING PROBLEMS IN INDUSTRIAL NETWORKS
THROUGH CONTINUOUS MONITORING
Jan Seidl 1
Marcelo Ayres Branquinho 2
SUMMARY
Automation networks offer a range of real-time applications and data, making necessary the
continuous monitoring of the quality of services. The parameters of QoS (Quality of Service)
seek to address priorities, bandwidth allocation and network latency control. There are several
QoS parameters to characterize a computer network, and that can be used for monitoring
purposes.
Each SCADA network, in a healthy state, presents a specific QoS which rarely changes given
the repetitive process of the IACS operations. The continuous monitoring of QoS parameters
of an automation network may anticipate problems such as malware contamination and
equipment failures like switches and routers. It is very important to be aware of these changes
in behavior in order to receive alerts and promptly handle them, avoiding incidents that could
compromise the operation of the network and be financially or environmentally costly.
In addition to the monitoring of network traffic, it is also necessary to monitor resource
consumption of critical servers, such as the processing (CPU), memory, storage capacity and
hard disk failures, among others.
This work aims to establish a method by which SCADA security professionals can
differentiate and qualify any problems that may be occurring through continuous monitoring
of the automation network performance parameters giving a more behavioral approach than
current signature-based ones.
We presented a series of tests conducted in our laboratories in order to measure the
performance of a simulated automation network parameters using a small SCADA network
sandbox. First we measured the normal operating parameters of the network and reap its main
graphics obtained with the proper tools. In a second step we practiced several attacks against
the simulated automation network. During all attacks we collected the operating parameters of
the network and its main graphics.
At the conclusion of the work we compared the graphs of the network in healthy state with
the graphs of the network with the security incidents described above. We detailed how the
network parameters were affected by each kind of incident and built a table showing the way
the main parameters of an automation network were affected by the attacks.
Keywords: Monitoring, SCADA, Security, Malware, Attacks.
1 CTO at TI Safe Segurança da Informação Ltda, Brazil (http://br.linkedin.com/in/janseidl)
2 CEO at TI Safe Segurança da Informação Ltda, Brazil (http://br.linkedin.com/in/marcelobranquinho)
2. 1 ABOUT AUTOMATION NETWORK MONITORING
Automation network monitoring is the term used to describe a system that continuously
monitors an automation network and notifies the network administrator when a device fails or
an outage occurs. This notification is normally performed through messaging systems (usually
e-mail and SNMP traps). Network monitoring is normally made through the use of dedicated
software applications and/or commercial tools. The ping command, for example, is a type of
network monitoring tool.
1.1 ASSETS TO BE MONITORED
To monitor a control system we first need to know exactly what devices exists on the network
and how they communicate with each other. Almost every piece of networked hardware on a
industrial plant can be monitored. From SCADA servers, supervisory/control stations to
PLCs, innumerous items can be monitored and aid us on preventing and quickly responding
to incidents.
2 PREPARING THE MONITORING ENVIRONMENT
The monitoring data can be very intensive and frequent. Given that axiom, its strongly
recommended that one creates an entirely separated network segment exclusive to monitoring.
This will prevent data from interfering with legit control / supervisory traffic at network level
and can help on isolating traffic from sniffing and other attacks.
The appropriate amount of servers must be setup according to the number of assets and
locations to be monitored. Most of the solutions can operate in high-availability and high-
performance clustered modes.
It's important to determine asset's processing and network capacity in order to determinate
whether an agent approach will be used or passive monitoring (ping or SNMP monitoring)
must be used.
Keep in mind that monitoring solutions usually pair with a database back-end solution and for
performance reasons databases should never share their data hard disk with another
application.
Writing up an industrial traffic matrix is also recommended in order to visually determinate
which assess need to communicate with each other assets and in which function codes so we
can tune up the monitoring triggers.
Below is an example of a industrial traffic matrix:
Source Destination Function Codes
192.168.1.15 192.168.1.1 3, 16
192.168.1.18 192.168.1.1 3
Table 1: Sample industrial traffic flow matrix
3. 2.1 MONITORING THE MACHINE HEALTH
Machine health monitoring can aid on preventing issues that could interrupt program's
operation disrupting supervisory or control operations. With active performance monitoring
issues can be predicted and solved before they happen.
Common items are monitored as Free/Used Disk space (applications may crash if cannot
write temporary files), Disk I/O (may indicate low memory [paging] or data extraction),
Logged on users, Number of failed login attempts (may indicate system compromise),
Number of incoming connections, number of outgoing connections, incoming/outgoing
packets rate (may indicate data extraction or illegal connections or even malware), CPU and
memory usage (may indicate worms/rootkits).
2.2 MONITORING OPERATING SYSTEM ERRORS
Error monitoring can be very useful on anticipating hardware failures. As in industrial plants
a scheduled stop must be placed in order to perform maintenance on hardware, it's better to do
on that window than in a hurry in the event of the hardware failing while in production.
Errors that can indicate hardware error are memory commit/allocation errors, disk read/write
errors. CPU temperature and fan speeds, disk temperatures, memory temperature, and such.
2.2 MONITORING PROCESSES
Key processes can also be monitored to see if they haven't crashed so the crew can be alerted
just right it crashed or also restart-it automatically (must be used with extreme caution
because may cause inconsistencies).
Suspicious known processes names and ports can also be monitored as RDP, HTTP/HTTPS,
TeamViewer, “cmd.exe” and Windows PowerShell processes, and such triggering an alert if
present that could indicate unauthorized remote access.
2.3 MONITORING HIGH AVAILABILITY
Communication link states can be checked to see if plant's network enters into contingency
state. The monitoring agent can perform automated tasks if needed.
2.4 MONITORING MODBUS TRAFFIC
You can setup a host to act as a network sniffer, mirroring all Modbus traffic to that switch
port. A simple Modbus sniffer can be built using pure python and scapy in order to dump out
function codes, sources and destinations.
With Modbus monitoring you can create alerts on disallowed function codes, tag values and
source and can also have the graphical representation of the frequency of commands sent and
received.
4. 2.5 PLC SNMP TRAPS MONITORING
Some PLCs offer SNMP (Simple Network Management Protocol) monitoring. Items like
network I/O, discarded packets, unknown protocols received, network errors, allocation tables
(useful against ARP poisoning), fragmentation. Those indicators (specially the error ones) can
promptly tell if something is happening.
2.6 PLC ICMP (PING) MONITORING
For PLCs that do not support SNMP monitoring a simple ping monitoring can be used to
detect device connectivity and also response times that could indicate an device overload due
a DOS attack.
2.7 DISTRIBUTED MONITORING
Monitoring with Zabbix (open source software) can be configured with distributed
monitoring. That means that all automation plants can have their own monitoring station
reporting data up to a central station. This can be very useful as you can have self-regulated
distributed monitoring stations reporting to the company's main office monitoring station.
2.8 CHECKING FREQUENCY
Depending on the load that the machine performs, items can be configured to be polled in a
defined interval. Servers with lighter load can have shorter checks (each 15 or 30 seconds)
and servers with higher load can have more delayed checks (1min or more) to preserve the
machine computational power and bandwidth.
2.9 ALERTING
Besides monitoring and plotting data, email, SMS or Jabber (the original name of the
Extensible Messaging and Presence Protocol-XMPP, the open technology for instant
messaging and presence) alerts can be configured to alert the response team. With a little
effort, alerts can trigger sound alerts or any other type of alerting method.
2.10 ESCALATIONS
Alerts can also be configured to escalate to other people in case the primary response team
takes too long to resolve the issue. If a trigger remains active after a configured time, e-mail
alerts can be automatically sent to main office's response team or even to the manufacturer's
response team or gradually climb up the hierarchy tree until problem is solved.
2.11 ITEM GROUP ALERTING
Items that share the same role can be grouped together making alerting more targeted. You
can have all database-related items triggering alerts for the database team, SCADA-related to
the automation team and so on. Escalation can be applied here too so higher support levels
can be contacted if the first responsible team cannot solve-it on time.
5. 3. THE TEST BED
"Test bed" is the denomination given to an structured test platform for running experiments
on a safe and controlled manner. The structure used for this work is composed of elements
that emulate the behavior of a real automation network and represent a replica of the real
world of industrial processes. Due to factors specific to SCADA environments such as the
criticality of real-time systems and the need for uninterrupted availability, test beds represent
ideal platforms for observing the behavior of systems and components analysis of control
systems.
3.1 THE TEST ENVIRONMENT
The test structures existing in the TI Safe Laboratory includes the field equipment - consisting
of a Wago 741-800 PLC and some hardware simulating an industrial natural gas plant (Tofino
Scada Security Simulator), an Windows 7 (physical) station acting as the supervisory station,
a monitoring server (Debian Linux 6) and a modbus traffic sniffer server (Debian Linux 6
with python script + scapy) (both virtual machines).
Picture 1: The SCADA Security Simulator used for the tests
3.1.1 THE TEST NETWORK
The configured network has no segmentation either on sub-netting, routing or VLANs. All
connected equipment is on the same “flat” network within the same IP address range
(192.168.1.0/24).
Diagram 1: The test network
6. 3.1.2 THE ATTACKER MACHINE
The attacker machine is a HP laptop that will be directly connected at the switch (not shown
in diagram above) and is running Kali Linux 1.0 from a Live-CD. Below is the list of the
software used on tests:
Software / Tool Description Attack Author
Hping3 ICMP flood tool Network Layer 3
denial of service
http://www.hping.org/
T50 Flood tool Network Layer 3
denial of service
https://github.com/me
rces/t50
Meterpreter Remote access shell Remote compromise,
malware infection
http://www.metasploit
.com/
Arpspoof ARP poison/spoofing tool ARP poison http://arpspoof.source
forge.net/
Pymodbus Modbus python library Unauthorized modbus
traffic
https://github.com/bas
hwork/pymodbus
Table 2: List of software used on tests
3.1.3 THE MONITORING SERVER
The monitoring server is built on top of a Debian Linux machine running the opensource
monitoring solution Zabbix 2.0.6 and MySQL 5.1 as data backend. Ideally in a production
environment, those monitoring solutions backends would be split across servers for
performance and isolation reasons.
Diagram 2: Monitoring data flow. Arrow indicates if data is remotely collected or sent by
agent on asset.
Data is collected either actively by Zabbix agents or passively by ICMP and SNMP queries.
The collected data can then be fed directly to Zabbix where is normalized, graphed and stored.
7. SNMP MIBs (Management Information Bases) on PLC were enumerated with the snmpwalk
tool and converted to Zabbix template with zload_snmpwalk perl script
(https://www.zabbix.com/wiki/howto/monitor/snmp/zload_snmpwalk).
3.1.4 THE NETWORK SNIFFER
The network sniffer consists on a simple linux installation connected at an specific switch port
that is configured to receive a mirror of every other port that has modbus traffic on it. A
internally crafted sniffer built in python within the Scapy
(http://www.secdev.org/projects/scapy/), packet manipulation program.
The sniffer is able to decode modbus traffic and output information in the following format:
{'request': '0330000064', 'unit_id': 1, 'src_ip': '192.168.1.15', 'dst_ip':
'192.168.1.1', 'response':
'03c80006005000160012019000120050005000160012001201901130cd00cd00cd00cd00cd
00cd00cd00cd00cd00cd00cd00cd00cd00cd00cd00cd00cd00cd00cd00cd00cd00cd00cd00c
d00cd00cd00cd00cd00cd00cd00cd00cd00cd00cd00cd00cd00cd00cd00cd00cd00cd00cd00
cd00cd008801004008110000012010480000800000024000000200000000200000000000200
410000450200000000200000020004002000000002000408000000040800200000020000000
000800cd0010800000cd0000440000', 'func_code': 3}
The sniffer consists of 3 modules, fired-off as forked instances in this multiprocessed
paradigm: The Sniffer, The Workers and The Publisher. These were built these way to let the
operations run isolated and without blocking each other. The main processes is the sniffer
built on top of Scapy. This module feds a IPC queue that is later consumed by the workers
(100 instances) that will add the transactions to the pool for summarization. The publisher
gets the summary, reports it to Zabbix for monitoring and graphing and then flush-it so the
cycle begins again. This way alerts can be set up if invalid function codes are detected. The
sniffer could be also reprogrammed to output more data like register-write values and such.
3.2 ATTACKS PERFORMED
The following attacks were performed against the simulated industrial network assets:
Attack Attack Vector Affected Assets
Communications interception ARP poison PLC, Supervisory Stations
PLC Denial of Service Layer 3 network flood, 0day PLC
Supervisory station malware
infection
Modbus malware, Meterpreter
shell backdoor
Supervisory Stations,
Network
Supervisory station
compromise
Meterpreter shell backdoor Supervisory Station
Unauthorized remote logon Enabling remote desktop on
machine, accessing machine
from other machine on network
Supervisory Station
Unauthorized modbus traffic Sending commands from
attacker machine
PLC
Table 3: Attacks performed.
8. 3.3 TEST RESULTS
As testing usually involves some information gathering, we started the tests by issuing some
simple scan with nmap.
$ nmap -sV 192.168.1.1
Without any rate limiting, three triggers were off: Abnormal incoming traffic, Abnormal
outgoing traffic and TCP Connection number change.
Picture 2: Triggers fired by nmap scan
9. Picture 3: Peaks generated by nmap scan on TX/RX and spit out some TCP RSTs.
3.3.1 COMMUNICATIONS INTERCEPTION
Our ARP table checking script has done its job and reported the changed MAC address for
192.168.1.1 (PLC) when we poisoned the Supervisory Station with arpspoof.
10. Picture 4: Trigger for ARP changes
The following UserParameter entry was added to the Supervisory Station's Zabbix agent
configuration file, for ARP testing:
UserParameter=arpcheck[*],ping -n 1 -w 1 $1 > NUL & for /f “tokens=2” %i
in ('arp -a | findstr /r “$1>”') do @echo %i
So items can be created like “arpcheck[192.168.1.1]” to get the MAC address from the ARP
table to this IP. A trigger is created to fire up on changes to this value.
3.3.2 PLC DENIAL OF SERVICE
The ICMP flood attack was very noisy as expected. Several triggers were off during the
attack. Graphs clearly showed abnormal peaks.
Picture 5: Green peak shows 30mbps peak from flood
11. Picture 6: Errors caused by the network overflow
The SNMP data isn't available while PLC is under DoS attack because the snmp client cannot
connect to the device to collect data so those triggers will wear off after the attack has ceased
and are considered collateral triggers.
Picture 7: Triggers fired after DOS attack
To get alerted just when the device begins to be attacked by a Denial-of-Service attack, we
periodically ping the PLC. If the PLC doesn't responds within a certain timeout, its considered
offline and a trigger is issued.
12. Picture 8: Trigger from PLC ping fail
3.3.3 SUPERVISORY STATION MALWARE INFECTION
As ICS network traffic is mostly homogeneous, we can set pretty tight thresholds on network
input and output variance. Most remote access tools (RATs) doesn't mind about limiting their
speed and they'll most likely try to communicate as fast as they can.
Picture 9: Network traffic hops on meterpreter session
Picture 10: Triggers from abnormal network traffic
The meterpreter session creates a little noise while downloading the stage but when I start
issuing some commands (like 'ls', 'ps', 'migrate' and some scripts) the graph gets really off that
nearly-flat line showing that someone is doing something there.
Network input and output is one of the best places to early detect malware outbreaks since
worms are usually noisy as they try to phone-home or spread across the network.
13. 3.3.4 SUPERVISORY STATION COMPROMISE
We created a monitoring item for the number of running processes “cmd.exe” in order to get
notified every 30 seconds if a cmd.exe is open in any supervisory station. Normally no shells
should be open unless the system is under some kind of maintenance.
Picture 11: Trigger notifies about new shell open
The Windows Powershell is also monitored as is less common to be used on a regular basis
on supervisory stations.
Picture 12: Trigger notifies about new Windows Powershell open
When system is marked as "under maintenance", triggers are supressed so you may open as
many shells as you want.
If you need to run scheduled batch jobs, you can add time ranges where it's allowed to have
one or more “cmd.exe” processes running.
3.3.5 UNAUTHORIZED REMOTE LOGON
By monitoring Windows Event Log we can determine whether a new session is created. Our
trigger caught it right away.
Picture 13: Trigger for new sessions created on Windows station
3.3.6 UNAUTHORIZED MODBUS TRAFFIC
The unauthorized traffic caused subtle but noticeable changes in the TX/RX graph. If the ICS
network can keep a steady pace, variation thresholds can be tuned to detect anomalous traffic.
14. The peak within the area marked in blue is where the unauthorized commands were issued.
Note the increase in TCP connections count and traffic during this period.
Picture 14: Peaks generated by Modbus traffic from attacker machine
Picture 15: Triggers triggered by abnormal traffic generated by issuing unauthorized modbus
traffic and modbus data extraction
The network sniffer also gives us some good visualization of Modbus Function codes. Take
function-code 3 (Read Multiple Registers) for example. The Supervisory Station polls it every
N seconds to update the supervisory software. The graph is pretty constant as shown below.
15. Picture 16: Regular Function-Code 3 Modbus Traffic to PLC
As soon as the attacker starts sending Modbus funcion code 3 to the PLC in order to
enumerate tags the graph creates spikes (highlighted below) that blow the whistle on our
enumeration.
Picture 17: Peaks after Modbus probing
The first peak is due manual individual tag probing via command-line modbus client. The
second (larger) one is due the “enum.sh” script that tries to read Tags from a supplied range.
As the normal communication is steady, this subtle change also fires a trigger.
Picture 18: Trigger fired by unauthorized Modbus traffic
16. 4. CONCLUSION
The homogeneity of the cyclical behavior of industrial networks and servers allows us to
establish with little effort the parameters of a 'healthy' network. This same characteristic is
hardly found in IT networks due to their nature of use and makes unfeasible the monitoring
with the same level of accuracy without the massive occurrence of false positives.
Network and servers analysis and monitoring applications are critical for the detection of
unusual network traffic, performing network and control systems management, and assisting
in responding to security incidents. This type of software addresses a general need for security
in control systems rather than specific vulnerabilities.
Through behavior monitoring can be achieved more tangible results than monitoring for
known keywords, called 'signatures'.
When applied to an automation network, monitoring software can be used to establish a
baseline of normal network traffic, a task that helps facilitate incident response and risk
assessment. The establishment of baselines traffic by analyzing packets in the control system
network is required for the detection of anomalous traffic by analyzing the differences.
Once the irregular network traffic has been captured and analyzed by the monitoring software,
the security team will use the data dumps to assess what is really happening on the network.
The anomalous traffic is compared to traffic from baseline to provide important information
about which servers (or equipments) are generating the anomalous traffic, ports and services
that may be involved, and which network protocols are being used. Traffic packets dumps can
be used to determine if the traffic is due to network errors, system configuration, or a
compromised system.
After the normal activity of the network has been demarcated, triggers can be configured for
parameters outside these ranges that can mean a compromise of the assets in question. Based
on these triggers, alarms (including sound) can be configured.
For industrial automation environments, with their unusual protocols, there are few
commercial tools available for purchase, and the customization of an open source tool that fits
the monitoring needs should be considered.
17. REFERENCES ON THE INTERNET
1. http://www.tisafe.com/en/solucoes/governanca-industrial/
2. http://www.tofinosecurity.com/
3. https://www.zabbix.org
4. https://www.zabbix.com/documentation/2.0/manual/config/items/itemtypes/zabbix_agent/win
_keys
5. https://www.zabbix.com/forum/showthread.php?t=10679
6. https://www.zabbix.com/wiki/howto/monitor/snmp/zload_snmpwalk
7. http://technet.microsoft.com/en-us/library/dd941635(v=ws.10).aspx
8. http://technet.microsoft.com/en-us/library/cc732459(v=ws.10).aspx
9. http://www.kali.org
10. http://docs.python.org/2/library/multiprocessing.html
11. https://www.zabbix.com/forum/showthread.php?p=90132