3. Obligatory Definition
"Big data is a term for data sets that are
so large or complex that traditional
data processing applications are
inadequate."
https://en.wikipedia.org/wiki/Big_data
Big data – with spheres containing pictures of the main themes.
Just as an intro to what we are going to cover and that today our talk on big data is brought to you by Leo Tolstoy, Donald Rumsfeld, 2001 A space odyssey, star wars and the Eiffel tower.
Fast paced, non-sales led overview of how we approach large-scale log analysis at 7E as part of incident response.
Being a conference on Big Data and as my talk is at least the sixth of the day, I thought we may need to actually define what big data is!
For me, the key points are highlighted in bold, we are talking about large and or complex data sets where traditional approaches to data processing are no longer adequate.
In this talk we are going to cover how large-scale log analysis can quickly generate data sets that are so large that current approaches are inadequate and explore how a blended approach of technology and human problem solving is required to find the needle in the haystack.
What we are not talking about is replacing people with algorithms / machine learning or magic. Also a key point is that big data is not a panacea for all ills within the security world, but when leveraged in the right way with skilled technical resource, the use of big data will enable our community to make sense of the ever evolving and complicated threat landscape that we now face.
So, this talk is based on real world experience of dealing with large amounts of log data. For context, we are going to use an incident that the 7E team worked on that involved a new exploit with no known attack signatures.
So in the words of Donald Rumsfeld, “because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don't know we don't know.
If it was a known know, then we could expect IPS / WAF platforms to provide protection and alerting and thus analysis of those alerts.
However, in the event of a breach that doesn’t match known signatures, then we are moving in to the realm of known unknowns and having to search available logs for evidence of malicious activity.
I’m not even going to get in to a discussion about unknown unknowns….. So for now, lets get back to logs, specifically apache log files and lots of them.
What do we actually mean when we talk about large and complex logs?
Incident engagement, enterprise grade three tier architecture model, split over multiple load balancers serving traffic to eight application servers and supporting backend databases.
Compromise of the main application layer. As part of the incident response we obtained 3TB of log files that covered the previous three months of log history for the eight servers.
All logs split by host and further split through log rotation.
Today, 3TB of data doesn’t sound that much, especially as you can now buy 3TB external drives for £70.
So what is 3TB of log data in real terms?
Total number of words = three hundred thirty nine billion, three hundred eighty eight million, three hundred thirteen thousand, four hundred twelve.
Total number of lines = twenty six billion, eight hundred million, five hundred ninety five thousand, nine hundred twenty seven.
Enough words to fill five hundred seventy seven thousand, eight hundred ninety one copies of Leo Tolstoy’s War and Peace.
Which if you stacked those books would be be the height of 117 Eifel Towers.
Clearly that amount of data was going to require more than notepad or a spreadsheet!
And as any discerning Jedi knows, you don’t just rely on technology – the failure of the death star is a prime case in point. We need a blend of human skills supported by technology.
So we built an easily extensible python controller using a RESTful JSON-based API to take advantage of Elasticsearch and Logstash.
Combined with Kibana to provide a graphical interface for manual interrogation of log data.
Key to large scale log analysis:
Ability to extract out ‘known knowns’. This is done through verified attack signatures.
Automatically generate evidence files.
Access to raw logs for manual analysis and verification. Both command line and visual through kabana.
Ability to update signatures with new attack patterns and repeat analysis.
Timelines, suspect IPs.
All traffic from a suspect IP is extracted for further targeted analysis.
Human is in full control over the analysis, this includes choice of signatures.
Finding new attacks that automated tools along would not have found!
In our reference case:
Identification of an exploit kit being used to compromise enterprise
Established the timeline of the attack, occurred further back than initially thought
Identification of compromised hosts and backdoor scripts in use by the attackers, allowing for wider internal search of systems for evidence of compromise.
As a second example:
Manual analysis of 1m lines of log data for evidence of compromise through a new exploit. Manually the effort took around four man days of effort. Using the platform, 2.5 seconds to identify the nine relevant entries within the logs showing a successful compromise of the server.
Positioning slide to talk about the need to be able to drive tools and interrogate the data. Do not be driven by the output of the tool!
Also talk about being able to do critical analysis – “So what”
Use example of analyst finding shellshock.
Skill set of front line needs to increase, not just restricted to research.