There are four key issues to overcome if you want to tame Big Data: volume, variety, velocity and veracity. You have to be able to deal with lots and lots, of all kinds of data, moving really quickly.
Big Data Analytics has a huge impact on how we plan CERN’s overall technology strategy as well as specific strategies for High-Energy Physics analysis. We want to profit from our data investment and extract the knowledge. This has to be done in a proactive, predictive and intelligent way.
This presentation shows you how we use Big Data Analytics to improve the operation of the Large Hardron Collider. See also: http://alexloth.com/2012/06/03/challenges-big-data-analytics-high-energy-physics/
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Big Data Analytics in High-Energy Physics
1. DB
CERN
CH-1211 Geneva 23
Switzerland
www.cern.ch
Big Data Analytics
in High-Energy Physics
Alexander Loth
CERN
23 May 2012
2. CERN
CH-1211 Geneva 23
Switzerland
www.cern.ch
CERN
• CERN is the European
Organization for Nuclear Research
• Founded in 1954 by 12 countries
for fundamental physics
• Today: the global effort of
21 member states
– About 1 billion CHF yearly budget
– 3300 employees
• Supporting the research
activities of ~10000 scientists
from 110+ nationalities
Alexander Loth, 23 May 2012
3. CERN
CH-1211 Geneva 23
Switzerland
www.cern.ch
Fundamental Research at CERN
• Why do particles have mass?
• Why is there no antimatter left
in the universe?
• What was the state
of universe just after
the Big Bang?
Alexander Loth, 23 May 2012
4. CERN
CH-1211 Geneva 23
Switzerland
www.cern.ch
CERN Accelerator Complex
Alexander Loth, 23 May 2012
5. CERN
CH-1211 Geneva 23
Switzerland
www.cern.ch
Potential of Big Data Analytics
Stage 4:
WISDOM
Decisions
Stage 3: KNOWLEDGE
GENERATION
Reduce and predict
faults and corrective
interventions
Increase the availability
and operations efficiency
Predictions Reporting Visualization
PROACTIVE
PREDICTIVE
INTELLIGENT
Stage 2: INFORMATION RETRIEVAL
Queries Statistics Analysis
Stage 1: DATA COLLECTION AND STORAGE
Data Integration Data Merging ETL
CONTROL AND MONITORING SYSTEMS
Alexander Loth, 23 May 2012
6. CERN
CH-1211 Geneva 23
Switzerland
www.cern.ch
What about Business Intelligence?
Traditional BI Big Data Analytics
TBs to EBs of
data
External +
Operational
Un-/Semi-
Structured
Ad hoc
GBs to TBs of
data
Operational
Structured
Repetitive
Alexander Loth, 23 May 2012
7. CERN
CH-1211 Geneva 23
Switzerland
www.cern.ch
Challenges of Big Data Analytics
VOLUME
Scale of data: in 2011
humankind created
1200 EB of information
VELOCITY
Analysis of streaming
data: worldwide digital
content will double
every 18 month
VARIETY
Different forms of
data: 80% of data is
unstructured
CERN: 22PB/year,
peaking 20GB/s,
writing spread across
80 tape drives
VERACITY
Uncertainty of data:
poor data quality costs
$3.1 trillions a year
Sources: The Economist, Gartner, IDC, McKinsey
Alexander Loth, 23 May 2012
8. CERN
CH-1211 Geneva 23
Switzerland
www.cern.ch
Big Data Analytics Use Cases
Alexander Loth, 23 May 2012
9. CERN
CH-1211 Geneva 23
Switzerland
www.cern.ch
Why using Hadoop at CERN?
• System should manage and heal itself
– Automatically and transparently
route around failure
– Speculatively execute redundant
tasks if certain nodes are
detected to be slow
• Performance should scale linearly
– Proportional change in capacity
with resource change
• Computing should move to data
– Lower latency, lower bandwidth
• Simple core that is
modular and extensible
Alexander Loth, 23 May 2012
10. CERN
CH-1211 Geneva 23
Switzerland
www.cern.ch
Hadoop Clusters at CERN
• CASTOR Cluster
with ~10 servers
– ~100GB of logs per day
– >100TB of logs in total
• ATLAS Cluster
with ~20 servers
– Event index catalogue
for experimental data
in the Grid
• Monitoring Cluster
with ~10 servers
– Log events from
CERN Computer Cluster
Alexander Loth, 23 May 2012
11. CERN
CH-1211 Geneva 23
Switzerland
www.cern.ch
Meta data from Physics Events (1)
• Meta data are created upon recording of a physics event
• Example 1: Event Information
– Run number, Event number
– Timestamp
– Luminosity block number
– Trigger that selected the event, etc.
Alexander Loth, 23 May 2012
12. CERN
CH-1211 Geneva 23
Switzerland
www.cern.ch
Meta data from Physics Events (2)
• Meta data are created upon recording of a physics event
• Example 2: Tape Storage Event Log
– On which tape is my file stored?
– Is there a copy on a disk?
– List me all events for a given tape or drive
– Was the tape repacked?
Alexander Loth, 23 May 2012
13. CERN
CH-1211 Geneva 23
Switzerland
www.cern.ch
Questions?
Alexander Loth, 23 May 2012
Notas del editor
What I’m going to tell you today is how we use Big Data Analytics to improve the operation of the Large Hardron collider.
My name is Alexander Loth and have been working for the last 3 years for CERN.
Big Data Analytics has a huge impact on how we plan CERN’s overall technology strategy as well as specific strategies for High-Energy Physics analysis.
Just a few words about CERN. CERN is the European Organization for Nuclear Research. It was founded some years after second World War for peaceful research on fundamental physics…
So we do fundamental research:
- Why do we have mass? 50 years ago Peter Higgs proposed the Higgs mechanism, which seems to be the answer.
- At the first moment of the universe the same amount of matter and antimatter was present. We are clearly matter. So what happened to the antimatter?
What were the properties of the universe right after Big Bang?
On the photo you can see a part of the LHC. The LHC is the biggest and most complex machine ever built.
The LHC is part of the CERN accelerator complex.
The particles start at the booster and gets accelerators until they reach the LHC ring.
If you read Dan Brown’s book Angel and Demon’s, you should have a look on the AD (Antiproton Decelerator).
In order to process the huge amount of data gathered by the LHC experiments, we need to apply Big Data Analytics.
We want to profit from our data investment and extract the knowledge. This has to be done in a proactive, predictive and intelligent way.
Big Data Analytics will safe massive costs, if we can:
Reduce and predict faults and corrective interventions
Increase the availability and operations efficiency
If you ask yourself how Big Data Analytics differs to Business Intelligence…
This brings us to the specific challenges of Big Data Analytics shown on the next slide.
VOLUME / VARIETY / VELOCITY / VERACITY
Data is exploding because it is coming from so many sources, continuously. Systems, sensors, and more. But the amount of data isn’t the only issue.
There are 4 key issues to overcome if you want to tame big data – volume, variety, velocity and veracity.
You have to be able to deal with lots and lots, of all kinds of data, moving really quickly. Today, most of this data is passing you by. You blink and it’s gone.
Next year: in total over 88PB stored 55.000 tapes + 14PB stored on disks.
At CERN we are problem driven people. This slide shows the present technologies applied for Big Data Analytics at CERN.
As you can see we choose always the technology as it fits best. So on plenty cases we still rely on Oracle and store even tons of raw data on tape.
However, more and more use cases pop up to use Hadoop, for instance for monitoring and meta data.
… furthermore the Hadoop Distributed File System (HDFS), also used massively by Facebook is a self-healing, high bandwidth clustered storage.
It’s reliable, redundant and optimized for huge amount of data.