Big Data is a new term used to identify datasets that we can not manage with current methodologies or data mining software tools due to their large size and complexity. Big Data mining is the capability of extracting useful information from these large datasets or streams of data. New mining techniques are necessary due to the volume, variability, and velocity, of such data.
2. Motivation
• BIG DATA is an OPEN SOURCE
Software Revolution
• BIG DATA Analytics 2.0
• What is happening right now
• Why we need new tools?
• Improve decision making:
• Measure and react in REAL-TIME
2 7/6/2013
3. Real Time Decision Making
3 7/6/2013
Companies need to know:
• what is happening right now,
in real time, to be able to
• react
• anticipate and detect new
business opportunities.
4. Big Data 6 Vs
• Volume
• Variety
• Velocity
• Value
• Variability
• Veracity
4 7/6/2013
5. Controversy of Big Data
• All data is BIG now
• Hype to sell Hadoop
based systems
• Ethical concerns about
accessibility
• Limited access to Big
Data creates new digital
divides
5 7/6/2013
6. Controversy of Big Data
• Statistical Significance:
– When the number of
variables grow, the
number of fake
correlations also grow
– Leinweber: S&P 500
stock index correlated
with butter production
in Bangladesh
6 7/6/2013
7. Need for Big Data
• McKinsey Global Institute
(MGI) Report on Big
Data, 2011
7 7/6/2013
8. Need for Big Data
8 7/6/2013
• McKinsey Global Institute
(MGI) Report on Big
Data, 2011
9. More data or better models?
9 7/6/2013
Xavier Amatriain
Netflix Research/Engineering Director
http://recsys.acm.org/more-data-or-better-models/
10. Future Challenges for Big Data
• Evaluation
• Time evolving data
• Distributed mining
• Compression
• Visualization
• Hidden Big Data
10 7/6/2013
17. What is SAMOA?
17 7/6/2013
• NEW Software framework for mining distributed data streams
• Big Data mining for evolving streams in REAL-TIME
18. 18 7/6/2013
Big Data Stream Mining
BIG DATA Streams
• Sequence is potentially infinite
• High amount of data, high speed of arrival
• Change over time
• Process elements from a data stream in only one pass
• Approximation algorithms
– Small error rate with high probability
19. 19 7/6/2013
Big Data Stream Mining
Distributed BIG DATA
• BIG DATA Analytics 2.0
– Apache S4
• Yahoo! 2010
– Storm
• Twitter 2011
Machine
Learning
Distributed
Batch
Hadoop
Mahout
Stream
S4, Storm
SAMOA
Non
Distributed
Batch
R,
WEKA,…
Stream
MOA
20. SAMOAArchitecture
Use S4, Storm, or other distributed stream processing platform
Use MOA, or other streaming machine learning library
Easy to extend through PACKAGES
20 7/6/2013
SAMOA
S4 Storm …
SAMOA
Classifier
Methods
Clustering
Methods
Frequent
Pattern
Mining
21. Thanks!
http://samoa-project.net/
G. De Francisci Morales SAMOA: A Platform for Mining Big Data Streams
Keynote Talk at RAMSS ’13: 2nd International Workshop on Real-Time Analysis and
Mining of Social Streams @WWW, Rio De Janeiro, 2013.
21 7/6/2013