SlideShare una empresa de Scribd logo
1 de 62
Marko Grobelnik
Jozef Stefan Institute, Slovenia
Sao Carlos, July 15th 2013
 Big-Data in numbers
 Big-Data Definitions
 Motivation
 State of Market
 Techniques
 Tools
 Data Science
 Concluding remarks
http://www.go-gulf.com/blog/online-time/
http://www.go-gulf.com/blog/online-time/
http://www.go-gulf.com/blog/online-time/
http://www.go-gulf.com/blog/online-time/
http://www.go-gulf.com/blog/online-time/
http://www.go-gulf.com/blog/online-time/
http://www.go-gulf.com/blog/online-time/
http://www.go-gulf.com/blog/online-time/
 „Big-data‟ is similar to „Small-data‟, but bigger
 …but having data bigger it requires different
approaches:
◦ techniques, tools, architectures
 …with an aim to solve new problems
◦ …or old problems in a better way.
 Volume –
challenging to load
and process (how to
index, retrieve)
 Variety – different
data types and
degree of structure
(how to query semi-
structured data)
 Velocity – real-time
processing
influenced by rate of
data arrival
From “Understanding Big Data” by IBM
 1. Volume (lots of data = “Tonnabytes”)
 2. Variety (complexity, curse of
dimensionality)
 3. Velocity (rate of data and information flow)
 4. Veracity (verifying inference-based models
from comprehensive data collections)
 5. Variability
 6. Venue (location)
 7. Vocabulary (semantics)
Comparing volume of “big data” and “data mining” queries
…adding “web 2.0” to “big data” and “data mining” queries volume
Big-Data
 Key enablers for the appearance and growth
of “Big Data” are:
◦ Increase of storage capacities
◦ Increase of processing power
◦ Availability of data
Source: WikiBon report on “Big Data Vendor Revenue and Market Forecast 2012-2017”, 2013
 …when the operations on data are complex:
◦ …e.g. simple counting is not a complex problem
◦ Modeling and reasoning with data of different kinds
can get extremely complex
 Good news about big-data:
◦ Often, because of vast amount of data, modeling
techniques can get simpler (e.g. smart counting can
replace complex model-based analytics)…
◦ …as long as we deal with the scale
 Research areas (such
as IR, KDD, ML, NLP,
SemWeb, …) are sub-
cubes within the data
cube
Scalability
Streaming
Context
Quality
Usage
 A risk with “Big-Data mining” is that an
analyst can “discover” patterns that are
meaningless
 Statisticians call it Bonferroni‟s principle:
◦ Roughly, if you look in more places for interesting
patterns than your amount of data will support, you
are bound to find crap
Example:
 We want to find (unrelated) people who at least twice
have stayed at the same hotel on the same day
◦ 109 people being tracked.
◦ 1000 days.
◦ Each person stays in a hotel 1% of the time (1 day out of 100)
◦ Hotels hold 100 people (so 105 hotels).
◦ If everyone behaves randomly (i.e., no terrorists) will the data
mining detect anything suspicious?
 Expected number of “suspicious” pairs of people:
◦ 250,000
◦ … too many combinations to check – we need to have some
additional evidence to find “suspicious” pairs of people in
some more efficient way
Example taken from: Rajaraman, Ullman: Mining of Massive Datasets
 Smart sampling of data
◦ …reducing the original data while not losing the
statistical properties of data
 Finding similar items
◦ …efficient multidimensional indexing
 Incremental updating of the models
◦ (vs. building models from scratch)
◦ …crucial for streaming data
 Distributed linear algebra
◦ …dealing with large sparse matrices
 On the top of the previous ops we perform
usual data mining/machine learning/statistics
operators:
◦ Supervised learning (classification, regression, …)
◦ Non-supervised learning (clustering, different types
of decompositions, …)
◦ …
 …we are just more careful which algorithms
we choose (typically linear or sub-linear
versions)
 An excellent overview of the algorithms
covering the above issues is the book
“Rajaraman, Leskovec, Ullman: Mining of
Massive Datasets”
 Downloadable from:
http://infolab.stanford.edu/~ullman/mmds.html
 Where processing is hosted?
◦ Distributed Servers / Cloud (e.g. Amazon EC2)
 Where data is stored?
◦ Distributed Storage (e.g. Amazon S3)
 What is the programming model?
◦ Distributed Processing (e.g. MapReduce)
 How data is stored & indexed?
◦ High-performance schema-free databases (e.g.
MongoDB)
 What operations are performed on data?
◦ Analytic / Semantic Processing (e.g. R, OWLIM)
 Computing and storage are typically hosted
transparently on cloud infrastructures
◦ …providing scale, flexibility and high fail-safety
 Distributed Servers
◦ Amazon-EC2, Google App Engine, Elastic,
Beanstalk, Heroku
 Distributed Storage
◦ Amazon-S3, Hadoop Distributed File System
 Distributed processing of Big-Data requires non-
standard programming models
◦ …beyond single machines or traditional parallel
programming models (like MPI)
◦ …the aim is to simplify complex programming tasks
 The most popular programming model is
MapReduce approach
 Implementations of MapReduce
◦ Hadoop (http://hadoop.apache.org/), Hive, Pig,
Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu,
Flume, Kafka, Azkaban, Oozie, Greenplum
 The key idea of the MapReduce approach:
◦ A target problem needs to be parallelizable
◦ First, the problem gets split into a set of smaller problems (Map step)
◦ Next, smaller problems are solved in a parallel way
◦ Finally, a set of solutions to the smaller problems get synthesized
into a solution of the original problem (Reduce step)
 NoSQL class of databases have in common:
◦ To support large amounts of data
◦ Have mostly non-SQL interface
◦ Operate on distributed infrastructures (e.g. Hadoop)
◦ Are based on key-value pairs (no predefined schema)
◦ …are flexible and fast
 Implementations
◦ MongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase,
Hypertable, Voldemort, Riak, ZooKeeper…
 Interdisciplinary field using
techniques and theories from many
fields, including math, statistics, data
engineering, pattern recognition and
learning, advanced computing,
visualization, uncertainty modeling,
data warehousing, and high
performance computing with the goal
of extracting meaning from data and
creating data products.
 Data science is a novel term that is
often used interchangeably with
competitive intelligence or business
analytics, although it is becoming
more common.
 Data science seeks to use all available
and relevant data to effectively tell a
story that can be easily understood by
non-practitioners.
http://en.wikipedia.org/wiki/Data_science
http://blog.revolutionanalytics.com/data-science/
http://blog.revolutionanalytics.com/data-science/
Analyzing the Analyzers
An Introspective Survey of Data Scientists and Their Work
By Harlan Harris, Sean Murphy, Marck Vaisman
Publisher: O'Reilly Media
Released: June 2013
An Introduction to Data
Jeffrey Stanton, Syracuse University School of Information Studies
Downloadable from http://jsresearch.net/wiki/projects/teachdatascience
Released: February 2013
 Big-Data is everywhere, we are just not used to
deal with it
 The “Big-Data” hype is very recent
◦ …growth seems to be going up
◦ …evident lack of experts to build Big-Data apps
 Can we do “Big-Data” without big investment?
◦ …yes – many open source tools, computing machinery is
cheap (to buy or to rent)
◦ …the key is knowledge on how to deal with data
◦ …data is either free (e.g. Wikipedia) or to buy (e.g.
twitter)
 Analysis of big social network
 Recommender service on a big scale
 How to profile people from web data?
 Complex data visualization
 How to track information across languages
 Putting deep semantics into the picture

Más contenido relacionado

La actualidad más candente

The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
Zubair Nabi
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big Analytics
Ajay Ohri
 
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
BigMine
 

La actualidad más candente (20)

Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 
Introduction of big data unit 1
Introduction of big data unit 1Introduction of big data unit 1
Introduction of big data unit 1
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data Science
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streams
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big Analytics
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big Data
 
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research Opportunities
 
Big Data: an introduction
Big Data: an introductionBig Data: an introduction
Big Data: an introduction
 
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
 
Mongo Internal Training session by Soner Altin
Mongo Internal Training session by Soner AltinMongo Internal Training session by Soner Altin
Mongo Internal Training session by Soner Altin
 
Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data Analytics
 

Destacado

EdVard munch presentazione
EdVard munch presentazioneEdVard munch presentazione
EdVard munch presentazione
serepisa
 

Destacado (20)

Big Data, Big Opportunity: A Primer for Understanding The Big Data Frontier
Big Data, Big Opportunity: A Primer for Understanding The Big Data FrontierBig Data, Big Opportunity: A Primer for Understanding The Big Data Frontier
Big Data, Big Opportunity: A Primer for Understanding The Big Data Frontier
 
Systems of Intelligence - Wikibon/theCUBE
Systems of Intelligence - Wikibon/theCUBESystems of Intelligence - Wikibon/theCUBE
Systems of Intelligence - Wikibon/theCUBE
 
Jeff Kelly, Wikibon Slides; Big Data Summit 2015
Jeff Kelly, Wikibon Slides; Big Data Summit 2015Jeff Kelly, Wikibon Slides; Big Data Summit 2015
Jeff Kelly, Wikibon Slides; Big Data Summit 2015
 
Big Data
Big DataBig Data
Big Data
 
Innomantra - Intellectual Property Consulting & Services
Innomantra - Intellectual Property Consulting & ServicesInnomantra - Intellectual Property Consulting & Services
Innomantra - Intellectual Property Consulting & Services
 
Issues in the conservation of photographs
Issues in the conservation of photographsIssues in the conservation of photographs
Issues in the conservation of photographs
 
The Nature of the Future: The Socialstructed World (Marina Gorbis Keynote)
The Nature of the Future: The Socialstructed World (Marina Gorbis Keynote)The Nature of the Future: The Socialstructed World (Marina Gorbis Keynote)
The Nature of the Future: The Socialstructed World (Marina Gorbis Keynote)
 
A&m assessment help
A&m assessment helpA&m assessment help
A&m assessment help
 
Final Assignment
Final AssignmentFinal Assignment
Final Assignment
 
EzMate 401 Arise Biotech
EzMate 401 Arise BiotechEzMate 401 Arise Biotech
EzMate 401 Arise Biotech
 
Cuidando al Cuidador
Cuidando al CuidadorCuidando al Cuidador
Cuidando al Cuidador
 
Cámbiate Transvulcania
Cámbiate TransvulcaniaCámbiate Transvulcania
Cámbiate Transvulcania
 
Opening an Account for Innovation in Banking Industry
Opening an Account for Innovation in Banking IndustryOpening an Account for Innovation in Banking Industry
Opening an Account for Innovation in Banking Industry
 
Half Day Signal Sharing Workshop Process and Templates
Half Day Signal Sharing Workshop Process and TemplatesHalf Day Signal Sharing Workshop Process and Templates
Half Day Signal Sharing Workshop Process and Templates
 
3d Views Portfolio
3d Views Portfolio3d Views Portfolio
3d Views Portfolio
 
EdVard munch presentazione
EdVard munch presentazioneEdVard munch presentazione
EdVard munch presentazione
 
Seo thời tam quốc
Seo thời tam quốcSeo thời tam quốc
Seo thời tam quốc
 
ALICT opening presentation 9.10.12, African Union Headquarters
ALICT opening presentation 9.10.12, African Union HeadquartersALICT opening presentation 9.10.12, African Union Headquarters
ALICT opening presentation 9.10.12, African Union Headquarters
 
¿Querer es poder?
¿Querer es poder?¿Querer es poder?
¿Querer es poder?
 
All Of The Above
All Of The AboveAll Of The Above
All Of The Above
 

Similar a Big Data Analytics V2

ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
eswcsummerschool
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
eswcsummerschool
 
Delivering Security Insights with Data Analytics and Visualization
Delivering Security Insights with Data Analytics and VisualizationDelivering Security Insights with Data Analytics and Visualization
Delivering Security Insights with Data Analytics and Visualization
Raffael Marty
 

Similar a Big Data Analytics V2 (20)

ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1
 
Week-1-Introduction to Data Mining.pptx
Week-1-Introduction to Data Mining.pptxWeek-1-Introduction to Data Mining.pptx
Week-1-Introduction to Data Mining.pptx
 
Introduction to-data-mining chapter 1
Introduction to-data-mining  chapter 1Introduction to-data-mining  chapter 1
Introduction to-data-mining chapter 1
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
Delivering Security Insights with Data Analytics and Visualization
Delivering Security Insights with Data Analytics and VisualizationDelivering Security Insights with Data Analytics and Visualization
Delivering Security Insights with Data Analytics and Visualization
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
 
Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference
 
Spark
SparkSpark
Spark
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Big Data Analytics V2

  • 1. Marko Grobelnik Jozef Stefan Institute, Slovenia Sao Carlos, July 15th 2013
  • 2.  Big-Data in numbers  Big-Data Definitions  Motivation  State of Market  Techniques  Tools  Data Science  Concluding remarks
  • 3.
  • 4.
  • 5.
  • 14.
  • 15.  „Big-data‟ is similar to „Small-data‟, but bigger  …but having data bigger it requires different approaches: ◦ techniques, tools, architectures  …with an aim to solve new problems ◦ …or old problems in a better way.
  • 16.  Volume – challenging to load and process (how to index, retrieve)  Variety – different data types and degree of structure (how to query semi- structured data)  Velocity – real-time processing influenced by rate of data arrival From “Understanding Big Data” by IBM
  • 17.  1. Volume (lots of data = “Tonnabytes”)  2. Variety (complexity, curse of dimensionality)  3. Velocity (rate of data and information flow)  4. Veracity (verifying inference-based models from comprehensive data collections)  5. Variability  6. Venue (location)  7. Vocabulary (semantics)
  • 18. Comparing volume of “big data” and “data mining” queries
  • 19. …adding “web 2.0” to “big data” and “data mining” queries volume
  • 20.
  • 22.
  • 23.  Key enablers for the appearance and growth of “Big Data” are: ◦ Increase of storage capacities ◦ Increase of processing power ◦ Availability of data
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34. Source: WikiBon report on “Big Data Vendor Revenue and Market Forecast 2012-2017”, 2013
  • 35.
  • 36.
  • 37.
  • 38.  …when the operations on data are complex: ◦ …e.g. simple counting is not a complex problem ◦ Modeling and reasoning with data of different kinds can get extremely complex  Good news about big-data: ◦ Often, because of vast amount of data, modeling techniques can get simpler (e.g. smart counting can replace complex model-based analytics)… ◦ …as long as we deal with the scale
  • 39.  Research areas (such as IR, KDD, ML, NLP, SemWeb, …) are sub- cubes within the data cube Scalability Streaming Context Quality Usage
  • 40.  A risk with “Big-Data mining” is that an analyst can “discover” patterns that are meaningless  Statisticians call it Bonferroni‟s principle: ◦ Roughly, if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap
  • 41. Example:  We want to find (unrelated) people who at least twice have stayed at the same hotel on the same day ◦ 109 people being tracked. ◦ 1000 days. ◦ Each person stays in a hotel 1% of the time (1 day out of 100) ◦ Hotels hold 100 people (so 105 hotels). ◦ If everyone behaves randomly (i.e., no terrorists) will the data mining detect anything suspicious?  Expected number of “suspicious” pairs of people: ◦ 250,000 ◦ … too many combinations to check – we need to have some additional evidence to find “suspicious” pairs of people in some more efficient way Example taken from: Rajaraman, Ullman: Mining of Massive Datasets
  • 42.  Smart sampling of data ◦ …reducing the original data while not losing the statistical properties of data  Finding similar items ◦ …efficient multidimensional indexing  Incremental updating of the models ◦ (vs. building models from scratch) ◦ …crucial for streaming data  Distributed linear algebra ◦ …dealing with large sparse matrices
  • 43.  On the top of the previous ops we perform usual data mining/machine learning/statistics operators: ◦ Supervised learning (classification, regression, …) ◦ Non-supervised learning (clustering, different types of decompositions, …) ◦ …  …we are just more careful which algorithms we choose (typically linear or sub-linear versions)
  • 44.  An excellent overview of the algorithms covering the above issues is the book “Rajaraman, Leskovec, Ullman: Mining of Massive Datasets”  Downloadable from: http://infolab.stanford.edu/~ullman/mmds.html
  • 45.
  • 46.  Where processing is hosted? ◦ Distributed Servers / Cloud (e.g. Amazon EC2)  Where data is stored? ◦ Distributed Storage (e.g. Amazon S3)  What is the programming model? ◦ Distributed Processing (e.g. MapReduce)  How data is stored & indexed? ◦ High-performance schema-free databases (e.g. MongoDB)  What operations are performed on data? ◦ Analytic / Semantic Processing (e.g. R, OWLIM)
  • 47.
  • 48.  Computing and storage are typically hosted transparently on cloud infrastructures ◦ …providing scale, flexibility and high fail-safety  Distributed Servers ◦ Amazon-EC2, Google App Engine, Elastic, Beanstalk, Heroku  Distributed Storage ◦ Amazon-S3, Hadoop Distributed File System
  • 49.  Distributed processing of Big-Data requires non- standard programming models ◦ …beyond single machines or traditional parallel programming models (like MPI) ◦ …the aim is to simplify complex programming tasks  The most popular programming model is MapReduce approach  Implementations of MapReduce ◦ Hadoop (http://hadoop.apache.org/), Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu, Flume, Kafka, Azkaban, Oozie, Greenplum
  • 50.  The key idea of the MapReduce approach: ◦ A target problem needs to be parallelizable ◦ First, the problem gets split into a set of smaller problems (Map step) ◦ Next, smaller problems are solved in a parallel way ◦ Finally, a set of solutions to the smaller problems get synthesized into a solution of the original problem (Reduce step)
  • 51.  NoSQL class of databases have in common: ◦ To support large amounts of data ◦ Have mostly non-SQL interface ◦ Operate on distributed infrastructures (e.g. Hadoop) ◦ Are based on key-value pairs (no predefined schema) ◦ …are flexible and fast  Implementations ◦ MongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable, Voldemort, Riak, ZooKeeper…
  • 52.
  • 53.  Interdisciplinary field using techniques and theories from many fields, including math, statistics, data engineering, pattern recognition and learning, advanced computing, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products.  Data science is a novel term that is often used interchangeably with competitive intelligence or business analytics, although it is becoming more common.  Data science seeks to use all available and relevant data to effectively tell a story that can be easily understood by non-practitioners. http://en.wikipedia.org/wiki/Data_science
  • 56. Analyzing the Analyzers An Introspective Survey of Data Scientists and Their Work By Harlan Harris, Sean Murphy, Marck Vaisman Publisher: O'Reilly Media Released: June 2013 An Introduction to Data Jeffrey Stanton, Syracuse University School of Information Studies Downloadable from http://jsresearch.net/wiki/projects/teachdatascience Released: February 2013
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.  Big-Data is everywhere, we are just not used to deal with it  The “Big-Data” hype is very recent ◦ …growth seems to be going up ◦ …evident lack of experts to build Big-Data apps  Can we do “Big-Data” without big investment? ◦ …yes – many open source tools, computing machinery is cheap (to buy or to rent) ◦ …the key is knowledge on how to deal with data ◦ …data is either free (e.g. Wikipedia) or to buy (e.g. twitter)
  • 62.  Analysis of big social network  Recommender service on a big scale  How to profile people from web data?  Complex data visualization  How to track information across languages  Putting deep semantics into the picture