SlideShare una empresa de Scribd logo
1 de 45
Descargar para leer sin conexión
OPEN SOURCE TOOLS FOR BIG DATA
Helsinki 19.9.2017
Teemu Heikkilä
Emblica
EMBLICA
We’re super small company of 5
people
We’re into Data Engineering,
DevOps and ML
We’re hiring!
Let’s start with something simple
first
What is “big data”?
Just a buzzword…
…but, still it has a meaning.
1TB of data, is it BIG?
Volume Velocity Variety
We are not really in Facebook scale
but is it worth to talk about big data
tools?
Answer: Yes!
Question: Why?
Because what works with petabytes of
data, almost certainly works with
gigabytes
Helsinki City bike station usage
17M rows of JSON
You will get:
Fault tolerance, reliability, scalability and working
models of processing data of any amounts
… but it doesn't mean you need
fancy frameworks necessarily
History of data processing
with free software
NOW200320011997 2006
Google published
whitepaper about solving
storage problems with
web indexing. Carafella
and Cutting implemented
the white paper as part of
the Nutch project
GFS
HISTORY OF HADOOP
Doug Cutting started to
develop first version of
Lucene at Yahoo!
START Cutting moved the NDFS
and MapReduce related
codebase under new
project called Hadoop
HADOOP
Cutting open sourced
Lucene and it was moved
under Apache Foundation
Mike Cafarella joined with
Cutting to start Apache
Nutch - project to index
whole internet.
OPEN SOURCED
Ideas, (whitepapers)
DFS
MR
BigTable
Dynamo
FOSS Implementations
HDFS
Hadoop MR
HBase
Cassandra
(Notable) formats of Big data
ACTIVITY DATA
Clickstreams
App usage
Application specific usage
Music listening
Video streaming
Money usage
Credit cards
Transactions
SENSOR DATA
Locations
Spatial data
Sensor metrics
IoT devices
Industrial and consumer
Time series
UNSTRUCTURED DATA
Machine logs,
Unstructured text,
natural language
Sound, Photos, Video
Use cases
What are you using those fancy logos for?
CASE 1: EVENT SOURCING SQL-DATABASES
Working legacy systems that used
MySQL-database as a realtime data
storage.
No historical data saved ever.
Delete means delete
Update means update
We could touch the legacy code to
save the changes
But we don’t have to
Maxwell’s daemon
Reads MySQL replication binary log
Produces stream of JSON-formatted changes
?
KAFKA - DISTRIBUTED APPEND-ONLY LOG
Kafka was originally developed by
LinkedIn, open sourced 2011
Distributed, append-only log
Great tool for delivering reliably
millions of arbitrary formatted
messages
Scales by partitioning and adding new
nodes
(c) Ch.ko123 / CC BY 4.0
(c) Apache Spark
+ Fast writes (queue/log)
+ Fast reads (in-memory)
- Latency
- Reliable event delivery

is essential
KAPPA ARCHITECTURE
MATERIALIZING EVENT SOURCES
Change stream
Change stream
Change stream
Materialized
‘User’-table
Materialized
‘Resource’-table
Materialized
‘Usage’-table
APACHE SPARK
Originally developed at the University
of California, Berkeley's AMPLab
General large-scale data processing
framework
Based on MapReduce architecture but
keeps intermediate results in memory
instead of saving them to slow disks
like Hadoop
(c) Ch.ko123 / CC BY 4.0
Supports lot’s of different data
sources

Programming APIs for Scala, Java or
Python
EKS-STACK
Elasticsearch is based on Lucene but
it’s more than just search engine, it
can be used to provide real time
analytics even for end users, it’s
usually used to store the aggregated
data
Kibana is great tool for the developers
and for internal use to discover and
analyze the data lying inside ES
Spark is used to process the events,
produce the needed aggregates and
ingest data into Elasticsearch so it can
be queried
Screenshot by elastic.co
Event
Collector
Processing AnalyticsEventsUser agent
CASE 2: EVERY ANALYTICS PIPELINE EVER
Event source
(demo)
What are we sampling?
State
N
ew
State
Event Session
New session:

started 07:17:09, duration 0s, OPEN
Existing session:

started 07:17:09, duration 5s, OPEN
Existing session:

started 07:17:09, duration 10s, OPEN
Existing session:
started 07:17:09, duration 14s,
paused 07:17:23, CLOSED
New session:

started 07:17:09, duration 0s, OPEN
Existing session:

started 07:17:09, duration 5s, OPEN
Existing session:

started 07:17:09, duration 10s, OPEN
Existing session:
started 07:17:09, duration 14s,
paused 07:17:23, CLOSED
You can find me at:
@theikkilap
teemu@emblica.fi
https://emblica.fi
Any questions?
Thanks!
Icons from Font Awesome project

Más contenido relacionado

La actualidad más candente

Tools for Unstructured Data Analytics
Tools for Unstructured Data AnalyticsTools for Unstructured Data Analytics
Tools for Unstructured Data Analytics
Ravi Teja
 
Big data analytics, research report
Big data analytics, research reportBig data analytics, research report
Big data analytics, research report
JULIO GONZALEZ SANZ
 
Bp presentation business intelligence and advanced data analytics september ...
Bp presentation business intelligence  and advanced data analytics september ...Bp presentation business intelligence  and advanced data analytics september ...
Bp presentation business intelligence and advanced data analytics september ...
Barrett Peterson
 
Bigdata " new level"
Bigdata " new level"Bigdata " new level"
Bigdata " new level"
Vamshikrishna Goud
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Simplilearn
 
Bigdata
BigdataBigdata

La actualidad más candente (20)

Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
Tools for Unstructured Data Analytics
Tools for Unstructured Data AnalyticsTools for Unstructured Data Analytics
Tools for Unstructured Data Analytics
 
Big data analytics, research report
Big data analytics, research reportBig data analytics, research report
Big data analytics, research report
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Bp presentation business intelligence and advanced data analytics september ...
Bp presentation business intelligence  and advanced data analytics september ...Bp presentation business intelligence  and advanced data analytics september ...
Bp presentation business intelligence and advanced data analytics september ...
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
 
A Big Data Concept
A Big Data ConceptA Big Data Concept
A Big Data Concept
 
Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035
 
Big data-ppt
Big data-pptBig data-ppt
Big data-ppt
 
Bigdata " new level"
Bigdata " new level"Bigdata " new level"
Bigdata " new level"
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies Overview
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must Know
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
Bigdata
BigdataBigdata
Bigdata
 

Similar a Open Source Tools for Big Data

NoSQL and Hadoop: A New Generation of Databases - Changing the Game: Monthly ...
NoSQL and Hadoop: A New Generation of Databases - Changing the Game: Monthly ...NoSQL and Hadoop: A New Generation of Databases - Changing the Game: Monthly ...
NoSQL and Hadoop: A New Generation of Databases - Changing the Game: Monthly ...
Capgemini
 
Database Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wiDatabase Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wi
OllieShoresna
 

Similar a Open Source Tools for Big Data (20)

NoSQL and Hadoop: A New Generation of Databases - Changing the Game: Monthly ...
NoSQL and Hadoop: A New Generation of Databases - Changing the Game: Monthly ...NoSQL and Hadoop: A New Generation of Databases - Changing the Game: Monthly ...
NoSQL and Hadoop: A New Generation of Databases - Changing the Game: Monthly ...
 
963
963963
963
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
 
The Big Data Puzzle, Where Does the Eclipse Piece Fit?
The Big Data Puzzle, Where Does the Eclipse Piece Fit?The Big Data Puzzle, Where Does the Eclipse Piece Fit?
The Big Data Puzzle, Where Does the Eclipse Piece Fit?
 
Hadoop
HadoopHadoop
Hadoop
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
 
Big data apache spark + scala
Big data   apache spark + scalaBig data   apache spark + scala
Big data apache spark + scala
 
TDC2017 | POA Trilha BigData - IBM BigSQL - Engine de consulta de dados de al...
TDC2017 | POA Trilha BigData - IBM BigSQL - Engine de consulta de dados de al...TDC2017 | POA Trilha BigData - IBM BigSQL - Engine de consulta de dados de al...
TDC2017 | POA Trilha BigData - IBM BigSQL - Engine de consulta de dados de al...
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
OCP Summit 2017
OCP Summit 2017OCP Summit 2017
OCP Summit 2017
 
Recent IT Development and Women: Big Data and The Power of Women in Goryeo
 Recent IT Development and Women: Big Data and The Power of Women in Goryeo Recent IT Development and Women: Big Data and The Power of Women in Goryeo
Recent IT Development and Women: Big Data and The Power of Women in Goryeo
 
Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...
 
Database Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wiDatabase Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wi
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
 
Big dataanalyticsinthecloud
Big dataanalyticsinthecloudBig dataanalyticsinthecloud
Big dataanalyticsinthecloud
 
Is multi-model the future of NoSQL?
Is multi-model the future of NoSQL?Is multi-model the future of NoSQL?
Is multi-model the future of NoSQL?
 
Hopsworks - The Platform for Data-Intensive AI
Hopsworks - The Platform for Data-Intensive AIHopsworks - The Platform for Data-Intensive AI
Hopsworks - The Platform for Data-Intensive AI
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with Kubernetes
 
The BDAS Open Source Community
The BDAS Open Source CommunityThe BDAS Open Source Community
The BDAS Open Source Community
 

Más de Exove

Más de Exove (20)

Data security in the age of GDPR – most common data security problems
Data security in the age of GDPR – most common data security problemsData security in the age of GDPR – most common data security problems
Data security in the age of GDPR – most common data security problems
 
Provisioning infrastructure to AWS using Terraform – Exove
Provisioning infrastructure to AWS using Terraform – ExoveProvisioning infrastructure to AWS using Terraform – Exove
Provisioning infrastructure to AWS using Terraform – Exove
 
Advanced custom fields in Wordpress
Advanced custom fields in WordpressAdvanced custom fields in Wordpress
Advanced custom fields in Wordpress
 
Introduction to Robot Framework – Exove
Introduction to Robot Framework – ExoveIntroduction to Robot Framework – Exove
Introduction to Robot Framework – Exove
 
Jenkins and visual regression – Exove
Jenkins and visual regression – ExoveJenkins and visual regression – Exove
Jenkins and visual regression – Exove
 
Server-side React with Headless CMS – Exove
Server-side React with Headless CMS – ExoveServer-side React with Headless CMS – Exove
Server-side React with Headless CMS – Exove
 
WebSockets in Bravo Dashboard – Exove
WebSockets in Bravo Dashboard – ExoveWebSockets in Bravo Dashboard – Exove
WebSockets in Bravo Dashboard – Exove
 
Diversity in recruitment
Diversity in recruitmentDiversity in recruitment
Diversity in recruitment
 
Saavutettavuus liiketoimintana
Saavutettavuus liiketoimintanaSaavutettavuus liiketoimintana
Saavutettavuus liiketoimintana
 
Saavutettavuus osana Eläkeliiton verkkosivu-uudistusta
Saavutettavuus osana Eläkeliiton verkkosivu-uudistustaSaavutettavuus osana Eläkeliiton verkkosivu-uudistusta
Saavutettavuus osana Eläkeliiton verkkosivu-uudistusta
 
Mitä saavutettavuusdirektiivi pitää sisällään
Mitä saavutettavuusdirektiivi pitää sisälläänMitä saavutettavuusdirektiivi pitää sisällään
Mitä saavutettavuusdirektiivi pitää sisällään
 
Creating Landing Pages for Drupal 8
Creating Landing Pages for Drupal 8Creating Landing Pages for Drupal 8
Creating Landing Pages for Drupal 8
 
GDPR for developers
GDPR for developersGDPR for developers
GDPR for developers
 
Managing Complexity and Privacy Debt with Drupal
Managing Complexity and Privacy Debt with DrupalManaging Complexity and Privacy Debt with Drupal
Managing Complexity and Privacy Debt with Drupal
 
Life with digital services after GDPR
Life with digital services after GDPRLife with digital services after GDPR
Life with digital services after GDPR
 
GDPR - no beginning no end
GDPR - no beginning no endGDPR - no beginning no end
GDPR - no beginning no end
 
Developing truly personalised experiences
Developing truly personalised experiencesDeveloping truly personalised experiences
Developing truly personalised experiences
 
Customer Experience and Personalisation
Customer Experience and PersonalisationCustomer Experience and Personalisation
Customer Experience and Personalisation
 
Adventures In Programmatic Branding – How To Design With Algorithms And How T...
Adventures In Programmatic Branding – How To Design With Algorithms And How T...Adventures In Programmatic Branding – How To Design With Algorithms And How T...
Adventures In Programmatic Branding – How To Design With Algorithms And How T...
 
Dataohjattu asiakaskokemus
Dataohjattu asiakaskokemusDataohjattu asiakaskokemus
Dataohjattu asiakaskokemus
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Último (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 

Open Source Tools for Big Data