SlideShare una empresa de Scribd logo
1 de 25
Descargar para leer sin conexión
Jacek Kruszelnicki, Numatica Corporation 
E-mail: j a c e k@numatica.com (remove spaces) 
Phone: 781 756 8064 
Scalable Software. Real-time Big Data Analytics. 
Big Data, Simple and Fast: 
1 
Addressing the Shortcomings of Hadoop
Scalable Software. Real-time Big Data Analytics. 
Jacek: 
 20+ years of shipping enterprise software 
 Biased towards simplicity/design elegance. Vendor–neutral. 
 Author, mentor & conference speaker 
 M.S. Computer Science, focused on business value of IT 
Jacek Kruszelnicki 
President & CEO, Numatica Corporation 
Presenter 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
2 
Focus: 
 Distributed, High-throughput, Low-latency software. 
 Real-time (“Fast”) Big Data. Services: 
 IT Strategy and Planning 
 Software Architecture & Design 
 Proof of Concept, Quick Start programs 
 Software Development 
 Trainings, Seminars
Scalable Software. Real-time Big Data Analytics. 
Agenda 
 Hadoop is not a universal, or inexpensive, Big Data stack 
 Technical requirements for a flexible Big/Fast Data stack 
 Solutions thought to be alternatives to Hadoop 
 Why In-Memory Data Grids are a good fit for Big/Fast Data 
 How Hazelcast meets the Big/Fast Data requirements 
 Focus on architectures, but demo/code samples provided 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
3 
“Any intelligent fool can make things bigger and more complex... It takes a touch of genius - and a lot of courage to move in the opposite direction.” - E.F. Schumacher
Scalable Software. Real-time Big Data Analytics. 
Not on Agenda 
 Big Data Tutorial 
 Hadoop or any other technology tutorial 
 In-depth overview of Big Data market 
 Big Data Analytics -> Business Insights 
 CAP Theorem discussion 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
4
Scalable Software. Real-time Big Data Analytics. 
Operational Intelligence: 
 Analyze stream of business activities and external stimuli on-the fly 
 React to them (preferably) instantaneously 
 Real time data stream processing is critical, otherwise business value is lost. Real-time Big (Fast) Data Analytics examples: 
 Dynamic pricing (e-commerce) 
 High-frequency trading 
 Network security threats 
 Credit card fraud prevention 
 Factory floor data collection, RFID 
 Mobile infrastructure, machine to machine (M2M) applications 
 Prescriptive or Location-based applications 
 Real-time dashboards, alerts, and reports 
Fast Data - Business Drivers 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
5
Scalable Software. Real-time Big Data Analytics. 
Big Data – The 3 Vs Re-arranged 
Volume 
Variety 
Velocity 
Common definition: Data sets too large and complex to process using standard DB tools and data processing applications (in short: “Will not fit into MS Excel”) 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
 Structured, semi-structured 
 At-Rest, In-Motion 
 Variety of formats 
 Millions of events per second 
 Need to re-run analytics frequently (Dashboards, etc) 
 Frequently changing data (at rest or in- motion) 
 Stale data provides little business value 
 Also need to do off-line processing (machine learning) 
6 
 Few Googles and Facebooks out there 
 Typical data < 100 TB
Scalable Software. Real-time Big Data Analytics. 
Big...and Fast 
 Rapidly changing, massive stream(s) of data (events) 
 Multiple sources, formats 
 Needs to be processed in-flight 
 Events persisted or not, depending on business value 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
7 
Data In-Motion: 
Data At-Rest: 
 Data already persisted, change notification 
 Multiple formats 
 Re-ingest, re-process 
Software stacks to support “In-Motion” and “At-Rest” Big Data are needed.
Scalable Software. Real-time Big Data Analytics. 
”Big Elephant in the room...” 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
8 
 Currently the large scale data analysis tool of choice 
 De facto standard, almost synonymous with Big Data 
 “Nobody ever got fired for using Hadoop” Problems: Very complex/expensive to deploy and run -> high TCO MapReduce slow, batch-oriented, “intrusive” No stream processing No support for in-memory processing Closely coupled with HDFS (third-party solutions of varying quality) SQL-on-Hadoop limited and slow http://hortonworks.com/blog/install-hadoop-windows-hortonworks-data-platform-2-0/ http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html/ 
The Hadoop Stack:
Scalable Software. Real-time Big Data Analytics. 
Hadoop... and the kitchen sink 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
9 
Not shown: 
JobTracker. TaskTracker,... 
Avro (Serialization) 
Chukwa(logs, incremental) 
EMR 
BigTop 
Spark 
Impala (vs Hive) 
19 shown, up to 24 
Violates first rule of distributed programming: DO NOT DISTRIBUTE (unnecessarily).
Scalable Software. Real-time Big Data Analytics. 
Hadoop 2.0 - High TCO remains 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
10 
Improved: 
YARN, MapReduce separated 
Removed Name Node/ Job Tracker as SPOF? 
Unchanged: 
More complexity 
Low-level abstractions 
Still Master/Slave
Scalable Software. Real-time Big Data Analytics. 
Hadoop 2.0 - High TCO remains 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
11 
Hadoop-based stack example from the Web. Very Complicated: 
Berkeley Big-data 
Analytics Stack 
(BDAS)
Scalable Software. Real-time Big Data Analytics. 
Hadoop Disillusionment Phase? 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
12 
Hadoop hyped as a universal, disruptive big data solution 
MR is on its way out at Google. 
http://www.datacenterknowledge.com/archives/2014/06/25/google-dumps-mapreduce-favor-new-hyper-scale-analytics-system/ 
Data Scientists: 76 % felt Hadoop is too slow, too much effort to program http://www.cio.com/article/2449814/big-data/data-scientists-frustrated-by-data-variety-find-hadoop-limiting.html
Scalable Software. Real-time Big Data Analytics. 
Some Hadoop Alternatives 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
13 
Hadoop “extensions” 
MPP DBs 
Still needs many of Hadoop modules: HDFS, YARN, Pig, Hive, HBase, JobTracker.TaskTracker,... 
 Spark 
 Shark (Spark on Hive) 
 Storm 
 Kafka 
 Teradata 
 Vertica 
 Greenplum 
Proprietary, complex, VERY expensive, query language not Turing complete 
NoSQL DBs 
Column: Cassandra, Hbase, etc. 
Document/K-V: MongoDB, CouchDB, Riak, etc. 
In-Memory: VoltDB, etc. 
No ACID/Referential integrity, 
No triggers, foreign keys 
Mongo/Couch ->JSON-centered, Cassandra - complex 
Query language proprietary, subset of SQL 
VoltDB – precoded stored procs, no ad-hoc 
Not Turing complete 
In-Memory Data Grids 
 Coherence (commercial 
 Hazelcast (OSS) 
 Terracotta (commercial) 
 Gemfire (commercial) 
 GridGain (OSS/commercial) 
 Gigaspaces XAP (OSS)
Scalable Software. Real-time Big Data Analytics. 
Big+Fast Alternatives 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
14 
Storm/Kafka 
 Intrusive (introduces exotic abstractions): Streams, Spouts, Bolts, Tasks, Workers, Stream Groups, Topologies 
 Low-level: Shell scripting, Clojure code here and there 
 Complex Admin: Everything requires ZooKeeper, Nimbus is a SPOF 
https://news.ycombinator.com/item?id=8416455
Scalable Software. Real-time Big Data Analytics. 
Big+Fast Alternatives part 2 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
15 
Lambda Architecture example from the Web. Lots of moving parts:
Scalable Software. Real-time Big Data Analytics. 
Big, Fast Stack - Technical Requirements 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
16 
Successful Big + Fast Data stack should have: 
Architectural simplicity and elegance [lowers Total Cost of Ownership (TCO)] 
Low Transactional and Data Latency [response <1s, “fresh” or real-time data] 
High-throughput, overall up- and out- Scalability 
High-throughput Data Stream/Complex Event Processing 
SQL-like querying, ACID, Txs (including XA) 
Distributed Execution Framework 
High Availability and Fault Tolerance. 
Stateless, decentralized, elastic cluster management
Scalable Software. Real-time Big Data Analytics. 
In-Memory Grids for Big And Fast Data 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
17 
 Technology moving forward (6x sequential, 100K x random): 
 RAM is the new disk (ns) 
 DISK is the new tape (ms) 
 Databases (RDBMS, NoSQL) are not enough 
 SQL or SQL-like languages are not Turing complete 
 Object-oriented/functional/parallel programming abstractions needed 
 In-memory data is volatile/perishable? 
 In-memory data can be persisted, if so desired 
 No need to achieve archival durability. 
 Most Big Data brought in already stored somewhere else 
 Stream data may not be relevant for long, typically limited retention time
Scalable Software. Real-time Big Data Analytics. 
Why Hazelcast? 
The most intellectually elegant distributed in-memory data/compute grid. 
Distributed Execution Framework (extension of Java’s Executor Service) 
Distributed Queries (SQL/Predicate), Data affinity (execution on specific node execution) 
Elastic cluster management Java API included 
Auto-discovery of nodes and re-balancing 
Minimalism as design aesthetic: 
Non-intrusive, 
No dependencies 
3.1MB single jar library. 
Apache License 2, commercial extensions + support 
Implements Java APIs (Map, List, Set, Queue, Lock) in a distributed manner 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
18 
Peer-peer architecture, no single point of failure
Scalable Software. Real-time Big Data Analytics. 
Hazelcast vs Hadoop 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
19 
Java8 + Hazelcast (single 3.1 MB jar) Java + Hadoop 
Pluggable persistence (HDFS, MapR, RDBMS) HDFS 
MapReduce MapReduce 
Data manipulation & querying Hive (not OLTP) 
In-memory parallel processing Spark 
Stream processing Storm 
Messaging Kafka 
Scalable and Elastic ZooKeeper 
Cluster Management ZooKeeper, YARN, Mesos 
Elastic, simple (automatic cluster re-balancing) N/A
Scalable Software. Real-time Big Data Analytics. 
RAM currently maxes out at ~ 640GB/server (256GB?) 
RAM still more expensive than SSDs and HDDs 
Garbage collection 
Considerations 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
20 
Cost and capacity limitations will disappear over time 
Off-heap memory and specialized JVMs (Azul, etc.)
Scalable Software. Real-time Big Data Analytics. 
From Technologies to Platforms 
 Backing Storage: RDBMS, NoSQL, HDFS GlusterFs, XFS 
 Enterprise Message Store(s) 
 Multi-Source Data Harvesting, Ingestion, Transformation 
 Data Abstraction, Modeling, Querying, Visualization 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
21 
In-Memory Data/Compute Grids not enough:
Scalable Software. Real-time Big Data Analytics. 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
22 
The Demo: Clickstream Analytics + Intrusion Detection 
Traffic Generator 
GeoLocation Map (5M entries) 
UserActionQueue 
GeoDataLoader 
JSON/Http 
RDBMS (PostgreSQL) 
InsightMagic 
Node 1 (JVM) 
UserAction QueueProcessor 
JSON/Http (“fire and forget”) 
Backing Storage 
(optional) 
UserActionHistory Map (incudes location) 
“At Rest” Analytics Client 
“timestamp > now() - 30 minutes” 
1. take() 
“In-Motion” Analytics Client 
“countryCode=‘CN’” 
.csv files 
“201.35.217.36” 
“Columbus,OH, US” 
A. Existing Data 
2. resolveGeo() 
3. putIfAbsent() 
Features: 
SOA Architecture 
Data Ingestion 
Complex Event Processing 
Dimension, Fact Maps 
In-Motion, At-Rest Analytics 
Backing Storage 
ACID + Txs available 
B. Real-time Events 
Analytics 
& Visualization 
Data Abstraction 
& Virtualization 
Messaging should be 
a separate cluster 
Copyright (C) 2014 Numatica Corporation
Scalable Software. Real-time Big Data Analytics. 
Q & A 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
23 
“Intellectuals solve problems. Geniuses prevent them.” 
-- Albert Einstein 
More questions? Feel free to contact me at: 
Jacek Kruszelnicki, Numatica Corporation 
E-mail: j a c e k@numatica.com (remove spaces) 
Phone: 781 756 8064
Appendix 
24
25

Más contenido relacionado

La actualidad más candente

Infinspan: In-memory data grid meets NoSQL
Infinspan: In-memory data grid meets NoSQLInfinspan: In-memory data grid meets NoSQL
Infinspan: In-memory data grid meets NoSQLManik Surtani
 
Web session replication with Hazelcast
Web session replication with HazelcastWeb session replication with Hazelcast
Web session replication with HazelcastEmrah Kocaman
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsJonas Bonér
 
In Memory Data Grids, Demystified!
In Memory Data Grids, Demystified! In Memory Data Grids, Demystified!
In Memory Data Grids, Demystified! Uri Cohen
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld
 
Trusted advisory on technology comparison --exadata, hana, db2
Trusted advisory on technology comparison --exadata, hana, db2Trusted advisory on technology comparison --exadata, hana, db2
Trusted advisory on technology comparison --exadata, hana, db2Ajay Kumar Uppal
 
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...Cloudera, Inc.
 
Choosing the Right Big Data Tools for the Job - A Polyglot Approach
Choosing the Right Big Data Tools for the Job - A Polyglot ApproachChoosing the Right Big Data Tools for the Job - A Polyglot Approach
Choosing the Right Big Data Tools for the Job - A Polyglot ApproachDATAVERSITY
 
Development of concurrent services using In-Memory Data Grids
Development of concurrent services using In-Memory Data GridsDevelopment of concurrent services using In-Memory Data Grids
Development of concurrent services using In-Memory Data Gridsjlorenzocima
 
Design Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational DatabasesDesign Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational Databasesguestdfd1ec
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...DataWorks Summit
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache KuduAndriy Zabavskyy
 
Postgres in Production - Best Practices 2014
Postgres in Production - Best Practices 2014Postgres in Production - Best Practices 2014
Postgres in Production - Best Practices 2014EDB
 
Best Practices for Virtualizing Hadoop
Best Practices for Virtualizing HadoopBest Practices for Virtualizing Hadoop
Best Practices for Virtualizing HadoopDataWorks Summit
 
Active/Active Database Solutions with Log Based Replication in xDB 6.0
Active/Active Database Solutions with Log Based Replication in xDB 6.0Active/Active Database Solutions with Log Based Replication in xDB 6.0
Active/Active Database Solutions with Log Based Replication in xDB 6.0EDB
 
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...DataStax
 
10 Tricks to Ensure Your Oracle Coherence Cluster is Not a "Black Box" in Pro...
10 Tricks to Ensure Your Oracle Coherence Cluster is Not a "Black Box" in Pro...10 Tricks to Ensure Your Oracle Coherence Cluster is Not a "Black Box" in Pro...
10 Tricks to Ensure Your Oracle Coherence Cluster is Not a "Black Box" in Pro...SL Corporation
 
Best Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopBest Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopHortonworks
 
Understanding the IBM Power Systems Advantage
Understanding the IBM Power Systems AdvantageUnderstanding the IBM Power Systems Advantage
Understanding the IBM Power Systems AdvantageIBM Power Systems
 
9/ IBM POWER @ OPEN'16
9/ IBM POWER @ OPEN'169/ IBM POWER @ OPEN'16
9/ IBM POWER @ OPEN'16Kangaroot
 

La actualidad más candente (20)

Infinspan: In-memory data grid meets NoSQL
Infinspan: In-memory data grid meets NoSQLInfinspan: In-memory data grid meets NoSQL
Infinspan: In-memory data grid meets NoSQL
 
Web session replication with Hazelcast
Web session replication with HazelcastWeb session replication with Hazelcast
Web session replication with Hazelcast
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
 
In Memory Data Grids, Demystified!
In Memory Data Grids, Demystified! In Memory Data Grids, Demystified!
In Memory Data Grids, Demystified!
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
 
Trusted advisory on technology comparison --exadata, hana, db2
Trusted advisory on technology comparison --exadata, hana, db2Trusted advisory on technology comparison --exadata, hana, db2
Trusted advisory on technology comparison --exadata, hana, db2
 
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
 
Choosing the Right Big Data Tools for the Job - A Polyglot Approach
Choosing the Right Big Data Tools for the Job - A Polyglot ApproachChoosing the Right Big Data Tools for the Job - A Polyglot Approach
Choosing the Right Big Data Tools for the Job - A Polyglot Approach
 
Development of concurrent services using In-Memory Data Grids
Development of concurrent services using In-Memory Data GridsDevelopment of concurrent services using In-Memory Data Grids
Development of concurrent services using In-Memory Data Grids
 
Design Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational DatabasesDesign Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational Databases
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
Postgres in Production - Best Practices 2014
Postgres in Production - Best Practices 2014Postgres in Production - Best Practices 2014
Postgres in Production - Best Practices 2014
 
Best Practices for Virtualizing Hadoop
Best Practices for Virtualizing HadoopBest Practices for Virtualizing Hadoop
Best Practices for Virtualizing Hadoop
 
Active/Active Database Solutions with Log Based Replication in xDB 6.0
Active/Active Database Solutions with Log Based Replication in xDB 6.0Active/Active Database Solutions with Log Based Replication in xDB 6.0
Active/Active Database Solutions with Log Based Replication in xDB 6.0
 
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
 
10 Tricks to Ensure Your Oracle Coherence Cluster is Not a "Black Box" in Pro...
10 Tricks to Ensure Your Oracle Coherence Cluster is Not a "Black Box" in Pro...10 Tricks to Ensure Your Oracle Coherence Cluster is Not a "Black Box" in Pro...
10 Tricks to Ensure Your Oracle Coherence Cluster is Not a "Black Box" in Pro...
 
Best Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopBest Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache Hadoop
 
Understanding the IBM Power Systems Advantage
Understanding the IBM Power Systems AdvantageUnderstanding the IBM Power Systems Advantage
Understanding the IBM Power Systems Advantage
 
9/ IBM POWER @ OPEN'16
9/ IBM POWER @ OPEN'169/ IBM POWER @ OPEN'16
9/ IBM POWER @ OPEN'16
 

Similar a Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop

Deutsche Telekom on Big Data
Deutsche Telekom on Big DataDeutsche Telekom on Big Data
Deutsche Telekom on Big DataDataWorks Summit
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesDataWorks Summit
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: RevealedSachin Holla
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Cécile Poyet
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Hortonworks
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Cécile Poyet
 
EMC Isilon Database Converged deck
EMC Isilon Database Converged deckEMC Isilon Database Converged deck
EMC Isilon Database Converged deckKeithETD_CTO
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopJosh Patterson
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
GOAI: GPU-Accelerated Data Science DataSciCon 2017
GOAI: GPU-Accelerated Data Science DataSciCon 2017GOAI: GPU-Accelerated Data Science DataSciCon 2017
GOAI: GPU-Accelerated Data Science DataSciCon 2017Joshua Patterson
 
Big Data & Open Source - Neil Jadhav
Big Data & Open Source - Neil JadhavBig Data & Open Source - Neil Jadhav
Big Data & Open Source - Neil JadhavSwapnil (Neil) Jadhav
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Imam Raza
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 
Exploring the Wider World of Big Data
Exploring the Wider World of Big DataExploring the Wider World of Big Data
Exploring the Wider World of Big DataNetApp
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015Christopher Curtin
 
Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Richard McDougall
 

Similar a Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop (20)

Deutsche Telekom on Big Data
Deutsche Telekom on Big DataDeutsche Telekom on Big Data
Deutsche Telekom on Big Data
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! Perspectives
 
Final deck
Final deckFinal deck
Final deck
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
EMC Isilon Database Converged deck
EMC Isilon Database Converged deckEMC Isilon Database Converged deck
EMC Isilon Database Converged deck
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
GOAI: GPU-Accelerated Data Science DataSciCon 2017
GOAI: GPU-Accelerated Data Science DataSciCon 2017GOAI: GPU-Accelerated Data Science DataSciCon 2017
GOAI: GPU-Accelerated Data Science DataSciCon 2017
 
Big Data & Open Source - Neil Jadhav
Big Data & Open Source - Neil JadhavBig Data & Open Source - Neil Jadhav
Big Data & Open Source - Neil Jadhav
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Exploring the Wider World of Big Data
Exploring the Wider World of Big DataExploring the Wider World of Big Data
Exploring the Wider World of Big Data
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013
 

Más de Hazelcast

Hazelcast 3.6 Roadmap Preview
Hazelcast 3.6 Roadmap PreviewHazelcast 3.6 Roadmap Preview
Hazelcast 3.6 Roadmap PreviewHazelcast
 
Time to Make the Move to In-Memory Data Grids
Time to Make the Move to In-Memory Data GridsTime to Make the Move to In-Memory Data Grids
Time to Make the Move to In-Memory Data GridsHazelcast
 
The Power of the JVM: Applied Polyglot Projects with Java and JavaScript
The Power of the JVM: Applied Polyglot Projects with Java and JavaScriptThe Power of the JVM: Applied Polyglot Projects with Java and JavaScript
The Power of the JVM: Applied Polyglot Projects with Java and JavaScriptHazelcast
 
JCache - It's finally here
JCache -  It's finally hereJCache -  It's finally here
JCache - It's finally hereHazelcast
 
Speed Up Your Existing Relational Databases with Hazelcast and Speedment
Speed Up Your Existing Relational Databases with Hazelcast and SpeedmentSpeed Up Your Existing Relational Databases with Hazelcast and Speedment
Speed Up Your Existing Relational Databases with Hazelcast and SpeedmentHazelcast
 
Shared Memory Performance: Beyond TCP/IP with Ben Cotton, JPMorgan
Shared Memory Performance: Beyond TCP/IP with Ben Cotton, JPMorganShared Memory Performance: Beyond TCP/IP with Ben Cotton, JPMorgan
Shared Memory Performance: Beyond TCP/IP with Ben Cotton, JPMorganHazelcast
 
Applying Real-time SQL Changes in your Hazelcast Data Grid
Applying Real-time SQL Changes in your Hazelcast Data GridApplying Real-time SQL Changes in your Hazelcast Data Grid
Applying Real-time SQL Changes in your Hazelcast Data GridHazelcast
 
WAN Replication: Hazelcast Enterprise Lightning Talk
WAN Replication: Hazelcast Enterprise Lightning TalkWAN Replication: Hazelcast Enterprise Lightning Talk
WAN Replication: Hazelcast Enterprise Lightning TalkHazelcast
 
JAAS Security Suite: Hazelcast Enterprise Lightning Talk
JAAS Security Suite: Hazelcast Enterprise Lightning TalkJAAS Security Suite: Hazelcast Enterprise Lightning Talk
JAAS Security Suite: Hazelcast Enterprise Lightning TalkHazelcast
 
Hazelcast for Terracotta Users
Hazelcast for Terracotta UsersHazelcast for Terracotta Users
Hazelcast for Terracotta UsersHazelcast
 
Extreme Network Performance with Hazelcast on Torusware
Extreme Network Performance with Hazelcast on ToruswareExtreme Network Performance with Hazelcast on Torusware
Extreme Network Performance with Hazelcast on ToruswareHazelcast
 
JAXLondon - Squeezing Performance of IMDGs
JAXLondon - Squeezing Performance of IMDGsJAXLondon - Squeezing Performance of IMDGs
JAXLondon - Squeezing Performance of IMDGsHazelcast
 
OrientDB & Hazelcast: In-Memory Distributed Graph Database
 OrientDB & Hazelcast: In-Memory Distributed Graph Database OrientDB & Hazelcast: In-Memory Distributed Graph Database
OrientDB & Hazelcast: In-Memory Distributed Graph DatabaseHazelcast
 
How to Use HazelcastMQ for Flexible Messaging and More
 How to Use HazelcastMQ for Flexible Messaging and More How to Use HazelcastMQ for Flexible Messaging and More
How to Use HazelcastMQ for Flexible Messaging and MoreHazelcast
 
Devoxx UK 2014 High Performance In-Memory Java with Open Source
Devoxx UK 2014   High Performance In-Memory Java with Open SourceDevoxx UK 2014   High Performance In-Memory Java with Open Source
Devoxx UK 2014 High Performance In-Memory Java with Open SourceHazelcast
 
JSR107 State of the Union JavaOne 2013
JSR107  State of the Union JavaOne 2013JSR107  State of the Union JavaOne 2013
JSR107 State of the Union JavaOne 2013Hazelcast
 
Jfokus - Hazlecast
Jfokus - HazlecastJfokus - Hazlecast
Jfokus - HazlecastHazelcast
 
In-memory No SQL- GIDS2014
In-memory No SQL- GIDS2014In-memory No SQL- GIDS2014
In-memory No SQL- GIDS2014Hazelcast
 
In-memory Data Management Trends & Techniques
In-memory Data Management Trends & TechniquesIn-memory Data Management Trends & Techniques
In-memory Data Management Trends & TechniquesHazelcast
 
How to Speed up your Database
How to Speed up your DatabaseHow to Speed up your Database
How to Speed up your DatabaseHazelcast
 

Más de Hazelcast (20)

Hazelcast 3.6 Roadmap Preview
Hazelcast 3.6 Roadmap PreviewHazelcast 3.6 Roadmap Preview
Hazelcast 3.6 Roadmap Preview
 
Time to Make the Move to In-Memory Data Grids
Time to Make the Move to In-Memory Data GridsTime to Make the Move to In-Memory Data Grids
Time to Make the Move to In-Memory Data Grids
 
The Power of the JVM: Applied Polyglot Projects with Java and JavaScript
The Power of the JVM: Applied Polyglot Projects with Java and JavaScriptThe Power of the JVM: Applied Polyglot Projects with Java and JavaScript
The Power of the JVM: Applied Polyglot Projects with Java and JavaScript
 
JCache - It's finally here
JCache -  It's finally hereJCache -  It's finally here
JCache - It's finally here
 
Speed Up Your Existing Relational Databases with Hazelcast and Speedment
Speed Up Your Existing Relational Databases with Hazelcast and SpeedmentSpeed Up Your Existing Relational Databases with Hazelcast and Speedment
Speed Up Your Existing Relational Databases with Hazelcast and Speedment
 
Shared Memory Performance: Beyond TCP/IP with Ben Cotton, JPMorgan
Shared Memory Performance: Beyond TCP/IP with Ben Cotton, JPMorganShared Memory Performance: Beyond TCP/IP with Ben Cotton, JPMorgan
Shared Memory Performance: Beyond TCP/IP with Ben Cotton, JPMorgan
 
Applying Real-time SQL Changes in your Hazelcast Data Grid
Applying Real-time SQL Changes in your Hazelcast Data GridApplying Real-time SQL Changes in your Hazelcast Data Grid
Applying Real-time SQL Changes in your Hazelcast Data Grid
 
WAN Replication: Hazelcast Enterprise Lightning Talk
WAN Replication: Hazelcast Enterprise Lightning TalkWAN Replication: Hazelcast Enterprise Lightning Talk
WAN Replication: Hazelcast Enterprise Lightning Talk
 
JAAS Security Suite: Hazelcast Enterprise Lightning Talk
JAAS Security Suite: Hazelcast Enterprise Lightning TalkJAAS Security Suite: Hazelcast Enterprise Lightning Talk
JAAS Security Suite: Hazelcast Enterprise Lightning Talk
 
Hazelcast for Terracotta Users
Hazelcast for Terracotta UsersHazelcast for Terracotta Users
Hazelcast for Terracotta Users
 
Extreme Network Performance with Hazelcast on Torusware
Extreme Network Performance with Hazelcast on ToruswareExtreme Network Performance with Hazelcast on Torusware
Extreme Network Performance with Hazelcast on Torusware
 
JAXLondon - Squeezing Performance of IMDGs
JAXLondon - Squeezing Performance of IMDGsJAXLondon - Squeezing Performance of IMDGs
JAXLondon - Squeezing Performance of IMDGs
 
OrientDB & Hazelcast: In-Memory Distributed Graph Database
 OrientDB & Hazelcast: In-Memory Distributed Graph Database OrientDB & Hazelcast: In-Memory Distributed Graph Database
OrientDB & Hazelcast: In-Memory Distributed Graph Database
 
How to Use HazelcastMQ for Flexible Messaging and More
 How to Use HazelcastMQ for Flexible Messaging and More How to Use HazelcastMQ for Flexible Messaging and More
How to Use HazelcastMQ for Flexible Messaging and More
 
Devoxx UK 2014 High Performance In-Memory Java with Open Source
Devoxx UK 2014   High Performance In-Memory Java with Open SourceDevoxx UK 2014   High Performance In-Memory Java with Open Source
Devoxx UK 2014 High Performance In-Memory Java with Open Source
 
JSR107 State of the Union JavaOne 2013
JSR107  State of the Union JavaOne 2013JSR107  State of the Union JavaOne 2013
JSR107 State of the Union JavaOne 2013
 
Jfokus - Hazlecast
Jfokus - HazlecastJfokus - Hazlecast
Jfokus - Hazlecast
 
In-memory No SQL- GIDS2014
In-memory No SQL- GIDS2014In-memory No SQL- GIDS2014
In-memory No SQL- GIDS2014
 
In-memory Data Management Trends & Techniques
In-memory Data Management Trends & TechniquesIn-memory Data Management Trends & Techniques
In-memory Data Management Trends & Techniques
 
How to Speed up your Database
How to Speed up your DatabaseHow to Speed up your Database
How to Speed up your Database
 

Último

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineeringssuserb3a23b
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 

Último (20)

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineering
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Odoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting ServiceOdoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting Service
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 

Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop

  • 1. Jacek Kruszelnicki, Numatica Corporation E-mail: j a c e k@numatica.com (remove spaces) Phone: 781 756 8064 Scalable Software. Real-time Big Data Analytics. Big Data, Simple and Fast: 1 Addressing the Shortcomings of Hadoop
  • 2. Scalable Software. Real-time Big Data Analytics. Jacek:  20+ years of shipping enterprise software  Biased towards simplicity/design elegance. Vendor–neutral.  Author, mentor & conference speaker  M.S. Computer Science, focused on business value of IT Jacek Kruszelnicki President & CEO, Numatica Corporation Presenter Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 2 Focus:  Distributed, High-throughput, Low-latency software.  Real-time (“Fast”) Big Data. Services:  IT Strategy and Planning  Software Architecture & Design  Proof of Concept, Quick Start programs  Software Development  Trainings, Seminars
  • 3. Scalable Software. Real-time Big Data Analytics. Agenda  Hadoop is not a universal, or inexpensive, Big Data stack  Technical requirements for a flexible Big/Fast Data stack  Solutions thought to be alternatives to Hadoop  Why In-Memory Data Grids are a good fit for Big/Fast Data  How Hazelcast meets the Big/Fast Data requirements  Focus on architectures, but demo/code samples provided Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 3 “Any intelligent fool can make things bigger and more complex... It takes a touch of genius - and a lot of courage to move in the opposite direction.” - E.F. Schumacher
  • 4. Scalable Software. Real-time Big Data Analytics. Not on Agenda  Big Data Tutorial  Hadoop or any other technology tutorial  In-depth overview of Big Data market  Big Data Analytics -> Business Insights  CAP Theorem discussion Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 4
  • 5. Scalable Software. Real-time Big Data Analytics. Operational Intelligence:  Analyze stream of business activities and external stimuli on-the fly  React to them (preferably) instantaneously  Real time data stream processing is critical, otherwise business value is lost. Real-time Big (Fast) Data Analytics examples:  Dynamic pricing (e-commerce)  High-frequency trading  Network security threats  Credit card fraud prevention  Factory floor data collection, RFID  Mobile infrastructure, machine to machine (M2M) applications  Prescriptive or Location-based applications  Real-time dashboards, alerts, and reports Fast Data - Business Drivers Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 5
  • 6. Scalable Software. Real-time Big Data Analytics. Big Data – The 3 Vs Re-arranged Volume Variety Velocity Common definition: Data sets too large and complex to process using standard DB tools and data processing applications (in short: “Will not fit into MS Excel”) Copyright (C) 2014 Numatica Corporation. All Rights Reserved.  Structured, semi-structured  At-Rest, In-Motion  Variety of formats  Millions of events per second  Need to re-run analytics frequently (Dashboards, etc)  Frequently changing data (at rest or in- motion)  Stale data provides little business value  Also need to do off-line processing (machine learning) 6  Few Googles and Facebooks out there  Typical data < 100 TB
  • 7. Scalable Software. Real-time Big Data Analytics. Big...and Fast  Rapidly changing, massive stream(s) of data (events)  Multiple sources, formats  Needs to be processed in-flight  Events persisted or not, depending on business value Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 7 Data In-Motion: Data At-Rest:  Data already persisted, change notification  Multiple formats  Re-ingest, re-process Software stacks to support “In-Motion” and “At-Rest” Big Data are needed.
  • 8. Scalable Software. Real-time Big Data Analytics. ”Big Elephant in the room...” Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 8  Currently the large scale data analysis tool of choice  De facto standard, almost synonymous with Big Data  “Nobody ever got fired for using Hadoop” Problems: Very complex/expensive to deploy and run -> high TCO MapReduce slow, batch-oriented, “intrusive” No stream processing No support for in-memory processing Closely coupled with HDFS (third-party solutions of varying quality) SQL-on-Hadoop limited and slow http://hortonworks.com/blog/install-hadoop-windows-hortonworks-data-platform-2-0/ http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html/ The Hadoop Stack:
  • 9. Scalable Software. Real-time Big Data Analytics. Hadoop... and the kitchen sink Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 9 Not shown: JobTracker. TaskTracker,... Avro (Serialization) Chukwa(logs, incremental) EMR BigTop Spark Impala (vs Hive) 19 shown, up to 24 Violates first rule of distributed programming: DO NOT DISTRIBUTE (unnecessarily).
  • 10. Scalable Software. Real-time Big Data Analytics. Hadoop 2.0 - High TCO remains Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 10 Improved: YARN, MapReduce separated Removed Name Node/ Job Tracker as SPOF? Unchanged: More complexity Low-level abstractions Still Master/Slave
  • 11. Scalable Software. Real-time Big Data Analytics. Hadoop 2.0 - High TCO remains Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 11 Hadoop-based stack example from the Web. Very Complicated: Berkeley Big-data Analytics Stack (BDAS)
  • 12. Scalable Software. Real-time Big Data Analytics. Hadoop Disillusionment Phase? Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 12 Hadoop hyped as a universal, disruptive big data solution MR is on its way out at Google. http://www.datacenterknowledge.com/archives/2014/06/25/google-dumps-mapreduce-favor-new-hyper-scale-analytics-system/ Data Scientists: 76 % felt Hadoop is too slow, too much effort to program http://www.cio.com/article/2449814/big-data/data-scientists-frustrated-by-data-variety-find-hadoop-limiting.html
  • 13. Scalable Software. Real-time Big Data Analytics. Some Hadoop Alternatives Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 13 Hadoop “extensions” MPP DBs Still needs many of Hadoop modules: HDFS, YARN, Pig, Hive, HBase, JobTracker.TaskTracker,...  Spark  Shark (Spark on Hive)  Storm  Kafka  Teradata  Vertica  Greenplum Proprietary, complex, VERY expensive, query language not Turing complete NoSQL DBs Column: Cassandra, Hbase, etc. Document/K-V: MongoDB, CouchDB, Riak, etc. In-Memory: VoltDB, etc. No ACID/Referential integrity, No triggers, foreign keys Mongo/Couch ->JSON-centered, Cassandra - complex Query language proprietary, subset of SQL VoltDB – precoded stored procs, no ad-hoc Not Turing complete In-Memory Data Grids  Coherence (commercial  Hazelcast (OSS)  Terracotta (commercial)  Gemfire (commercial)  GridGain (OSS/commercial)  Gigaspaces XAP (OSS)
  • 14. Scalable Software. Real-time Big Data Analytics. Big+Fast Alternatives Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 14 Storm/Kafka  Intrusive (introduces exotic abstractions): Streams, Spouts, Bolts, Tasks, Workers, Stream Groups, Topologies  Low-level: Shell scripting, Clojure code here and there  Complex Admin: Everything requires ZooKeeper, Nimbus is a SPOF https://news.ycombinator.com/item?id=8416455
  • 15. Scalable Software. Real-time Big Data Analytics. Big+Fast Alternatives part 2 Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 15 Lambda Architecture example from the Web. Lots of moving parts:
  • 16. Scalable Software. Real-time Big Data Analytics. Big, Fast Stack - Technical Requirements Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 16 Successful Big + Fast Data stack should have: Architectural simplicity and elegance [lowers Total Cost of Ownership (TCO)] Low Transactional and Data Latency [response <1s, “fresh” or real-time data] High-throughput, overall up- and out- Scalability High-throughput Data Stream/Complex Event Processing SQL-like querying, ACID, Txs (including XA) Distributed Execution Framework High Availability and Fault Tolerance. Stateless, decentralized, elastic cluster management
  • 17. Scalable Software. Real-time Big Data Analytics. In-Memory Grids for Big And Fast Data Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 17  Technology moving forward (6x sequential, 100K x random):  RAM is the new disk (ns)  DISK is the new tape (ms)  Databases (RDBMS, NoSQL) are not enough  SQL or SQL-like languages are not Turing complete  Object-oriented/functional/parallel programming abstractions needed  In-memory data is volatile/perishable?  In-memory data can be persisted, if so desired  No need to achieve archival durability.  Most Big Data brought in already stored somewhere else  Stream data may not be relevant for long, typically limited retention time
  • 18. Scalable Software. Real-time Big Data Analytics. Why Hazelcast? The most intellectually elegant distributed in-memory data/compute grid. Distributed Execution Framework (extension of Java’s Executor Service) Distributed Queries (SQL/Predicate), Data affinity (execution on specific node execution) Elastic cluster management Java API included Auto-discovery of nodes and re-balancing Minimalism as design aesthetic: Non-intrusive, No dependencies 3.1MB single jar library. Apache License 2, commercial extensions + support Implements Java APIs (Map, List, Set, Queue, Lock) in a distributed manner Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 18 Peer-peer architecture, no single point of failure
  • 19. Scalable Software. Real-time Big Data Analytics. Hazelcast vs Hadoop Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 19 Java8 + Hazelcast (single 3.1 MB jar) Java + Hadoop Pluggable persistence (HDFS, MapR, RDBMS) HDFS MapReduce MapReduce Data manipulation & querying Hive (not OLTP) In-memory parallel processing Spark Stream processing Storm Messaging Kafka Scalable and Elastic ZooKeeper Cluster Management ZooKeeper, YARN, Mesos Elastic, simple (automatic cluster re-balancing) N/A
  • 20. Scalable Software. Real-time Big Data Analytics. RAM currently maxes out at ~ 640GB/server (256GB?) RAM still more expensive than SSDs and HDDs Garbage collection Considerations Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 20 Cost and capacity limitations will disappear over time Off-heap memory and specialized JVMs (Azul, etc.)
  • 21. Scalable Software. Real-time Big Data Analytics. From Technologies to Platforms  Backing Storage: RDBMS, NoSQL, HDFS GlusterFs, XFS  Enterprise Message Store(s)  Multi-Source Data Harvesting, Ingestion, Transformation  Data Abstraction, Modeling, Querying, Visualization Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 21 In-Memory Data/Compute Grids not enough:
  • 22. Scalable Software. Real-time Big Data Analytics. Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 22 The Demo: Clickstream Analytics + Intrusion Detection Traffic Generator GeoLocation Map (5M entries) UserActionQueue GeoDataLoader JSON/Http RDBMS (PostgreSQL) InsightMagic Node 1 (JVM) UserAction QueueProcessor JSON/Http (“fire and forget”) Backing Storage (optional) UserActionHistory Map (incudes location) “At Rest” Analytics Client “timestamp > now() - 30 minutes” 1. take() “In-Motion” Analytics Client “countryCode=‘CN’” .csv files “201.35.217.36” “Columbus,OH, US” A. Existing Data 2. resolveGeo() 3. putIfAbsent() Features: SOA Architecture Data Ingestion Complex Event Processing Dimension, Fact Maps In-Motion, At-Rest Analytics Backing Storage ACID + Txs available B. Real-time Events Analytics & Visualization Data Abstraction & Virtualization Messaging should be a separate cluster Copyright (C) 2014 Numatica Corporation
  • 23. Scalable Software. Real-time Big Data Analytics. Q & A Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 23 “Intellectuals solve problems. Geniuses prevent them.” -- Albert Einstein More questions? Feel free to contact me at: Jacek Kruszelnicki, Numatica Corporation E-mail: j a c e k@numatica.com (remove spaces) Phone: 781 756 8064
  • 25. 25