2. • Who we are and what we are doing
• Big Data era
• BSP (Bulk synchronous parallel)
• Apache Giraph
• Storm
• Our use case
• Conclusion
Topics today
Starschema
Experience and Innovation
3. Continuous growth
25 FTE plus external resources,
over $1.5million EBIT
Open source projects
Share the knowledge with the public.
Open source project in ETL and data
warehousing fields.
Founded in 2006
Company was founded by private
owners with decade of BI and data
warehouse background
R&D
Cooperation with Obuda University,
NKE, EU co-founded technology
research and development
COMPANY Data
Facts about Starschema
Starschema
Experience and Innovation
4. Big Data eraThe rise of Hadoop
Starschema
Experience and Innovation
Google Year of WP Apache Year
GFS 2003 HDFS 2007
MapReduce 2004 Hadoop MR 2007
BigTable 2006 HBase 2007
Chubby Lock Service 2006 ZooKeeper 2007
Pregel 2009 Giraph 2011
Dremel 2010 Drill 2012 ?
Which is next? (Curator, Falcon, MRQL, etc.)
5. • Leslie Valiant - article in nov. 1990
• Supersteps
• Data stored in local memory
• Asynchronous data processing
• Barrier sync
• Optimal load balacing (more logical processes
than physcal processors, random allocation of
processes)
• Solution differences (procotols, buffer
management, routing strategies)
• No deadlock or any other race conditions
(since no circular dependency)
• Use cases
BSP (Bulk synchronous parallel)What is it? What is it good for?
Starschema
Experience and Innovation
7. Apache Giraph
Starschema
Experience and Innovation
• A loose implementation of Pregel
• Avery Chink: We can't use it at Yahoo, that's too bad
• Developed at Yahoo
• Runs on existing MapReduce infrastructure
• Netty based comm. instead of Hadoop RPC
• In-memory
• Fault tolerant
• Internal state is saved at user-defined intervals
• Master/slave architecture
What is it?
8. Storm
Starschema
Experience and Innovation
• Storm is a free and open source distributed real time
computation system
• Developed at BackType, open-sourced by Twitter in 2011
• Guaranteed data processing
• Horizontal scalability
• Fault tolerance
• ZeroMQ for message passing
• Processing unbounded
sequence of tuples
• Groupings
What is it?
9. Storm
Starschema
Experience and Innovation
What is it for?
• Analyze, clean, normalize
• Real-time calculation
• Real-time ETL
• Failure detection from log files
• Machine data analysis
• IT early-warning systems, security and fraud detection
• Traffic information, DOS attack
• Stream processing - Continous computation -
Distributed RPC
10. Our use case
Starschema
Experience and Innovation
• Real-time calculation
• Error detection
• Horizontal scalability
• Fast implementation
• High-availability
• Error prediction
POC: Processing machine data from sensors
Requirements
11. Our use case part 2
Starschema
Experience and Innovation
• Choosen tool: Storm
• One spout for each sensor
• Dynamic add and remove of spouts
• Error detection based on statistical calculations
• ~ 200 lines
• HA capability of Storm
POC: Processing machine data from sensors
Solution:
12. Conclusion
• Extend existing infrastructure
• Answer to new questions
• Re-think old problems
• New solutions, new features
• Happy customers/users
• $$$
Starschema
Experience and Innovation