This document discusses Apache Druid, an open-source distributed real-time analytics database. It summarizes Druid's evolution, architecture, use cases, and how companies use it. The document outlines Druid's ability to handle large, high-dimensional datasets with sub-second queries and discusses its core components like segments for efficient storage and parallelism. It concludes by inviting the reader to join the Druid community.
2. Who Am I
!2
Rommel Garcia
Director, Field Engineering @Imply
Author: Virtualizing Hadoop
10+ years: distributed systems, big data, security, cloud, gpu
3. Agenda
• Evolution of analytic platforms
• Yet, decision makers wants more
• The technical challenges
• Apache Druid: The Genesis
• Architecture
• Real-time Use Cases
• Powered by Druid
• Join the community!
5. Yet, decision makers wants more
!5
Still has problems to solve:
• can’t get data fast enough
• interacting with data instantly is tough
• large amount of data to slice and dice, drill down
• need to make decisions now
6. The technical challenges
!6
• Scale: when data is large, we need a lot of servers
• Speed: aiming for sub-second response time
• Complexity: too much fine grain to precompute
• High dimensionality: 10s or 100s of dimensions
• Concurrency: many users and tenants
• Freshness: load from streams
7. Apache Druid: The Genesis
!7
Vadim Ogievetsky Gian Merlino Fangjin Yang
12. Join the community
!12
Druid community site (current): http://druid.io/
Druid community site (new): https://druid.apache.org/
Imply distribution: https://imply.io/get-started