FlumeBase Study

•

1 recomendación•1,097 vistas

Hanborq Inc.

A study of FlumeBase

Tecnología

FlumeBase Study

Nov. 29, 2011
Willis Gong
Big Data Engineering Team
Hanborq Inc.

Application scenario
• Originating tier
– Automatically reconfigured as fan out when pull a flow from a stream
– agentBESink to forward event to FB’s ‘collectorSource’
• Flumebase:
– Is actually a physical flume node created using flume node constructor
FlumeNode(…)
– Presents two type of logical nodes
• Source adapting node
– One node for each stream: reuse and de-multiplex stream into flow
– Input: delimited, regex, avro
• Output node
– One node for each named ‘flow’
– Emitting avro record
– need manually re-route to appropriate sink
• FB can also use local file source

Flumebase Server
• Stream: to share event from same flume node; created by sql statement
“CREATE STREAM …”; composed of 0+ flow
• Flow: each ‘select’ statement produce a flow
• rtsqlmultisink
– Input side: reuse events from same collectorSource
– Output side: no effect actually (should be manually replaced)
• rtsqlsink: wrap and drive flume event into flumebase flow pipeline
• rtsqlsource: emitting avro record produced by flumebase flow pipeline
• Flumebase flow pipeline: the main thread
– Process operation from shell
– Flow lifecycle management (create, deploy, event-feed, terminate)

Flumebase flow pipeline
• Flow:
– Is a graph of flow elements
– Take input from rtsqlsink and produce output to rtsqlsource
• Flow element:
– Each carry out certain functionality in a sql query, like:
• Project, aggregation, filter, join, etc.
– drive by the pipeline: take event and produce output
– Output varies depends on implementation
• output to next phase queue, or
• output as flow final result, or
• Cache and output later (for aggregation)

The aggregation flow element
• Operates on ‘window’
– Defined by a relative range of time
– Further divided into smaller time-slot
(customizable slot width)
• Aggregation is firstly done per slot, then
summarized on all slot when window finished
– A event fall into a particular slot according
to its timestamp
• The timestamp is either specified column in
the record or local sampling time
• Two thread:
– main thread drive in-window event
– eviction thread watches when to close a
window
• Output one record containing results from
all aggregation functions once a window is
closed

Features
• Compared with ordinary sql
– No primary index
• Do not identify if record is duplicated
– The window concept
• Compared with ordinary flume node
– Flumebase logical nodes are with particular
source & sink – rtsqlxxx
– Flumebase logical node cannot be initiated by
flume master – FB shell instead

$Features • SQL – CREATE STREAM stream_name (col_name data_type [, ...]) FROM [LOCAL] {FILE | NODE | SOURCE} input_spec [EVENT FORMAT format_spec [PROPERTIES (key = val, …)]] – SELECT select_expr, select_expr ... FROM stream_reference [ JOIN stream_reference ON join_expr OVER range_expr, JOIN ... ] [ WHERE where_condition ] [ GROUP BY column_list ] [ OVER range_expr ] [ HAVING having_condition ] [ WINDOW window_name AS ( range_expr ), WINDOW ... ]$

Possible issues
• Aggregation
– Currently FB window is not timeline aligned
• may need to be aligned with seconds or minutes or hours
– FB do not to support distinct
• Deployment
– Currently usage: flume deploy  FB start up  FB shell create stream / flow
 manually re-route FB output logic node
• manually change sink for rtsqlsource
– Better if FB stream / flow auto created by configuration from flume – better
integration with flume
• Code maturity is in doubt
– Seems to based on flume-0.9.3
– Not work directly on cdhu1 & 2
– According to github: few activities
• No update within about half year
• Very few issues and discussion; issues unresolved
• One contributors – the author

Más contenido relacionado

La actualidad más candente

127 Ch 2: Stack overflows on LinuxSam Bowne

Fluentd and Distributed Logging at KubeconN Masahiro

CNIT 127 Ch 2: Stack overflows on LinuxSam Bowne

CNIT 127: Ch 8: Windows overflows (Part 2)Sam Bowne

JS introductionYi Tseng

CNIT 127: Ch 2: Stack Overflows in LinuxSam Bowne

CNIT 127 Ch 2: Stack overflows on LinuxSam Bowne

CNIT 127 Ch 3: ShellcodeSam Bowne

Migration to Siebel IP17+Vasiliy Tokarchuk

CNIT 127: Ch 3: ShellcodeSam Bowne

CompilationDavid Halliday

CNIT 127 Ch 1: Before you BeginSam Bowne

What's in the Box?: An Intro to HFM System Utilities Alithya

Flink Forward Berlin 2017: Boris Lublinsky, Stavros Kontopoulos - Introducing...Flink Forward

Linker and loader uploadBin Yang

Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Thomas Weise

SAP LVM Custom InstancesAliter Consulting

The Patterns of Distributed Logging and ContainersSATOSHI TAGOMORI

Dynamic pricing of Lyft rides using streamingAmar Pai

La actualidad más candente (20)

127 Ch 2: Stack overflows on Linux

Fluentd and Distributed Logging at Kubecon

CNIT 127 Ch 2: Stack overflows on Linux

CNIT 127: Ch 8: Windows overflows (Part 2)

JS introduction

CNIT 127: Ch 2: Stack Overflows in Linux

CNIT 127 Ch 2: Stack overflows on Linux

CNIT 127 Ch 3: Shellcode

Migration to Siebel IP17+

CNIT 127: Ch 3: Shellcode

Compilation

CNIT 127 Ch 1: Before you Begin

What's in the Box?: An Intro to HFM System Utilities

Flink Forward Berlin 2017: Boris Lublinsky, Stavros Kontopoulos - Introducing...

Linker and loader upload

Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019

SAP LVM Custom Instances

The Patterns of Distributed Logging and Containers

Dynamic pricing of Lyft rides using streaming

Similar a FlumeBase Study

Centralized logging with FlumeRatnakar Pawar

Give your little scripts big wings: Using cron in the cloud with Amazon Simp...Amazon Web Services

Symfony Components 2.0 on PHP 5.3Fabien Potencier

Serverless design with Fn projectSiva Rama Krishna Chunduru

Low Latency Streaming Data Processing in HadoopInSemble

Apache Spark ComponentsGirish Khanzode

Cloud Foundry Monitoring How-To: Collecting Metrics and LogsAltoros

Stream Processing with Apache FlinkC4Media

Etienne chauchot spark structured streaming runnerEtienne Chauchot

Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...Flink Forward

Advanced windows debuggingchrisortman

-ImplementSubprogram-theory of programmingjavariagull777

Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz

3.2 Streaming and Messaging振东刘

Ansible benelux meetup - Amsterdam 27-5-2015Pavel Chunyayev

Get to Know AtoM's CodebaseArtefactual Systems - AtoM

Travis Wright - PS WF SMA SCSM SPNordic Infrastructure Conference

Getting Data In and Out of Flink - Understanding Flink and Its Connector Ecos...HostedbyConfluent

QCon London - Stream Processing with Apache FlinkRobert Metzger

Open Source Swift Under the HoodC4Media

Similar a FlumeBase Study (20)

Centralized logging with Flume

Give your little scripts big wings: Using cron in the cloud with Amazon Simp...

Symfony Components 2.0 on PHP 5.3

Serverless design with Fn project

Low Latency Streaming Data Processing in Hadoop

Apache Spark Components

Cloud Foundry Monitoring How-To: Collecting Metrics and Logs

Stream Processing with Apache Flink

Etienne chauchot spark structured streaming runner

Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...

Advanced windows debugging

-ImplementSubprogram-theory of programming

Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...

3.2 Streaming and Messaging

Ansible benelux meetup - Amsterdam 27-5-2015

Get to Know AtoM's Codebase

Travis Wright - PS WF SMA SCSM SP

Getting Data In and Out of Flink - Understanding Flink and Its Connector Ecos...

QCon London - Stream Processing with Apache Flink

Open Source Swift Under the Hood

Más de Hanborq Inc.

Introduction to CassandraHanborq Inc.

Hadoop HDFS NameNode HAHanborq Inc.

Hadoop大数据实践经验Hanborq Inc.

Flume and Flive IntroductionHanborq Inc.

Hadoop MapReduce Streaming and PipesHanborq Inc.

HBase IntroductionHanborq Inc.

Hadoop VersioningHanborq Inc.

Hadoop MapReduce Task Scheduler IntroductionHanborq Inc.

Hadoop MapReduce Introduction and Deep InsightHanborq Inc.

Hadoop HDFS Detailed IntroductionHanborq Inc.

How to Build Cloud Storage Service SystemsHanborq Inc.

Hanborq Optimizations on Hadoop MapReduceHanborq Inc.

Más de Hanborq Inc. (12)

Introduction to Cassandra

Hadoop HDFS NameNode HA

Hadoop大数据实践经验

Flume and Flive Introduction

Hadoop MapReduce Streaming and Pipes

HBase Introduction

Hadoop Versioning

Hadoop MapReduce Task Scheduler Introduction

Hadoop MapReduce Introduction and Deep Insight

Hadoop HDFS Detailed Introduction

How to Build Cloud Storage Service Systems

Hanborq Optimizations on Hadoop MapReduce

Último

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Histor y of HAM Radio presentation slidevu2urc

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

Partners Life - Insurer Innovation Award 2024The Digital Insurer

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Tech Trends Report 2024 Future Today Institute.pdfhans926745

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco

A Year of the Servo Reboot: Where Are We Now?Igalia

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

FlumeBase Study

1. FlumeBase Study Nov. 29, 2011 Willis Gong Big Data Engineering Team Hanborq Inc.

2. Application scenario • Originating tier – Automatically reconfigured as fan out when pull a flow from a stream – agentBESink to forward event to FB’s ‘collectorSource’ • Flumebase: – Is actually a physical flume node created using flume node constructor FlumeNode(…) – Presents two type of logical nodes • Source adapting node – One node for each stream: reuse and de-multiplex stream into flow – Input: delimited, regex, avro • Output node – One node for each named ‘flow’ – Emitting avro record – need manually re-route to appropriate sink • FB can also use local file source

3. Flumebase Server • Stream: to share event from same flume node; created by sql statement “CREATE STREAM …”; composed of 0+ flow • Flow: each ‘select’ statement produce a flow • rtsqlmultisink – Input side: reuse events from same collectorSource – Output side: no effect actually (should be manually replaced) • rtsqlsink: wrap and drive flume event into flumebase flow pipeline • rtsqlsource: emitting avro record produced by flumebase flow pipeline • Flumebase flow pipeline: the main thread – Process operation from shell – Flow lifecycle management (create, deploy, event-feed, terminate)

4. Flumebase flow pipeline • Flow: – Is a graph of flow elements – Take input from rtsqlsink and produce output to rtsqlsource • Flow element: – Each carry out certain functionality in a sql query, like: • Project, aggregation, filter, join, etc. – drive by the pipeline: take event and produce output – Output varies depends on implementation • output to next phase queue, or • output as flow final result, or • Cache and output later (for aggregation)

5. The aggregation flow element • Operates on ‘window’ – Defined by a relative range of time – Further divided into smaller time-slot (customizable slot width) • Aggregation is firstly done per slot, then summarized on all slot when window finished – A event fall into a particular slot according to its timestamp • The timestamp is either specified column in the record or local sampling time • Two thread: – main thread drive in-window event – eviction thread watches when to close a window • Output one record containing results from all aggregation functions once a window is closed

6. Features • Compared with ordinary sql – No primary index • Do not identify if record is duplicated – The window concept • Compared with ordinary flume node – Flumebase logical nodes are with particular source & sink – rtsqlxxx – Flumebase logical node cannot be initiated by flume master – FB shell instead

7. Features • SQL – CREATE STREAM stream_name (col_name data_type [, ...]) FROM [LOCAL] {FILE | NODE | SOURCE} input_spec [EVENT FORMAT format_spec [PROPERTIES (key = val, …)]] – SELECT select_expr, select_expr ... FROM stream_reference [ JOIN stream_reference ON join_expr OVER range_expr, JOIN ... ] [ WHERE where_condition ] [ GROUP BY column_list ] [ OVER range_expr ] [ HAVING having_condition ] [ WINDOW window_name AS ( range_expr ), WINDOW ... ]

8. Possible issues • Aggregation – Currently FB window is not timeline aligned • may need to be aligned with seconds or minutes or hours – FB do not to support distinct • Deployment – Currently usage: flume deploy  FB start up  FB shell create stream / flow  manually re-route FB output logic node • manually change sink for rtsqlsource – Better if FB stream / flow auto created by configuration from flume – better integration with flume • Code maturity is in doubt – Seems to based on flume-0.9.3 – Not work directly on cdhu1 & 2 – According to github: few activities • No update within about half year • Very few issues and discussion; issues unresolved • One contributors – the author

FlumeBase Study

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a FlumeBase Study

Similar a FlumeBase Study (20)

Más de Hanborq Inc.

Más de Hanborq Inc. (12)

Último

Último (20)

FlumeBase Study