Apache Gobblin

•

0 recomendaciones•125 vistas

This presentation gives an overview of the Apache Gobblin project. It explains Apache Gobblin in terms of it's architecture, data sources/sinks and it's work unit processing. Links for further information and connecting http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ https://nz.linkedin.com/pub/mike-frampton/20/630/385 https://open-source-systems.blogspot.com/

Tecnología

What Is Apache Gobblin ?
● A big data integration framework
● To simplify integration issues like
– Data ingestion
– Replication
– Organization
– Lifecycle management
● For streaming and batch
● An Apache incubator project

Gobblin Execution Modes
● Gobblin has a number of execution modes
● Standalone
– Run on a single box / JVM / embedded mode
● Map Reduce
– Run as a map reduce application
● Yarn / Mesos ( proposed ? )
– Run on a cluster via a scheduler, supports HA
● Cloud
– Run on AWS / Azure, supports HA

Gobblin Sinks/Writers
● Gobblin supports the following sinks
– Avro HDFS
– Parquet HDFS
– HDFS byte array
– Console (StdOut)
– Couchbase
– HTTP
– JDBC
– Kafka

Gobblin Sources
Gobblin supports the following sources
● Avro files
● File copy
● Query based
● Rest API
● Google Analytics
● Google drive
● Google webmaster
● Hadoop text input
● Hive Avro to ORC
● Hive compliance purging
● JSON
● Kafka
● MySQL
● Oracle
● Salesforce
● FTP / SFTP
● SQL Server
● Teradata
● Wikipedia

Gobblin Architecture
● A Gobblin job is built on a set of plugable constructs
● Which are extensible
● A job is a set of tasks created from a workunit
● The workunit serves as a container at runtime
● Tasks are executed by the Gobblin runtime
– On the chosen deployment i.e. MapReduce
● Run time handles scheduling, error handling etc
● Utilities handle meta data, state, metrics etc

Gobblin Job
● Optional aquire lock (to stop next job instance)
● Create source instance
● From source work units create tasks
● Launch and run tasks
● Publish data if OK to do so
● Persist the job/task states into the state store
● Clean up temporary work data
● Release the job lock ( optional )

Gobblin Constructs
● Source partitions data into work units
● Source creates work unit data extractors
● Converter converts schema and data records
● Quality checker checks row and task level data
● Fork operator allows control to flow into multiple streams
● Writers sends data records to sink
● Publisher publishes job records

Gobblin Job Configuration
● Goblin jobs are configured via configuration files
● May be named .pull / .job plus .properties
● Source properties file defines
– Connection / converter / quality / publisher
● Job file defines
– Name / group / description / schedule
– Extraction properties
– Source properties

Available Books
● See “Big Data Made Easy”
– Apress Jan 2015
●
See “Mastering Apache Spark”
– Packt Oct 2015
●
See “Complete Guide to Open Source Big Data Stack
– “Apress Jan 2018”
● Find the author on Amazon
– www.amazon.com/Michael-Frampton/e/B00NIQDOOM/
●
Connect on LinkedIn
– www.linkedin.com/in/mike-frampton-38563020

Connect
● Feel free to connect on LinkedIn
– www.linkedin.com/in/mike-frampton-38563020
● See my open source blog at
– open-source-systems.blogspot.com/
● I am always interested in
– New technology
– Opportunities
– Technology based issues
– Big data integration

Más contenido relacionado

La actualidad más candente

Concourse CIMatteo Gazzetta

Airflow introductionChandler Huang

WSO2Con USA 2015: Deployment Patterns and Capacity PlanningWSO2

Log analysis using Logstash,ElasticSearch and KibanaAvinash Ramineni

The journey of Moving from AWS ELK to GCP Data PipelineRandy Huang

Melbourne User Group OAK and MongoDBYuval Ararat

AMIS ADF Weblogic 12c launch Event 08 DVT And Websockets by Lucas JellemaGetting value from IoT, Integration and Data Analytics

Introduction to CosmosDB - Azure Bootcamp 2018Josh Carlisle

HBaseCon2017 Apache HBase at DidiHBaseCon

Experience with C++11 in ArangoDBMax Neunhöffer

CosmosDB for DBAs & DevelopersNiko Neugebauer

Airflow for BeginnersVarya Karpenko

Yet another intro to Apache SparkSimon Lia-Jonassen

Data Pipelines with Apache AirflowManning Publications

PostgreSQL 9.4Satoshi Nagayasu

Microservice-based software architectureArangoDB Database

NoSQL Database in .NET AppsShiju Varghese

Basic terminologies for a developerUC San Diego

2016 jan-pugs-meetup-v9.5-featuresSameer Kumar

Putting the Spark into Functional Fashion Tech AnalysticsGareth Rogers

La actualidad más candente (20)

Concourse CI

Airflow introduction

WSO2Con USA 2015: Deployment Patterns and Capacity Planning

Log analysis using Logstash,ElasticSearch and Kibana

The journey of Moving from AWS ELK to GCP Data Pipeline

Melbourne User Group OAK and MongoDB

AMIS ADF Weblogic 12c launch Event 08 DVT And Websockets by Lucas Jellema

Introduction to CosmosDB - Azure Bootcamp 2018

HBaseCon2017 Apache HBase at Didi

Experience with C++11 in ArangoDB

CosmosDB for DBAs & Developers

Airflow for Beginners

Yet another intro to Apache Spark

Data Pipelines with Apache Airflow

PostgreSQL 9.4

Microservice-based software architecture

NoSQL Database in .NET Apps

Basic terminologies for a developer

2016 jan-pugs-meetup-v9.5-features

Putting the Spark into Functional Fashion Tech Analystics

Similar a Apache Gobblin

Kotlin REST & GraphQL APISean O'Brien

Speed up large-scale ML/DL offline inference job with AlluxioAlluxio, Inc.

Gobblin @ NerdWallet (Nov 2015)NerdWalletHQ

Next.js with drupal, the good partsTaller Negócio Digitais

Hadoop on OpenStack - Sahara @DevNation 2014spinningmatt

AWS (Hadoop) Meetup 30.04.09Chris Purrington

Experiences with Evangelizing Java Within the DatabaseMarcelo Ochoa

Hadoop Ecosystem OverviewGerrit van Vuuren

ApacheCon09: AvroCloudera, Inc.

Workflow Engines for HadoopJoe Crobak

It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyHostedbyConfluent

It's Time To Stop Using Lambda ArchitectureYaroslav Tkachenko

Scaling ELK Stack - DevOpsDays SingaporeAngad Singh

Change data captureRon Barabash

Tips for Apache Flink on Kafka with Olena Babenko | Kafka Summit London 2022HostedbyConfluent

CBDW2014- Intro to CommandBox; The ColdFusion CLI, Package Manager, and REPL ...Ortus Solutions, Corp

DrupalCampLA 2014 - Drupal backend performance and scalabilitycherryhillco

Balkan - data eng meetup - data fusionBalkan Misirli

node.js 실무 - node js in practice by Jesang YoonJesang Yoon

Logging in The World of DevOps DevOps Indonesia

Similar a Apache Gobblin (20)

Kotlin REST & GraphQL API

Speed up large-scale ML/DL offline inference job with Alluxio

Gobblin @ NerdWallet (Nov 2015)

Next.js with drupal, the good parts

Hadoop on OpenStack - Sahara @DevNation 2014

AWS (Hadoop) Meetup 30.04.09

Experiences with Evangelizing Java Within the Database

Hadoop Ecosystem Overview

ApacheCon09: Avro

Workflow Engines for Hadoop

It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify

It's Time To Stop Using Lambda Architecture

Scaling ELK Stack - DevOpsDays Singapore

Change data capture

Tips for Apache Flink on Kafka with Olena Babenko | Kafka Summit London 2022

CBDW2014- Intro to CommandBox; The ColdFusion CLI, Package Manager, and REPL ...

DrupalCampLA 2014 - Drupal backend performance and scalability

Balkan - data eng meetup - data fusion

node.js 실무 - node js in practice by Jesang Yoon

Logging in The World of DevOps

Más de Mike Frampton

Apache AiravataMike Frampton

Apache MADlib AI/MLMike Frampton

Apache MXNet AIMike Frampton

Apache Singa AIMike Frampton

Apache RangerMike Frampton

OrientDBMike Frampton

PrometheusMike Frampton

Apache TephraMike Frampton

Apache KuduMike Frampton

Apache BahirMike Frampton

Apache ArrowMike Frampton

JanusGraph DBMike Frampton

Apache IgniteMike Frampton

Apache SamzaMike Frampton

Apache FlinkMike Frampton

Apache EdgentMike Frampton

Apache CouchDBMike Frampton

An introduction to Apache MesosMike Frampton

An introduction to PentahoMike Frampton

An introduction to Apache ThriftMike Frampton

Más de Mike Frampton (20)

Apache Airavata

Apache MADlib AI/ML

Apache MXNet AI

Apache Singa AI

Apache Ranger

OrientDB

Prometheus

Apache Tephra

Apache Kudu

Apache Bahir

Apache Arrow

JanusGraph DB

Apache Ignite

Apache Samza

Apache Flink

Apache Edgent

Apache CouchDB

An introduction to Apache Mesos

An introduction to Pentaho

An introduction to Apache Thrift

Último

Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll

A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3

Scale your database traffic with Read & Write split using MySQL RouterMydbops

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5

Data governance with Unity Catalog PresentationKnoldus Inc.

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein

Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

A Framework for Development in the AI AgeCprime

Decarbonising Buildings: Making a net-zero built environment a realityIES VE

Connecting the Dots for Information Discovery.pdfNeo4j

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA

Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery

Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes

Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney

Apache Gobblin

1. What Is Apache Gobblin ? ● A big data integration framework ● To simplify integration issues like – Data ingestion – Replication – Organization – Lifecycle management ● For streaming and batch ● An Apache incubator project

2. Gobblin Execution Modes ● Gobblin has a number of execution modes ● Standalone – Run on a single box / JVM / embedded mode ● Map Reduce – Run as a map reduce application ● Yarn / Mesos ( proposed ? ) – Run on a cluster via a scheduler, supports HA ● Cloud – Run on AWS / Azure, supports HA

3. Gobblin Sinks/Writers ● Gobblin supports the following sinks – Avro HDFS – Parquet HDFS – HDFS byte array – Console (StdOut) – Couchbase – HTTP – JDBC – Kafka

4. Gobblin Sources Gobblin supports the following sources ● Avro files ● File copy ● Query based ● Rest API ● Google Analytics ● Google drive ● Google webmaster ● Hadoop text input ● Hive Avro to ORC ● Hive compliance purging ● JSON ● Kafka ● MySQL ● Oracle ● Salesforce ● FTP / SFTP ● SQL Server ● Teradata ● Wikipedia

5. Gobblin Architecture

6. Gobblin Architecture ● A Gobblin job is built on a set of plugable constructs ● Which are extensible ● A job is a set of tasks created from a workunit ● The workunit serves as a container at runtime ● Tasks are executed by the Gobblin runtime – On the chosen deployment i.e. MapReduce ● Run time handles scheduling, error handling etc ● Utilities handle meta data, state, metrics etc

7. Gobblin Job

8. Gobblin Job ● Optional aquire lock (to stop next job instance) ● Create source instance ● From source work units create tasks ● Launch and run tasks ● Publish data if OK to do so ● Persist the job/task states into the state store ● Clean up temporary work data ● Release the job lock ( optional )

9. Gobblin Constructs

10. Gobblin Constructs ● Source partitions data into work units ● Source creates work unit data extractors ● Converter converts schema and data records ● Quality checker checks row and task level data ● Fork operator allows control to flow into multiple streams ● Writers sends data records to sink ● Publisher publishes job records

11. Gobblin Job Configuration ● Goblin jobs are configured via configuration files ● May be named .pull / .job plus .properties ● Source properties file defines – Connection / converter / quality / publisher ● Job file defines – Name / group / description / schedule – Extraction properties – Source properties

12. Gobblin Users

13. Available Books ● See “Big Data Made Easy” – Apress Jan 2015 ● See “Mastering Apache Spark” – Packt Oct 2015 ● See “Complete Guide to Open Source Big Data Stack – “Apress Jan 2018” ● Find the author on Amazon – www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ ● Connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020

14. Connect ● Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020 ● See my open source blog at – open-source-systems.blogspot.com/ ● I am always interested in – New technology – Opportunities – Technology based issues – Big data integration

Apache Gobblin

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Apache Gobblin

Similar a Apache Gobblin (20)

Más de Mike Frampton

Más de Mike Frampton (20)

Último

Último (20)

Apache Gobblin