SlideShare una empresa de Scribd logo
1 de 15
Descargar para leer sin conexión
Gobblin @ NerdWallet
By Akshay Nanavati and Eric Ogren
akshay@nerdwallet.com eric@nerdwallet.com
Agenda
● Introduction to NerdWallet
● Gobblin @ NerdWallet Today
● Initial Pain Points & Learnings
● Contributions (Present and Future)
● Future Use Cases & Requests
2
What Is NerdWallet?
● Started in 2009. 275+ employees
● Highly profitable. Series A funding Feb 2015.
● We want to bring clarity to life’s financial decisions.
3
Front-End
Services Tier
NerdWallet Tech Stack
Data Analytics
Data Systems & Platforms
4
Data Types @ NerdWallet
● Partner Offer Data (MySQL & ElasticSearch: heavy reads, rare writes)
○ Synced to Redshift periodically
● Consumer Identity Data (Postgres: medium reads, medium writes)
● Site Generated Tracking Data (Redshift: heavy reads, heavy writes)
● Operational Data (e.g. Nginx logs) (Redshift: low reads, heavy writes) ?
● Internal Business Data (e.g. Salesforce) (Redshift: medium reads, rare writes)
● External 3rd Party Analytics Data (Redshift: medium reads, batch import)
5
Gobblin @ NW Today
● Running in standalone mode
● Ingests user tracking and operational log data
● Tracking Data:
○ ~10 Kafka topics - 1 per event & schema type
○ Hourly Gobblin jobs pull from kafka and dump to date-partitioned directory in S3
○ Events are already serialized as protobuf in each Kafka topic
○ Around 100 events/second
● Log Ingestion (Operational Data):
○ Extracts data from AWS logs sitting in S3
○ Parses log lines and serializes them to protobuf
○ Writes the serialized protobuf files back to S3 and eventually into redshift
6
Tracking Pipeline
7
Learnings: Deploying Gobblin w/Internal Code
● Have a repo of internal Gobblin modules (this is where we compile everything)
● Modified the build script to link the gobblin project to our gobblin-modules
project
● Use jenkins to compile gobblin on the remote machine
● Maintain a separate repository with .pull files that we can sync with our stage
and production environments
8
Current Contributions
● Simple Data Writer
○ class gobblin.writer.SimpleDataWriter
○ Writes binary record as bytes with no regard to encoding
○ Optionally prepends records by record size or uses a char delimiter at the end of records (i.e. n
for string data)
● Kafka Simple Extractor
○ class gobblin.source.extractor.extract.kafka.KafkaSimpleExtractor
○ class gobblin.source.extractor.extract.kafka.KafkaSimpleSource
○ Extracts binary data from Kafka as an array of bytes without any serde
9
Future Contributions
● Gobblin Dashboards
● S3 Source & Extractor
○ Given an S3 bucket, extract all files matching a regex
■ Leverages FileBasedExtractor
■ We would also like to modify this to have similar functionality to
DatePartitionedDailyAvroSource
● S3 Publisher
○ Publishes files to S3
○ Currently there is an issue where the AWS S3 Java API doesn’t work correctly with HDFS; since
we are running in standalone this is not an issue for us
10
Future: Dashboards
11
Gobblin @ NW tomorrow
● More data types
○ Offer data from partners: JSON/CSV/XML over {HTTP, FTP} => S3
○ Offer data from our site: MySQL => S3 (batch and incremental)
○ Identity data from out site: Postgres => S3 (batch and incremental, data hiding)
○ Salesforce Data
● Integration with Airflow DAGs
● Integration with data cleansing & entity matching frameworks
12
Early Adoption Pain Points & Solutions
● Best practices around for ingestion w/ transformation steps
● Initial problems integrating NW specific code (especially extractors &
converters) into Gobblin’s build process
● Best practices around scheduler integration - Quartz (built-in) vs ETL
schedulers
● Backwards incompatible changes caused us to make migrations to upgrade
versions
● No changelogs & tagged releases
13
Things we would like to see/add in future
● Abstract out Avro specific code
● Best practices for scheduler integration (can contribute for Airflow)
● Clustering without requiring Hadoop & YARN
● Metadata support (job X produced files Y,Z)
● Release notes & tags :)
● The build & unit test process is very bloated
○ Hard to differentiate warnings/stack traces vs legitimate build issues
○ Opens ports, creates temporary dbs, etc which make it difficult to test on arbitrary machines
(port collisions)
14
Thanks!
Questions??
15

Más contenido relacionado

La actualidad más candente

Flink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at AlibabaFlink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at AlibabaDataWorks Summit
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
 
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...Databricks
 
Data Distribution and Ordering for Efficient Data Source V2
Data Distribution and Ordering for Efficient Data Source V2Data Distribution and Ordering for Efficient Data Source V2
Data Distribution and Ordering for Efficient Data Source V2Databricks
 
Symantec: Cassandra Data Modelling techniques in action
Symantec: Cassandra Data Modelling techniques in actionSymantec: Cassandra Data Modelling techniques in action
Symantec: Cassandra Data Modelling techniques in actionDataStax Academy
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageDatabricks
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan BlueDatabricks
 
Big Data Platform at Pinterest
Big Data Platform at PinterestBig Data Platform at Pinterest
Big Data Platform at PinterestQubole
 
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...Databricks
 
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Databricks
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
 
Change Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVHChange Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVHParis Data Engineers !
 
Building scalable data with kafka and spark
Building scalable data with kafka and sparkBuilding scalable data with kafka and spark
Building scalable data with kafka and sparkbabatunde ekemode
 
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native WayMigrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native WayDatabricks
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberXiang Fu
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineDataWorks Summit
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKKriangkrai Chaonithi
 
What is Change Data Capture (CDC) and Why is it Important?
What is Change Data Capture (CDC) and Why is it Important?What is Change Data Capture (CDC) and Why is it Important?
What is Change Data Capture (CDC) and Why is it Important?FlyData Inc.
 

La actualidad más candente (20)

Flink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at AlibabaFlink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at Alibaba
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
 
Data Distribution and Ordering for Efficient Data Source V2
Data Distribution and Ordering for Efficient Data Source V2Data Distribution and Ordering for Efficient Data Source V2
Data Distribution and Ordering for Efficient Data Source V2
 
Symantec: Cassandra Data Modelling techniques in action
Symantec: Cassandra Data Modelling techniques in actionSymantec: Cassandra Data Modelling techniques in action
Symantec: Cassandra Data Modelling techniques in action
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
 
Big Data Platform at Pinterest
Big Data Platform at PinterestBig Data Platform at Pinterest
Big Data Platform at Pinterest
 
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
 
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
Graph ql and enterprise
Graph ql and enterpriseGraph ql and enterprise
Graph ql and enterprise
 
Change Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVHChange Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVH
 
Building scalable data with kafka and spark
Building scalable data with kafka and sparkBuilding scalable data with kafka and spark
Building scalable data with kafka and spark
 
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native WayMigrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
 
Cloud dwh
Cloud dwhCloud dwh
Cloud dwh
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query Engine
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
What is Change Data Capture (CDC) and Why is it Important?
What is Change Data Capture (CDC) and Why is it Important?What is Change Data Capture (CDC) and Why is it Important?
What is Change Data Capture (CDC) and Why is it Important?
 

Destacado

Banki.ru Иван Ильин: Емейл-маркетинг в банковской лидогенерации.
Banki.ru Иван Ильин: Емейл-маркетинг в банковской лидогенерации.Banki.ru Иван Ильин: Емейл-маркетинг в банковской лидогенерации.
Banki.ru Иван Ильин: Емейл-маркетинг в банковской лидогенерации.CPAex
 
Как получить и использовать знания о клиентах! Bablometer.ru
Как получить и использовать знания о клиентах!  Bablometer.ruКак получить и использовать знания о клиентах!  Bablometer.ru
Как получить и использовать знания о клиентах! Bablometer.ruRafail Galiev
 
Как собрать команду мечты
Как собрать команду мечтыКак собрать команду мечты
Как собрать команду мечтыSQALab
 
Agile days 2015. Непрерывное качество в непрерывной разработке
Agile days 2015. Непрерывное качество в непрерывной разработкеAgile days 2015. Непрерывное качество в непрерывной разработке
Agile days 2015. Непрерывное качество в непрерывной разработкеMaxim Boguslavsky
 
работа в крупной компании на примере Banki.ru
работа в крупной компании на примере Banki.ruработа в крупной компании на примере Banki.ru
работа в крупной компании на примере Banki.ruYuri Afanasiev
 
Talent 2020 Webinar Will Your High Potentials Stick Around
Talent 2020 Webinar Will Your High Potentials Stick AroundTalent 2020 Webinar Will Your High Potentials Stick Around
Talent 2020 Webinar Will Your High Potentials Stick AroundTalentPlus Inc
 
Nectarin Digital Digest №5
Nectarin Digital Digest №5Nectarin Digital Digest №5
Nectarin Digital Digest №5Nectarin
 
Как маркетплейсы произведут революцию на рынке электронного ОСАГО
Как маркетплейсы произведут революцию на рынке электронного ОСАГОКак маркетплейсы произведут революцию на рынке электронного ОСАГО
Как маркетплейсы произведут революцию на рынке электронного ОСАГОInstitute of development of the Internet
 
Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014Lin Qiao
 
Newsbrands and finance
Newsbrands and financeNewsbrands and finance
Newsbrands and financeNewsworks
 
Mail.ru: Как вырастить в себе автоматизатора и разработчика
Mail.ru:  Как вырастить в себе автоматизатора и разработчикаMail.ru:  Как вырастить в себе автоматизатора и разработчика
Mail.ru: Как вырастить в себе автоматизатора и разработчикаMaxim Boguslavsky
 
Optimising your rtb infrastructure sammy austin - follow up
Optimising your rtb infrastructure   sammy austin - follow upOptimising your rtb infrastructure   sammy austin - follow up
Optimising your rtb infrastructure sammy austin - follow upad:tech London
 
Reduce, Reuse, Recycle: How Moneysupermarket.com Created APIs Without Startin...
Reduce, Reuse, Recycle: How Moneysupermarket.com Created APIs Without Startin...Reduce, Reuse, Recycle: How Moneysupermarket.com Created APIs Without Startin...
Reduce, Reuse, Recycle: How Moneysupermarket.com Created APIs Without Startin...CA API Management
 
MoneySuperMarket.com: Case Study
MoneySuperMarket.com: Case StudyMoneySuperMarket.com: Case Study
MoneySuperMarket.com: Case StudyNewsworks
 
Почему почта не работает
Почему почта не работаетПочему почта не работает
Почему почта не работаетRina Uzhevko
 
MoneySupermarket.com: Customer Case Study - Layer 7 API Management Workshop L...
MoneySupermarket.com: Customer Case Study - Layer 7 API Management Workshop L...MoneySupermarket.com: Customer Case Study - Layer 7 API Management Workshop L...
MoneySupermarket.com: Customer Case Study - Layer 7 API Management Workshop L...CA API Management
 
Идеальный тестдизайн: от Цема Канера до Agile
Идеальный тестдизайн: от Цема Канера до AgileИдеальный тестдизайн: от Цема Канера до Agile
Идеальный тестдизайн: от Цема Канера до AgileCodeFest
 
Внедрение измениений. Рефакторинг Vs реинжиниринг
Внедрение измениений. Рефакторинг Vs реинжинирингВнедрение измениений. Рефакторинг Vs реинжиниринг
Внедрение измениений. Рефакторинг Vs реинжинирингRina Uzhevko
 

Destacado (20)

Banki.ru Иван Ильин: Емейл-маркетинг в банковской лидогенерации.
Banki.ru Иван Ильин: Емейл-маркетинг в банковской лидогенерации.Banki.ru Иван Ильин: Емейл-маркетинг в банковской лидогенерации.
Banki.ru Иван Ильин: Емейл-маркетинг в банковской лидогенерации.
 
Как получить и использовать знания о клиентах! Bablometer.ru
Как получить и использовать знания о клиентах!  Bablometer.ruКак получить и использовать знания о клиентах!  Bablometer.ru
Как получить и использовать знания о клиентах! Bablometer.ru
 
Как собрать команду мечты
Как собрать команду мечтыКак собрать команду мечты
Как собрать команду мечты
 
Agile days 2015. Непрерывное качество в непрерывной разработке
Agile days 2015. Непрерывное качество в непрерывной разработкеAgile days 2015. Непрерывное качество в непрерывной разработке
Agile days 2015. Непрерывное качество в непрерывной разработке
 
работа в крупной компании на примере Banki.ru
работа в крупной компании на примере Banki.ruработа в крупной компании на примере Banki.ru
работа в крупной компании на примере Banki.ru
 
Talent 2020 Webinar Will Your High Potentials Stick Around
Talent 2020 Webinar Will Your High Potentials Stick AroundTalent 2020 Webinar Will Your High Potentials Stick Around
Talent 2020 Webinar Will Your High Potentials Stick Around
 
Angular meetup
Angular meetupAngular meetup
Angular meetup
 
Nectarin Digital Digest №5
Nectarin Digital Digest №5Nectarin Digital Digest №5
Nectarin Digital Digest №5
 
Как маркетплейсы произведут революцию на рынке электронного ОСАГО
Как маркетплейсы произведут революцию на рынке электронного ОСАГОКак маркетплейсы произведут революцию на рынке электронного ОСАГО
Как маркетплейсы произведут революцию на рынке электронного ОСАГО
 
Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014
 
Newsbrands and finance
Newsbrands and financeNewsbrands and finance
Newsbrands and finance
 
Mail.ru: Как вырастить в себе автоматизатора и разработчика
Mail.ru:  Как вырастить в себе автоматизатора и разработчикаMail.ru:  Как вырастить в себе автоматизатора и разработчика
Mail.ru: Как вырастить в себе автоматизатора и разработчика
 
Optimising your rtb infrastructure sammy austin - follow up
Optimising your rtb infrastructure   sammy austin - follow upOptimising your rtb infrastructure   sammy austin - follow up
Optimising your rtb infrastructure sammy austin - follow up
 
Reduce, Reuse, Recycle: How Moneysupermarket.com Created APIs Without Startin...
Reduce, Reuse, Recycle: How Moneysupermarket.com Created APIs Without Startin...Reduce, Reuse, Recycle: How Moneysupermarket.com Created APIs Without Startin...
Reduce, Reuse, Recycle: How Moneysupermarket.com Created APIs Without Startin...
 
MoneySuperMarket.com: Case Study
MoneySuperMarket.com: Case StudyMoneySuperMarket.com: Case Study
MoneySuperMarket.com: Case Study
 
Почему почта не работает
Почему почта не работаетПочему почта не работает
Почему почта не работает
 
MoneySupermarket.com: Customer Case Study - Layer 7 API Management Workshop L...
MoneySupermarket.com: Customer Case Study - Layer 7 API Management Workshop L...MoneySupermarket.com: Customer Case Study - Layer 7 API Management Workshop L...
MoneySupermarket.com: Customer Case Study - Layer 7 API Management Workshop L...
 
Идеальный тестдизайн: от Цема Канера до Agile
Идеальный тестдизайн: от Цема Канера до AgileИдеальный тестдизайн: от Цема Канера до Agile
Идеальный тестдизайн: от Цема Канера до Agile
 
Pankov
PankovPankov
Pankov
 
Внедрение измениений. Рефакторинг Vs реинжиниринг
Внедрение измениений. Рефакторинг Vs реинжинирингВнедрение измениений. Рефакторинг Vs реинжиниринг
Внедрение измениений. Рефакторинг Vs реинжиниринг
 

Similar a Gobblin @ NerdWallet (Nov 2015)

Kotlin REST & GraphQL API
Kotlin REST & GraphQL APIKotlin REST & GraphQL API
Kotlin REST & GraphQL APISean O'Brien
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simpleDori Waldman
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Kaxil Naik
 
Snowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD PipelinesSnowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD PipelinesDrew Hansen
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in SparkDigital Vidya
 
BlackRay - The open Source Data Engine
BlackRay - The open Source Data EngineBlackRay - The open Source Data Engine
BlackRay - The open Source Data Enginefschupp
 
Blackray @ SAPO CodeBits 2009
Blackray @ SAPO CodeBits 2009Blackray @ SAPO CodeBits 2009
Blackray @ SAPO CodeBits 2009fschupp
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupKaxil Naik
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at UberDatabricks
 
OpenLineage for Stream Processing | Kafka Summit London
OpenLineage for Stream Processing | Kafka Summit LondonOpenLineage for Stream Processing | Kafka Summit London
OpenLineage for Stream Processing | Kafka Summit LondonHostedbyConfluent
 
Benchmarking for postgresql workloads in kubernetes
Benchmarking for postgresql workloads in kubernetesBenchmarking for postgresql workloads in kubernetes
Benchmarking for postgresql workloads in kubernetesDoKC
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodbPGConf APAC
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkDataWorks Summit
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe
 
MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...
MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...
MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...Boško Devetak
 
High performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodbHigh performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodbWei Shan Ang
 
OSA Con 2022 - Extract, Transform, and Learn about your developers - Brian Le...
OSA Con 2022 - Extract, Transform, and Learn about your developers - Brian Le...OSA Con 2022 - Extract, Transform, and Learn about your developers - Brian Le...
OSA Con 2022 - Extract, Transform, and Learn about your developers - Brian Le...Altinity Ltd
 

Similar a Gobblin @ NerdWallet (Nov 2015) (20)

Kotlin REST & GraphQL API
Kotlin REST & GraphQL APIKotlin REST & GraphQL API
Kotlin REST & GraphQL API
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
 
Snowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD PipelinesSnowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD Pipelines
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in Spark
 
BlackRay - The open Source Data Engine
BlackRay - The open Source Data EngineBlackRay - The open Source Data Engine
BlackRay - The open Source Data Engine
 
Blackray @ SAPO CodeBits 2009
Blackray @ SAPO CodeBits 2009Blackray @ SAPO CodeBits 2009
Blackray @ SAPO CodeBits 2009
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
OpenLineage for Stream Processing | Kafka Summit London
OpenLineage for Stream Processing | Kafka Summit LondonOpenLineage for Stream Processing | Kafka Summit London
OpenLineage for Stream Processing | Kafka Summit London
 
Benchmarking for postgresql workloads in kubernetes
Benchmarking for postgresql workloads in kubernetesBenchmarking for postgresql workloads in kubernetes
Benchmarking for postgresql workloads in kubernetes
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...
MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...
MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...
 
High performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodbHigh performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodb
 
OSA Con 2022 - Extract, Transform, and Learn about your developers - Brian Le...
OSA Con 2022 - Extract, Transform, and Learn about your developers - Brian Le...OSA Con 2022 - Extract, Transform, and Learn about your developers - Brian Le...
OSA Con 2022 - Extract, Transform, and Learn about your developers - Brian Le...
 

Último

WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024VictoriaMetrics
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationJuha-Pekka Tolvanen
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...masabamasaba
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...masabamasaba
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2
 

Último (20)

WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 

Gobblin @ NerdWallet (Nov 2015)

  • 1. Gobblin @ NerdWallet By Akshay Nanavati and Eric Ogren akshay@nerdwallet.com eric@nerdwallet.com
  • 2. Agenda ● Introduction to NerdWallet ● Gobblin @ NerdWallet Today ● Initial Pain Points & Learnings ● Contributions (Present and Future) ● Future Use Cases & Requests 2
  • 3. What Is NerdWallet? ● Started in 2009. 275+ employees ● Highly profitable. Series A funding Feb 2015. ● We want to bring clarity to life’s financial decisions. 3
  • 4. Front-End Services Tier NerdWallet Tech Stack Data Analytics Data Systems & Platforms 4
  • 5. Data Types @ NerdWallet ● Partner Offer Data (MySQL & ElasticSearch: heavy reads, rare writes) ○ Synced to Redshift periodically ● Consumer Identity Data (Postgres: medium reads, medium writes) ● Site Generated Tracking Data (Redshift: heavy reads, heavy writes) ● Operational Data (e.g. Nginx logs) (Redshift: low reads, heavy writes) ? ● Internal Business Data (e.g. Salesforce) (Redshift: medium reads, rare writes) ● External 3rd Party Analytics Data (Redshift: medium reads, batch import) 5
  • 6. Gobblin @ NW Today ● Running in standalone mode ● Ingests user tracking and operational log data ● Tracking Data: ○ ~10 Kafka topics - 1 per event & schema type ○ Hourly Gobblin jobs pull from kafka and dump to date-partitioned directory in S3 ○ Events are already serialized as protobuf in each Kafka topic ○ Around 100 events/second ● Log Ingestion (Operational Data): ○ Extracts data from AWS logs sitting in S3 ○ Parses log lines and serializes them to protobuf ○ Writes the serialized protobuf files back to S3 and eventually into redshift 6
  • 8. Learnings: Deploying Gobblin w/Internal Code ● Have a repo of internal Gobblin modules (this is where we compile everything) ● Modified the build script to link the gobblin project to our gobblin-modules project ● Use jenkins to compile gobblin on the remote machine ● Maintain a separate repository with .pull files that we can sync with our stage and production environments 8
  • 9. Current Contributions ● Simple Data Writer ○ class gobblin.writer.SimpleDataWriter ○ Writes binary record as bytes with no regard to encoding ○ Optionally prepends records by record size or uses a char delimiter at the end of records (i.e. n for string data) ● Kafka Simple Extractor ○ class gobblin.source.extractor.extract.kafka.KafkaSimpleExtractor ○ class gobblin.source.extractor.extract.kafka.KafkaSimpleSource ○ Extracts binary data from Kafka as an array of bytes without any serde 9
  • 10. Future Contributions ● Gobblin Dashboards ● S3 Source & Extractor ○ Given an S3 bucket, extract all files matching a regex ■ Leverages FileBasedExtractor ■ We would also like to modify this to have similar functionality to DatePartitionedDailyAvroSource ● S3 Publisher ○ Publishes files to S3 ○ Currently there is an issue where the AWS S3 Java API doesn’t work correctly with HDFS; since we are running in standalone this is not an issue for us 10
  • 12. Gobblin @ NW tomorrow ● More data types ○ Offer data from partners: JSON/CSV/XML over {HTTP, FTP} => S3 ○ Offer data from our site: MySQL => S3 (batch and incremental) ○ Identity data from out site: Postgres => S3 (batch and incremental, data hiding) ○ Salesforce Data ● Integration with Airflow DAGs ● Integration with data cleansing & entity matching frameworks 12
  • 13. Early Adoption Pain Points & Solutions ● Best practices around for ingestion w/ transformation steps ● Initial problems integrating NW specific code (especially extractors & converters) into Gobblin’s build process ● Best practices around scheduler integration - Quartz (built-in) vs ETL schedulers ● Backwards incompatible changes caused us to make migrations to upgrade versions ● No changelogs & tagged releases 13
  • 14. Things we would like to see/add in future ● Abstract out Avro specific code ● Best practices for scheduler integration (can contribute for Airflow) ● Clustering without requiring Hadoop & YARN ● Metadata support (job X produced files Y,Z) ● Release notes & tags :) ● The build & unit test process is very bloated ○ Hard to differentiate warnings/stack traces vs legitimate build issues ○ Opens ports, creates temporary dbs, etc which make it difficult to test on arbitrary machines (port collisions) 14