SlideShare una empresa de Scribd logo
1 de 26
Descargar para leer sin conexión
THE EVOLUTION
OF HADOOP AT
STRIPE
colin marc @colinmarc
ABOUT STRIPE
• payments for the web
• based in SF
• last time I checked, ~75 people
(stripe.com/about)
• main product is an API
WITH US,
DATA WAS AN
AFTERTHOUGHT
A LOT OF OUR
DATA IS IN MONGO
• MongoDB is a fantastic application
database
• uses BSON - like JSON, but has a
binary representation
• MongoDB is schemaless, but has
indexed queries and other
features that are nice for
applications
APPLICATION DBS
SUCK FOR ANALYSIS
• well, sometimes. relational
databases are OK
• MongoDB is awful (for this)
• no joins
• scans are painful
• no declarative query language
SOLUTION:
PUT THE DATA
SOMEWHERE
ELSE
V1:
TSV + IMPALA
• threw together a Hadoop cluster
on the developer boxes
script dumped models to
• “nightly” in HDFS
TSV files
script output
• jankyour models the schema
from
• query from Impala
ASIDE: IMPALA IS
PRETTY COOL
• developed by Cloudera
• absurdly fast queries over HDFS
• SQL is great
• most of our questions are ad-hoc
secrets =(
woah
A NICE
EXPERIMENT, BUT...
• schema translation is hard
• SLOW SLOW SLOW
• TSV is not a great format
• script never runs
• not production data
V2:
MONGO -> HBASE
• Impala can query HBase, I think?
wrote MoSQL - let’s do
• @nelhagething, but put the data in
the same
HBase!
from
• translatingeasier one k/v store to
another is
ZEROWING
http://github.com/stripe/zerowing
FIRST, SNAPSHOT
Mongo-Hadoop, map
• usingMongoDB database over
your
• HFileOutputFormat,
completeBulkLoad
THEN, STREAM
MongoDB oplog, like a
• tail the set member
replica
• replicate inserts/updates/deletes
by _id
HAVING DATA
IN HDFS IS A
GREAT
THEN, QUERY IT
WITH IMPALA...UM
• wait, impala can’t actually query
HBase effectively
• 30-40x slower over the same
data
• limitingI factor is HBase scan
speed, think
LOST IN
TRANSLATION
• our schema problem is still there!
• BSON is typed, but HBase is just
strings
• nested hashes still don’t work
• lists???
• what is the canonical schema?
V3:
PARQUET + THRIFT
storing k/v pairs,
• instead ofraw BSON blobs just
store the
• write your MR jobs against HBase
if you want up-to-date data
• also periodically dump out
Parquet files
• use thrift definitions to manage
schema
USING THRIFT AS
SCHEMA
nice way
• thrift is a expect toto define what
fields we
be in the
BSON

• in most cases, we can do the
translation automatically
on the backend, instead of
• decodereplication
during
• no information loss
GENERATE THRIFT
DEFINITIONS?
• thrift still isn’t the canonical- that
schema for our application
exists in our ODM

• wrote a quick ruby script to
generate thrift definitions from
our application models
PARQUET <3
THRIFT
• columnar, read-optimized
a little bit
• withbasic thrift of glue, serialize
any
struct easily
IMPALA <3
PARQUET
• more glue can automatically
import parquet files into Impala
designed
• Impala and parquet areother
to work well with each
• nested structs don’t work yet =(
SCALDING <3
PARQUET
• we use scalding for a lot of
MapReduce stuff
• added ParquetSource to scalding
to make this easy (source and
sink)
THIS WORKS FOR
ANY DATA
• use thrift to define an data type,
intermediate or derived
and you get, for free:

• serialization using parquet
• easy MR jobs with scalding
• ad-hoc querying with Impala
MongoDB

Application
Land

ZeroWing

OVERVIEW

HBase
Hadoop
MR

Impala
Parquet
Snapshots

Hadoop Land
QUESTIONS?
• meeeee: @colinmarc
• Stripe: stripe.com
• we’re hiring! stripe.com/jobs
• ZeroWing: github.com/stripe/zerowing
• Impala: github.com/cloudera/impala
• Parquet: parquet.github.com

Más contenido relacionado

La actualidad más candente

Day 9 - PostgreSQL Application Architecture
Day 9 - PostgreSQL Application ArchitectureDay 9 - PostgreSQL Application Architecture
Day 9 - PostgreSQL Application ArchitectureBarry Jones
 
What Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database ScalabilityWhat Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database Scalabilityjbellis
 
Migrating applications to serverless Apache Kafka + KSQL
Migrating applications to serverless Apache Kafka + KSQLMigrating applications to serverless Apache Kafka + KSQL
Migrating applications to serverless Apache Kafka + KSQLconfluent
 
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...HostedbyConfluent
 
GoSF Summerfest - Why Go at Apcera
GoSF Summerfest - Why Go at ApceraGoSF Summerfest - Why Go at Apcera
GoSF Summerfest - Why Go at ApceraDerek Collison
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Powering an API with GraphQL, Golang, and NoSQL
Powering an API with GraphQL, Golang, and NoSQLPowering an API with GraphQL, Golang, and NoSQL
Powering an API with GraphQL, Golang, and NoSQLNic Raboy
 
PHP, LAMP Stack & WordPress
PHP, LAMP Stack & WordPressPHP, LAMP Stack & WordPress
PHP, LAMP Stack & WordPressSuman Srinivasan
 
Riding rails for 10 years
Riding rails for 10 yearsRiding rails for 10 years
Riding rails for 10 yearsjduff
 
Why we love ArangoDB. The hunt for the right NosQL Database
Why we love ArangoDB. The hunt for the right NosQL DatabaseWhy we love ArangoDB. The hunt for the right NosQL Database
Why we love ArangoDB. The hunt for the right NosQL DatabaseAndreas Jung
 
Planet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: BigdamPlanet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: BigdamSATOSHI TAGOMORI
 
AWS Cloud experience concepts tips and tricks
AWS Cloud experience concepts tips and tricksAWS Cloud experience concepts tips and tricks
AWS Cloud experience concepts tips and tricksDirk Harms-Merbitz
 
Day 2 - Intro to Rails
Day 2 - Intro to RailsDay 2 - Intro to Rails
Day 2 - Intro to RailsBarry Jones
 
PharoDAYS 2015: On Relational Databases by Guille Polito
PharoDAYS 2015: On Relational Databases by Guille PolitoPharoDAYS 2015: On Relational Databases by Guille Polito
PharoDAYS 2015: On Relational Databases by Guille PolitoPharo
 
MongoDB as a fast and queryable cache
MongoDB as a fast and queryable cacheMongoDB as a fast and queryable cache
MongoDB as a fast and queryable cacheMongoDB
 
The Many Faces of Apache Kafka: Leveraging real-time data at scale
The Many Faces of Apache Kafka: Leveraging real-time data at scaleThe Many Faces of Apache Kafka: Leveraging real-time data at scale
The Many Faces of Apache Kafka: Leveraging real-time data at scaleNeha Narkhede
 
Ubiquitous Solr - A Database's not-so-evil Twin
Ubiquitous Solr - A Database's not-so-evil TwinUbiquitous Solr - A Database's not-so-evil Twin
Ubiquitous Solr - A Database's not-so-evil TwinAyon Sinha
 
Communication tool & Environment for Remote Worker
Communication tool & Environment for Remote WorkerCommunication tool & Environment for Remote Worker
Communication tool & Environment for Remote WorkerShotaro Sakamaki
 

La actualidad más candente (20)

Railsで作るBFFの功罪
Railsで作るBFFの功罪Railsで作るBFFの功罪
Railsで作るBFFの功罪
 
Day 9 - PostgreSQL Application Architecture
Day 9 - PostgreSQL Application ArchitectureDay 9 - PostgreSQL Application Architecture
Day 9 - PostgreSQL Application Architecture
 
What Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database ScalabilityWhat Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database Scalability
 
Migrating applications to serverless Apache Kafka + KSQL
Migrating applications to serverless Apache Kafka + KSQLMigrating applications to serverless Apache Kafka + KSQL
Migrating applications to serverless Apache Kafka + KSQL
 
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
 
GoSF Summerfest - Why Go at Apcera
GoSF Summerfest - Why Go at ApceraGoSF Summerfest - Why Go at Apcera
GoSF Summerfest - Why Go at Apcera
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Powering an API with GraphQL, Golang, and NoSQL
Powering an API with GraphQL, Golang, and NoSQLPowering an API with GraphQL, Golang, and NoSQL
Powering an API with GraphQL, Golang, and NoSQL
 
PHP, LAMP Stack & WordPress
PHP, LAMP Stack & WordPressPHP, LAMP Stack & WordPress
PHP, LAMP Stack & WordPress
 
Drop acid
Drop acidDrop acid
Drop acid
 
Riding rails for 10 years
Riding rails for 10 yearsRiding rails for 10 years
Riding rails for 10 years
 
Why we love ArangoDB. The hunt for the right NosQL Database
Why we love ArangoDB. The hunt for the right NosQL DatabaseWhy we love ArangoDB. The hunt for the right NosQL Database
Why we love ArangoDB. The hunt for the right NosQL Database
 
Planet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: BigdamPlanet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: Bigdam
 
AWS Cloud experience concepts tips and tricks
AWS Cloud experience concepts tips and tricksAWS Cloud experience concepts tips and tricks
AWS Cloud experience concepts tips and tricks
 
Day 2 - Intro to Rails
Day 2 - Intro to RailsDay 2 - Intro to Rails
Day 2 - Intro to Rails
 
PharoDAYS 2015: On Relational Databases by Guille Polito
PharoDAYS 2015: On Relational Databases by Guille PolitoPharoDAYS 2015: On Relational Databases by Guille Polito
PharoDAYS 2015: On Relational Databases by Guille Polito
 
MongoDB as a fast and queryable cache
MongoDB as a fast and queryable cacheMongoDB as a fast and queryable cache
MongoDB as a fast and queryable cache
 
The Many Faces of Apache Kafka: Leveraging real-time data at scale
The Many Faces of Apache Kafka: Leveraging real-time data at scaleThe Many Faces of Apache Kafka: Leveraging real-time data at scale
The Many Faces of Apache Kafka: Leveraging real-time data at scale
 
Ubiquitous Solr - A Database's not-so-evil Twin
Ubiquitous Solr - A Database's not-so-evil TwinUbiquitous Solr - A Database's not-so-evil Twin
Ubiquitous Solr - A Database's not-so-evil Twin
 
Communication tool & Environment for Remote Worker
Communication tool & Environment for Remote WorkerCommunication tool & Environment for Remote Worker
Communication tool & Environment for Remote Worker
 

Destacado

Braintree and our new v.zero SDK for iOS
Braintree and our new v.zero SDK for iOSBraintree and our new v.zero SDK for iOS
Braintree and our new v.zero SDK for iOSAlberto López Martín
 
Braintree v.zero: a modern foundation for accepting payments - Alberto Lopez ...
Braintree v.zero: a modern foundation for accepting payments - Alberto Lopez ...Braintree v.zero: a modern foundation for accepting payments - Alberto Lopez ...
Braintree v.zero: a modern foundation for accepting payments - Alberto Lopez ...Codemotion
 
Django Zebra Lightning Talk
Django Zebra Lightning TalkDjango Zebra Lightning Talk
Django Zebra Lightning TalkLee Trout
 
Online learning talk
Online learning talkOnline learning talk
Online learning talkEmily Chin
 
Paymill vs Stripe
Paymill vs StripePaymill vs Stripe
Paymill vs Stripebetabeers
 
Machine Learning Experimentation at Sift Science
Machine Learning Experimentation at Sift ScienceMachine Learning Experimentation at Sift Science
Machine Learning Experimentation at Sift ScienceSift Science
 
Pay and Get Paid: How To Integrate Stripe Into Your App
Pay and Get Paid: How To Integrate Stripe Into Your AppPay and Get Paid: How To Integrate Stripe Into Your App
Pay and Get Paid: How To Integrate Stripe Into Your AppFlatiron School
 
Omise fintech研究会
Omise fintech研究会Omise fintech研究会
Omise fintech研究会Jun Hasegawa
 
[daddly] Stripe勉強会 運用編 2016/11/30
[daddly] Stripe勉強会 運用編 2016/11/30[daddly] Stripe勉強会 運用編 2016/11/30
[daddly] Stripe勉強会 運用編 2016/11/30Naoshi ONO
 
Entrepreneur + Developer Gangbang: Co-working
Entrepreneur + Developer Gangbang: Co-workingEntrepreneur + Developer Gangbang: Co-working
Entrepreneur + Developer Gangbang: Co-workingkamal.fariz
 
Payments using Stripe.com
Payments using Stripe.comPayments using Stripe.com
Payments using Stripe.comBilly Cravens
 
Hiring Hacks: How Stripe Creatively Finds Candidates and Builds a Recruiting ...
Hiring Hacks: How Stripe Creatively Finds Candidates and Builds a Recruiting ...Hiring Hacks: How Stripe Creatively Finds Candidates and Builds a Recruiting ...
Hiring Hacks: How Stripe Creatively Finds Candidates and Builds a Recruiting ...GreenhouseSoftware
 
Braintree SDK v.zero or "A payment gateway walks into a bar..." - Devfest Nan...
Braintree SDK v.zero or "A payment gateway walks into a bar..." - Devfest Nan...Braintree SDK v.zero or "A payment gateway walks into a bar..." - Devfest Nan...
Braintree SDK v.zero or "A payment gateway walks into a bar..." - Devfest Nan...Alberto López Martín
 
Payments integration: Stripe & Taxamo
Payments integration: Stripe & TaxamoPayments integration: Stripe & Taxamo
Payments integration: Stripe & TaxamoNetguru
 
Payments Made Easy with Stripe
Payments Made Easy with StripePayments Made Easy with Stripe
Payments Made Easy with StripeShawn Hooper
 
Getting started with Stripe
Getting started with StripeGetting started with Stripe
Getting started with StripeTechMagic
 
FinTech Hong Kong Report
FinTech Hong Kong Report FinTech Hong Kong Report
FinTech Hong Kong Report CFTE
 

Destacado (18)

Braintree and our new v.zero SDK for iOS
Braintree and our new v.zero SDK for iOSBraintree and our new v.zero SDK for iOS
Braintree and our new v.zero SDK for iOS
 
Braintree v.zero: a modern foundation for accepting payments - Alberto Lopez ...
Braintree v.zero: a modern foundation for accepting payments - Alberto Lopez ...Braintree v.zero: a modern foundation for accepting payments - Alberto Lopez ...
Braintree v.zero: a modern foundation for accepting payments - Alberto Lopez ...
 
Django Zebra Lightning Talk
Django Zebra Lightning TalkDjango Zebra Lightning Talk
Django Zebra Lightning Talk
 
Online learning talk
Online learning talkOnline learning talk
Online learning talk
 
Paymill vs Stripe
Paymill vs StripePaymill vs Stripe
Paymill vs Stripe
 
Machine Learning Experimentation at Sift Science
Machine Learning Experimentation at Sift ScienceMachine Learning Experimentation at Sift Science
Machine Learning Experimentation at Sift Science
 
Pay and Get Paid: How To Integrate Stripe Into Your App
Pay and Get Paid: How To Integrate Stripe Into Your AppPay and Get Paid: How To Integrate Stripe Into Your App
Pay and Get Paid: How To Integrate Stripe Into Your App
 
Omise fintech研究会
Omise fintech研究会Omise fintech研究会
Omise fintech研究会
 
[daddly] Stripe勉強会 運用編 2016/11/30
[daddly] Stripe勉強会 運用編 2016/11/30[daddly] Stripe勉強会 運用編 2016/11/30
[daddly] Stripe勉強会 運用編 2016/11/30
 
Entrepreneur + Developer Gangbang: Co-working
Entrepreneur + Developer Gangbang: Co-workingEntrepreneur + Developer Gangbang: Co-working
Entrepreneur + Developer Gangbang: Co-working
 
Payments using Stripe.com
Payments using Stripe.comPayments using Stripe.com
Payments using Stripe.com
 
Hiring Hacks: How Stripe Creatively Finds Candidates and Builds a Recruiting ...
Hiring Hacks: How Stripe Creatively Finds Candidates and Builds a Recruiting ...Hiring Hacks: How Stripe Creatively Finds Candidates and Builds a Recruiting ...
Hiring Hacks: How Stripe Creatively Finds Candidates and Builds a Recruiting ...
 
Bitcoin ,
Bitcoin ,Bitcoin ,
Bitcoin ,
 
Braintree SDK v.zero or "A payment gateway walks into a bar..." - Devfest Nan...
Braintree SDK v.zero or "A payment gateway walks into a bar..." - Devfest Nan...Braintree SDK v.zero or "A payment gateway walks into a bar..." - Devfest Nan...
Braintree SDK v.zero or "A payment gateway walks into a bar..." - Devfest Nan...
 
Payments integration: Stripe & Taxamo
Payments integration: Stripe & TaxamoPayments integration: Stripe & Taxamo
Payments integration: Stripe & Taxamo
 
Payments Made Easy with Stripe
Payments Made Easy with StripePayments Made Easy with Stripe
Payments Made Easy with Stripe
 
Getting started with Stripe
Getting started with StripeGetting started with Stripe
Getting started with Stripe
 
FinTech Hong Kong Report
FinTech Hong Kong Report FinTech Hong Kong Report
FinTech Hong Kong Report
 

Similar a The Evolution of Hadoop at Stripe

HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airshipdave_revell
 
Scaling with swagger
Scaling with swaggerScaling with swagger
Scaling with swaggerTony Tam
 
Running MongoDB in the Cloud
Running MongoDB in the CloudRunning MongoDB in the Cloud
Running MongoDB in the CloudTony Tam
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Miguel Pastor
 
Inside Wordnik's Architecture
Inside Wordnik's ArchitectureInside Wordnik's Architecture
Inside Wordnik's ArchitectureTony Tam
 
Hybrid MongoDB and RDBMS Applications
Hybrid MongoDB and RDBMS ApplicationsHybrid MongoDB and RDBMS Applications
Hybrid MongoDB and RDBMS ApplicationsSteven Francia
 
Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)guest0f8e278
 
Messaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new frameworkMessaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new frameworkTomas Doran
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.
 
Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)Bradford Stephens
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars GeorgeJAX London
 
Message:Passing - lpw 2012
Message:Passing - lpw 2012Message:Passing - lpw 2012
Message:Passing - lpw 2012Tomas Doran
 
Impala presentation
Impala presentationImpala presentation
Impala presentationtrihug
 
WordCamp 2012 - WordPress Webapps
WordCamp 2012 - WordPress WebappsWordCamp 2012 - WordPress Webapps
WordCamp 2012 - WordPress Webappstjasko
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
 

Similar a The Evolution of Hadoop at Stripe (20)

HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airship
 
Hadoop in a Windows Shop - CHUG - 20120416
Hadoop in a Windows Shop - CHUG - 20120416Hadoop in a Windows Shop - CHUG - 20120416
Hadoop in a Windows Shop - CHUG - 20120416
 
Scaling with swagger
Scaling with swaggerScaling with swagger
Scaling with swagger
 
Why ruby and rails
Why ruby and railsWhy ruby and rails
Why ruby and rails
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Running MongoDB in the Cloud
Running MongoDB in the CloudRunning MongoDB in the Cloud
Running MongoDB in the Cloud
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014
 
Inside Wordnik's Architecture
Inside Wordnik's ArchitectureInside Wordnik's Architecture
Inside Wordnik's Architecture
 
Hybrid MongoDB and RDBMS Applications
Hybrid MongoDB and RDBMS ApplicationsHybrid MongoDB and RDBMS Applications
Hybrid MongoDB and RDBMS Applications
 
Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)
 
Messaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new frameworkMessaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new framework
 
MongoDB
MongoDBMongoDB
MongoDB
 
Hbase jdd
Hbase jddHbase jdd
Hbase jdd
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
 
Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
Message:Passing - lpw 2012
Message:Passing - lpw 2012Message:Passing - lpw 2012
Message:Passing - lpw 2012
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
WordCamp 2012 - WordPress Webapps
WordCamp 2012 - WordPress WebappsWordCamp 2012 - WordPress Webapps
WordCamp 2012 - WordPress Webapps
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 

Último

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 

Último (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

The Evolution of Hadoop at Stripe

  • 1. THE EVOLUTION OF HADOOP AT STRIPE colin marc @colinmarc
  • 2. ABOUT STRIPE • payments for the web • based in SF • last time I checked, ~75 people (stripe.com/about) • main product is an API
  • 3. WITH US, DATA WAS AN AFTERTHOUGHT
  • 4. A LOT OF OUR DATA IS IN MONGO • MongoDB is a fantastic application database • uses BSON - like JSON, but has a binary representation • MongoDB is schemaless, but has indexed queries and other features that are nice for applications
  • 5. APPLICATION DBS SUCK FOR ANALYSIS • well, sometimes. relational databases are OK • MongoDB is awful (for this) • no joins • scans are painful • no declarative query language
  • 7. V1: TSV + IMPALA • threw together a Hadoop cluster on the developer boxes script dumped models to • “nightly” in HDFS TSV files script output • jankyour models the schema from • query from Impala
  • 8. ASIDE: IMPALA IS PRETTY COOL • developed by Cloudera • absurdly fast queries over HDFS • SQL is great • most of our questions are ad-hoc secrets =( woah
  • 9. A NICE EXPERIMENT, BUT... • schema translation is hard • SLOW SLOW SLOW • TSV is not a great format • script never runs • not production data
  • 10. V2: MONGO -> HBASE • Impala can query HBase, I think? wrote MoSQL - let’s do • @nelhagething, but put the data in the same HBase! from • translatingeasier one k/v store to another is
  • 12. FIRST, SNAPSHOT Mongo-Hadoop, map • usingMongoDB database over your • HFileOutputFormat, completeBulkLoad
  • 13. THEN, STREAM MongoDB oplog, like a • tail the set member replica • replicate inserts/updates/deletes by _id
  • 14. HAVING DATA IN HDFS IS A GREAT
  • 15. THEN, QUERY IT WITH IMPALA...UM • wait, impala can’t actually query HBase effectively • 30-40x slower over the same data • limitingI factor is HBase scan speed, think
  • 16. LOST IN TRANSLATION • our schema problem is still there! • BSON is typed, but HBase is just strings • nested hashes still don’t work • lists??? • what is the canonical schema?
  • 17.
  • 18. V3: PARQUET + THRIFT storing k/v pairs, • instead ofraw BSON blobs just store the • write your MR jobs against HBase if you want up-to-date data • also periodically dump out Parquet files • use thrift definitions to manage schema
  • 19. USING THRIFT AS SCHEMA nice way • thrift is a expect toto define what fields we be in the BSON • in most cases, we can do the translation automatically on the backend, instead of • decodereplication during • no information loss
  • 20. GENERATE THRIFT DEFINITIONS? • thrift still isn’t the canonical- that schema for our application exists in our ODM • wrote a quick ruby script to generate thrift definitions from our application models
  • 21. PARQUET <3 THRIFT • columnar, read-optimized a little bit • withbasic thrift of glue, serialize any struct easily
  • 22. IMPALA <3 PARQUET • more glue can automatically import parquet files into Impala designed • Impala and parquet areother to work well with each • nested structs don’t work yet =(
  • 23. SCALDING <3 PARQUET • we use scalding for a lot of MapReduce stuff • added ParquetSource to scalding to make this easy (source and sink)
  • 24. THIS WORKS FOR ANY DATA • use thrift to define an data type, intermediate or derived and you get, for free: • serialization using parquet • easy MR jobs with scalding • ad-hoc querying with Impala
  • 26. QUESTIONS? • meeeee: @colinmarc • Stripe: stripe.com • we’re hiring! stripe.com/jobs • ZeroWing: github.com/stripe/zerowing • Impala: github.com/cloudera/impala • Parquet: parquet.github.com