SlideShare a Scribd company logo
1 of 26
Our Hadoop Journey
Chris Curtin
Head ofTechnical Research
Atlanta Hadoop Users Group July 2013
About Me
• 20+ years in technology
• Head ofTechnical Research at Silverpop (12 + years at Silverpop)
• Built a SaaS platform before the term ‘SaaS’ was being used
• Prior to Silverpop: real-time control systems, factory automation
and warehouse management
• Always looking for technologies and algorithms to help with our
challenges
• Car nut
2
Silverpop Open Positions
• Senior Software Engineer (Java, Oracle, Spring, Hibernate, MongoDB)
• Senior Software Engineer – MIS (.NET stack)
• Software Engineer
• Software Engineer – Integration Services (PHP, MySQL)
• Delivery Manager – Engineering
• Technical Lead – Engineering
• Technical Project Manager – Integration Services
• http://www.silverpop.com – Go to Careers under About
3
About Silverpop
• Founded in late 1999, Atlanta based, offices in London, Germany,
Irvine California
• Digital MarketingTechnology provider, unifying marketing
automation, email, mobile and social.
• Track billions of contact events, execute on those events, send
billions of emails
• Clients are in marketing departments
4
Challenge from the business
• Engage allows clients to define their own database schema for
contact records
• No two client’s schemas are the same
• Schemas often change weekly/monthly
• Contact’s records are ‘point in time’
• Users want to report on value of a contact record when activity
occurred
5
Example
• How well did my marketing campaign to my loyalty clients do last
quarter?
• Easy question, hard answer
– Contact’s ‘level’ changes throughout the year (Silver to Gold)
– Some piece of data wasn’t known at the time of the email send, but is
now
– What do you want to pivot on? Level? Age? Source Code?Time in
database?
6
Technical solutions
• Traditional Data warehouse
• Queries against OLTP or OLAP stores
• Customer-specific databases
7
Hadoop
• Started working on R&D project in 2008
• First raw map/reduce
• Some Pig
• Some Hive/Hbase
• (and several start-ups long since dead …)
• Flexible schema caused problems with most of them
8
First ‘real’ application
• Pivot reports against flexible schemas
• Per contact, not aggregate
• Let the user select any communication(s), see what user attributes
are available to use as pivots
• Pivot data is at time of communication, not current values (slow
moving data)
• Could be against a few thousand events, to billions
9
First ‘real’ challenges
• Flexible schema meant Hbase, Hive etc. wouldn’t work easily
• Flexible schema meant Pig scripts were difficult to maintain (even
generating on the fly)
• Need to coordinate multiple steps OUTSIDE of the Hadoop
process
• UI
• Resource Allocation and control
10
Cascading
• Answered a number of problems
• Allowed integration with other platforms, even between M/R jobs
– MySQL to find list of supported columns
– HDFS to find actual files on disk
– JMS for job sourcing/status updates (not implemented)
11
Cascading Dynamic Schema Solution
• Allows the definition of schema at run time
• Allows definition of steps at run time
– One report may have 10 mailings, another 10,000
– 10,000 mailings can’t be run in parallel, so programmatically create
temporary results
12
SampleCascadingCode
13
Client Response
• Either got it immediately or didn’t see the need for something this
flexible
• Found a reason to talk to others in organization to find other pivot
fields
• Most common use case: behaviors based on Source Code
• Turned out to be a weekly/monthly report not a day-to-day tool
• Some used it for ad hoc, but to build a requirement for their BI
teams
14
ProfilingApplication
• Retention is a big theme in marketing
• Looking at a single mailing/ad buy etc. showed aggregates about
that slice of time, but are misleading:
– Is the 20% who opened that email the same 20% as last week?
– For people in my database for 6 months, how often do they interact
with my marketing?
– What is a typical interaction rate for my database?
– How many times on average does a contact interact with me in a
month?Who is outside of that rate?
• Instead of looking across communication now needed to look at
each contact
15
New technical challenges
• Previous report could be broken into specific steps to reduce
volume of events before ‘heavy’ math was done
• New report needs to look at all events together
• Quickly overwhelmed scheduler
16
HadoopChallenges
• No schema – external store of mappings
• No appending in HDFS – daily integration could be 10MM rows for
a communication or 5
• ‘lots of small files’ – thousands of clients with thousands of
communications means millions of files
• ETL from Oracle meant concatenating files weekly to keep count
down
• Single point of failure (Name Node) took long time to recover
• Non-batch processes, how to schedule jobs on demand?
• Hadoop Job History – memory vs. concurrent job tradeoffs
17
MapR
• Eventually settled on MapR M3
– Large number of files was main driver
– NFS mount is nice feature
– Cascading works
• Not without issues
– Found several bugs aroundVolumes in HDFS and log retention that
we had to work around (later fixed)
– Can’t copy between volumes using HDFS commands
– More complicated for operations to manage (had a CLDB failure that
took a day to recover, mostly us trying to figure out what to do.)
18
Misc.Technical Information
• Fair Scheduler
– Our scheduling logic knows how many queues and controls how many
jobs can be submitted at the same time
• Mapr ExpressLane is useful for small jobs
– Our scheduler knows it is a small job so lets MapR take it
• Mapr’s NFS mount is great
– Write directly to it from Java apps instead of HDFS API
– Concatenating daily files is a simple Java app now
– (Still don’t append to files in HDFS, but could)
• Nagios for monitoring
19
Cluster details
• 5 nodes
– 1 admin, 4 workers
– 8 core Xeon 16 GB
– 5TB usable per box assigned to MapR
• Had 9 nodes, reduced to 5
– Cluster was mostly idle due to user’s submittal patterns (heavy on
Tuesdays, 7th day of the month)
– Delay to end users was minimal when we reduced the number of
machines
20
Closing the loop
• Next logical step was for clients to ask to target the contacts
• The volume of data didn’t make that easily possible
• Integrating from Hadoop back to Oracle became an ETL project
– Export from Oracle was single dump, import would be a job per client.
• Automation of reports (and emailing results) was 2nd most asked
for feature
• Lots of support required to know what to do with the results
– No easy ‘go do this when you see this in the reports’
21
Current Status
• Dozens of monthly users
• Some optimizations to toss data early in the import step for clients
not using the tool
• Packaging and pricing is vexing the product marketing team
• Runs lights out unless the ETL process breaks
22
Business Challenges
• Lots of cool ideas we came up with, even implemented a few
• But end users didn’t know what to do with the data
• ‘SaaS-ifying’ is proving difficult
– Multi-tenancy resource management is not available
– How to price? End report may have 20 rows but processed 1BN rows
to get there
• If I hear ‘do you do big data’ one more time …
23
Things we are watching
• Real-time tools on top of Hadoop (Drill, Impala)
• Storm inside ofYARN
• Storm in general
• Integration of Kafka, Storm, Drill/Impala, Hadoop & MongoDB
24
Information
• Slides: http://www.slideshare.net/chriscurtin
• Me: ccurtin@silverpop.com @ChrisCurtin on twitter
25
Silverpop Open Positions
• Senior Software Engineer (Java, Oracle, Spring, Hibernate, MongoDB)
• Senior Software Engineer – MIS (.NET stack)
• Software Engineer
• Software Engineer – Integration Services (PHP, MySQL)
• Delivery Manager – Engineering
• Technical Lead – Engineering
• Technical Project Manager – Integration Services
• http://www.silverpop.com/marketing-company/careers/open-
positions.html
26

More Related Content

What's hot

Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
DataWorks Summit
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015
Lance Co Ting Keh
 

What's hot (20)

Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
 
Final deck
Final deckFinal deck
Final deck
 
Not Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache HadoopNot Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache Hadoop
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
Puree through Trillion of clicks in seconds using Interana
Puree through Trillion of clicks in seconds using InteranaPuree through Trillion of clicks in seconds using Interana
Puree through Trillion of clicks in seconds using Interana
 
Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineering
 
JBCN barcelona 2017 kappa architecture 2.0
JBCN barcelona 2017 kappa architecture 2.0JBCN barcelona 2017 kappa architecture 2.0
JBCN barcelona 2017 kappa architecture 2.0
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of AmazonBig Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
 
Using Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architectureUsing Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architecture
 
Implementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache SparkImplementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache Spark
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015
 
Apache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectosApache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectos
 
Open Source DataViz with Apache Superset
Open Source DataViz with Apache SupersetOpen Source DataViz with Apache Superset
Open Source DataViz with Apache Superset
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
 
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
 
Traveloka's journey to no ops streaming analytics
Traveloka's journey to no ops streaming analyticsTraveloka's journey to no ops streaming analytics
Traveloka's journey to no ops streaming analytics
 
How to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data TeamHow to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data Team
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
 

Similar to Atlanta hadoop users group july 2013

Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
Andrew Brust
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Altan Khendup
 

Similar to Atlanta hadoop users group july 2013 (20)

Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupReal Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
 
Tech view on Regulatory Compliance
Tech view on Regulatory ComplianceTech view on Regulatory Compliance
Tech view on Regulatory Compliance
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopBig Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
 
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Teradata Partners Conference Oct 2014   Big Data Anti-PatternsTeradata Partners Conference Oct 2014   Big Data Anti-Patterns
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
 
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
 
Retail & CPG
Retail & CPGRetail & CPG
Retail & CPG
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Utah Big Mountain Big Data Baby Steps (4-12-2014) Final
Utah Big Mountain   Big Data Baby Steps (4-12-2014) FinalUtah Big Mountain   Big Data Baby Steps (4-12-2014) Final
Utah Big Mountain Big Data Baby Steps (4-12-2014) Final
 
Big Data, Baby Steps
Big Data, Baby StepsBig Data, Baby Steps
Big Data, Baby Steps
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
 

More from Christopher Curtin (6)

2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
 
AJUG April 2011 Cascading example
AJUG April 2011 Cascading exampleAJUG April 2011 Cascading example
AJUG April 2011 Cascading example
 
AJUG April 2011 Raw hadoop example
AJUG April 2011 Raw hadoop exampleAJUG April 2011 Raw hadoop example
AJUG April 2011 Raw hadoop example
 
NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010
 
IASA Atlanta September 2009
IASA Atlanta September 2009IASA Atlanta September 2009
IASA Atlanta September 2009
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Atlanta hadoop users group july 2013

  • 1. Our Hadoop Journey Chris Curtin Head ofTechnical Research Atlanta Hadoop Users Group July 2013
  • 2. About Me • 20+ years in technology • Head ofTechnical Research at Silverpop (12 + years at Silverpop) • Built a SaaS platform before the term ‘SaaS’ was being used • Prior to Silverpop: real-time control systems, factory automation and warehouse management • Always looking for technologies and algorithms to help with our challenges • Car nut 2
  • 3. Silverpop Open Positions • Senior Software Engineer (Java, Oracle, Spring, Hibernate, MongoDB) • Senior Software Engineer – MIS (.NET stack) • Software Engineer • Software Engineer – Integration Services (PHP, MySQL) • Delivery Manager – Engineering • Technical Lead – Engineering • Technical Project Manager – Integration Services • http://www.silverpop.com – Go to Careers under About 3
  • 4. About Silverpop • Founded in late 1999, Atlanta based, offices in London, Germany, Irvine California • Digital MarketingTechnology provider, unifying marketing automation, email, mobile and social. • Track billions of contact events, execute on those events, send billions of emails • Clients are in marketing departments 4
  • 5. Challenge from the business • Engage allows clients to define their own database schema for contact records • No two client’s schemas are the same • Schemas often change weekly/monthly • Contact’s records are ‘point in time’ • Users want to report on value of a contact record when activity occurred 5
  • 6. Example • How well did my marketing campaign to my loyalty clients do last quarter? • Easy question, hard answer – Contact’s ‘level’ changes throughout the year (Silver to Gold) – Some piece of data wasn’t known at the time of the email send, but is now – What do you want to pivot on? Level? Age? Source Code?Time in database? 6
  • 7. Technical solutions • Traditional Data warehouse • Queries against OLTP or OLAP stores • Customer-specific databases 7
  • 8. Hadoop • Started working on R&D project in 2008 • First raw map/reduce • Some Pig • Some Hive/Hbase • (and several start-ups long since dead …) • Flexible schema caused problems with most of them 8
  • 9. First ‘real’ application • Pivot reports against flexible schemas • Per contact, not aggregate • Let the user select any communication(s), see what user attributes are available to use as pivots • Pivot data is at time of communication, not current values (slow moving data) • Could be against a few thousand events, to billions 9
  • 10. First ‘real’ challenges • Flexible schema meant Hbase, Hive etc. wouldn’t work easily • Flexible schema meant Pig scripts were difficult to maintain (even generating on the fly) • Need to coordinate multiple steps OUTSIDE of the Hadoop process • UI • Resource Allocation and control 10
  • 11. Cascading • Answered a number of problems • Allowed integration with other platforms, even between M/R jobs – MySQL to find list of supported columns – HDFS to find actual files on disk – JMS for job sourcing/status updates (not implemented) 11
  • 12. Cascading Dynamic Schema Solution • Allows the definition of schema at run time • Allows definition of steps at run time – One report may have 10 mailings, another 10,000 – 10,000 mailings can’t be run in parallel, so programmatically create temporary results 12
  • 14. Client Response • Either got it immediately or didn’t see the need for something this flexible • Found a reason to talk to others in organization to find other pivot fields • Most common use case: behaviors based on Source Code • Turned out to be a weekly/monthly report not a day-to-day tool • Some used it for ad hoc, but to build a requirement for their BI teams 14
  • 15. ProfilingApplication • Retention is a big theme in marketing • Looking at a single mailing/ad buy etc. showed aggregates about that slice of time, but are misleading: – Is the 20% who opened that email the same 20% as last week? – For people in my database for 6 months, how often do they interact with my marketing? – What is a typical interaction rate for my database? – How many times on average does a contact interact with me in a month?Who is outside of that rate? • Instead of looking across communication now needed to look at each contact 15
  • 16. New technical challenges • Previous report could be broken into specific steps to reduce volume of events before ‘heavy’ math was done • New report needs to look at all events together • Quickly overwhelmed scheduler 16
  • 17. HadoopChallenges • No schema – external store of mappings • No appending in HDFS – daily integration could be 10MM rows for a communication or 5 • ‘lots of small files’ – thousands of clients with thousands of communications means millions of files • ETL from Oracle meant concatenating files weekly to keep count down • Single point of failure (Name Node) took long time to recover • Non-batch processes, how to schedule jobs on demand? • Hadoop Job History – memory vs. concurrent job tradeoffs 17
  • 18. MapR • Eventually settled on MapR M3 – Large number of files was main driver – NFS mount is nice feature – Cascading works • Not without issues – Found several bugs aroundVolumes in HDFS and log retention that we had to work around (later fixed) – Can’t copy between volumes using HDFS commands – More complicated for operations to manage (had a CLDB failure that took a day to recover, mostly us trying to figure out what to do.) 18
  • 19. Misc.Technical Information • Fair Scheduler – Our scheduling logic knows how many queues and controls how many jobs can be submitted at the same time • Mapr ExpressLane is useful for small jobs – Our scheduler knows it is a small job so lets MapR take it • Mapr’s NFS mount is great – Write directly to it from Java apps instead of HDFS API – Concatenating daily files is a simple Java app now – (Still don’t append to files in HDFS, but could) • Nagios for monitoring 19
  • 20. Cluster details • 5 nodes – 1 admin, 4 workers – 8 core Xeon 16 GB – 5TB usable per box assigned to MapR • Had 9 nodes, reduced to 5 – Cluster was mostly idle due to user’s submittal patterns (heavy on Tuesdays, 7th day of the month) – Delay to end users was minimal when we reduced the number of machines 20
  • 21. Closing the loop • Next logical step was for clients to ask to target the contacts • The volume of data didn’t make that easily possible • Integrating from Hadoop back to Oracle became an ETL project – Export from Oracle was single dump, import would be a job per client. • Automation of reports (and emailing results) was 2nd most asked for feature • Lots of support required to know what to do with the results – No easy ‘go do this when you see this in the reports’ 21
  • 22. Current Status • Dozens of monthly users • Some optimizations to toss data early in the import step for clients not using the tool • Packaging and pricing is vexing the product marketing team • Runs lights out unless the ETL process breaks 22
  • 23. Business Challenges • Lots of cool ideas we came up with, even implemented a few • But end users didn’t know what to do with the data • ‘SaaS-ifying’ is proving difficult – Multi-tenancy resource management is not available – How to price? End report may have 20 rows but processed 1BN rows to get there • If I hear ‘do you do big data’ one more time … 23
  • 24. Things we are watching • Real-time tools on top of Hadoop (Drill, Impala) • Storm inside ofYARN • Storm in general • Integration of Kafka, Storm, Drill/Impala, Hadoop & MongoDB 24
  • 25. Information • Slides: http://www.slideshare.net/chriscurtin • Me: ccurtin@silverpop.com @ChrisCurtin on twitter 25
  • 26. Silverpop Open Positions • Senior Software Engineer (Java, Oracle, Spring, Hibernate, MongoDB) • Senior Software Engineer – MIS (.NET stack) • Software Engineer • Software Engineer – Integration Services (PHP, MySQL) • Delivery Manager – Engineering • Technical Lead – Engineering • Technical Project Manager – Integration Services • http://www.silverpop.com/marketing-company/careers/open- positions.html 26