SlideShare una empresa de Scribd logo
1 de 28
Descargar para leer sin conexión
Scalding 
YARN Webinar Series 
September 18, 2014 
Page 1 © Hortonworks Inc. 2014 
Ajay Singh, Director - Hortonworks 
Jonathan Coveney, Senior Software Engineer - Twitter
Agenda 
Introduction: Ajay Singh, Hortonworks 
Modern Data Architecture and how Cascading and Scalding fit in 
Scalding: Jonathan Coveney, Twitter 
Why Scalding? 
Core Concepts and Limitations 
Scalding at Twitter 
Resources 
Page 2 © Hortonworks Inc. 2014
Speakers 
Page 3 © Hortonworks Inc. 2014 
Ajay Singh is Hortonworks Director of Technical 
Channels and leads the strategic alliances with partners 
from a technology standpoint such as driving alignment 
on roadmaps, product certifications and demos. Ajay is 
dedicated to building, scaling and delivering exceptional 
go-to-market solutions with partners. 
Jonathan Coveney currently works at Twitter, where he 
has spent a lot of time maintaining and updating Scalding; 
in the past, he has worked extensively on Apache Pig. He 
is deeply interested in functional programming, as well as 
developing usable, scalable API's for data processing at 
scale.
A Modern Data Architecture 
DATA 
SYSTEM 
APPLICATIONS 
RDBMS 
EDW 
MPP 
REPOSITORIES 
SOURCES 
Exis4ng 
Sources 
(CRM, 
ERP, 
Clickstream, 
Logs) 
Page 4 © Hortonworks Inc. 2014 
Emerging 
Sources 
(Sensor, 
Sen4ment, 
Geo, 
Unstructured) 
DEV 
& 
DATA 
TOOLS 
BUILD 
& 
TEST 
OPERATIONAL 
TOOLS 
MANAGE 
& 
MONITOR 
Business 
Analy4cs 
Custom 
Applica4ons 
Packaged 
Applica4ons 
Governance 
& Integration 
ENTERPRISE HADOOP 
Security 
Operations 
Data Access 
Data Management
HDP 2.1: Enterprise Hadoop 
HDP 2.1 
Hortonworks Data Platform 
Page 5 © Hortonworks Inc. 2014 
Provision, 
Manage 
& 
Monitor 
Ambari 
Zookeeper 
Scheduling 
Oozie 
Data 
Workflow, 
Lifecycle 
& 
Governance 
Falcon 
Sqoop 
Flume 
NFS 
WebHDFS 
YARN 
: 
Data 
Opera4ng 
System 
DATA 
MANAGEMENT 
GOVERNANCE 
& 
DATA 
ACCESS 
SECURITY 
INTEGRATION 
Authen4ca4on 
Authoriza4on 
Accoun4ng 
Data 
Protec4on 
Storage: 
HDFS 
Resources: 
YARN 
Access: 
Hive, 
… 
Pipeline: 
Falcon 
Cluster: 
Knox 
OPERATIONS 
Script 
Pig 
Search 
Solr 
SQL 
Hive/Tez, 
HCatalog 
NoSQL 
HBase 
Accumulo 
Stream 
Storm 
Others 
In-­‐Memory 
AnalyNcs, 
ISV 
engines 
Cascading 
1 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
N 
HDFS 
(Hadoop 
Distributed 
File 
System) 
Batch 
Map 
Reduce 
Deployment 
Choice 
Linux Windows On-Premise Cloud
Cascading SDK 
HDP Integrates and delivers Cascading SDK 
• Collection of tools, documentation, libraries, 
tutorials and example projects 
• Key Benefits 
• Simplified Development 
• Multi Language Support 
• Reuse existing skills and tools 
• Native YARN Integration 
Hortonworks delivers Enterprise support 
• Backed by Concurrent 
Hortonworks and Concurrent Advance Enterprise Data Application 
Development on Hadoop 
Page 6 © Hortonworks Inc. 2014
HDP Integration of Cascading SDK 
• Write once and deploy on your fabric of 
choice 
• Integration with data processing layer allows 
Cascading to take advantage of advances in 
interactive applications 
• Sep 17th - Cascading 3.0 WIP Now Supports 
Apache Tez 
– http://www.cascading.org/2014/09/17/ 
cascading-3-0-wip-now-supports-apache-tez/ 
Page 7 © Hortonworks Inc. 2014 
PRESENTATION 
& 
APPLICATION 
Efficient 
Cluster 
Resource 
Management 
& 
Shared 
Services 
(YARN) 
Batch 
Data 
Processing 
MapReduce 
Interac4ve 
Data 
Processing 
TEZ 
Java 
Cascading 
Scala 
Scalding 
SQL 
Lingual 
ML 
Pa6ern 
Java 
Cascading 
Scala 
Scalding 
SQL 
Lingual 
ML 
Pa6ern 
Enable both existing and new application to 
provide value to the organization 
CURRENT WIP
Cascading.org Scalding Resources 
Scalding Resources on Cascading.org 
• Videos and Tutorials 
• Mailing List 
• Newsletter 
Cascading 3.0 WIP With Tez Support 
• https://github.com/cwensel/cascading/tree/wip-3.0/cascading-hadoop2-tez 
Scalding Training Debuts This Fall 
• In-person, 1-day class with labs 
• Email: info@cascading.io 
Page 8 © Hortonworks Inc. 2014
Page 9 © Hortonworks Inc. 2014 
Jonathan Coveney 
Twitter 
@jco
Why Scalding? 
Writing raw map reduce is difficult! 
● Scalding is 
o Less verbose 
o Less error prone (type checking!) 
o Easier to evolve 
o Performant enough 
Page 10 © Hortonworks Inc. 2014
But what about Hive and Pig? 
● Really good for certain things 
o Excellent for quick, ad-hoc work 
o Easy to understand 
o Can leverage existing knowledge (ie SQL) 
● Not always the best for maintainability 
o Composition isn’t great 
o Testing is difficult 
o Type safety is lacking 
Page 11 © Hortonworks Inc. 2014
So… Cascading? 
● Still pretty verbose! 
● But you can use normal java tools 
o Maven 
o JUnit 
o IDEs 
● Handles the low level details for you 
● A good target for higher level languages 
Page 12 © Hortonworks Inc. 2014
Scalding 
● Concise, expressive syntax 
● Testable 
● Abstractable 
● Composable 
Because it’s in a full-featured, functional language! 
Page 13 © Hortonworks Inc. 2014
But Scala is scary! 
● Scalding doesn’t force you to use more complicated 
features 
● Can just write less-verbose Java if desired 
● Functional programming is an important paradigm -- but 
especially for big data 
Learning new things is good for your brain :) 
Page 14 © Hortonworks Inc. 2014
Example Scalding job 
class Webinar(arg: Args) extends Job(args) { 
import TDsl._ 
TextLine(args(“input”)) 
.flatMap { _.split(“s+”) } 
.map { w => (w, 1L) } 
.group 
.sum 
.write(TypedTsv[(String, Long)](args(“output”))) 
} 
“Hadoop is a system for counting words” -Oscar Boykin, @posco 
Page 15 © Hortonworks Inc. 2014
Core concepts 
● Source 
o How to read or write data 
● TypedPipe[T] 
o A distributed list of T 
o Kind of like a Seq[T] in Scala’s collections library 
● Grouped[K, T] 
o A grouping on K 
o Represents transition to reduce phase 
Page 16 © Hortonworks Inc. 2014
Word Co-Occurrence 
TextLine(args("input")) 
.flatMap { line => 
val words = line.split("s+") 
for (w1 <- words; w2 <- words if (w1 != w2)) yield (w1, Map(w2 -> 1L)) 
}.group[String, Map[String, Long]] 
.sum 
.flatMap { case (word, wordMap) => wordMap.map { 
case (otherWord, count) => (word, otherWord, count) 
}}.write(TypedTsv[(String, String, Long)](args("output"))) 
Page 17 © Hortonworks Inc. 2014
Important concepts 
Scalding leverages a lot of Scala idioms, as well as 
concepts from functional programming 
● map 
o a 1 to 1 mapping for every piece of data 
● flatMap 
o a 1 to 0 or more mapping for every piece of data 
Page 18 © Hortonworks Inc. 2014
Important concepts (continued) 
● Typeclasses 
o The separation of computation from data types 
o Think Java’s Comparator (but way more powerful) 
o These are what power .sum 
Page 19 © Hortonworks Inc. 2014
Limitations 
Scalding’s limitations are MapReduce’s limitations 
● Bad at iterative jobs 
● Lots of checkpointing, serialization, sorting 
However... 
● Cascading on Tez could help! 
o in progress as part of Cascading 3.0 
● So could Cascading on Spark! 
Page 20 © Hortonworks Inc. 2014
The cutting edge 
● REPL support 
● Executor[T] 
o Decoupling TypedPipes from specifics of the execution 
engine 
o Makes Iterative algorithms much easier to express 
● Macros 
o Allowing easier use of case classes 
o Closure analysis? 
Page 21 © Hortonworks Inc. 2014
Scalding at Twitter 
● Thousands of users 
o Engineers AND data scientists 
● Many thousands of jobs every day 
o ETL 
o Recommendations 
o Email 
o Time series analysis 
When you use Twitter, you’re using features powered by 
Scalding! 
Page 22 © Hortonworks Inc. 2014
Useful practices 
● A standardized “Job” subclass with company specific 
information 
o Want the common case to be as simple as possible 
o Especially should configure serialization for users 
● Separate data from functions on data 
o At Twitter, this means Thrift for data, and various Scala 
functions operating and that data 
o Decouples the specification of some data from the derived 
data people want based on it 
Page 23 © Hortonworks Inc. 2014
Q&A 
Page 24 © Hortonworks Inc. 2014
Contribute! 
● Scalding 
● Algebird 
o Math inspired aggregators (.sum uses it) 
● Bijection 
o Conversion and serialization made fun 
● Summingbird 
o Abstraction for batch and online map/reduce (see resources for more) 
Page 25 © Hortonworks Inc. 2014
More resources 
Scalding/Algebird 
• Oscar Boykin: Algebra for Scalable Analytics 
• Avi Bryant: Add ALL the Things 
• Oscar Boykin, Argyris Zimny: Scalding: Powerful & Concise MapReduce 
You might also be interested in… 
• Summingbird! Streaming real-time and batch analytics, unified and made 
beautiful 
• Oscar Boykin: Introduction to Summingbird 
• Oscar Boykin, Sam Ritchie, Ian O’Connell, Jimmy Lin: 
Summingbird, A Framework for Integrating Batch and Online MapReduce 
Computations 
Page 26 © Hortonworks Inc. 2014
Next Webinar – Oct 2 - Spark 
Writing applications to Hadoop and YARN using Spark 
• October 2nd at 9am Pacific Time 
• Register 
Find all webinars 
• Hortonworks.com/webinars 
Find past recorded webinars 
• Hortonworks.com/webinars/#library 
Page 27 © Hortonworks Inc. 2014
Thank you! 
Page 28 © Hortonworks Inc. 2014

Más contenido relacionado

La actualidad más candente

Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
DataWorks Summit
 

La actualidad más candente (20)

Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextDiscover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014
 
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopEnrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
 
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
 
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
 
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache HiveDiscover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
 
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
 
Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture
 
What's new in Ambari
What's new in AmbariWhat's new in Ambari
What's new in Ambari
 
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
 

Destacado

Apache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance BenchmarksApache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance Benchmarks
Hortonworks
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Databricks
 

Destacado (20)

Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
Apache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNApache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARN
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final
 
Hortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical ApplicationsHortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical Applications
 
Apache Ambari - What's New in 2.1
Apache Ambari - What's New in 2.1Apache Ambari - What's New in 2.1
Apache Ambari - What's New in 2.1
 
Apache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance BenchmarksApache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance Benchmarks
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
 
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
 
Hortonworks technical workshop operations with ambari
Hortonworks technical workshop   operations with ambariHortonworks technical workshop   operations with ambari
Hortonworks technical workshop operations with ambari
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
 
Dataguise hortonworks insurance_feb25
Dataguise hortonworks insurance_feb25Dataguise hortonworks insurance_feb25
Dataguise hortonworks insurance_feb25
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - Webinar
 
Hortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptxHortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptx
 
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
 
Leverage Big Data to Enhance Customer Experience in Telecommunications – with...
Leverage Big Data to Enhance Customer Experience in Telecommunications – with...Leverage Big Data to Enhance Customer Experience in Telecommunications – with...
Leverage Big Data to Enhance Customer Experience in Telecommunications – with...
 
Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks Data In Motion Webinar Series Pt. 2Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks Data In Motion Webinar Series Pt. 2
 
Hortonworks Technical Workshop: Apache Ambari
Hortonworks Technical Workshop:   Apache AmbariHortonworks Technical Workshop:   Apache Ambari
Hortonworks Technical Workshop: Apache Ambari
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
A Multi Colored YARN
A Multi Colored YARNA Multi Colored YARN
A Multi Colored YARN
 

Similar a YARN webinar series: Using Scalding to write applications to Hadoop and YARN

Similar a YARN webinar series: Using Scalding to write applications to Hadoop and YARN (20)

Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
 
H2O Rains with Databricks Cloud - Parisoma SF
H2O Rains with Databricks Cloud - Parisoma SFH2O Rains with Databricks Cloud - Parisoma SF
H2O Rains with Databricks Cloud - Parisoma SF
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
BDTC2015 databricks-辛湜-state of spark
BDTC2015 databricks-辛湜-state of sparkBDTC2015 databricks-辛湜-state of spark
BDTC2015 databricks-辛湜-state of spark
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
Resume
ResumeResume
Resume
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster Computing
 
Bringing Deep Learning into production
Bringing Deep Learning into production Bringing Deep Learning into production
Bringing Deep Learning into production
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaSunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
 
2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 

Más de Hortonworks

Más de Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Último (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

YARN webinar series: Using Scalding to write applications to Hadoop and YARN

  • 1. Scalding YARN Webinar Series September 18, 2014 Page 1 © Hortonworks Inc. 2014 Ajay Singh, Director - Hortonworks Jonathan Coveney, Senior Software Engineer - Twitter
  • 2. Agenda Introduction: Ajay Singh, Hortonworks Modern Data Architecture and how Cascading and Scalding fit in Scalding: Jonathan Coveney, Twitter Why Scalding? Core Concepts and Limitations Scalding at Twitter Resources Page 2 © Hortonworks Inc. 2014
  • 3. Speakers Page 3 © Hortonworks Inc. 2014 Ajay Singh is Hortonworks Director of Technical Channels and leads the strategic alliances with partners from a technology standpoint such as driving alignment on roadmaps, product certifications and demos. Ajay is dedicated to building, scaling and delivering exceptional go-to-market solutions with partners. Jonathan Coveney currently works at Twitter, where he has spent a lot of time maintaining and updating Scalding; in the past, he has worked extensively on Apache Pig. He is deeply interested in functional programming, as well as developing usable, scalable API's for data processing at scale.
  • 4. A Modern Data Architecture DATA SYSTEM APPLICATIONS RDBMS EDW MPP REPOSITORIES SOURCES Exis4ng Sources (CRM, ERP, Clickstream, Logs) Page 4 © Hortonworks Inc. 2014 Emerging Sources (Sensor, Sen4ment, Geo, Unstructured) DEV & DATA TOOLS BUILD & TEST OPERATIONAL TOOLS MANAGE & MONITOR Business Analy4cs Custom Applica4ons Packaged Applica4ons Governance & Integration ENTERPRISE HADOOP Security Operations Data Access Data Management
  • 5. HDP 2.1: Enterprise Hadoop HDP 2.1 Hortonworks Data Platform Page 5 © Hortonworks Inc. 2014 Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Data Workflow, Lifecycle & Governance Falcon Sqoop Flume NFS WebHDFS YARN : Data Opera4ng System DATA MANAGEMENT GOVERNANCE & DATA ACCESS SECURITY INTEGRATION Authen4ca4on Authoriza4on Accoun4ng Data Protec4on Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox OPERATIONS Script Pig Search Solr SQL Hive/Tez, HCatalog NoSQL HBase Accumulo Stream Storm Others In-­‐Memory AnalyNcs, ISV engines Cascading 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (Hadoop Distributed File System) Batch Map Reduce Deployment Choice Linux Windows On-Premise Cloud
  • 6. Cascading SDK HDP Integrates and delivers Cascading SDK • Collection of tools, documentation, libraries, tutorials and example projects • Key Benefits • Simplified Development • Multi Language Support • Reuse existing skills and tools • Native YARN Integration Hortonworks delivers Enterprise support • Backed by Concurrent Hortonworks and Concurrent Advance Enterprise Data Application Development on Hadoop Page 6 © Hortonworks Inc. 2014
  • 7. HDP Integration of Cascading SDK • Write once and deploy on your fabric of choice • Integration with data processing layer allows Cascading to take advantage of advances in interactive applications • Sep 17th - Cascading 3.0 WIP Now Supports Apache Tez – http://www.cascading.org/2014/09/17/ cascading-3-0-wip-now-supports-apache-tez/ Page 7 © Hortonworks Inc. 2014 PRESENTATION & APPLICATION Efficient Cluster Resource Management & Shared Services (YARN) Batch Data Processing MapReduce Interac4ve Data Processing TEZ Java Cascading Scala Scalding SQL Lingual ML Pa6ern Java Cascading Scala Scalding SQL Lingual ML Pa6ern Enable both existing and new application to provide value to the organization CURRENT WIP
  • 8. Cascading.org Scalding Resources Scalding Resources on Cascading.org • Videos and Tutorials • Mailing List • Newsletter Cascading 3.0 WIP With Tez Support • https://github.com/cwensel/cascading/tree/wip-3.0/cascading-hadoop2-tez Scalding Training Debuts This Fall • In-person, 1-day class with labs • Email: info@cascading.io Page 8 © Hortonworks Inc. 2014
  • 9. Page 9 © Hortonworks Inc. 2014 Jonathan Coveney Twitter @jco
  • 10. Why Scalding? Writing raw map reduce is difficult! ● Scalding is o Less verbose o Less error prone (type checking!) o Easier to evolve o Performant enough Page 10 © Hortonworks Inc. 2014
  • 11. But what about Hive and Pig? ● Really good for certain things o Excellent for quick, ad-hoc work o Easy to understand o Can leverage existing knowledge (ie SQL) ● Not always the best for maintainability o Composition isn’t great o Testing is difficult o Type safety is lacking Page 11 © Hortonworks Inc. 2014
  • 12. So… Cascading? ● Still pretty verbose! ● But you can use normal java tools o Maven o JUnit o IDEs ● Handles the low level details for you ● A good target for higher level languages Page 12 © Hortonworks Inc. 2014
  • 13. Scalding ● Concise, expressive syntax ● Testable ● Abstractable ● Composable Because it’s in a full-featured, functional language! Page 13 © Hortonworks Inc. 2014
  • 14. But Scala is scary! ● Scalding doesn’t force you to use more complicated features ● Can just write less-verbose Java if desired ● Functional programming is an important paradigm -- but especially for big data Learning new things is good for your brain :) Page 14 © Hortonworks Inc. 2014
  • 15. Example Scalding job class Webinar(arg: Args) extends Job(args) { import TDsl._ TextLine(args(“input”)) .flatMap { _.split(“s+”) } .map { w => (w, 1L) } .group .sum .write(TypedTsv[(String, Long)](args(“output”))) } “Hadoop is a system for counting words” -Oscar Boykin, @posco Page 15 © Hortonworks Inc. 2014
  • 16. Core concepts ● Source o How to read or write data ● TypedPipe[T] o A distributed list of T o Kind of like a Seq[T] in Scala’s collections library ● Grouped[K, T] o A grouping on K o Represents transition to reduce phase Page 16 © Hortonworks Inc. 2014
  • 17. Word Co-Occurrence TextLine(args("input")) .flatMap { line => val words = line.split("s+") for (w1 <- words; w2 <- words if (w1 != w2)) yield (w1, Map(w2 -> 1L)) }.group[String, Map[String, Long]] .sum .flatMap { case (word, wordMap) => wordMap.map { case (otherWord, count) => (word, otherWord, count) }}.write(TypedTsv[(String, String, Long)](args("output"))) Page 17 © Hortonworks Inc. 2014
  • 18. Important concepts Scalding leverages a lot of Scala idioms, as well as concepts from functional programming ● map o a 1 to 1 mapping for every piece of data ● flatMap o a 1 to 0 or more mapping for every piece of data Page 18 © Hortonworks Inc. 2014
  • 19. Important concepts (continued) ● Typeclasses o The separation of computation from data types o Think Java’s Comparator (but way more powerful) o These are what power .sum Page 19 © Hortonworks Inc. 2014
  • 20. Limitations Scalding’s limitations are MapReduce’s limitations ● Bad at iterative jobs ● Lots of checkpointing, serialization, sorting However... ● Cascading on Tez could help! o in progress as part of Cascading 3.0 ● So could Cascading on Spark! Page 20 © Hortonworks Inc. 2014
  • 21. The cutting edge ● REPL support ● Executor[T] o Decoupling TypedPipes from specifics of the execution engine o Makes Iterative algorithms much easier to express ● Macros o Allowing easier use of case classes o Closure analysis? Page 21 © Hortonworks Inc. 2014
  • 22. Scalding at Twitter ● Thousands of users o Engineers AND data scientists ● Many thousands of jobs every day o ETL o Recommendations o Email o Time series analysis When you use Twitter, you’re using features powered by Scalding! Page 22 © Hortonworks Inc. 2014
  • 23. Useful practices ● A standardized “Job” subclass with company specific information o Want the common case to be as simple as possible o Especially should configure serialization for users ● Separate data from functions on data o At Twitter, this means Thrift for data, and various Scala functions operating and that data o Decouples the specification of some data from the derived data people want based on it Page 23 © Hortonworks Inc. 2014
  • 24. Q&A Page 24 © Hortonworks Inc. 2014
  • 25. Contribute! ● Scalding ● Algebird o Math inspired aggregators (.sum uses it) ● Bijection o Conversion and serialization made fun ● Summingbird o Abstraction for batch and online map/reduce (see resources for more) Page 25 © Hortonworks Inc. 2014
  • 26. More resources Scalding/Algebird • Oscar Boykin: Algebra for Scalable Analytics • Avi Bryant: Add ALL the Things • Oscar Boykin, Argyris Zimny: Scalding: Powerful & Concise MapReduce You might also be interested in… • Summingbird! Streaming real-time and batch analytics, unified and made beautiful • Oscar Boykin: Introduction to Summingbird • Oscar Boykin, Sam Ritchie, Ian O’Connell, Jimmy Lin: Summingbird, A Framework for Integrating Batch and Online MapReduce Computations Page 26 © Hortonworks Inc. 2014
  • 27. Next Webinar – Oct 2 - Spark Writing applications to Hadoop and YARN using Spark • October 2nd at 9am Pacific Time • Register Find all webinars • Hortonworks.com/webinars Find past recorded webinars • Hortonworks.com/webinars/#library Page 27 © Hortonworks Inc. 2014
  • 28. Thank you! Page 28 © Hortonworks Inc. 2014