Más contenido relacionado La actualidad más candente (20) Similar a YARN webinar series: Using Scalding to write applications to Hadoop and YARN (20) YARN webinar series: Using Scalding to write applications to Hadoop and YARN1. Scalding
YARN Webinar Series
September 18, 2014
Page 1 © Hortonworks Inc. 2014
Ajay Singh, Director - Hortonworks
Jonathan Coveney, Senior Software Engineer - Twitter
2. Agenda
Introduction: Ajay Singh, Hortonworks
Modern Data Architecture and how Cascading and Scalding fit in
Scalding: Jonathan Coveney, Twitter
Why Scalding?
Core Concepts and Limitations
Scalding at Twitter
Resources
Page 2 © Hortonworks Inc. 2014
3. Speakers
Page 3 © Hortonworks Inc. 2014
Ajay Singh is Hortonworks Director of Technical
Channels and leads the strategic alliances with partners
from a technology standpoint such as driving alignment
on roadmaps, product certifications and demos. Ajay is
dedicated to building, scaling and delivering exceptional
go-to-market solutions with partners.
Jonathan Coveney currently works at Twitter, where he
has spent a lot of time maintaining and updating Scalding;
in the past, he has worked extensively on Apache Pig. He
is deeply interested in functional programming, as well as
developing usable, scalable API's for data processing at
scale.
4. A Modern Data Architecture
DATA
SYSTEM
APPLICATIONS
RDBMS
EDW
MPP
REPOSITORIES
SOURCES
Exis4ng
Sources
(CRM,
ERP,
Clickstream,
Logs)
Page 4 © Hortonworks Inc. 2014
Emerging
Sources
(Sensor,
Sen4ment,
Geo,
Unstructured)
DEV
&
DATA
TOOLS
BUILD
&
TEST
OPERATIONAL
TOOLS
MANAGE
&
MONITOR
Business
Analy4cs
Custom
Applica4ons
Packaged
Applica4ons
Governance
& Integration
ENTERPRISE HADOOP
Security
Operations
Data Access
Data Management
5. HDP 2.1: Enterprise Hadoop
HDP 2.1
Hortonworks Data Platform
Page 5 © Hortonworks Inc. 2014
Provision,
Manage
&
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data
Workflow,
Lifecycle
&
Governance
Falcon
Sqoop
Flume
NFS
WebHDFS
YARN
:
Data
Opera4ng
System
DATA
MANAGEMENT
GOVERNANCE
&
DATA
ACCESS
SECURITY
INTEGRATION
Authen4ca4on
Authoriza4on
Accoun4ng
Data
Protec4on
Storage:
HDFS
Resources:
YARN
Access:
Hive,
…
Pipeline:
Falcon
Cluster:
Knox
OPERATIONS
Script
Pig
Search
Solr
SQL
Hive/Tez,
HCatalog
NoSQL
HBase
Accumulo
Stream
Storm
Others
In-‐Memory
AnalyNcs,
ISV
engines
Cascading
1
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
N
HDFS
(Hadoop
Distributed
File
System)
Batch
Map
Reduce
Deployment
Choice
Linux Windows On-Premise Cloud
6. Cascading SDK
HDP Integrates and delivers Cascading SDK
• Collection of tools, documentation, libraries,
tutorials and example projects
• Key Benefits
• Simplified Development
• Multi Language Support
• Reuse existing skills and tools
• Native YARN Integration
Hortonworks delivers Enterprise support
• Backed by Concurrent
Hortonworks and Concurrent Advance Enterprise Data Application
Development on Hadoop
Page 6 © Hortonworks Inc. 2014
7. HDP Integration of Cascading SDK
• Write once and deploy on your fabric of
choice
• Integration with data processing layer allows
Cascading to take advantage of advances in
interactive applications
• Sep 17th - Cascading 3.0 WIP Now Supports
Apache Tez
– http://www.cascading.org/2014/09/17/
cascading-3-0-wip-now-supports-apache-tez/
Page 7 © Hortonworks Inc. 2014
PRESENTATION
&
APPLICATION
Efficient
Cluster
Resource
Management
&
Shared
Services
(YARN)
Batch
Data
Processing
MapReduce
Interac4ve
Data
Processing
TEZ
Java
Cascading
Scala
Scalding
SQL
Lingual
ML
Pa6ern
Java
Cascading
Scala
Scalding
SQL
Lingual
ML
Pa6ern
Enable both existing and new application to
provide value to the organization
CURRENT WIP
8. Cascading.org Scalding Resources
Scalding Resources on Cascading.org
• Videos and Tutorials
• Mailing List
• Newsletter
Cascading 3.0 WIP With Tez Support
• https://github.com/cwensel/cascading/tree/wip-3.0/cascading-hadoop2-tez
Scalding Training Debuts This Fall
• In-person, 1-day class with labs
• Email: info@cascading.io
Page 8 © Hortonworks Inc. 2014
9. Page 9 © Hortonworks Inc. 2014
Jonathan Coveney
Twitter
@jco
10. Why Scalding?
Writing raw map reduce is difficult!
● Scalding is
o Less verbose
o Less error prone (type checking!)
o Easier to evolve
o Performant enough
Page 10 © Hortonworks Inc. 2014
11. But what about Hive and Pig?
● Really good for certain things
o Excellent for quick, ad-hoc work
o Easy to understand
o Can leverage existing knowledge (ie SQL)
● Not always the best for maintainability
o Composition isn’t great
o Testing is difficult
o Type safety is lacking
Page 11 © Hortonworks Inc. 2014
12. So… Cascading?
● Still pretty verbose!
● But you can use normal java tools
o Maven
o JUnit
o IDEs
● Handles the low level details for you
● A good target for higher level languages
Page 12 © Hortonworks Inc. 2014
13. Scalding
● Concise, expressive syntax
● Testable
● Abstractable
● Composable
Because it’s in a full-featured, functional language!
Page 13 © Hortonworks Inc. 2014
14. But Scala is scary!
● Scalding doesn’t force you to use more complicated
features
● Can just write less-verbose Java if desired
● Functional programming is an important paradigm -- but
especially for big data
Learning new things is good for your brain :)
Page 14 © Hortonworks Inc. 2014
15. Example Scalding job
class Webinar(arg: Args) extends Job(args) {
import TDsl._
TextLine(args(“input”))
.flatMap { _.split(“s+”) }
.map { w => (w, 1L) }
.group
.sum
.write(TypedTsv[(String, Long)](args(“output”)))
}
“Hadoop is a system for counting words” -Oscar Boykin, @posco
Page 15 © Hortonworks Inc. 2014
16. Core concepts
● Source
o How to read or write data
● TypedPipe[T]
o A distributed list of T
o Kind of like a Seq[T] in Scala’s collections library
● Grouped[K, T]
o A grouping on K
o Represents transition to reduce phase
Page 16 © Hortonworks Inc. 2014
17. Word Co-Occurrence
TextLine(args("input"))
.flatMap { line =>
val words = line.split("s+")
for (w1 <- words; w2 <- words if (w1 != w2)) yield (w1, Map(w2 -> 1L))
}.group[String, Map[String, Long]]
.sum
.flatMap { case (word, wordMap) => wordMap.map {
case (otherWord, count) => (word, otherWord, count)
}}.write(TypedTsv[(String, String, Long)](args("output")))
Page 17 © Hortonworks Inc. 2014
18. Important concepts
Scalding leverages a lot of Scala idioms, as well as
concepts from functional programming
● map
o a 1 to 1 mapping for every piece of data
● flatMap
o a 1 to 0 or more mapping for every piece of data
Page 18 © Hortonworks Inc. 2014
19. Important concepts (continued)
● Typeclasses
o The separation of computation from data types
o Think Java’s Comparator (but way more powerful)
o These are what power .sum
Page 19 © Hortonworks Inc. 2014
20. Limitations
Scalding’s limitations are MapReduce’s limitations
● Bad at iterative jobs
● Lots of checkpointing, serialization, sorting
However...
● Cascading on Tez could help!
o in progress as part of Cascading 3.0
● So could Cascading on Spark!
Page 20 © Hortonworks Inc. 2014
21. The cutting edge
● REPL support
● Executor[T]
o Decoupling TypedPipes from specifics of the execution
engine
o Makes Iterative algorithms much easier to express
● Macros
o Allowing easier use of case classes
o Closure analysis?
Page 21 © Hortonworks Inc. 2014
22. Scalding at Twitter
● Thousands of users
o Engineers AND data scientists
● Many thousands of jobs every day
o ETL
o Recommendations
o Email
o Time series analysis
When you use Twitter, you’re using features powered by
Scalding!
Page 22 © Hortonworks Inc. 2014
23. Useful practices
● A standardized “Job” subclass with company specific
information
o Want the common case to be as simple as possible
o Especially should configure serialization for users
● Separate data from functions on data
o At Twitter, this means Thrift for data, and various Scala
functions operating and that data
o Decouples the specification of some data from the derived
data people want based on it
Page 23 © Hortonworks Inc. 2014
25. Contribute!
● Scalding
● Algebird
o Math inspired aggregators (.sum uses it)
● Bijection
o Conversion and serialization made fun
● Summingbird
o Abstraction for batch and online map/reduce (see resources for more)
Page 25 © Hortonworks Inc. 2014
26. More resources
Scalding/Algebird
• Oscar Boykin: Algebra for Scalable Analytics
• Avi Bryant: Add ALL the Things
• Oscar Boykin, Argyris Zimny: Scalding: Powerful & Concise MapReduce
You might also be interested in…
• Summingbird! Streaming real-time and batch analytics, unified and made
beautiful
• Oscar Boykin: Introduction to Summingbird
• Oscar Boykin, Sam Ritchie, Ian O’Connell, Jimmy Lin:
Summingbird, A Framework for Integrating Batch and Online MapReduce
Computations
Page 26 © Hortonworks Inc. 2014
27. Next Webinar – Oct 2 - Spark
Writing applications to Hadoop and YARN using Spark
• October 2nd at 9am Pacific Time
• Register
Find all webinars
• Hortonworks.com/webinars
Find past recorded webinars
• Hortonworks.com/webinars/#library
Page 27 © Hortonworks Inc. 2014