SlideShare una empresa de Scribd logo
1 de 22
Descargar para leer sin conexión
@ Coursera 
Daniel Chia 
@DanielJHChia 
Software Engineer, Infrastructure
Overview 
• Context 
• Growing Needs 
• Hive / Pig / Scalding
Technical (Online Stack) 
• 100% hosted on AWS 
• Service-oriented architecture 
• Mix of MySQL and Cassandra for persistence 
• Scala
Existing Warehouse 
Streaming
Future Warehouse Flow 
S3 
Event Data
Need 1: Expressive 
• Joins 
• Aggregations 
• Secondary sort 
• Multiple map-reduce
Need 2: Semi-structured Data 
• Increased usage of Cassandra 
• Events data
{ 
“timestamp”:1411359695744, 
“membershipState":"LearnerEnrolled" 
}
{ 
"typeName": "multipart", 
"definition": { 
"assignmentParts": { 
"id1": { 
"typeName": "plainText", 
"order": 0, 
"definition": { 
"prompt": "Write a sentence describing what you think about cereal." 
} 
}, 
"id2": { 
"typeName": "richText", 
"order": 1, 
"definition": { 
"prompt": "Write a long essay with lots of fancy formatting describing what you 
think about cereal." 
} 
}, 
"id3": { 
"typeName": "url", 
"order": 2, 
"definition": { 
"prompt": "Post a link to your favorite cereal." 
} 
}, 
"id4": { 
"typeName": "plainText", 
"order": 3, 
"definition": { 
…
Choices 
• Hive 
• Pig 
• Scalding
Hive 
• SQL-like language 
• Great for simple rollups and aggregations 
• Procedural transforms difficult to express
Pig 
• Mature 
• Procedural 
• Pig Latin + Lots of UDFs
Scalding – Pros 
• Succinct 
• Expressive 
• All code in one language 
• Re-use online data models
Scaling – Pros 
• Easy to test
Scalding – Cons 
• Have to learn Scala 
• More heavy weight for simple experimental things. 
• Many layers abstracted from MapReduce
Scalding – Example 
• User event data 
• Want to join with course and topic data
Scalding – Example 
val events = TypedTsv … /* load data */ 
.toTypedPipe 
val courses = TypedTsv … 
.toTypedPipe 
val topics = TypedTsv … 
.toTypedPipe
Scalding – Example 
events.groupBy(_.courseId) 
.leftJoin(courses.groupBy(_.courseId)) 
.groupBy(_._2.topicId) 
.leftJoin(topics.groupBy(_.topicId)) 
/* more analysis */
Scalding – Example 
events.groupBy(_.courseId) 
.leftJoin(courses.groupBy(_.courseId)) 
.groupBy(_._2.topicId) 
.leftJoin(topics.groupBy(_.topicId)) 
/* more analysis */
Scalding – Example 
events.groupBy(_.courseId) 
.leftJoin(courses.groupBy(_.courseId)) 
.groupBy(_._2.topicId) 
.sketch(reducer = 100) 
.leftJoin(topics.groupBy(_.topicId))
Scalding – Wish-list 
• More documentation 
• Scala 2.11 soon, please?
Questions? 
We’re hiring! 
coursera.org/jobs

Más contenido relacionado

La actualidad más candente

Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API
Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible APIIntroducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API
Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API
ScyllaDB
 

La actualidad más candente (20)

BigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearchBigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearch
 
Azure CosmosDb - Where we are
Azure CosmosDb - Where we areAzure CosmosDb - Where we are
Azure CosmosDb - Where we are
 
Presentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membasePresentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membase
 
Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API
Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible APIIntroducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API
Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API
 
NoSQL and AWS Dynamodb
NoSQL and AWS DynamodbNoSQL and AWS Dynamodb
NoSQL and AWS Dynamodb
 
Azure CosmosDB the new frontier of big data and nosql
Azure CosmosDB the new frontier of big data and nosqlAzure CosmosDB the new frontier of big data and nosql
Azure CosmosDB the new frontier of big data and nosql
 
Mongodb lab
Mongodb labMongodb lab
Mongodb lab
 
Azure CosmosDb
Azure CosmosDbAzure CosmosDb
Azure CosmosDb
 
Scaling horizontally on AWS
Scaling horizontally on AWSScaling horizontally on AWS
Scaling horizontally on AWS
 
Using cassandra as a distributed logging to store pb data
Using cassandra as a distributed logging to store pb dataUsing cassandra as a distributed logging to store pb data
Using cassandra as a distributed logging to store pb data
 
Scylla Summit 2018: Scaling your time series data with Newts
Scylla Summit 2018: Scaling your time series data with NewtsScylla Summit 2018: Scaling your time series data with Newts
Scylla Summit 2018: Scaling your time series data with Newts
 
Big Data Overview Part 1
Big Data Overview Part 1Big Data Overview Part 1
Big Data Overview Part 1
 
Overview of the Hive Stinger Initiative
Overview of the Hive Stinger InitiativeOverview of the Hive Stinger Initiative
Overview of the Hive Stinger Initiative
 
Survey of the Microsoft Azure Data Landscape
Survey of the Microsoft Azure Data LandscapeSurvey of the Microsoft Azure Data Landscape
Survey of the Microsoft Azure Data Landscape
 
CosmosDb for beginners
CosmosDb for beginnersCosmosDb for beginners
CosmosDb for beginners
 
Elasticsearch in Production (London version)
Elasticsearch in Production (London version)Elasticsearch in Production (London version)
Elasticsearch in Production (London version)
 
Build 2017 - P4010 - A lap around Azure HDInsight and Cosmos DB Open Source A...
Build 2017 - P4010 - A lap around Azure HDInsight and Cosmos DB Open Source A...Build 2017 - P4010 - A lap around Azure HDInsight and Cosmos DB Open Source A...
Build 2017 - P4010 - A lap around Azure HDInsight and Cosmos DB Open Source A...
 
Introduction to Cassandra (June 2010)
Introduction to Cassandra (June 2010)Introduction to Cassandra (June 2010)
Introduction to Cassandra (June 2010)
 
Lightning talk: elasticsearch at Cogenta
Lightning talk: elasticsearch at CogentaLightning talk: elasticsearch at Cogenta
Lightning talk: elasticsearch at Cogenta
 
SQL vs NoSQL
SQL vs NoSQLSQL vs NoSQL
SQL vs NoSQL
 

Destacado

Destacado (11)

Analytics with Splunk (Open edX)
Analytics with Splunk (Open edX)Analytics with Splunk (Open edX)
Analytics with Splunk (Open edX)
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
 
What we're learning about burnout and how DevOps can help
What we're learning about burnout and how DevOps can helpWhat we're learning about burnout and how DevOps can help
What we're learning about burnout and how DevOps can help
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
 
Twitter by the Numbers
Twitter by the NumbersTwitter by the Numbers
Twitter by the Numbers
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Real-time systems at Twitter (Velocity 2012)
Real-time systems at Twitter (Velocity 2012)Real-time systems at Twitter (Velocity 2012)
Real-time systems at Twitter (Velocity 2012)
 
Luigi presentation OA Summit
Luigi presentation OA SummitLuigi presentation OA Summit
Luigi presentation OA Summit
 
Scaling Twitter
Scaling TwitterScaling Twitter
Scaling Twitter
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 

Similar a Scalding @ Coursera

Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
Buildingsocialanalyticstoolwithmongodb
MongoDB APAC
 

Similar a Scalding @ Coursera (20)

ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir Volk
 
Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
Buildingsocialanalyticstoolwithmongodb
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Shooting rabbits with sling
Shooting rabbits with slingShooting rabbits with sling
Shooting rabbits with sling
 
Building Hadoop Data Applications with Kite
Building Hadoop Data Applications with KiteBuilding Hadoop Data Applications with Kite
Building Hadoop Data Applications with Kite
 
Guacamole
GuacamoleGuacamole
Guacamole
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Training Slides: 351 - Tungsten Replicator for Data Warehouses
Training Slides: 351 - Tungsten Replicator for Data WarehousesTraining Slides: 351 - Tungsten Replicator for Data Warehouses
Training Slides: 351 - Tungsten Replicator for Data Warehouses
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
StackMate - CloudFormation for CloudStack
StackMate - CloudFormation for CloudStackStackMate - CloudFormation for CloudStack
StackMate - CloudFormation for CloudStack
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
 
Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardBig Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David Giard
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Real-Time Analytics with Apache Cassandra and Apache Spark,
Real-Time Analytics with Apache Cassandra and Apache Spark,Real-Time Analytics with Apache Cassandra and Apache Spark,
Real-Time Analytics with Apache Cassandra and Apache Spark,
 

Último

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 

Último (20)

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
Generic or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 

Scalding @ Coursera