SlideShare una empresa de Scribd logo
1 de 29
Descargar para leer sin conexión
Spark Application Carousel
Spark Summit East 2015
About Today’s Talk
• About Me:
• Vida Ha - Solutions Engineer at Databricks.
• Goal:
• For beginning/early intermediate Spark Developers.
• Motivate you to start writing more apps in Spark.
• Share some tips I’ve learned along the way.
2
Today’s Applications Covered
• Web Logs Analysis
• Basic Data Pipeline - Spark & Spark SQL
• Wikipedia Dataset
• Machine Learning
• Facebook API
• Graph Algorithms
3
4
Application  1:  Web  Log  Analysis
Web Logs
• Why?
• Most organizations have web log data.
• Dataset is too expensive to store in a database.
• Awesome, easy way to learn Spark!
• What?
• Standard Apache Access Logs.
• Web logs flow in each day from a web server.
5
Reading in Log Files
access_logs  =  (sc.textFile(DBFS_SAMPLE_LOGS_FOLDER)  
                              #  Call  the  parse_apace_log_line  on  each  line.  
                              .map(parse_apache_log_line)  
                              #  Caches  the  objects  in  memory.  
                            .cache())  
#  Call  an  action  on  the  RDD  to  actually  populate  the  cache.  
access_logs.count()
6
Calculate Context Size Stats
content_sizes  =  (access_logs  
                                  .map(lambda  row:  row.contentSize)  
                                  .cache())  #  Cache  since  multiple  queries.  
average_content_size  =  content_sizes.reduce(  
      lambda  x,  y:  x  +  y)  /  content_sizes.count()  
min_content_size  =  content_sizes.min()  
max_content_size  =  content_sizes.max()
7
Frequent IPAddresses - Key/Value Pairs
ip_addresses_rdd  =  (access_logs  
                                        .map(lambda  log:  (log.ipAddress,  1))  
                                        .reduceByKey(lambda  x,  y  :  x  +  y)  
                                        .filter(lambda  s:  s[1]  >  n)  
                                        .map(lambda  s:  Row(ip_address  =  s[0],  
                                                                                                                    count  =  s[1])))  
         #  Alternately,  could  just  collect()  the  values.  
           .registerTempTable(“ip_addresses”)
8
Other Statistics to Compute
• Response Code Count.
• Top Endpoints & Distribution.
• …and more.
Great way to learn various Spark Transformations and
actions and how to chain them together.
!
* BUT Spark SQL makes this much easier!
9
Better: Register Logs as a Spark SQL Table
sqlContext.sql(“CREATE  EXTERNAL  TABLE  access_logs  
      (  ipaddress  STRING  …  contextSize  INT  …  )  
      ROW  FORMAT    
      SERDE  'org.apache.hadoop.hive.serde2.RegexSerDe'  
    WITH  SERDEPROPERTIES  (  
        "input.regex"  =  '^(S+)  (S+)  (S+)  …’)  
    LOCATION  ”/tmp/sample_logs”  
”)
10
Context Sizes with Spark SQL
sqlContext.sql(“SELECT    
     (SUM(contentsize)  /  COUNT(*)),    #  Average  
   MIN(contentsize),    
   MAX(contentsize)  
FROM  access_logs”)
11
Frequent IPAddress with Spark SQL
sqlContext.sql(“SELECT    
                ipaddress,  
                COUNT(*)  AS  total          
          FROM  access_logs  
          GROUP  BY  ipaddress  
          HAVING  total  >  N”)
12
Tip: Use Partitioning
• Only analyze files from days you care about.
!
sqlContext.sql(“ALTER  TABLE  access_logs    
      ADD  PARTITION  (date='20150318')    
      LOCATION  ‘/logs/2015/3/18’”)  
!
• If your data rolls between days - perhaps those few missed
logs don’t matter.
13
Tip: Define Last N Day Tables for caching
• Create  another  table  with  a  similar  format.  
• Only  register  partitions  for  the  last  N  days.  
• Each  night:  
• Uncache  the  table.  
• Update  the  partition  definitions.  
• Recache:  
      sqlContext.sql(“CACHE  access_logs_last_7_days”)
14
Tip: Monitor the Pipeline with Spark SQL
• Detect if your batch jobs are taking too long.
• Programmatically create a temp table with stats from
one run.
sqlContext.sql(“CREATE  TABLE  IF  NOT  EXISTS  
pipelineStats  (runStart  INT,  runDuration  INT)”)  
sqlContext.sql(“insert  into  TABLE  pipelineStats  select  
runStart,  runDuration  from  oneRun  limit  1”)  
• Coalesce the table from time to time.
15
16
Demo:  Web  Log  Analysis
17
Application:  Wikipedia
Tip: Use Spark to Parallelize Downloading
Wikipedia can be downloaded in one giant file, or you can
download the 27 parts.
!
val  articlesRDD  =  sc.parallelize(articlesToRetrieve.toList,  4)  
val  retrieveInPartitions  =  (iter:  Iterator[String])  =>  {  
    iter.map(article  =>  retrieveArticleAndWriteToS3(article))  }  
val  fetchedArticles  =    
    articlesRDD.mapPartitions(retrieveInPartitions).collect()
18
Processing XML data
• Excessively large (> 1GB) compressed XML data is hard
to process.
• Not easily splittable.
• Solution: Break into text files where there is one XML
element per line.
19
ETL-ing your data with Spark
• Use an XML Parser to pull out fields of interest in the
XML document.
• Save the data in Parquet File format for faster querying.
• Register the Parquet format files as Spark SQL since
there is a clearly defined schema.
20
Using Spark for Fast Data Exploration
!
• CACHE the dataset for faster querying.
• Interactive programming experience.
• Use a mix of Python or Scala combined with SQL to
analyze the dataset.
21
Tip: Use MLLib to Learn from Dataset
• Wikipedia articles are an rich set of data for the English
language.
• Word2Vec is a simple algorithm to learn synonyms and
can be applied to the wikipedia article.
• Try out your favorite ML/NLP algorithms!
22
23
Demo:  Wikipedia  App
24
Application:  Facebook  API
Tip: Use Spark to Scrape Facebook Data
• Use Spark to Facebook to make requests for friends of
friends in parallel.
• NOTE: Latest Facebook API will only show friends that
have also enabled the app.
• If you build a Facebook App and get more users to
accept it, you can build a more complete picture of the
social graph!
25
Tip: Use GraphX to learn on Data
• Use the Page Rank algorithm to determine who’s the
most popular**.
• Output User Data: Facebook User Id to name.
• Output Edges: User Id to User Id
!
** In this case it’s my friends, so I’m clearly the most
popular.
26
27
Demo:  Facebook  API
Conclusion
• I hope this talk has inspired you to want to write Spark
applications on your favorite dataset.
• Hacking (and making mistakes) is the best way to learn.
• If you want to walk through some examples, see the
Databricks Spark Reference Applications:
• https://github.com/databricks/reference-apps
28
THE END

Más contenido relacionado

La actualidad más candente

End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
Databricks
 

La actualidad más candente (20)

End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
 
Visualizing big data in the browser using spark
Visualizing big data in the browser using sparkVisualizing big data in the browser using spark
Visualizing big data in the browser using spark
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
A look ahead at spark 2.0
A look ahead at spark 2.0 A look ahead at spark 2.0
A look ahead at spark 2.0
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
 
New Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit EastNew Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit East
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
The BDAS Open Source Community
The BDAS Open Source CommunityThe BDAS Open Source Community
The BDAS Open Source Community
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
 

Similar a Spark Application Carousel: Highlights of Several Applications Built with Spark

AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 

Similar a Spark Application Carousel: Highlights of Several Applications Built with Spark (20)

20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data Processing
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Spark core
Spark coreSpark core
Spark core
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetupxPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 

Más de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
anilsa9823
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Último (20)

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 

Spark Application Carousel: Highlights of Several Applications Built with Spark

  • 2. About Today’s Talk • About Me: • Vida Ha - Solutions Engineer at Databricks. • Goal: • For beginning/early intermediate Spark Developers. • Motivate you to start writing more apps in Spark. • Share some tips I’ve learned along the way. 2
  • 3. Today’s Applications Covered • Web Logs Analysis • Basic Data Pipeline - Spark & Spark SQL • Wikipedia Dataset • Machine Learning • Facebook API • Graph Algorithms 3
  • 4. 4 Application  1:  Web  Log  Analysis
  • 5. Web Logs • Why? • Most organizations have web log data. • Dataset is too expensive to store in a database. • Awesome, easy way to learn Spark! • What? • Standard Apache Access Logs. • Web logs flow in each day from a web server. 5
  • 6. Reading in Log Files access_logs  =  (sc.textFile(DBFS_SAMPLE_LOGS_FOLDER)                                #  Call  the  parse_apace_log_line  on  each  line.                                .map(parse_apache_log_line)                                #  Caches  the  objects  in  memory.                              .cache())   #  Call  an  action  on  the  RDD  to  actually  populate  the  cache.   access_logs.count() 6
  • 7. Calculate Context Size Stats content_sizes  =  (access_logs                                    .map(lambda  row:  row.contentSize)                                    .cache())  #  Cache  since  multiple  queries.   average_content_size  =  content_sizes.reduce(       lambda  x,  y:  x  +  y)  /  content_sizes.count()   min_content_size  =  content_sizes.min()   max_content_size  =  content_sizes.max() 7
  • 8. Frequent IPAddresses - Key/Value Pairs ip_addresses_rdd  =  (access_logs                                          .map(lambda  log:  (log.ipAddress,  1))                                          .reduceByKey(lambda  x,  y  :  x  +  y)                                          .filter(lambda  s:  s[1]  >  n)                                          .map(lambda  s:  Row(ip_address  =  s[0],                                                                                                                      count  =  s[1])))         #  Alternately,  could  just  collect()  the  values.           .registerTempTable(“ip_addresses”) 8
  • 9. Other Statistics to Compute • Response Code Count. • Top Endpoints & Distribution. • …and more. Great way to learn various Spark Transformations and actions and how to chain them together. ! * BUT Spark SQL makes this much easier! 9
  • 10. Better: Register Logs as a Spark SQL Table sqlContext.sql(“CREATE  EXTERNAL  TABLE  access_logs        (  ipaddress  STRING  …  contextSize  INT  …  )        ROW  FORMAT          SERDE  'org.apache.hadoop.hive.serde2.RegexSerDe'      WITH  SERDEPROPERTIES  (          "input.regex"  =  '^(S+)  (S+)  (S+)  …’)      LOCATION  ”/tmp/sample_logs”   ”) 10
  • 11. Context Sizes with Spark SQL sqlContext.sql(“SELECT         (SUM(contentsize)  /  COUNT(*)),    #  Average     MIN(contentsize),       MAX(contentsize)   FROM  access_logs”) 11
  • 12. Frequent IPAddress with Spark SQL sqlContext.sql(“SELECT                    ipaddress,                  COUNT(*)  AS  total                    FROM  access_logs            GROUP  BY  ipaddress            HAVING  total  >  N”) 12
  • 13. Tip: Use Partitioning • Only analyze files from days you care about. ! sqlContext.sql(“ALTER  TABLE  access_logs          ADD  PARTITION  (date='20150318')          LOCATION  ‘/logs/2015/3/18’”)   ! • If your data rolls between days - perhaps those few missed logs don’t matter. 13
  • 14. Tip: Define Last N Day Tables for caching • Create  another  table  with  a  similar  format.   • Only  register  partitions  for  the  last  N  days.   • Each  night:   • Uncache  the  table.   • Update  the  partition  definitions.   • Recache:       sqlContext.sql(“CACHE  access_logs_last_7_days”) 14
  • 15. Tip: Monitor the Pipeline with Spark SQL • Detect if your batch jobs are taking too long. • Programmatically create a temp table with stats from one run. sqlContext.sql(“CREATE  TABLE  IF  NOT  EXISTS   pipelineStats  (runStart  INT,  runDuration  INT)”)   sqlContext.sql(“insert  into  TABLE  pipelineStats  select   runStart,  runDuration  from  oneRun  limit  1”)   • Coalesce the table from time to time. 15
  • 16. 16 Demo:  Web  Log  Analysis
  • 18. Tip: Use Spark to Parallelize Downloading Wikipedia can be downloaded in one giant file, or you can download the 27 parts. ! val  articlesRDD  =  sc.parallelize(articlesToRetrieve.toList,  4)   val  retrieveInPartitions  =  (iter:  Iterator[String])  =>  {      iter.map(article  =>  retrieveArticleAndWriteToS3(article))  }   val  fetchedArticles  =        articlesRDD.mapPartitions(retrieveInPartitions).collect() 18
  • 19. Processing XML data • Excessively large (> 1GB) compressed XML data is hard to process. • Not easily splittable. • Solution: Break into text files where there is one XML element per line. 19
  • 20. ETL-ing your data with Spark • Use an XML Parser to pull out fields of interest in the XML document. • Save the data in Parquet File format for faster querying. • Register the Parquet format files as Spark SQL since there is a clearly defined schema. 20
  • 21. Using Spark for Fast Data Exploration ! • CACHE the dataset for faster querying. • Interactive programming experience. • Use a mix of Python or Scala combined with SQL to analyze the dataset. 21
  • 22. Tip: Use MLLib to Learn from Dataset • Wikipedia articles are an rich set of data for the English language. • Word2Vec is a simple algorithm to learn synonyms and can be applied to the wikipedia article. • Try out your favorite ML/NLP algorithms! 22
  • 25. Tip: Use Spark to Scrape Facebook Data • Use Spark to Facebook to make requests for friends of friends in parallel. • NOTE: Latest Facebook API will only show friends that have also enabled the app. • If you build a Facebook App and get more users to accept it, you can build a more complete picture of the social graph! 25
  • 26. Tip: Use GraphX to learn on Data • Use the Page Rank algorithm to determine who’s the most popular**. • Output User Data: Facebook User Id to name. • Output Edges: User Id to User Id ! ** In this case it’s my friends, so I’m clearly the most popular. 26
  • 28. Conclusion • I hope this talk has inspired you to want to write Spark applications on your favorite dataset. • Hacking (and making mistakes) is the best way to learn. • If you want to walk through some examples, see the Databricks Spark Reference Applications: • https://github.com/databricks/reference-apps 28