Heuritech: Apache Spark REX

•

2 recomendaciones•2,708 vistas

didmarin

Return of experience on the use of Spark by Heuritech

Software

ABOUT ME
Didier Marin
PhD in Computer Science (UPMC)
Machine Learning, Reinforcement Learning & Robotics
Co-founder of Heuritech
Likes functional programming and distributed computing

We develop tools to make sense from raw text data
Customer insight using the text of visited web pages

Data Analytics Platform
Qualify users using their web logs
50M lines/day
Match CRM and web data

WHY SPARK ?
Performance, in particular when
batch size < total RAM in cluster
More general than MR, high-level API
Extensions (ML, streaming) and
connectors (Cassandra)
Growing community

PARSING LOGS
defparseLine(line:String):
Either[ParsingError,LogData]=???
vallogs=sc.textFile("logfile").map(parseLine(_))
valvalidLogs=logs.flatMap(_.right.toOption)

CLUSTER CONFIGURATION
LXC + salt
N containers : 1 master/executor + (N-1) executors
Cassandra node for each Spark executor
Using an "uber"-JAR to submit jobs
Sharing data through NFS

MANAGING SPARK'S MEMORY
Default: 40 % working memory, 60 % cache
20 % of cache used to unroll blocks
Explicit caching for huge RDDs we reuse:
validLogs.persist(StorageLevel.MEMORY_AND_DISK)
Partition tuning may be necessary (spark.default.parallelism)

AGGREGATION
valwords=sc.parallelize(List("a","b","a","c"))
words.groupBy(x=>x).mapValues(_.size).collect
//Array((a,2),(b,1),(c,1))
words.map(x=>(x,1)).reduceByKey(_+_).collect
//Array((a,2),(b,1),(c,1))

see also &
AGGREGATION
reduceByKey
combineByKey foldByKey

Databricks knowledge base
Spark users mailing list
Parsing Apache logs with Spark (Scala)
USEFUL LINKS
github.com/databricks/spark-knowledgebase
apache-spark-user-list.1001560.n3.nabble.com
alvinalexander.com/scala/analyzing-apache-access-logs-files-
spark-scala

Más contenido relacionado

La actualidad más candente

Intro to py spark (and cassandra)Jon Haddad

PySpark Cassandra - Amsterdam Spark MeetupFrens Jan Rumph

Introduction to Apache Drill - interactive query and analysis at scaleMapR Technologies

Cassandra + Spark + ElkVasil Remeniuk

Spark Cassandra Connector DataframesRussell Spitzer

Cassandra + Hadoop = BriskDave Gardner

Online Analytics with Hadoop and CassandraRobbie Strickland

Cassandra/Hadoop IntegrationJeremy Hanna

Cassandra+HadoopJeremy Hanna

Real time data pipeline with spark streaming and cassandra with mesosRahul Kumar

Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisDuyhai Doan

Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryIlya Ganelin

Cassandra ExplainedEric Evans

Introduce to Spark sql 1.3.0 Bryan Yang

Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer

Hw09 Sqoop Database Import For HadoopCloudera, Inc.

Using Apache Spark as ETL engine. Pros and Cons Provectus

Hadoop+Cassandra_IntegrationJoyabrata Das

Druid meetup 4th_sql_on_druidYousun Jeong

DataEngConf SF16 - Spark SQL WorkshopHakka Labs

La actualidad más candente (20)

Intro to py spark (and cassandra)

PySpark Cassandra - Amsterdam Spark Meetup

Introduction to Apache Drill - interactive query and analysis at scale

Cassandra + Spark + Elk

Spark Cassandra Connector Dataframes

Cassandra + Hadoop = Brisk

Online Analytics with Hadoop and Cassandra

Cassandra/Hadoop Integration

Cassandra+Hadoop

Real time data pipeline with spark streaming and cassandra with mesos

Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library

Cassandra Explained

Introduce to Spark sql 1.3.0

Spark Cassandra Connector: Past, Present, and Future

Hw09 Sqoop Database Import For Hadoop

Using Apache Spark as ETL engine. Pros and Cons

Hadoop+Cassandra_Integration

Druid meetup 4th_sql_on_druid

DataEngConf SF16 - Spark SQL Workshop

Similar a Heuritech: Apache Spark REX

Unified Big Data Processing with Apache Spark (QCON 2014)Databricks

A look under the hood at Apache Spark's API and engine evolutionsDatabricks

Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy

In Memory Analytics with Apache SparkVenkata Naga Ravi

Unified Big Data Processing with Apache SparkC4Media

Apache spark architecture (Big Data and Analytics)Jyotasana Bharti

Big data vahidamiri-tabriz-13960226-datastack.irdatastack

Big Data on the CloudSercan Karaoglu

An introduction To Apache SparkAmir Sedighi

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson

TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta

TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta

Spark ML Pipeline servingStepan Pushkarev

Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere

Simplifying Big Data Analytics with Apache SparkDatabricks

Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一scalaconfjp

Big Data Analytics and Ubiquitous computingAnimesh Chaturvedi

Spark Study NotesRichard Kuo

How Apache Spark fits into the Big Data landscapePaco Nathan

Getting Started with Spark StreamingAlex Apollonsky

Similar a Heuritech: Apache Spark REX (20)

Unified Big Data Processing with Apache Spark (QCON 2014)

A look under the hood at Apache Spark's API and engine evolutions

Apache spark - Architecture , Overview & libraries

In Memory Analytics with Apache Spark

Unified Big Data Processing with Apache Spark

Apache spark architecture (Big Data and Analytics)

Big data vahidamiri-tabriz-13960226-datastack.ir

Big Data on the Cloud

An introduction To Apache Spark

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...

TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...

TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...

Spark ML Pipeline serving

Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...

Simplifying Big Data Analytics with Apache Spark

Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一

Big Data Analytics and Ubiquitous computing

Spark Study Notes

How Apache Spark fits into the Big Data landscape

Getting Started with Spark Streaming

Último

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls

HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai

A Secure and Reliable Document Management System is Essential.docxComplianceQuest1

TECUNIQUE: Success Stories: IT Service providermohitmore19

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.

Software Quality Assurance Interview QuestionsArshad QA

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health

How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveCall Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy

How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.

Optimizing AI for immediate response in Smart CCTVshikhaohhpro

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531

Heuritech: Apache Spark REX

1. APACHE SPARK REX

2. ABOUT ME Didier Marin PhD in Computer Science (UPMC) Machine Learning, Reinforcement Learning & Robotics Co-founder of Heuritech Likes functional programming and distributed computing

3. We develop tools to make sense from raw text data Customer insight using the text of visited web pages

4. Data Analytics Platform Qualify users using their web logs 50M lines/day Match CRM and web data

6. WHY SPARK ? Performance, in particular when batch size < total RAM in cluster More general than MR, high-level API Extensions (ML, streaming) and connectors (Cassandra) Growing community

7. PARSING LOGS defparseLine(line:String): Either[ParsingError,LogData]=??? vallogs=sc.textFile("logfile").map(parseLine(_)) valvalidLogs=logs.flatMap(_.right.toOption)

8. LAMBDA ARCHITECTURE

9. IMPLEMENTATION

10. CLUSTER CONFIGURATION LXC + salt N containers : 1 master/executor + (N-1) executors Cassandra node for each Spark executor Using an "uber"-JAR to submit jobs Sharing data through NFS

11.

12. MANAGING SPARK'S MEMORY Default: 40 % working memory, 60 % cache 20 % of cache used to unroll blocks Explicit caching for huge RDDs we reuse: validLogs.persist(StorageLevel.MEMORY_AND_DISK) Partition tuning may be necessary (spark.default.parallelism)

13. AGGREGATION valwords=sc.parallelize(List("a","b","a","c")) words.groupBy(x=>x).mapValues(_.size).collect //Array((a,2),(b,1),(c,1)) words.map(x=>(x,1)).reduceByKey(_+_).collect //Array((a,2),(b,1),(c,1))

14. AGGREGATION groupBy

16. Databricks knowledge base Spark users mailing list Parsing Apache logs with Spark (Scala) USEFUL LINKS github.com/databricks/spark-knowledgebase apache-spark-user-list.1001560.n3.nabble.com alvinalexander.com/scala/analyzing-apache-access-logs-files- spark-scala

17. THANK YOU ! contact@heuritech.com

Heuritech: Apache Spark REX

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Heuritech: Apache Spark REX

Similar a Heuritech: Apache Spark REX (20)

Último

Último (20)

Heuritech: Apache Spark REX