If the data cannot come to the algorithm...

•

0 recomendaciones•400 vistas

Session four of my series on many cores turns to data, both big and small. Looks at MapReduce but approaches sideways from a classic computer science perspective.

Tecnología

If the Data Cannot Come to
the Algorithm...
many cores with java
session four
data locality
copyright 2013 Robert Burrell Donkin robertburrelldonkin.name
this work is licensed under a Creative Commons Attribution 3.0 Unported License

Pre-emptive multi-tasking operating
systems use involuntary context switching
to provide the illusion of parallel processes
even when the hardware supports only a
single thread of execution.
Take Away from Session One

Even on a single core,
there's no escaping parallelism.
Take Away from Session Two

Take Away from Session Three
Code executing on different cores uses copies held
in registers and caches, so memory shared is likely
to be incoherent unless the program plays by the
rules of the software platform.

$Gustafson's Law S(p) = p - a (p-1) ● S(p) is the speedup for pprocessors ● a is the non-parallelizable fraction "in practice, the problem size scales with the number of processors" John L. Gustafson$

● Think about Gustafson's Law...
● The quantity of data processed...
● ...scales linearly as processors added.
● Throwing processors at the problem
works...
● ...at least sometimes.
Scales and Scaling

Divide and Conquer
● Back to the future
● Partition the data...
○ ...apply the same algorithm to each part and then
○ ...collate the answers.
● Natural to parallelise
● No contended shared memory

Data Locality
● When the algorithm is small
○ it's more efficient
■ to bring the algorithm to the data
■ than the data to the algorithm
● Whether the data is in
○ caches on cores in a many core computer, or in
○ disc storage in a distributed data store

Map and Reduce
● Partition the data
● The map algorithm
○ works in parallel
○ on local data
○ independently
● The reduce algorithm
○ collates output from map algorithms
● More complex systems built from these blocks

Map-Reduce
As a Query Language
● NoSQL
● A popular alternative to SQL
○ for distributed data stores
● Why...?
○ Easy to
■ read and write
■ parallelize
○ Rich and full programming model

Map-Reduce
Crunching Big Data
● Commodity hardware
● Scales up to Terabyte and Petabyte
○ smoothly by adding new nodes
● Map-Reduce platforms typically provide
○ fault tolerance eg. retry
○ orchestration
○ redundant data storage
● Statistical resilience

Take Away
When you want to be able to process big data
tomorrow by adding cores or computers, adopt
an appropriate architecture today.

Más contenido relacionado

La actualidad más candente

Introduction to Hadoop : A bird eye's view | Abhishek MukherjeeFinTechopedia

EC2, MapReduce, and Distributed ProcessingJonathan Dahl

Java one2015 - Work With Hundreds of Hot Terabytes in JVMsSpeedment, Inc.

Scalability broad strokesGagan Bajpai

Dataframes Showdown (miniConf 2022)8thLight

PregelWeiru Dai

Caffe + H2O - By Cyprien noelSri Ambati

Block Sampling: Efficient Accurate Online Aggregation in MapReduceVasia Kalavri

MapReduce: Optimizations, Limitations, and Open IssuesVasia Kalavri

Tech Talk - Underutilized Resources in Distributed SystemRishabh Dugar

Large scale graph processingHarisankar H

Scalable Distributed Real-Time Clustering for Big Data StreamsAntonio Severien

Pregel: A System For Large Scale Graph ProcessingRiyad Parvez

Hadoop and cassandraChristina Yu

Scheduling for Parallel and Multi-Core SystemsPradeeban Kathiravelu, Ph.D.

m2r2: A Framework for Results Materialization and ReuseVasia Kalavri

Multi core processing of xml twig patternsieeepondy

Tensorflow Lite and ARM Compute LibraryKobe Yu

OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS

Serving deep learning models in a serverless platform (IC2E 2018)alekn

La actualidad más candente (20)

Introduction to Hadoop : A bird eye's view | Abhishek Mukherjee

EC2, MapReduce, and Distributed Processing

Java one2015 - Work With Hundreds of Hot Terabytes in JVMs

Scalability broad strokes

Dataframes Showdown (miniConf 2022)

Pregel

Caffe + H2O - By Cyprien noel

Block Sampling: Efficient Accurate Online Aggregation in MapReduce

MapReduce: Optimizations, Limitations, and Open Issues

Tech Talk - Underutilized Resources in Distributed System

Large scale graph processing

Scalable Distributed Real-Time Clustering for Big Data Streams

Pregel: A System For Large Scale Graph Processing

Hadoop and cassandra

Scheduling for Parallel and Multi-Core Systems

m2r2: A Framework for Results Materialization and Reuse

Multi core processing of xml twig patterns

Tensorflow Lite and ARM Compute Library

OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...

Serving deep learning models in a serverless platform (IC2E 2018)

Destacado

Fifty Year Of MicroprocessorAli Usman

ttec / transtec | IBM NeXtScale Marco van der Hart

Apostila lptMaria Anunciada Nery Rodrigues

processorsParul Gupta

The Evolution Of ComputerShravan Kumar

Unum Computing: An Energy Efficient and Massively Parallel Approach to Valid ...inside-BigData.com

Genesis & Progression of Processors in CPUAnkita Jangir

Multicore processor by Ankit Raj and Akash PrajapatiAnkit Raj

Xilinx fpga coressanaz nouri

Introduction to microprocessorKashyap Shah

Multi core processorsAdithya Bhat

Destacado (11)

Fifty Year Of Microprocessor

ttec / transtec | IBM NeXtScale

Apostila lpt

processors

The Evolution Of Computer

Unum Computing: An Energy Efficient and Massively Parallel Approach to Valid ...

Genesis & Progression of Processors in CPU

Multicore processor by Ankit Raj and Akash Prajapati

Xilinx fpga cores

Introduction to microprocessor

Multi core processors

Similar a If the data cannot come to the algorithm...

Netflix machine learningAmer Ather

Architecting and productionising data science applications at scalesamthemonad

Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh

Caching in (DevoxxUK 2013)RichardWarburton

Apache Spark 101 - Demi Ben-AriDemi Ben-Ari

The Parquet Format and Performance Optimization OpportunitiesDatabricks

An Introduction to Apache CassandraSaeid Zebardast

Introduction to MemoriaVictor Smirnov

Software Design Practices for Large-Scale AutomationHao Xu

Impala presentation ahad ranaData Con LA

Scalable data systems at TravelokaRendy Bambang Junior

Machine learning and big data @ uber a tale of two systemsZhenxiao Luo

Caching inRichardWarburton

Big Data Lakes Benchmarking 2018Tom Grek

Threads - Why Can't You Just Play Nicely With Your Memory_Robert Burrell Donkin

PGConf APAC 2018 - High performance json postgre-sql vs. mongodbPGConf APAC

Big Data processing with Apache SparkLucian Neghina

Fast and Scalable PythonTravis Oliphant

AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty

Threads - Why Can't You Just Play Nicely With Your Memory?Robert Burrell Donkin

Similar a If the data cannot come to the algorithm... (20)

Netflix machine learning

Architecting and productionising data science applications at scale

Ledingkart Meetup #2: Scaling Search @Lendingkart

Caching in (DevoxxUK 2013)

Apache Spark 101 - Demi Ben-Ari

The Parquet Format and Performance Optimization Opportunities

An Introduction to Apache Cassandra

Introduction to Memoria

Software Design Practices for Large-Scale Automation

Impala presentation ahad rana

Scalable data systems at Traveloka

Machine learning and big data @ uber a tale of two systems

Caching in

Big Data Lakes Benchmarking 2018

Threads - Why Can't You Just Play Nicely With Your Memory_

PGConf APAC 2018 - High performance json postgre-sql vs. mongodb

Big Data processing with Apache Spark

Fast and Scalable Python

AWS Big Data Demystified #1: Big data architecture lessons learned

Threads - Why Can't You Just Play Nicely With Your Memory?

Más de Robert Burrell Donkin

Threads and ThreadsRobert Burrell Donkin

If the Data Cannot Come To The Algorithm...Robert Burrell Donkin

An End to OrderRobert Burrell Donkin

An End to Order (many cores with java, session two)Robert Burrell Donkin

Many Cores Java - Session One: Threads and ThreadsRobert Burrell Donkin

Apache Maven In 10 SlidesRobert Burrell Donkin

XP In 10 slidesRobert Burrell Donkin

Public Sector: Agile and Open SourceRobert Burrell Donkin

An Agile Pick-N-MixRobert Burrell Donkin

The Pomodoro Technique: Introduced Unofficially In 10 SlidesRobert Burrell Donkin

Retrospectives In 10 Slides (With Notes)Robert Burrell Donkin

Más de Robert Burrell Donkin (11)

Threads and Threads

If the Data Cannot Come To The Algorithm...

An End to Order

An End to Order (many cores with java, session two)

Many Cores Java - Session One: Threads and Threads

Apache Maven In 10 Slides

XP In 10 slides

Public Sector: Agile and Open Source

An Agile Pick-N-Mix

The Pomodoro Technique: Introduced Unofficially In 10 Slides

Retrospectives In 10 Slides (With Notes)

Último

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

A Call to Action for Generative AI in 2024Results

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Slack Application Development 101 Slidespraypatel2

Scaling API-first – The story of a global engineering organizationRadu Cotescu

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

How to convert PDF to text with Nanonetsnaman860154

If the data cannot come to the algorithm...

1. If the Data Cannot Come to the Algorithm... many cores with java session four data locality copyright 2013 Robert Burrell Donkin robertburrelldonkin.name this work is licensed under a Creative Commons Attribution 3.0 Unported License

2. Pre-emptive multi-tasking operating systems use involuntary context switching to provide the illusion of parallel processes even when the hardware supports only a single thread of execution. Take Away from Session One

3. Even on a single core, there's no escaping parallelism. Take Away from Session Two

4. Take Away from Session Three Code executing on different cores uses copies held in registers and caches, so memory shared is likely to be incoherent unless the program plays by the rules of the software platform.

5. Gustafson's Law S(p) = p - a (p-1) ● S(p) is the speedup for pprocessors ● a is the non-parallelizable fraction "in practice, the problem size scales with the number of processors" John L. Gustafson

6. ● Think about Gustafson's Law... ● The quantity of data processed... ● ...scales linearly as processors added. ● Throwing processors at the problem works... ● ...at least sometimes. Scales and Scaling

7. Divide and Conquer ● Back to the future ● Partition the data... ○ ...apply the same algorithm to each part and then ○ ...collate the answers. ● Natural to parallelise ● No contended shared memory

8. Data Locality ● When the algorithm is small ○ it's more efficient ■ to bring the algorithm to the data ■ than the data to the algorithm ● Whether the data is in ○ caches on cores in a many core computer, or in ○ disc storage in a distributed data store

9. Map and Reduce ● Partition the data ● The map algorithm ○ works in parallel ○ on local data ○ independently ● The reduce algorithm ○ collates output from map algorithms ● More complex systems built from these blocks

10. Map-Reduce As a Query Language ● NoSQL ● A popular alternative to SQL ○ for distributed data stores ● Why...? ○ Easy to ■ read and write ■ parallelize ○ Rich and full programming model

11. Map-Reduce Crunching Big Data ● Commodity hardware ● Scales up to Terabyte and Petabyte ○ smoothly by adding new nodes ● Map-Reduce platforms typically provide ○ fault tolerance eg. retry ○ orchestration ○ redundant data storage ● Statistical resilience

12. Take Away When you want to be able to process big data tomorrow by adding cores or computers, adopt an appropriate architecture today.

If the data cannot come to the algorithm...

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (11)

Similar a If the data cannot come to the algorithm...

Similar a If the data cannot come to the algorithm... (20)

Más de Robert Burrell Donkin

Más de Robert Burrell Donkin (11)

Último

Último (20)

If the data cannot come to the algorithm...