Multilevel aggregation for Hadoop/MapReduce

•

3 recomendaciones•4,160 vistas

The document proposes a multi-level aggregation approach for Hadoop MapReduce to reduce shuffle costs by combining map outputs at the node and rack level. A prototype showed a job was 1.7 times faster and restricted shuffle costs to 50% by having mappers call a combiner before outputs are shuffled. Future work includes adding fault tolerance and supporting frameworks like Pig and Hive. Feedback is welcomed on the approach.

Tecnología

Multi-level aggregation for
Hadoop MapReduce

Tsuyoshi Ozawa
NTT

© 2012 NTT Software Innovation Center

Overview
• Background
• Shuffle cost
• Approach
• Multi-level aggregation
• Progress
• Discussion on MAPREDUCE-4502
• Design note is available on this JIRA
• Prototyped to launch combiner per node

© 2012 NTT Software Innovation Center 2

MapReduce Architecture
• MapReduce
• Programming model for large scale processing
• 3 processing phases

Map Phase Reduce Phase
Shuffle Phase
Map

Reduce
Map

Map
Reduce

Map

© 2012 NTT Software Innovation Center 3

Shuffle Phase
• What happens?
• Reducers retrieve the outputs of Mappers
• Mapper side read -> Reducer side write
• Problem
• Can be bottleneck in jobs
• Cause disk IO
• Cause network IO
• Current Solution for aggregation processing
• Combiner
• Reduce IO by mapper-side aggregation
• Apps: WordCount, N-gram, Co-occurrence of freq.

WordCount Example:
Data is aggregated
(apple, 1,1,1,1) => (apple, 4)
=> Get smaller!
(banana, 1,1) => (banana,2)

© 2012 NTT Software Innovation Center 4

Limitation of combiners
• Scope is limited within only one MapTask

© 2012 NTT Software Innovation Center 5

Limitation of combiners (1)
• Scope is limited within only one MapTask
1. Many-core environment
• Xeon E5 series : 16 threads /CPU => 16 outputs are generated
• These files must be transferred through network

Aggregation
Per map Map Map Map Map
IFile IFile IFile IFile IFile IFile IFile IFile
Combiner Combiner Combiner Combiner
IFile IFile IFile IFile

Still large…

Reduce
© 2012 NTT Software Innovation Center 6

Limitation of combiners(2)
• Scope is limited within only one MapTask
1. Many-core environment
• Xeon E5 series : 16 threads /CPU => 16 outputs are generated
2. Processing middle scale data(TB scale)
• Processing Larger data needs more network bandwidth & disk IO

All raw IFile must be sent 10GbE
1GbE over racks
Aggregation
Per map

Map Map 1GbE 1GbE
IFile IFile IFile IFile
Combiner
IFile IFile
Reducer
© 2012 NTT Software Innovation Center 7

Multi-level aggregation
• Aggregating the result of maps per node /rack

Smaller IFile is sent 10GbE
over racks
1GbE

Map Map 1GbE 1GbE
IFile IFile IFile IFile
Combiner
IFile IFile
Reducer

Aggregation Aggregation
Per Node Per Rack
© 2012 NTT Software Innovation Center 8

Design Concept
• Minimize overhead
• Adding new task type causes lots of overheads
• Modified Mapper to aggregate at the end stage
• Keep the current MapReduce design
• Fault tolerance against a few machine failures
• Each aggregation must be in Containers for YARN
• Point of view from Hadoopers
• Easy to switch ON/OFF the feature
(ideally, add only one line)
Public static void main(String[] argv) {
…
conf.setCombinerClass(Reducer.class);
conf.enableNodeLevelAggregation();
conf.enableRackLevelAggregation();
…
}
© 2012 NTT Software Innovation Center 9

Progress
• Prototype
• Modified Mapper to call combiner function at the last
stage

• Benchmark
• Environment
• 40 nodes
• Core 2 Duo 2.4GHz x2
• Memory 4GB
• 1GbE
• Configuration
• Reducer : 1
• Input
• Texts generated by RandomTextWriter
• Benchmark Program
• In-mapper combined Word Count
© 2012 NTT Software Innovation Center 10

Prototype Benchmark – Job Time -

ON OFF

• About 2 times faster
• Shuffle cost is decreased to 50% at most.

© 2012 NTT Software Innovation Center 11

TODOs
• Node level aggregation with FT
• Rack level aggregation with FT
• The design note is available at MAPREDUCE-4502
• Need to change umbilical protocol to support FT

• Support for High level languages
• Pig /Hive support – when issuing “GROUP BY”
statement
• The other case may be switch off multi-level aggregation

© 2012 NTT Software Innovation Center 12

Summary
• Multi-level aggregation with combining the
result of maps per node /rack
• Node /rack-level combiner
• Needs extended umbilical protocol for FT
• Benchmark with prototype version
• 1.7 times faster
• Can restrict the shuffle costs maximum 50%
• TODOs
• Fault Tolerance
• Pig /Hive support
• Special Thanks to have discussion with me,
Chris, Karthik, Siddarsh, Robert, Bikas

• Any Feedbacks are welcome!
© 2012 NTT Software Innovation Center 13

Más contenido relacionado

La actualidad más candente

As data centers reach the upper limits of their power and cooling capacity, efficiency has become the focus of extending the life of existing data centers and designing new ones. As part of these efforts, IT needs to refresh existing infrastructure with servers that deliver more performance and scalability, more efficiently. The Intel® Xeon® processor 5500Δ series provides a foundation for IT management to refresh existing or design new data centers to achieve greater performance while using less energy and space, and dramatically reducing operating costs.

Intel® Xeon® Processor 5500 Series

James Price

OSI Electronics Manufacturing Services Capabilities

PAWeyn

Final apu13 phil-rogers-keynote-21

r Skip

Unite Berlin 2018 - Book of the Dead Optimizing Performance for High End Cons...

Unity Technologies

CC-4006, Deliver Hardware Accelerated Applications Using RemoteFX vGPU with W...

AMD Developer Central

MM-4099, Adapting game content to the viewing environment, by Noman Hashim

AMD Developer Central

PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard Hoffnung

AMD Developer Central

Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Harris Gasparakis, AMD, at the Embedded Vision Alliance Summit, May 2014. Harris Gasparakis, Ph.D., is AMD’s OpenCV manager. In addition to enhancing OpenCV with OpenCL acceleration, he is engaged in AMD’s Computer Vision strategic planning, ISVs, and AMD Ventures engagements, including technical leadership and oversight in the AMD Gesture product line. He holds a Ph.D. in theoretical high energy physics from YITP at SUNYSB. He is credited with enabling real-time volumetric visualization and analysis in Radiology Information Systems (Terarecon), including the first commercially available virtual colonoscopy system (Vital Images). He was responsible for cutting edge medical technology (Biosense Webster, Stereotaxis, Boston Scientific), incorporating image and signal processing with AI and robotic control.

Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...

AMD Developer Central

Focus Group Open Source 04.10.2011 Marco De Felice

Roberto Galoppini

These slides are part of a "Trends in Memory Desegregation" Webinar published in March 2021. You can see the webinar recording here https://youtu.be/g0QEX5qE8kE. The presentation slides show how the Open Memory Interface, OMI , is a critical System Architecture building block towards our industry being able to easily build Domain Specific Architectures of the future as defined by the gods of Computing Architecture John Hennessy and David Patterson.

OMI - The Missing Piece of a Modular, Flexible and Composable Computing World

Allan Cantle

Wildfire IR and Mapping

mszaller

SAP Virtualization Week 2012 - The Lego Cloud

aidanshribman

HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...

AMD Developer Central

Ics21 workshop decoupling compute from memory, storage & io with omi - ...

Vaibhav R

Talk by Yuriy O’Donnell at GDC 2017. This talk describes how Frostbite handles rendering architecture challenges that come with having to support a wide variety of games on a single engine. Yuriy describes their new rendering abstraction design, which is based on a graph of all render passes and resources. This approach allows implementation of rendering features in a decoupled and modular way, while still maintaining efficiency. A graph of all rendering operations for the entire frame is a useful abstraction. The industry can move away from “immediate mode” DX11 style APIs to a higher level system that allows simpler code and efficient GPU utilization. Attendees will learn how it worked out for Frostbite.

FrameGraph: Extensible Rendering Architecture in Frostbite

Electronic Arts / DICE

Talk by Graham Wihlidal (Frostbite Labs) at GDC 2017. Checkerboard rendering is a relatively new technique, popularized recently by the introduction of the PlayStation 4 Pro. Many modern game engines are adding support for it right now, and in this talk, Graham will present an in-depth look at the new implementation in Frostbite, which is used in shipping titles like 'Battlefield 1' and 'Mass Effect Andromeda'. Despite being conceptually simple, checkerboard rendering requires a deep integration into the post-processing chain, in particular temporal anti-aliasing, dynamic resolution scaling, and poses various challenges to existing effects. This presentation will cover the basics of checkerboard rendering, explain the impact on a game engine that powers a wide range of titles, and provide a detailed look at how the current implementation in Frostbite works, including topics like object id, alpha unrolling, gradient adjust, and a highly efficient depth resolve.

4K Checkerboard in Battlefield 1 and Mass Effect Andromeda

Electronic Arts / DICE

Mantle for Developers

Electronic Arts / DICE

Blue Gene Active Storage

Heiko Joerg Schick

Today, we have a bunch of interfaces on production devices There is no interoperability with IT based post units Common equipment connection technology is restricted to 3Gbit/s, but 10Gbit/s or even more already required We demonstrate a validated future strategic interface technology for power sensitive applications and long distances IP-Racine is sponsored by the European Commission 6th Framework Programme. see www.ipracine.org for details.

iDiff 2008 conference #01 IP-Racine : Cinema production infrastructure on 10G...

Benoit Michel

AMD Analyst Day 2009: Rick Bergman

AMD

La actualidad más candente (20)

Intel® Xeon® Processor 5500 Series

OSI Electronics Manufacturing Services Capabilities

Final apu13 phil-rogers-keynote-21

Unite Berlin 2018 - Book of the Dead Optimizing Performance for High End Cons...

CC-4006, Deliver Hardware Accelerated Applications Using RemoteFX vGPU with W...

MM-4099, Adapting game content to the viewing environment, by Noman Hashim

PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard Hoffnung

Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...

Focus Group Open Source 04.10.2011 Marco De Felice

OMI - The Missing Piece of a Modular, Flexible and Composable Computing World

Wildfire IR and Mapping

SAP Virtualization Week 2012 - The Lego Cloud

HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...

Ics21 workshop decoupling compute from memory, storage & io with omi - ...

FrameGraph: Extensible Rendering Architecture in Frostbite

4K Checkerboard in Battlefield 1 and Mass Effect Andromeda

Mantle for Developers

Blue Gene Active Storage

iDiff 2008 conference #01 IP-Racine : Cinema production infrastructure on 10G...

AMD Analyst Day 2009: Rick Bergman

Similar a Multilevel aggregation for Hadoop/MapReduce

Optimizing MapReduce job performance is often seen as something of a black art. In order to maximize performance, developers need to understand the inner workings of the MapReduce execution framework and how they are affected by various configuration parameters and MR design patterns. The talk will illustrate the underlying mechanics of job and task execution, including the map side sort/spill, the shuffle, and the reduce side merge, and then explain how different job configuration parameters and job design strategies affect the performance of these operations. Though the talk will cover internals, it will also provide practical tips, guidelines, and rules of thumb for better job performance. The talk is primarily targeted towards developers directly using the MapReduce API, though will also include some tips for users of higher level frameworks.

Hadoop Summit 2012 | Optimizing MapReduce Job Performance

Cloudera, Inc.

Optimizing MapReduce Job performance

DataWorks Summit

ELCE 2011 - BZ - Embedded Linux Optimization Techniques - How Not To Be Slow

Benjamin Zores

z13: New Opportunities – if you dare!

Michael Erichsen

This session will present a memory scale-out solution that liberates SAP HANA, or similar memory demanding enterprise applications, from the classical limitation of underlying physical servers. The solution relies on key enabling technology developed within SAP. It allows applications or hypervisors to go beyond the boundaries of the underlying hardware, and effectively enables a fluid transformation from commodity sized physical nodes to very large virtual instances, in order to meet the rapidly growing demand of memory intensive applications.

Hana Memory Scale out using the hecatonchire Project

Benoit Hudzia

Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...

Ceph Community

Apache Hadoop project, and the Hadoop ecosystem has been designed be extremely flexible, and extensible. HDFS, Yarn, and MapReduce combined have more that 1000 configuration parameters that allow users to tune performance of Hadoop applications, and more importantly, extend Hadoop with application-specific functionality, without having to modify any of the core Hadoop code. In this talk, I will start with simple extensions, such as writing a new InputFormat to efficiently process video files. I will provide with some extensions that boost application performance, such as optimized compression codecs, and pluggable shuffle implementations. With refactoring of MapReduce framework, and emergence of YARN, as a generic resource manager for Hadoop, one can extend Hadoop further by implementing new computation paradigms. I will discuss one such computation framework, that allows Message Passing applications to run in the Hadoop cluster alongside MapReduce. I will conclude by outlining some of our ongoing work, that extends HDFS, by removing namespace limitations of the current Namenode implementation.

Extending Hadoop for Fun & Profit

Milind Bhandarkar

Software Stacks to enable SDN and NFV

Yoshihiro Nakajima

Using IT Equipment in Live Broadcast

Kieran Kunhya

Don't just go IP - Go IT

Kieran Kunhya

Named Data Networking Operational Aspects - IoT as a Use-case

Rute C. Sofia

Large customers want postgresql too !!

rosensteel

Using Many-Core Processors to Improve the Performance of Space Computing Plat...

Fisnik Kraja

Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...

Gaurav Raina

FPGAs versus GPUs in Data centers

Mehedi Hasan Raju

Webcast: Reduce latency, improve analytics and maximize asset utilization in ...

Emulex Corporation

By Andy Wingo. It used to be that to set up a serious network, you needed to stock racks and racks with specialized proprietary single-purpose boxes. This was because only specialized hardware could handle the hundreds of gigabits per second that might flow through any given box. Things have changed. With the rise of cheap commodity Xeon-based servers and widespread availability of 10 gigabit network cards, an off-the-shelf server with a few NICs can now handle the workload. The age of open source software-driven routers is fully here -- but it doesn't look like what we thought it would, 10 years ago. We thought it would be Linux everywhere, but it turns out that Linux's networking stack is just too slow. To get around this problem, modern high-speed software switches bypass the kernel entirely, instead booting network cards and handling traffic entirely from user-space. The up-side of this is that now we have the possibility of using pleasant, hackable, open source, standalone software stacks to deliver network applications that are tailored to specific needs. This talk presents Snabb, a toolkit for building user-space network functions. Snabb is entirely written in the expressive Lua language, minimizing the amount of code that you have to write to get stuff done. Snabb specifically uses the LuaJIT implementation of Lua, giving us excellent code generation as well as efficient access to low-level binary data and AVX2 assembly generation. Snabb's goal is to be "rewritable software": software that's so simple that you could explain it to someone and they could write their own. By the end of the presentation, you too should have this feeling. We will also describe how Snabb is used in practice in major telecoms and ISPs to provide IPv6 transition technologies to entire countries. Using Snabb allowed a small team of open-source hackers to ship a product that competed favorably against offerings from traditional network vendors. (c) linux.conf.au 2017, CC-BY-SA Hobart, 16-20 January 2017 https://linux.conf.au

Production high-performance networking with Snabb and LuaJIT (Linux.conf.au 2...

Igalia

Lego Cloud SAP Virtualization Week 2012

Benoit Hudzia

More Efficient Object Replication in OpenStack Summit Juno

Kota Tsuyuzaki

High Performance Computing Infrastructure: Past, Present, and Future

karl.barnes

Similar a Multilevel aggregation for Hadoop/MapReduce (20)

Hadoop Summit 2012 | Optimizing MapReduce Job Performance

Optimizing MapReduce Job performance

ELCE 2011 - BZ - Embedded Linux Optimization Techniques - How Not To Be Slow

z13: New Opportunities – if you dare!

Hana Memory Scale out using the hecatonchire Project

Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...

Extending Hadoop for Fun & Profit

Software Stacks to enable SDN and NFV

Using IT Equipment in Live Broadcast

Don't just go IP - Go IT

Named Data Networking Operational Aspects - IoT as a Use-case

Large customers want postgresql too !!

Using Many-Core Processors to Improve the Performance of Space Computing Plat...

Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...

FPGAs versus GPUs in Data centers

Webcast: Reduce latency, improve analytics and maximize asset utilization in ...

Production high-performance networking with Snabb and LuaJIT (Linux.conf.au 2...

Lego Cloud SAP Virtualization Week 2012

More Efficient Object Replication in OpenStack Summit Juno

High Performance Computing Infrastructure: Past, Present, and Future

Más de Tsuyoshi OZAWA

YARN: a resource manager for analytic platform

Tsuyoshi OZAWA

Dynamic Resource Allocation Spark on YARN

Tsuyoshi OZAWA

Taming YARN @ Hadoop Conference Japan 2014

Tsuyoshi OZAWA

Taming YARN @ Hadoop conference Japan 2014

Tsuyoshi OZAWA

Spark shark

Tsuyoshi OZAWA

Fluent logger-scala

Tsuyoshi OZAWA

Memcached as a Service for CloudFoundry

Tsuyoshi OZAWA

First step for dynticks in FreeBSD

Tsuyoshi OZAWA

Memory Virtualization

Tsuyoshi OZAWA

第二回Bitvisor読書会前半 Intel-VT について

Tsuyoshi OZAWA

第二回KVM読書会

Tsuyoshi OZAWA

Linux KVM のコードを追いかけてみよう

Tsuyoshi OZAWA

Más de Tsuyoshi OZAWA (12)

YARN: a resource manager for analytic platform

Dynamic Resource Allocation Spark on YARN

Taming YARN @ Hadoop Conference Japan 2014

Taming YARN @ Hadoop conference Japan 2014

Spark shark

Fluent logger-scala

Memcached as a Service for CloudFoundry

First step for dynticks in FreeBSD

Memory Virtualization

第二回Bitvisor読書会前半 Intel-VT について

第二回KVM読書会

Linux KVM のコードを追いかけてみよう

Último

Scaling API-first – The story of a global engineering organization Ian Reasor, Senior Computer Scientist - Adobe Radu Cotescu, Senior Computer Scientist - Adobe Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

apidays

[2024]Digital Global Overview Report 2024 Meltwater.pdf

hans926745

🐬 The future of MySQL is Postgres 🐘

RTylerCroy

What is a good lead in your organisation? Which leads are priority? What happens to leads? When sales and marketing give different answers to these questions, or perhaps aren't sure of the answers at all, frustrations build and opportunities are left on the table. Join us for an illuminating session with Cian McLoughlin, HubSpot Principal Customer Success Manager, as we look at that crucial piece of the customer journey in which leads are transferred from marketing to sales.

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

HampshireHUG

Tech Trends Report 2024 Future Today Institute.pdf

hans926745

Tata AIG General Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

The value of a flexible API Management solution for Open Banking Steve Melan, Manager for IT Innovation and Architecture - State's and Saving's Bank of Luxembourg Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The value of a flexible API Management solution for O...

apidays

Imagine a world where information flows as swiftly as thought itself, making decision-making as fluid as the data driving it. Every moment is critical, and the right tools can significantly boost your organization’s performance. The power of real-time data automation through FME can turn this vision into reality. Aimed at professionals eager to leverage real-time data for enhanced decision-making and efficiency, this webinar will cover the essentials of real-time data and its significance. We’ll explore: FME’s role in real-time event processing, from data intake and analysis to transformation and reporting An overview of leveraging streams vs. automations FME’s impact across various industries highlighted by real-life case studies Live demonstrations on setting up FME workflows for real-time data Practical advice on getting started, best practices, and tips for effective implementation Join us to enhance your skills in real-time data automation with FME, and take your operational capabilities to the next level.

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Safe Software

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

Building Digital Trust in a Digital Economy Veronica Tan, Director - Cyber Security Agency of Singapore Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

apidays

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Martijn de Jong

Automating Google Workspace (GWS) & more with Apps Script

wesley chun

As privacy and data protection regulations evolve rapidly, organizations operating in multiple jurisdictions face mounting challenges to ensure compliance and safeguard customer data. With state-specific privacy laws coming up in multiple states this year, it is essential to understand what their unique data protection regulations will require clearly. How will data privacy evolve in the US in 2024? How to stay compliant? Our panellists will guide you through the intricacies of these states' specific data privacy laws, clarifying complex legal frameworks and compliance requirements. This webinar will review: - The essential aspects of each state's privacy landscape and the latest updates - Common compliance challenges faced by organizations operating in multiple states and best practices to achieve regulatory adherence - Valuable insights into potential changes to existing regulations and prepare your organization for the evolving landscape

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

TrustArc

The presentation explores the development and application of artificial intelligence (AI) from its inception to its current status in the modern world. The term "artificial intelligence" was first coined by John McCarthy in 1956 to describe efforts to develop computer programs capable of performing tasks that typically require human intelligence. This concept was first introduced at a conference held at Dartmouth College, where programs demonstrated capabilities such as playing chess, proving theorems, and interpreting texts. In the early stages, Alan Turing contributed to the field by defining intelligence as the ability of a being to respond to certain questions intelligently, proposing what is now known as the Turing Test to evaluate the presence of intelligent behavior in machines. As the decades progressed, AI evolved significantly. The 1980s focused on machine learning, teaching computers to learn from data, leading to the development of models that could improve their performance based on their experiences. The 1990s and 2000s saw further advances in algorithms and computational power, which allowed for more sophisticated data analysis techniques, including data mining. By the 2010s, the proliferation of big data and the refinement of deep learning techniques enabled AI to become mainstream. Notable milestones included the success of Google's AlphaGo and advancements in autonomous vehicles by companies like Tesla and Waymo. A major theme of the presentation is the application of generative AI, which has been used for tasks such as natural language text generation, translation, and question answering. Generative AI uses large datasets to train models that can then produce new, coherent pieces of text or other media. The presentation also discusses the ethical implications and the need for regulation in AI, highlighting issues such as privacy, bias, and the potential for misuse. These concerns have prompted calls for comprehensive regulations to ensure the safe and equitable use of AI technologies. Artificial intelligence has also played a significant role in healthcare, particularly highlighted during the COVID-19 pandemic, where it was used in drug discovery, vaccine development, and analyzing the spread of the virus. The capabilities of AI in healthcare are vast, ranging from medical diagnostics to personalized medicine, demonstrating the technology's potential to revolutionize fields beyond just technical or consumer applications. In conclusion, AI continues to be a rapidly evolving field with significant implications for various aspects of society. The development from theoretical concepts to real-world applications illustrates both the potential benefits and the challenges that come with integrating advanced technologies into everyday life. The ongoing discussion about AI ethics and regulation underscores the importance of managing these technologies responsibly to maximize their their benefits while minimizing potential harms.

Artificial Intelligence: Facts and Myths

Joaquim Jorge

A Domino Admins Adventures (Engage 2024)

Gabriella Davis

Finology Group – Insurtech Innovation Award 2024

The Digital Insurer

Developing An App To Navigate The Roads of Brazil

V3cube

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Rafal Los

Histor y of HAM Radio presentation slide

vu2urc

Multilevel aggregation for Hadoop/MapReduce

2. Overview • Background • Shuffle cost • Approach • Multi-level aggregation • Progress • Discussion on MAPREDUCE-4502 • Design note is available on this JIRA • Prototyped to launch combiner per node © 2012 NTT Software Innovation Center 2

3. MapReduce Architecture • MapReduce • Programming model for large scale processing • 3 processing phases Map Phase Reduce Phase Shuffle Phase Map Reduce Map Map Reduce Map © 2012 NTT Software Innovation Center 3

4. Shuffle Phase • What happens? • Reducers retrieve the outputs of Mappers • Mapper side read -> Reducer side write • Problem • Can be bottleneck in jobs • Cause disk IO • Cause network IO • Current Solution for aggregation processing • Combiner • Reduce IO by mapper-side aggregation • Apps: WordCount, N-gram, Co-occurrence of freq. WordCount Example: Data is aggregated (apple, 1,1,1,1) => (apple, 4) => Get smaller! (banana, 1,1) => (banana,2) © 2012 NTT Software Innovation Center 4

6. Limitation of combiners (1) • Scope is limited within only one MapTask 1. Many-core environment • Xeon E5 series : 16 threads /CPU => 16 outputs are generated • These files must be transferred through network Aggregation Per map Map Map Map Map IFile IFile IFile IFile IFile IFile IFile IFile Combiner Combiner Combiner Combiner IFile IFile IFile IFile Still large… Reduce © 2012 NTT Software Innovation Center 6

7. Limitation of combiners(2) • Scope is limited within only one MapTask 1. Many-core environment • Xeon E5 series : 16 threads /CPU => 16 outputs are generated 2. Processing middle scale data(TB scale) • Processing Larger data needs more network bandwidth & disk IO All raw IFile must be sent 10GbE 1GbE over racks Aggregation Per map Map Map 1GbE 1GbE IFile IFile IFile IFile Combiner IFile IFile Reducer © 2012 NTT Software Innovation Center 7

8. Multi-level aggregation • Aggregating the result of maps per node /rack Smaller IFile is sent 10GbE over racks 1GbE Map Map 1GbE 1GbE IFile IFile IFile IFile Combiner IFile IFile Reducer Aggregation Aggregation Per Node Per Rack © 2012 NTT Software Innovation Center 8

9. Design Concept • Minimize overhead • Adding new task type causes lots of overheads • Modified Mapper to aggregate at the end stage • Keep the current MapReduce design • Fault tolerance against a few machine failures • Each aggregation must be in Containers for YARN • Point of view from Hadoopers • Easy to switch ON/OFF the feature (ideally, add only one line) Public static void main(String[] argv) { … conf.setCombinerClass(Reducer.class); conf.enableNodeLevelAggregation(); conf.enableRackLevelAggregation(); … } © 2012 NTT Software Innovation Center 9

10. Progress • Prototype • Modified Mapper to call combiner function at the last stage • Benchmark • Environment • 40 nodes • Core 2 Duo 2.4GHz x2 • Memory 4GB • 1GbE • Configuration • Reducer : 1 • Input • Texts generated by RandomTextWriter • Benchmark Program • In-mapper combined Word Count © 2012 NTT Software Innovation Center 10

12. TODOs • Node level aggregation with FT • Rack level aggregation with FT • The design note is available at MAPREDUCE-4502 • Need to change umbilical protocol to support FT • Support for High level languages • Pig /Hive support – when issuing “GROUP BY” statement • The other case may be switch off multi-level aggregation © 2012 NTT Software Innovation Center 12

13. Summary • Multi-level aggregation with combining the result of maps per node /rack • Node /rack-level combiner • Needs extended umbilical protocol for FT • Benchmark with prototype version • 1.7 times faster • Can restrict the shuffle costs maximum 50% • TODOs • Fault Tolerance • Pig /Hive support • Special Thanks to have discussion with me, Chris, Karthik, Siddarsh, Robert, Bikas • Any Feedbacks are welcome! © 2012 NTT Software Innovation Center 13

Multilevel aggregation for Hadoop/MapReduce

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Multilevel aggregation for Hadoop/MapReduce

Similar a Multilevel aggregation for Hadoop/MapReduce (20)

Más de Tsuyoshi OZAWA

Más de Tsuyoshi OZAWA (12)

Último

Último (20)

Multilevel aggregation for Hadoop/MapReduce