State of the Elephant: Hadoop 2009 Milestones and Goals for 2010

•

7 recomendaciones•4,669 vistas

The document discusses the state of Hadoop in 2009 and goals for 2010 and beyond. It focuses on improving data formats by proposing the use of Apache Avro, which provides an expressive, compact, and dynamic data serialization system with built-in capabilities for data sharing, remote procedure calls, and schema evolution. The document outlines how Avro could address current issues with data formats in Hadoop and its potential inclusion in future Hadoop releases.

Tecnología

State of the Elephant

Doug Cutting
Cloudera

2009 Hadoop Milestones
● A 2009 Sort Champion (O'Malley)
● won 100TB “Gray” sort @ .578TB/minute
● won “Minute” sort with 500GB
● Split Core Project in Three
● Common, HDFS & MapReduce
● Released 0.18.3, 0.19.[0-2], 0.20.[0-1]
● Many meetups, conferences, etc.
● Yada, yada, yada.

Goals for 2010 and beyond
● IMHO, YMMV, IANAM, etc.
● Concrete
● 1.0 release
– compatible APIs & RPCs for > 1 year
– Kerberos-based authentication
● Abstract
● faster, more reliable, available
● easier sharing
– of data & hardware resources
● spreadsheet-like interfaces
– provide non-programmers
– with powerful, interactive tools

Abstract Requirements
● security
● facilitate sharing of resources
● stable cross-language APIs
● facilitate diverse tools & apps
● expressive, inter-operable data
● facilitates sharing of datasets
● facilitates dynamic analyses

Data Formats
● today in Hadoop:
● text
– pro: inter-operable
– con: not expressive, inefficient
● Java Writable
– pro: expressive, efficient
– con: platform-specific, fragile

Protocol Buffers & Thrift
● expressive
● efficient (small & fast)
● but not very dynamic
● cannot browse arbitrary data
● no DESCRIBE or SHOW
● viewing a new dataset
– requires code generation & load
● writing a new dataset
– requires generating schema text
– plus code generation & load

Avro Data
● as expressive
● smaller and faster
● dynamic
● schema stored with data
– but factored out of instances
● APIs permit reading & creating
– arbitrary datatypes
– without generating & loading code

Avro Data
● includes a file format
● includes a textual encoding
● handles versioning
● if schema changes
● can still process data
● hope Hadoop apps will
● upgrade from text; &
● and standardize on Avro for data

Avro RPC
● leverage versioning support
● to permit different versions of services to
interoperate
● for Hadoop services, will
● provide cross-language access
● let apps talk to clusters running different versions

Avro Status
● Java & Python APIs
● C & C++ APIs making rapid progress
● 1.1 release
● added JSON data and comparators
● 1.2 release
● added HTTP & UDP-based RPC
● included in Hadoop 0.21
● as format for job history
● in sequence files

Avro Near Future
● full mapreduce support for Avro data
● enables fast comparators for non-Java apps
● Avro RPC used in Hadoop 0.22 (1.0)?
● provides compatibility; &
● native access from non-Java

Más contenido relacionado

La actualidad más candente

Apache Flink @ NYC Flink MeetupStephan Ewen

Aggregated queries with Druid on terrabytes and petabytes of dataRostislav Pashuto

MongodB InternalsNorberto Leite

The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation

SeaweedFS introductionchrislusf

Ceph QoS: How to support QoS in distributed storage system - Taewoong KimCeph Community

Building Better Data Pipelines using Apache AirflowSid Anand

Apache Flink Training: DataStream API Part 1 BasicFlink Forward

MongoDB InternalsSiraj Memon

Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )SANG WON PARK

Apache Calcite (a tutorial given at BOSS '21)Julian Hyde

Demystifying flink memory allocation and tuning - Roshan Naik, UberFlink Forward

Airflow - a data flow engineWalter Liu

Data Science Across Data Sources with Apache ArrowDatabricks

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Redshift VS BigQueryKostas Pardalis

SSD Deployment Strategies for MySQLYoshinori Matsunobu

Deep Dive: Memory Management in Apache SparkDatabricks

elasticsearch_적용 및 활용_정리Junyi Song

La actualidad más candente (20)

Apache Flink @ NYC Flink Meetup

Aggregated queries with Druid on terrabytes and petabytes of data

MongodB Internals

The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...

SeaweedFS introduction

Ceph QoS: How to support QoS in distributed storage system - Taewoong Kim

Building Better Data Pipelines using Apache Airflow

Apache Flink Training: DataStream API Part 1 Basic

MongoDB Internals

Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )

Apache Calcite (a tutorial given at BOSS '21)

Demystifying flink memory allocation and tuning - Roshan Naik, Uber

Airflow - a data flow engine

Data Science Across Data Sources with Apache Arrow

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud

Scaling your Data Pipelines with Apache Spark on Kubernetes

Redshift VS BigQuery

SSD Deployment Strategies for MySQL

Deep Dive: Memory Management in Apache Spark

elasticsearch_적용 및 활용_정리

Similar a State of the Elephant: Hadoop 2009 Milestones and Goals for 2010

Hw09 Next Steps For HadoopCloudera, Inc.

Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...Cloudera, Inc.

Introduction to Apache Sparkdatamantra

Hw09 Security And Api CompatibilityCloudera, Inc.

Plugging the Holes: Security and Compatability in HadoopOwen O'Malley

Blackray @ SAPO CodeBits 2009fschupp

Savanna - Elastic Hadoop on OpenStackSergey Lukjanov

Cloud Native API Design and ManagementAllBits BVBA (freelancer)

Present and future of unified, portable, and efficient data processing with A...DataWorks Summit

Change data captureRon Barabash

BlackRay - The open Source Data Enginefschupp

Introduction to Apache BeamKnoldus Inc.

GraphQL is actually restJakub Riedl

[Hadoop Meetup] Apache Hadoop 3 community update - Rohith SharmaNewton Alex

Apache Tez -- A modern processing enginebigdatagurus_meetup

Portable batch and streaming pipelines with Apache Beam (Big Data Application...Malo Denielou

Intro to Apache HadoopSufi Nawaz

Hadoop Introductionsheetal sharma

Apache PIGAnuja Gunale

Hadoop ppt1chariorienit

Similar a State of the Elephant: Hadoop 2009 Milestones and Goals for 2010 (20)

Hw09 Next Steps For Hadoop

Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...

Introduction to Apache Spark

Hw09 Security And Api Compatibility

Plugging the Holes: Security and Compatability in Hadoop

Blackray @ SAPO CodeBits 2009

Savanna - Elastic Hadoop on OpenStack

Cloud Native API Design and Management

Present and future of unified, portable, and efficient data processing with A...

Change data capture

BlackRay - The open Source Data Engine

Introduction to Apache Beam

GraphQL is actually rest

[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma

Apache Tez -- A modern processing engine

Portable batch and streaming pipelines with Apache Beam (Big Data Application...

Intro to Apache Hadoop

Hadoop Introduction

Apache PIG

Hadoop ppt1

Más de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.

Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.

2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.

Edc event vienna presentation 1 oct 2019Cloudera, Inc.

Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.

Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.

Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.

Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.

Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.

Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.

Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.

Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.

Extending Cloudera SDX beyond the PlatformCloudera, Inc.

Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.

Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.

Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.

Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.

Más de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx

Cloudera Data Impact Awards 2021 - Finalists

2020 Cloudera Data Impact Awards Finalists

Edc event vienna presentation 1 oct 2019

Machine Learning with Limited Labeled Data 4/3/19

Data Driven With the Cloudera Modern Data Warehouse 3.19.19

Introducing Cloudera DataFlow (CDF) 2.13.19

Introducing Cloudera Data Science Workbench for HDP 2.12.19

Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19

Leveraging the cloud for analytics and machine learning 1.29.19

Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19

Leveraging the Cloud for Big Data Analytics 12.11.18

Modern Data Warehouse Fundamentals Part 3

Modern Data Warehouse Fundamentals Part 2

Modern Data Warehouse Fundamentals Part 1

Extending Cloudera SDX beyond the Platform

Federated Learning: ML with Privacy on the Edge 11.15.18

Analyst Webinar: Doing a 180 on Customer 360

Build a modern platform for anti-money laundering 9.19.18

Introducing the data science sandbox as a service 8.30.18

Último

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Take control of your SAP testing with UiPath Test SuiteDianaGray10

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

"ML in Production",Oleksandr BaganFwdays

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

Story boards and shot lists for my a level piececharlottematthew16

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

From Family Reminiscence to Scholarly Archive .Alan Dix

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

State of the Elephant: Hadoop 2009 Milestones and Goals for 2010

1. State of the Elephant Doug Cutting Cloudera

2. 2009 Hadoop Milestones ● A 2009 Sort Champion (O'Malley) ● won 100TB “Gray” sort @ .578TB/minute ● won “Minute” sort with 500GB ● Split Core Project in Three ● Common, HDFS & MapReduce ● Released 0.18.3, 0.19.[0-2], 0.20.[0-1] ● Many meetups, conferences, etc. ● Yada, yada, yada.

3. Goals for 2010 and beyond ● IMHO, YMMV, IANAM, etc. ● Concrete ● 1.0 release – compatible APIs & RPCs for > 1 year – Kerberos-based authentication ● Abstract ● faster, more reliable, available ● easier sharing – of data & hardware resources ● spreadsheet-like interfaces – provide non-programmers – with powerful, interactive tools

4. Abstract Requirements ● security ● facilitate sharing of resources ● stable cross-language APIs ● facilitate diverse tools & apps ● expressive, inter-operable data ● facilitates sharing of datasets ● facilitates dynamic analyses

5. Data Formats ● today in Hadoop: ● text – pro: inter-operable – con: not expressive, inefficient ● Java Writable – pro: expressive, efficient – con: platform-specific, fragile

6. Protocol Buffers & Thrift ● expressive ● efficient (small & fast) ● but not very dynamic ● cannot browse arbitrary data ● no DESCRIBE or SHOW ● viewing a new dataset – requires code generation & load ● writing a new dataset – requires generating schema text – plus code generation & load

7. Avro Data ● as expressive ● smaller and faster ● dynamic ● schema stored with data – but factored out of instances ● APIs permit reading & creating – arbitrary datatypes – without generating & loading code

8. Avro Data ● includes a file format ● includes a textual encoding ● handles versioning ● if schema changes ● can still process data ● hope Hadoop apps will ● upgrade from text; & ● and standardize on Avro for data

9. Avro RPC ● leverage versioning support ● to permit different versions of services to interoperate ● for Hadoop services, will ● provide cross-language access ● let apps talk to clusters running different versions

10. Avro Status ● Java & Python APIs ● C & C++ APIs making rapid progress ● 1.1 release ● added JSON data and comparators ● 1.2 release ● added HTTP & UDP-based RPC ● included in Hadoop 0.21 ● as format for job history ● in sequence files

11. Avro Near Future ● full mapreduce support for Avro data ● enables fast comparators for non-Java apps ● Avro RPC used in Hadoop 0.22 (1.0)? ● provides compatibility; & ● native access from non-Java

12. Thanks! hadoop.apache.org/avro

State of the Elephant: Hadoop 2009 Milestones and Goals for 2010

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a State of the Elephant: Hadoop 2009 Milestones and Goals for 2010

Similar a State of the Elephant: Hadoop 2009 Milestones and Goals for 2010 (20)

Más de Cloudera, Inc.

Más de Cloudera, Inc. (20)

Último

Último (20)

State of the Elephant: Hadoop 2009 Milestones and Goals for 2010