Más contenido relacionado Similar a BDT202 The Hadoop Ecosystem - AWS re: Invent 2012 (20) Más de Amazon Web Services (20) BDT202 The Hadoop Ecosystem - AWS re: Invent 20121. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
2. EMR is Hadoop in the Cloud
Hadoop is an open-source framework for
parallel processing huge amounts of data
on a cluster of machines
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
3. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
4. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
5. Choose: Hadoop distribution,
# of nodes, types of nodes,
custom configs, Hive/Pig/etc.
Put the data
into S3 Amazon Simple
Storage Service (S3) EMR Cluster
011001101
EMR
Launch the cluster using
the EMR console, CLI,
SDK, or APIs
Get the output You can also
from S3 store everything
in HDFS
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
6. EMR Cluster
Amazon S3
EMR
You can easily add
and remove nodes
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
7. Amazon S3 EMR Cluster
When processing is complete,
you can terminate the cluster
(and stop paying)
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
8. options
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
9. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
10. Hive Pig
• Data Warehouse for Hadoop • High-level programming
• SQL-like query language language (Pig Latin)
(HiveQL) • Supports UDFs
• Initially developed at • Ideal for data flow/ETL
Facebook
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
11. HBase Mahout
• Column-oriented database • Machine learning library
• Runs on top of HDFS • Supports recommendation
• Ideal for sparse data mining, clustering,
• Random, read/write access classification, and frequent
• Ideal for very large tables (billions itemset mining
of rows, millions of columns)
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
12. Ganglia R
• Scalable distributed monitoring • Language and software
• View performance of the cluster environment for statistical
and individual nodes computing and graphics
• Open source • Open source
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
13. Hadoop
elastic-mapreduce --create --alive
--instance-type m1.xlarge
--num-instances 5
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
14. Hive
./elastic-mapreduce --create --alive
--name "Test Hive"
--hadoop-version 0.20
--num-instances 5
--instance-type m1.large
--hive-interactive
--hive-versions 0.7.1
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
15. HBase
elastic-mapreduce --create --hbase
--name "$USER HBase Cluster"
--num-instances 2
--instance-type cc2.8xlarge
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
16. bootstrap action
elastic-mapreduce --create
--bootstrap-action s3://s3bucket/installganglia
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
17. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
18. Hive
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
19. Hive
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
20. Hive
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
21. Hive
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
22. Hive
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
23. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
24. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
25. Data Data
Masking Data
Exchange Quality
MDM
Data
Transformation Enterprise
Data
Integration
Identity
Connectivity Resolution
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
26. HParser UI
- any format
- any complexity
- easily
Real-world - in Map Reduce
data Hadoop
source M results
M R
M
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
27. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
28. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
29. End-to-End Flow
Construction Execution
(Windows)
(EMR)
binary records text records
Map Reduce
HParser UI
in out
transform input output
definition
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
30. Real-World Data
HParser
Flat files
Logs
Records
XML, JSON
Industry standards
Ex. FIX, SWIFT, X12, ASN.1
Documents
Ex. PDF, Excel
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
31. Minutes ASN.1 on EMR Cluster
60
50
40
30 10 GB
50 GB
20
10
0
4 16 24 32 Nodes
Notes:
- These are only Mappers times. Add 60 sec lead time (Start) and 60 sec tail time (Reducer) for each run
- Amazon XL (Extra Large) instances – 64-bit, 15GB RAM, 1.5TB Storage
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
32. Minutes ASN.1 on EMR Cluster – 72 Nodes
60
50
40
30
20
10
0
10 GB 100 GB 400 GB 700 GB 1 TB File Size
Notes:
- These are only Mappers times. Add 60 sec lead time (Start) and 60 sec tail time (Reducer) for each run
- Amazon XL (Extra Large) instances – 64-bit, 15GB RAM, 1.5TB Storage
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
33. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
34. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
35. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
36. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
37. Batch processing Interactive analysis Stream processing
Query runtime Minutes to hours Milliseconds to minutes Never-ending
Data volume TBs to PBs GBs to PBs Continuous stream
Programming model MapReduce Queries DAG
Users Developers Analysts and developers Developers
Google project MapReduce Dremel
Open source project Hadoop MapReduce Storm and S4
Introducing Apache Drill…
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
38. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
39. Avro IDL
enum Gender {
MALE, FEMALE
}
record User {
string name;
Gender gender;
long followers;
}
JSON
{
"name": "Srivas",
"gender": "Male",
"followers": 100
}
{
"name": "Raina",
"gender": "Female",
"followers": 200,
"zip": "94305"
}
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
40. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
41. Flexible Easy
• Pluggable query languages • Unzip and run
• Extensible execution engine • Zero configuration
• Pluggable data formats • Reverse DNS not needed
• Column-based and row-based • IP addresses can change
• Schema and schema-less • Clear and concise log messages
• Pluggable data sources
Dependable Fast
• No SPOF • C/C++ core with Java support
• Instant recovery from crashes • Google C++ style guide
• Min latency and max throughput
(limited only by hardware)
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
42. No RegionServers Instant Recovery High Throughput
No Manual Splits No Compactions No Garbage Collection
No Manual Merges Snapshots Consistent Low Latency
No Manual Administration Mirroring No Practical Scale Limits
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
43. HBase
JVM
DFS HBase
JVM JVM
ext3 MapR Unified
Disks Disks Disks
Other Distributions
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
44. 50B real-time auctions
#1 in audience reach
“M7 is really taking Hadoop to the next level. It allows us to do new things with our data.” - Jan Gelin,
VP of Technical Operations
2M+ subscribers
10B+ records
“I’m really excited about M7 because it will address both the performance and the day-to-day challenges of
Hbase.” – Melinda Graham, Sr. Hadoop Engineer
Global leader in email intelligence
“M7 is a big win for us. It makes HBase really easy to use. It really helps us make better use of the data we
have. It allows us to look at use cases we haven't had the opportunity to in the past.” Andy Sautins - CTO
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
45. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
46. aws.amazon.com/elasticmapreduce
• Online Training
– Videos
– Articles/tutorials
• Documentation
– Getting Started Guide
– Developer Guide
– API Reference
• FAQs
• Paid Training
– 3-day Developer Course
taught by Think Big Analytics
• On-Site Consulting
– EMR Bootcamp (for companies processing 1+ TB per day)
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
47. We are sincerely eager to
hear your feedback on this
presentation and on re:Invent.
Please fill out an evaluation
form when you have a
chance.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.