Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Cloud Computing: Hadoop

17.154 visualizaciones

Publicado el

Data Processing in the Cloud with Hadoop from Data Services World conference.

Publicado en: Tecnología, Educación
  • thanks
    it helps to understand what is hadoop in cloud domain
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí
  • Hope fully interesting topic
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí
  • it clarifies many doubts... thank u
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí
  • Hi dude,
    Thank u, It helped a lot.
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí

Cloud Computing: Hadoop

  1. 1. Data Processing in the Cloud Parand Tony Darugar [email_address]
  2. 2. What is Hadoop <ul><li>Flexible infrastructure for large scale computation and data processing on a network of commodity hardware. </li></ul>
  3. 3. Why? <ul><li>A common infrastructure pattern extracted from building distributed systems </li></ul><ul><li>Scale </li></ul><ul><li>Incremental growth </li></ul><ul><li>Cost </li></ul><ul><li>Flexibility </li></ul>
  4. 4. Built-in Resilience to Failure <ul><li>When dealing with large numbers of commodity servers, failure is a fact of life </li></ul><ul><li>Assume failure, build protections and recovery into your architecture </li></ul><ul><ul><li>Data level redundancy </li></ul></ul><ul><ul><li>Job/Task level monitoring and automated restart and re-allocation </li></ul></ul>
  5. 5. Current State of Hadoop Project <ul><li>Top level Apache Foundation project </li></ul><ul><li>In production use at Yahoo, Facebook, Amazon, IBM, Fox, NY Times, Powerset, … </li></ul><ul><li>Large, active user base, mailing lists, user groups </li></ul><ul><li>Very active development, strong development team </li></ul>
  6. 6. Widely Adopted <ul><li>A valuable and reusable skill set </li></ul><ul><ul><li>Taught at major universities </li></ul></ul><ul><ul><li>Easier to hire for </li></ul></ul><ul><ul><li>Easier to train on </li></ul></ul><ul><ul><li>Portable across projects, groups </li></ul></ul>
  7. 7. Plethora of Related Projects <ul><li>Pig </li></ul><ul><li>Hive </li></ul><ul><li>Hbase </li></ul><ul><li>Cascading </li></ul><ul><li>Hadoop on EC2 </li></ul><ul><li>JAQL , X-Trace, Happy, Mahout </li></ul>
  8. 8. What is Hadoop <ul><li>The Linux of distributed processing. </li></ul>
  9. 9. How Does Hadoop Work?
  10. 10. Hadoop File System <ul><li>A distributed file system for large data </li></ul><ul><ul><li>Your data in triplicate </li></ul></ul><ul><ul><li>Built-in redundancy, resiliency to large scale failures </li></ul></ul><ul><ul><li>Intelligent distribution, striping across racks </li></ul></ul><ul><ul><li>Accommodates very large data sizes </li></ul></ul><ul><ul><li>On commodity hardware </li></ul></ul>
  11. 11. Programming Model: Map/Reduce <ul><li>Very simple programming model: </li></ul><ul><ul><li>Map(anything)->key, value </li></ul></ul><ul><ul><li>Sort, partition on key </li></ul></ul><ul><ul><li>Reduce(key,value)->key, value </li></ul></ul><ul><li>No parallel processing / message passing semantics </li></ul><ul><li>Programmable in Java or any other language (streaming) </li></ul>
  12. 12. Processing Model <ul><li>Create or allocate a cluster </li></ul><ul><li>Put data onto the file system: </li></ul><ul><ul><li>Data is split into blocks, stored in triplicate across your cluster </li></ul></ul><ul><li>Run your job: </li></ul><ul><ul><li>Your Map code is copied to the allocated nodes, preferring nodes that contain copies of your data </li></ul></ul><ul><ul><ul><li>Move computation to data, not data to computation </li></ul></ul></ul>
  13. 13. Processing Model <ul><ul><ul><li>Monitor workers, automatically restarting failed or slow tasks </li></ul></ul></ul><ul><ul><li>Gather output of Map, sort and partition on key </li></ul></ul><ul><ul><li>Run Reduce tasks </li></ul></ul><ul><ul><ul><li>Monitor workers, automatically restarting failed or slow tasks </li></ul></ul></ul><ul><li>Results of your job are now available on the Hadoop file system </li></ul>
  14. 14. Hadoop on the Grid <ul><li>Managed Hadoop clusters </li></ul><ul><li>Shared resources </li></ul><ul><ul><li>improved utilization </li></ul></ul><ul><li>Standard data sets, storage </li></ul><ul><li>Shared, standardized operations management </li></ul><ul><li>Hosted internally or externally (eg. on EC2) </li></ul>
  15. 15. Usage Patterns
  16. 16. ETL <ul><li>Put large data source (eg. Log files) onto the Hadoop File System </li></ul><ul><li>Perform aggregations, transformations, normalizations on the data </li></ul><ul><li>Load into RDBMS / data mart </li></ul>
  17. 17. Reporting and Analytics <ul><li>Run canned and ad-hoc queries over large data </li></ul><ul><li>Run analytics and data mining operations on large data </li></ul><ul><li>Produce reports for end-user consumption or loading into data mart </li></ul>
  18. 18. Data Processing Pipelines <ul><li>Multi-step pipelines for data processing </li></ul><ul><li>Coordination, scheduling, data collection and publishing of feeds </li></ul><ul><li>SLA carrying, regularly scheduled jobs </li></ul>
  19. 19. Machine Learning & Graph Algorithms <ul><li>Traverse large graphs and data sets, building models and classifiers </li></ul><ul><li>Implement machine learning algorithms over massive data sets </li></ul>
  20. 20. General Back-End Processing <ul><li>Implement significant portions of back-end, batch oriented processing on the grid </li></ul><ul><li>General computation framework </li></ul><ul><li>Simplify back-end architecture </li></ul>
  21. 21. What Next? <ul><li>Dowload Hadoop: </li></ul><ul><ul><li> </li></ul></ul><ul><li>Try it on your laptop </li></ul><ul><li>Try Pig </li></ul><ul><ul><li> </li></ul></ul><ul><li>Deploy to multiple boxes </li></ul><ul><li>Try it on EC2 </li></ul>