Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Cloud Big Data Architectures

2.245 visualizaciones

Publicado el

Workshop at QCon Brazil

Publicado en: Software
  • Sé el primero en comentar

Cloud Big Data Architectures

  1. 1. Cloud Big Data Architectures Lynn Langit QCon Sao Paulo, Brazil 2016
  2. 2. About this Workshop Real-world Cloud Scenarios w/AWS, Azure and GCP 1.  Big Data Solution Types 2.  Data Pipelines 3.  ETL and Visualization 4.  Bonus…(if time allows)
  3. 3. Save ALL of your Data
  4. 4. “What is the ACTUAL Cost of ✘  Saving all Data ✘  Using newer technologies ✘  Going beyond Relational
  5. 5. About this Workshop Real-world Cloud Scenarios w/AWS, Azure and GCP 1.  When to use which type of Big Data Solution 2.  The new world of Data Pipelines 3.  ETL and Visualization Practicalities 4.  Bonus…(if time allows)
  6. 6. 1. Big Data – Yes! But what kind?
  7. 7. Pattern 1 ✘ Which type(s) of Big Data work best? -- when to use Hadoop -- when to use NoSQL and which type, i.e. key-value, document, graph, etc. -- when to use Big Relational and what type of workload for hot, warm or cold data
  8. 8. Choice… is good, right?
  9. 9. “When do I use…? ✘  Hadoop ✘  NoSQL ✘  Big Relational
  10. 10. Size Matters
  11. 11. One Vendor’s View I don’t Want Text here
  12. 12. Where is Hadoop Used?
  13. 13. Hadoop is your LAST CHOICE ✘ Volume ✘ 10 TB or greater to start ✘ Growth of 25% YOY ✘ Where FROM ✘ Where TO ✘ Velocity and Variety ✘ Spark over HIVE ✘ Kafka and Samsa ✘ Veracity ✘ Pay, train and hire team ✘ Top $$$ for talent ✘ IF you can find it ✘ WATCH OUT for Cloud Vendors who promise ‘easy access’ ✘ Complexity of ecosystem ✘ Cloudera knows best
  14. 14. “When do I use…? ✘  Hadoop ✘  NoSQL ✘  Big Relational
  15. 15. 225NoSQL Database Types to Choose From
  16. 16. Let’s review some NoSQL concepts Key-Value Redis, Riak, Aerospike Graph Neo4j Document MongoDB Wide-Column Cassandra, HBase
  17. 17.
  18. 18. Key Questions - Storage ✘ Volume – how much now, what growth rate? ✘ Variety – what type(s) of data? ‘rectangular’, ‘graph’, ‘k-v’, etc… ✘ Velocity – batches, streams, both, what ingest rate? ✘ Veracity – current state (quality) of data, amount of duplication of data stores, existence of authoritative (master) data management?
  19. 19. 21 ✘ Open Source is Free ✘ Not Free §  Rapid iteration, innovation §  Can start up for free (on premise) §  Can ‘rent’ for cheap or free on the cloud §  Can use with the command line for free §  Some vendors offer free online training §  Ex. www.neo4j.org §  Constant releases §  Can be deceptively hard to set up (time is money) §  Don’t forget to turn it off if on the cloud! §  GUI tools, support, training cost $$$ §  Ex. www.neo4j.com NoSQL Example
  20. 20. Practice Applying Concepts - NoSQL
  21. 21. NoSQL Applied Log Files •  ??? Product Catalogs •  ??? Social Games •  ??? Social aggregators •  ??? Line-of- Business •  ???
  22. 22. NoSQL Applied Log Files •  Columnstore •  HBase Product Catalogs •  Key/Value •  Redis Social Games •  Document •  MongoDB Social aggregators •  Graph •  Neo4j Line-of- Business •  RDBMS •  SQL Server
  23. 23. More than NoSQL NoSQL ✘  Non-relational ✘  Can be optimized in- memory ✘  Eventually consistent ✘  Schema on Read ✘  Example: Aerospike NewSQL ✘  Relational plus more ✘  Often in-memory ✘  Some kind of SQL-layer ✘  Schema on Write ✘  Example: MemSQL U-SQL ✘  What??? ✘  Microsoft’s universal SQL language ✘  Example: Azure Data Lake
  24. 24. Focus
  25. 25. How Best to Store your Data? Complexity Scalability Developer Cost RDBMS easy medium low NoSQL medium big high Hadoop hard huge very high
  26. 26. Real World Big Data -- When do I use what? RDBMS 65% NoSQL 30% Hadoop 5%
  27. 27. “Do the Cloud Vendors Understand Big Data Realities?
  28. 28. Cloud Big Data Vendors - Storage AWS ✘  5-10X market share of next competitor ✘  Most complete offering ✘  Most mature offering ✘  Notable: Big Relational GCP ✘  Lean, mean and cheap ✘  Fastest player ✘  Requires top developers ✘  Notable: Query as a Service Azure ✘  Catching up ✘  Best tooling integration ✘  Notable: On-premise integration
  29. 29. Place your screenshot here AWS Console 17 Data services
  30. 30. Place your screenshot here GCP Console 8 Data Services
  31. 31. Place your screenshot here Azure Console 15 Data Services
  32. 32. Cloud Offerings – Big Data AWS Google Microsoft Managed RDBMS RDS Aurora Cloud SQL Azure SQL Data Warehouse Redshift BigQuery Azure SQL Data Warehouse NoSQL buckets S3 Glacier Cloud Storage Nearline Azure Blobs StorSimple NoSQL Key-Value NoSQL Wide Column DynamoDB Big Table Cloud Datastore Azure Tables NoSQL Document NoSQL Graph MongoDB on EC2 Neo4j on EC2 MongoDB on GCE Neo4j on GCE DocumentDB Neo4j on Azure Hadoop Elastic MapReduce DataProc Data Lake HDInsight
  33. 33. Practice Applying Concepts – Real Cost of Storage Types
  34. 34. Cloud NoSQL Applied – AWS Log Files Product Catalogs Social Games Social aggregators Line-of- Business
  35. 35. Cloud NoSQL Applied – AWS Log Files •  Stream or Hadoop •  Kinesis or EMR Product Catalogs •  Key/Value •  DynamoDB Social Games •  Document •  MongoDB Social aggregators •  Graph •  Neo4j Line-of- Business •  RDBMS •  RDS
  36. 36. ???The fastest growing cloud-based Big Data products are…
  37. 37. RelationalThe fastest growing cloud-based Big Data products are…
  38. 38. “When do I use…? ✘  Hadoop ✘  NoSQL ✘  Big Relational
  39. 39. Practice Applying Concepts – Real Cost of Storage Types
  40. 40. Reasons to use Big Relational Cloud Services Developers DevOps Cloud Vendors – AWS Developers DevOps Cloud Vendors – GCP
  41. 41. Reasons to use Big Relational Cloud Services Developers Most know RDBMS query patterns Many know basic administration DevOps Most know RDBMS administration Many know basic RDBMS queries Many know query optimization Cloud Vendors - AWS Aurora – RDBMS up to 64 TB Redshift - $ 1k USD / 1 TB / year Rich partner ecosystem – ETL Integration with AWS products Developers Most know coding language patterns to interact with RDBMS systems DevOps Familiar RDBMS security patterns Familiar auditing Partner tooling integration Cloud Vendors - GCP Big Query – familiar SQL queries No hassle streaming ingest No hassle pay-as-you-go Zero administration
  42. 42. My top Big Data Cloud Services
  43. 43. ETL is 75% of all Big Data Projects Surveying, cleaning and loading data is the majority of the billable time for new Big Data projects.
  44. 44. About this Workshop Real-world Cloud Scenarios w/AWS, Azure and GCP 1.  When to use which type of Big Data Solution 2.  The new world of Data Pipelines 3.  ETL and Visualization Practicalities 4.  Bonus…(if time allows)
  45. 45. 2. Data Pipelines Build vs. Buy
  46. 46. Pattern 2 ✘ How to build optimized cloud-based data pipelines? -- Cloud-based ETL tools and processes -- includes load-testing patterns and security practices -- including connecting between different vendor clouds
  47. 47. Key Questions – Ingestion and ETL ✘ Volume – how much and how fast, now and future? ✘ Variety – what type(s) or data, any pre-processing needed? ✘ Velocity – batches or steaming? ✘ Veracity – verification on ingest needed? new data needed?
  48. 48. Together How does your data pipeline flow?
  49. 49. “Considering… ✘  Initial Load/Transform ✘  Data Quality ✘  Batch vs. Stream
  50. 50. Pipeline Phases Phase 0 Eval Current Data - Quality & Quantity Phase 1 Get New Data - Free or Premium Phase 2 Build MVP & Forecast volume and growth Phase 3 Load test at scale Phase 4 Deploy – secure, audit and monitor
  51. 51. Cloud Big Data Vendors - ETL AWS ✘  5X market share of next competitor ✘  Notable: Many, strong ETL Partners GCP ✘  Lean, mean and cheap ✘  Fastest player ✘  Notable: DataFlow requires Java or Python developers Azure ✘  Difficulty with scale ✘  Best tooling integration ✘  Notable: Nothing
  52. 52. How Best to Ingest and ETL your Data? Complexity Scalability Developer Cost RDBMS medium medium low NoSQL medium big high Hadoop hard huge very high
  53. 53. “Considering… ✘  Initial Load/Transform ✘  Data Quality ✘  Batch vs. Stream
  54. 54. Building a Streaming Pipeline Stream Interval Window
  55. 55. “Near Real-time Streams Load Test All The Things
  56. 56. Key Questions - Streaming ✘ Volume – how much data now and predicted over next 12 months? ✘ Variety – what types of data now and future? ✘ Velocity – volume of input data / time now and near future? ✘ Veracity – volume of EXISTING data now
  57. 57. Cloud Big Data Vendors - Streaming AWS ✘  5X market share of next competitor ✘  Most complete offering ✘  Most mature offering ✘  Notable: Kinesis Firehose GCP ✘  Lean, mean and cheap ✘  Fastest player ✘  Requires top developers ✘  Notable: DataFlow flexible Azure ✘  Catching up ✘  Best tooling integration ✘  Notable: Stream Analytics integration with other products
  58. 58. Place your screenshot here AWS Console 17 Data services
  59. 59. Place your screenshot here GCP Console 8 Data Services
  60. 60. Place your screenshot here Azure Console 15 Data Services
  61. 61. Cloud Offerings – Data and Pipelines AWS Google Microsoft Managed RDBMS RDS Aurora Cloud SQL Azure SQL Data Warehouse Redshift BigQuery Azure SQL Data Warehouse NoSQL buckets S3 Glacier Cloud Storage Nearline Azure Blobs StorSimple NoSQL Key-Value NoSQL Wide Column DynamoDB Big Table Cloud Datastore Azure Tables Streaming or ML Kinesis AWS Machine Learning DataFlow Google Machine Learning StreamInsight Azure ML NoSQL Document NoSQL Graph MongoDB on EC2 Neo4j on EC2 MongoDB on GCE Neo4j on GCE DocumentDB Neo4j on Azure Hadoop Elastic MapReduce DataProc Data Lake HDInsight Cloud ETL Data Pipelines DataFlow Azure Data Pipeline
  62. 62. How Best to Stream your Data? Complexity Scalability Developer Cost Batches easy medium low Windows difficult big high Real-time very difficult huge high
  63. 63. Practice Applying Concepts
  64. 64. Designing Cloud Data Pipelines Log Files Product Catalogs Social Games Social aggregators Line-of- Business
  65. 65. About this Workshop Real-world Cloud Scenarios w/AWS, Azure and GCP 1.  When to use which type of Big Data Solution 2.  The new world of Data Pipelines 3.  ETL and Visualization Practicalities 4.  Bonus…(if time allows)
  66. 66. 3. Making Sense of Data Analytics and Presentation
  67. 67. Pattern 3 ✘ How best to Query and Visualize -- When to use business analytics vs. predictive analytics (machine learning) -- how best to present data to clients - partner visualization products or roll your own
  68. 68. Making Sense of Data Machine Learning Reports Presentation
  69. 69. Key Questions - Query ✘ Volume ✘ Variety ✘ Velocity ✘ Veracity
  70. 70. Graphs What is nature of your questions?
  71. 71. Cloud Big Data Vendors - Query AWS ✘  5X market share of next competitor ✘  Most complete offering ✘  Most mature offering ✘  Notable: Big Relational GCP ✘  Lean, mean and cheap ✘  Fastest player ✘  Notable: Flexible, powerful machine learning Azure ✘  WATCH OUT – Cost! ✘  Notable: Developer Tooling
  72. 72. Query Languages SQL Everyone knows it But how well do they know it? NoSQL Vendor Language Too many to list How will you learn it? Cypher Query language for graph databases The future? ORM Good, bad or horrible? Again, how well do they know it? HIVE Shown in too many vendor demos Really hard to make performant Machine Learning Queries SciPy, NumPy or Python R Language Julie Language Many more…
  73. 73. Practice Applying Concepts – Understanding D3
  74. 74. How Best to Query your Data? Business Analytics Predictive Analytics Developer Cost RDBMS NoSQL Hadoop
  75. 75. How Best to Query your Data? Business Analytics Predictive Analytics Developer Cost RDBMS easy medium low NoSQL hard very hard very high Hadoop hard hard very high
  76. 76. Machine Learning aka Predictive Analytics AWS ML for developers GUI-based GCP 3 Flavors of ML Python-based languages Azure ML for Data Scientists R Language
  77. 77. Presentation If you can’t see it, it’s not worth it.
  78. 78. Dashboards ✘  More than KPIs ✘  Mobile ✘  Alerts ✘  Data Stories Innovation in Data Visualization Reports ✘  Level of Detail ✘  Meaningful Taxonomies ✘  Fast enough ✘  Drill for Data
  79. 79. D3 The language of Data Visualization
  80. 80. Cloud Big Data Vendors - Visualization AWS ✘  Most complete offering ✘  Notable: Partners & QuickSight GCP ✘  Big Query Partners ✘  Notable: New Dashboards Azure ✘  Integrated ✘  Notable: PowerBI
  81. 81. About this Workshop Real-world Cloud Scenarios w/AWS, Azure and GCP 1.  When to use which type of Big Data Solution 2.  The new world of Data Pipelines 3.  ETL and Visualization Practicalities 4.  Bonus…(if time allows)
  82. 82. 4. About IoT It’s happening now
  83. 83. Place your screenshot here Data Generation Device
  84. 84. IoT is Big Data Realized
  85. 85. 235,000,000,000 $The IoT Market 2017By the year 20 Billion devicesAnd a lot of users
  86. 86. IoT all the Things
  87. 87. Cloud Big Data Vendors - IoT AWS ✘  First to market ✘  Most complete offering ✘  Most mature offering ✘  Notable: AWS IoT Rules GCP ✘  Still in Beta ✘  Fastest player ✘  Requires top developers ✘  Notable: Weave Azure ✘  Catching up ✘  Best tooling integration ✘  Notable: Device Mgmt.
  88. 88. Save ALL of your Data
  89. 89. The Next Generation…
  90. 90. ‘brigada! Any questions? You can find me at @lynnlangit

×