Publicidad

Big data on google cloud

Ceo @ AdFlex, Ex CTO at Eway & AdFlex, Co-Founder of DYNO en Eway JSC
3 de Jul de 2017
Publicidad

Más contenido relacionado

Similar a Big data on google cloud(20)

Publicidad

Último(20)

Publicidad

Big data on google cloud

  1. Big Data On Google Cloud Tu Pham - IO extended 2017
  2. CTO @ Dyno ADataasservicecompany Technologies: Java, Python, all kind of databases and Cloud platform from Google, Aws, Azure. Interests: Cloud computing / architecture, technology evolution, distributed systems. Husband, Father, GDE, Open source contributor. Tu Pham foto: Lars Kruse, Aarhus Universitet 3 Giới thiệu Dyno: - Tech marketing & digital agency
  3. For the past 17 years, Google has been building out the world’s fastest, most powerful, highest quality cloud infrastructure on the planet. Images by ConnieZhou
  4. Google Cloud Platform is built on the s a m e infrastructure that powers Google. Images by Connie Zhou
  5. Google’s Platform “[Google's] ability to build, organize, and operate a huge network of servers and fiber- optic cables with an efficiency and speed that rocks physics on its heels. This is whatmakes Google Google: its physical network, its thousands of fiber miles, and those many thousands of servers that, in aggregate, add up to the mother of all clouds.” - Wired
  6. 77 Peering locations
  7. Yes, We Can Power that Web Mobile Storage & Database Big Data Highly Scalable System Data Mining Cloud Platform
  8. Google CloudPlatform Organize the world’s information and make it universally accessible and useful. Google’s Mission 2 “
  9. Google CloudPlatform 5 Source: Boston Consulting Group: The Mobile Revolution: How Mobile Technologies Drive a Trillion-DollarImpact IDC,2015 By 2020, there will be 8 Billion connected smart phones — 2X more than today. And 32 Billion connected “IOT”devices —6X more thantoday.
  10. Exploring the Cloud IaaS Infrastructure-as-a- Service PaaS Platform-as-a- Service SaaS Software-as-a- Service Google Cloud Platform Cloud Platform
  11. Google Compute Engine Cloud Platform • Flexible Infrastructure • Customer VM Size • Online Disk Resizing • Network • Internal Network • Firewall • Load Balancing • External Ip Address • Billing • Sustained Usage Discounts • Preemptible VM
  12. App Engine • Fully Managed Platform • Popular Programming Language Support • Flexible and Scalable Application Storage • Auto-scaling • Versioning and Traffic Splitting • Local Developer Tools • Third-party Frameworks and Extensions Cloud Platform
  13. • Global Presence • Flexible Delivery Options • Pull • Push • Data Reliability • Flow Control • Data Security And Protection Cloud Platform Pub Sub
  14. • Reliable & Consistency Processing • Unified Programing Model • Intelligence Work Scheduling • Auto Scaling • Monitoring • Open Source Cloud Platform Cloud Data Flow
  15. • Versioning • Static Sites • Resumable Transfers • Object Change Notifications • TB scale Cloud Platform Cloud Storage
  16. Cloud SQL • Fully managed • Ease of Use • Highly Reliable • Flexible Charging • Security, Availability, Durability • Easy Migration & Data Portability • Optimized Mysql versions Cloud Platform
  17. Big Query • Fully Managed Big Data Analytics Service • Support SQL • Fast • Scalable • Flexible and Familiar • Security and Reliability Cloud Platform
  18. Data Proc • Includes • Apache Hadoop • Apache Pig • Apache Hive • Apache Spark • Fast And Scalable Data Processing • Flexible Virtual Machines • Resizable Cluster Cloud Platform
  19. Data Lab • Powerful Data Exploration • Scalable • Data Management • Visualization • Open Source (Jupyter) Cloud Platform
  20. Google’s Data Services for everyone
  21. A common configuration: draw conclusions CloudDatalab Events, metrics, etc. Stream Visualization and BI Raw logs, files, assets, Google Analytics data etc. Co-workers Batch Batch B C Applications and A Reports Confidential +Proprietary A serverless big data stack that scales automatically
  22. 10+ Years of Tackling Big Data Problems Google CloudPlatform 13 Google Papers 20082002 2004 2006 2010 2012 2014 2015 GFS Map Reduce Flume Java Millwheel Open Source 2005 Google Cloud Products BigQuery Pub/Sub Dataflow Bigtable BigTable Dremel PubSub Apache Beam Tensorflow
  23. Confidential & ProprietaryGoogle Cloud Platform 24 Transform Data into Actions Exploration & Collaboration Databases Storage Data Preparation & Processing Analytics Advanced Analytics & Intelligence Mobile apps Sensors and devices Web apps Relational Key-value Document SQL Wide column Object Stream processing Batch processing Data preparation Federated query Data catalog Data exploration Data visualization Developers Data scientists Business analysts Development environment for Machine Learning Pre-Trained Machine Learning models Data Ingestion Messaging Logs
  24. Confidential & ProprietaryGoogle Cloud Platform 25 Transform Data into Actions Data Preparation & Processing Cloud Dataflow Cloud Dataproc Exploration & Collaboration Google BigQuery Cloud Datalab Google Analytics 360 Cloud Dataproc Mobile apps Sensors and devices Web apps Developers Data scientists Business analysts Data Ingestion Cloud Pub/Sub App Engine Databases/ Storage Cloud SQL Cloud Bigtable Cloud Datastore Cloud Storage Analytics Google BigQuery Google Analytics 360 Cloud Dataproc Google Drive Advanced Analytics & Intelligence Cloud Machine Learning Translate API Vision API Speech API
  25. Google Cloud Platform 3 Apache Spark and Apache Hadoop should be fast, easy, and cost-effective. Google Cloud Data Proc
  26. Traditional Spark and Hadoop clusters
  27. Google Cloud Dataproc
  28. Google Cloud Dataproc - under the hood Applications on the cluster Dataproc Jobs GCP Products Spark PySpark Spark SQL MapReduce Pig Hive Dataproc Cluster Spark & Hadoop OSS Cloud Dataproc Agent Google Cloud Services Dataproc Jobs FeaturesData Outputs
  29. Easy, fast, cost-effective Fast Things take seconds to minutes, not hours or weeks Easy Be an expert with your data, not your data infrastructure Cost-effective Pay for exactly what you use
  30. Running Hadoop on Google Cloud bdutil Free OSS Toolkit Dataproc Managed Hadoop Custom Code Monitoring/Health Dev Integration Scaling Job Submission GCP Connectivity Deployment Creation Custom Code Monitoring/Health Dev Integration Manual Scaling Job Submission GCP Connectivity Deployment Creation On Premise Custom Code Monitoring/Health Dev Integration Scaling Job Submission GCP Connectivity Deployment Creation Google Managed Google Cloud Platform Customer Managed Vendor Hadoop Custom Code Monitoring/Health Dev Integration Scaling Job Submission GCP Connectivity Deployment Creation
  31. 6 Cloud Dataproc - integrated 6 Cloud Dataproc is natively integrated with several Google Cloud Platform products as part of an integrated data platform. Storage Operations Data
  32. 7 Where Cloud Dataproc fits into GCP 7 Google Bigtable (HBase) Google BigQuery (Analytics, Data warehouse) Stackdriver Logging (Logging Ops.) Google Cloud Dataflow (Batch/Stream Processing) Google Cloud Storage (HCFS/HDFS) Stackdriver Monitoring (Monitoring)
  33. Building what’s next 33 Scales automatically No setup or administration Stream up to 100,000 rowsp/sec Easily integrates with third-partysoftware Google BigQuery makes complex data analysis simple
  34. Confidential + Proprietary Google BigQuery Performance Example ? Running an inefficient regular expression over 100 billion rowsin less than 60 seconds Source: https://cloud.google.com/blog/big-data/2016/01/anatomy-of-a-bigquery- query
  35. Google BigQuery The Power of Google Dremel for everyone Storage Compute Fast Ingest Query Terabit Network
  36. 1000-core Hadoop Cluster = 2.5 hours Before Making ad hocQueries with BigQuery <5min After ● 500+ Games ● Hundreds of Analysts ● Terabytes of Data Daily
  37. “Right at the start of the partnership we were able to reduce time to insight from 96 hours to 30 minutes by using BigQuery, allowing us to react in real time to customer needs and provide better service..” GarySanders Head of the bank's digital analyticsfunction https://www.finextra.com/newsarticle/28566/lloyds-partners-google-on-data-analytics
  38. Big Data Challenges At Dyno - Multi TB data warehouse - Raw input > 100 GB new raw data per day (Structured & Unstructured) - 65 online data source - Unlimited offline data source - Face with data quality problem everyday - From user information & behavior to user interest & intention - Manage high performance / cost effective system
  39. JOIN THE FLIGHT - WE ARE HIRING IO Extended 2017 Twitter: @phamptu Email: tu@dyno.vn Frontend Developer: goo.gl/EY8RvV Backend Developer: goo.gl/BnmmK6
Publicidad