SlideShare a Scribd company logo
1 of 18
Download to read offline
©2017, Amazon Web Services, Inc. or its affiliates. All rights reserved
Big Data answers in seconds !
with Amazon Athena
Julien Simon, Principal Technical Evangelist, AWS
julsimon@amazon.fr 
@julsimon
Big Data the way it should be
Questions
(not data!)
Answers
We shouldn’t have to
care about how this
really works !
Data
We shouldn’t have to
mess with this at all
Want to build it yourself? You need to master this
•  Planning capacity for storage and compute
•  Handling different data formats, structured and
unstructured (CSV, JSON, Parquet, Avro, etc.)
•  Learning complex programming models and
languages (Map Reduce, Spark, Scala, etc.)
•  Keeping costs under control
•  Availability, performance, security and a few more
Need help with your own Hadoop?
•  Claranet: AWS Premier Consulting Partner
•  They can build and run your Cloudera Enterprise platforms
on top of AWS
•  Claranet has certified AWS and Cloudera experts
•  Security & compliance is built-in (ISO 27001, PCI-DSS)
•  24/7 support is available
•  Learn more on booth 110. Tell them I sent you ;)
https://www.claranet.fr
Amazon Athena
•  New service announced at re:Invent 2016
•  Run read-only SQL queries on S3 data
•  No data load, no indexing, no nothing
•  No infrastructure to create, manage or scale
•  Availability: us-east-1, us-east-2, us-west-2
•  Pricing: $5 per Terabyte scanned 

AWS re:Invent 2016: Introducing Athena (BDA303) https://www.youtube.com/watch?v=DxAuj_Ky5aw
Athena queries
•  Service based on Presto (already available in Amazon EMR)
•  Table creation: Apache Hive Data Definition Language
–  CREATE EXTERNAL_TABLE
•  ANSI SQL operators and functions: what Presto supports
•  Unsupported operations
–  User-defined functions (UDF or UDAFs)
–  Stored procedures
–  Any transaction found in Hive or Presto 
https://prestodb.io 
http://docs.aws.amazon.com/athena/latest/ug/known-limitations.html
Data formats supported by Athena
•  Unstructured
–  Apache logs, with customizable regular expression
•  Semi-structured
–  delimiter-separated values (CSV, OpenCSV)
–  Tab-separated values (TSV)
–  JSON
•  Structured
–  Apache Parquet https://parquet.apache.org/
–  Apache ORC https://orc.apache.org/ 
–  Apache Avro https://avro.apache.org/ 
•  Compression (Snappy, Zlib, GZIP) & partitioning
Data partitioning
•  Partitioning reduces the amount of scanned data
–  Better performance
–  Cost optimization
•  Data may be already partitioned in S3
–  CREATE EXTERNAL TABLE table_name(…) PARTITIONED BY (...) 
–  MSCK REPAIR TABLE table_name
•  Data can also be partitioned at table creation time
–  CREATE EXTERNAL TABLE table_name(…)
–  ALTER TABLE table_name ADD PARTITION …
Running queries on Athena
•  AWS Console (quite cool, actually)
–  Wizard for schema definition and table creation
–  Saved queries
–  Query history
–  Multiple queries in parallel

•  JDBC driver
–  SQL Workbench/J 
–  Java application
http://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html
Using columnar formats for fun and profit
•  Apache Parquet
•  Apache ORC
•  Ditto: better performance & cost optimization
•  You can convert your data to a columnar format with an
Amazon EMR cluster
•  More information and tutorial at
http://docs.aws.amazon.com/athena/latest/ug/convert-to-
columnar.html
Demo
GDELT Data set
•  Global Database of Events, Language and Tone Database 
–  300 categories of political & diplomatic activities around the world
–  Georeferenced to the city
–  Dating back to January 1, 1979
–  http://www.gdeltproject.org/ 
•  1543 CSV files in S3 (146 GB)
•  1 table (+ reference tables), 58 columns, 441M lines
•  https://aws.amazon.com/public-datasets/gdelt/
Using columnar formats for fun and profit
•  Hive makes it easy to convert from CSV to Parquet
https://docs.aws.amazon.com/athena/latest/ug/convert-to-columnar.html 
•  Large request 
–  CSV uncompressed : 26 seconds, 136GB scanned, $0.13
–  Parquet compressed : 4 seconds, 2.2GB scanned, $0.002
Athena in a nutshell
•  Run SQL queries on S3 data
•  No infrastructure
•  Multiple input formats supported
•  Pretty fast!
•  A simple, very cost-efficient option for ad-hoc
analysis
AWS User Groups
Lille
Paris
Rennes
Nantes
Bordeaux
Lyon
Montpellier
Toulouse
Côte d’Azur (new!)
facebook.com/groups/AWSFrance/
@aws_actus
“Amazon Web Services France”
https://aws.amazon.com/fr/events/webinaires/
Thank you!!
	
Julien	Simon,	Principal	Technical	Evangelist,	AWS	
julsimon@amazon.fr	
@julsimon

More Related Content

What's hot

Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersScaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million Users
Amazon Web Services
 
Journey Through the Cloud - Data Analysis
Journey Through the Cloud - Data AnalysisJourney Through the Cloud - Data Analysis
Journey Through the Cloud - Data Analysis
Amazon Web Services
 

What's hot (20)

Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersScaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million Users
 
Building Your First Big Data Application on AWS
Building Your First Big Data Application on AWSBuilding Your First Big Data Application on AWS
Building Your First Big Data Application on AWS
 
RMG207 Introduction to AWS CloudFormation - AWS re: Invent 2012
RMG207 Introduction to AWS CloudFormation - AWS re: Invent 2012RMG207 Introduction to AWS CloudFormation - AWS re: Invent 2012
RMG207 Introduction to AWS CloudFormation - AWS re: Invent 2012
 
Deep Learning on AWS (November 2016)
Deep Learning on AWS (November 2016)Deep Learning on AWS (November 2016)
Deep Learning on AWS (November 2016)
 
Automating your Infrastructure Deployment with AWS CloudFormation and AWS Ops...
Automating your Infrastructure Deployment with AWS CloudFormation and AWS Ops...Automating your Infrastructure Deployment with AWS CloudFormation and AWS Ops...
Automating your Infrastructure Deployment with AWS CloudFormation and AWS Ops...
 
Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
 
Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersScaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million Users
 
Deep Dive: Amazon Redshift (March 2017)
Deep Dive: Amazon Redshift (March 2017)Deep Dive: Amazon Redshift (March 2017)
Deep Dive: Amazon Redshift (March 2017)
 
Amazon Web Services Building Blocks for Drupal Applications and Hosting
Amazon Web Services Building Blocks for Drupal Applications and HostingAmazon Web Services Building Blocks for Drupal Applications and Hosting
Amazon Web Services Building Blocks for Drupal Applications and Hosting
 
透過 Amazon Redshift 打造數據分析服務及 Amazon Redshift 新功能案例介紹
透過 Amazon Redshift 打造數據分析服務及 Amazon Redshift 新功能案例介紹透過 Amazon Redshift 打造數據分析服務及 Amazon Redshift 新功能案例介紹
透過 Amazon Redshift 打造數據分析服務及 Amazon Redshift 新功能案例介紹
 
CloudFormation Best Practices
CloudFormation Best PracticesCloudFormation Best Practices
CloudFormation Best Practices
 
Deep Dive RDS & Aurora - Pop-up Loft TLV 2017
Deep Dive RDS & Aurora - Pop-up Loft TLV 2017Deep Dive RDS & Aurora - Pop-up Loft TLV 2017
Deep Dive RDS & Aurora - Pop-up Loft TLV 2017
 
Amazon Athena (April 2017)
Amazon Athena (April 2017)Amazon Athena (April 2017)
Amazon Athena (April 2017)
 
Amazon EC2 Systems Manager (March 2017)
Amazon EC2 Systems Manager (March 2017)Amazon EC2 Systems Manager (March 2017)
Amazon EC2 Systems Manager (March 2017)
 
AWS Webcast - Build Agile Applications in AWS Cloud
AWS Webcast - Build Agile Applications in AWS CloudAWS Webcast - Build Agile Applications in AWS Cloud
AWS Webcast - Build Agile Applications in AWS Cloud
 
Advanced Task Scheduling with Amazon ECS (June 2017)
Advanced Task Scheduling with Amazon ECS (June 2017)Advanced Task Scheduling with Amazon ECS (June 2017)
Advanced Task Scheduling with Amazon ECS (June 2017)
 
Build A Website on AWS for Your First 10 Million Users
Build A Website on AWS for Your First 10 Million UsersBuild A Website on AWS for Your First 10 Million Users
Build A Website on AWS for Your First 10 Million Users
 
Infrastructure as Code - AWS CloudFormation
Infrastructure as Code - AWS CloudFormationInfrastructure as Code - AWS CloudFormation
Infrastructure as Code - AWS CloudFormation
 
Scaling up to your first 10 million users - Pop-up Loft Tel Aviv
Scaling up to your first 10 million users - Pop-up Loft Tel AvivScaling up to your first 10 million users - Pop-up Loft Tel Aviv
Scaling up to your first 10 million users - Pop-up Loft Tel Aviv
 
Journey Through the Cloud - Data Analysis
Journey Through the Cloud - Data AnalysisJourney Through the Cloud - Data Analysis
Journey Through the Cloud - Data Analysis
 

Viewers also liked

Com clonar i com exportar i importar una màquina virtual
Com clonar i com exportar i importar una màquina virtualCom clonar i com exportar i importar una màquina virtual
Com clonar i com exportar i importar una màquina virtual
Edu Alias
 
Labo XX Antwerp - Densification strategies for the 20th century belt (BUUR)
Labo XX Antwerp - Densification strategies for the 20th century belt (BUUR)Labo XX Antwerp - Densification strategies for the 20th century belt (BUUR)
Labo XX Antwerp - Densification strategies for the 20th century belt (BUUR)
Kevin Penalva-Halpin
 
Filafat ilmu kepolisian
Filafat ilmu kepolisianFilafat ilmu kepolisian
Filafat ilmu kepolisian
rara wibowo
 

Viewers also liked (20)

Building serverless apps with Node.js
Building serverless apps with Node.jsBuilding serverless apps with Node.js
Building serverless apps with Node.js
 
IoT: it's all about Data!
IoT: it's all about Data!IoT: it's all about Data!
IoT: it's all about Data!
 
Viadeo - Cost Driven Development
Viadeo - Cost Driven DevelopmentViadeo - Cost Driven Development
Viadeo - Cost Driven Development
 
Building Serverless APIs (January 2017)
Building Serverless APIs (January 2017)Building Serverless APIs (January 2017)
Building Serverless APIs (January 2017)
 
Devops with Amazon Web Services (January 2017)
Devops with Amazon Web Services (January 2017)Devops with Amazon Web Services (January 2017)
Devops with Amazon Web Services (January 2017)
 
Amazon AI (February 2017)
Amazon AI (February 2017)Amazon AI (February 2017)
Amazon AI (February 2017)
 
Continuous Deployment with Amazon Web Services
Continuous Deployment with Amazon Web ServicesContinuous Deployment with Amazon Web Services
Continuous Deployment with Amazon Web Services
 
Bonnes pratiques pour la gestion des opérations de sécurité AWS
Bonnes pratiques pour la gestion des opérations de sécurité AWSBonnes pratiques pour la gestion des opérations de sécurité AWS
Bonnes pratiques pour la gestion des opérations de sécurité AWS
 
Developing and deploying serverless applications (February 2017)
Developing and deploying serverless applications (February 2017)Developing and deploying serverless applications (February 2017)
Developing and deploying serverless applications (February 2017)
 
Amazon Inspector
Amazon InspectorAmazon Inspector
Amazon Inspector
 
Authentification et autorisation d'accès avec AWS IAM
Authentification et autorisation d'accès avec AWS IAMAuthentification et autorisation d'accès avec AWS IAM
Authentification et autorisation d'accès avec AWS IAM
 
An introduction to serverless architectures (February 2017)
An introduction to serverless architectures (February 2017)An introduction to serverless architectures (February 2017)
An introduction to serverless architectures (February 2017)
 
Advanced Task Scheduling with Amazon ECS
Advanced Task Scheduling with Amazon ECSAdvanced Task Scheduling with Amazon ECS
Advanced Task Scheduling with Amazon ECS
 
AWS re:Invent 2016 recap (part 2)
AWS re:Invent 2016 recap (part 2) AWS re:Invent 2016 recap (part 2)
AWS re:Invent 2016 recap (part 2)
 
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAnnouncing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
 
Presentación dia de andalucia pedro
Presentación dia de andalucia pedroPresentación dia de andalucia pedro
Presentación dia de andalucia pedro
 
Iot og personvern 2017
Iot og personvern 2017Iot og personvern 2017
Iot og personvern 2017
 
Com clonar i com exportar i importar una màquina virtual
Com clonar i com exportar i importar una màquina virtualCom clonar i com exportar i importar una màquina virtual
Com clonar i com exportar i importar una màquina virtual
 
Labo XX Antwerp - Densification strategies for the 20th century belt (BUUR)
Labo XX Antwerp - Densification strategies for the 20th century belt (BUUR)Labo XX Antwerp - Densification strategies for the 20th century belt (BUUR)
Labo XX Antwerp - Densification strategies for the 20th century belt (BUUR)
 
Filafat ilmu kepolisian
Filafat ilmu kepolisianFilafat ilmu kepolisian
Filafat ilmu kepolisian
 

Similar to Big Data answers in seconds with Amazon Athena

Similar to Big Data answers in seconds with Amazon Athena (20)

Los Angeles AWS Users Group - Athena Deep Dive
Los Angeles AWS Users Group - Athena Deep DiveLos Angeles AWS Users Group - Athena Deep Dive
Los Angeles AWS Users Group - Athena Deep Dive
 
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQLNEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
 
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
 
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
(ISM304) Oracle to Amazon RDS MySQL & Aurora: How Gallup Made the Move
(ISM304) Oracle to Amazon RDS MySQL & Aurora: How Gallup Made the Move(ISM304) Oracle to Amazon RDS MySQL & Aurora: How Gallup Made the Move
(ISM304) Oracle to Amazon RDS MySQL & Aurora: How Gallup Made the Move
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 
Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview
 
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
 
Serverlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and QuicksightServerlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and Quicksight
 
Application design for the cloud using AWS
Application design for the cloud using AWSApplication design for the cloud using AWS
Application design for the cloud using AWS
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Aws Atlanta meetup Amazon Athena
Aws Atlanta meetup Amazon AthenaAws Atlanta meetup Amazon Athena
Aws Atlanta meetup Amazon Athena
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWS
 

More from Julien SIMON

More from Julien SIMON (20)

An introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging FaceAn introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging Face
 
Reinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face TransformersReinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face Transformers
 
Building NLP applications with Transformers
Building NLP applications with TransformersBuilding NLP applications with Transformers
Building NLP applications with Transformers
 
Building Machine Learning Models Automatically (June 2020)
Building Machine Learning Models Automatically (June 2020)Building Machine Learning Models Automatically (June 2020)
Building Machine Learning Models Automatically (June 2020)
 
Starting your AI/ML project right (May 2020)
Starting your AI/ML project right (May 2020)Starting your AI/ML project right (May 2020)
Starting your AI/ML project right (May 2020)
 
Scale Machine Learning from zero to millions of users (April 2020)
Scale Machine Learning from zero to millions of users (April 2020)Scale Machine Learning from zero to millions of users (April 2020)
Scale Machine Learning from zero to millions of users (April 2020)
 
An Introduction to Generative Adversarial Networks (April 2020)
An Introduction to Generative Adversarial Networks (April 2020)An Introduction to Generative Adversarial Networks (April 2020)
An Introduction to Generative Adversarial Networks (April 2020)
 
AIM410R1 Deep learning applications with TensorFlow, featuring Fannie Mae (De...
AIM410R1 Deep learning applications with TensorFlow, featuring Fannie Mae (De...AIM410R1 Deep learning applications with TensorFlow, featuring Fannie Mae (De...
AIM410R1 Deep learning applications with TensorFlow, featuring Fannie Mae (De...
 
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
 
AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...
AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...
AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...
 
A pragmatic introduction to natural language processing models (October 2019)
A pragmatic introduction to natural language processing models (October 2019)A pragmatic introduction to natural language processing models (October 2019)
A pragmatic introduction to natural language processing models (October 2019)
 
Building smart applications with AWS AI services (October 2019)
Building smart applications with AWS AI services (October 2019)Building smart applications with AWS AI services (October 2019)
Building smart applications with AWS AI services (October 2019)
 
Build, train and deploy ML models with SageMaker (October 2019)
Build, train and deploy ML models with SageMaker (October 2019)Build, train and deploy ML models with SageMaker (October 2019)
Build, train and deploy ML models with SageMaker (October 2019)
 
The Future of AI (September 2019)
The Future of AI (September 2019)The Future of AI (September 2019)
The Future of AI (September 2019)
 
Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)
 
Train and Deploy Machine Learning Workloads with AWS Container Services (July...
Train and Deploy Machine Learning Workloads with AWS Container Services (July...Train and Deploy Machine Learning Workloads with AWS Container Services (July...
Train and Deploy Machine Learning Workloads with AWS Container Services (July...
 
Optimize your Machine Learning Workloads on AWS (July 2019)
Optimize your Machine Learning Workloads on AWS (July 2019)Optimize your Machine Learning Workloads on AWS (July 2019)
Optimize your Machine Learning Workloads on AWS (July 2019)
 
Deep Learning on Amazon Sagemaker (July 2019)
Deep Learning on Amazon Sagemaker (July 2019)Deep Learning on Amazon Sagemaker (July 2019)
Deep Learning on Amazon Sagemaker (July 2019)
 
Automate your Amazon SageMaker Workflows (July 2019)
Automate your Amazon SageMaker Workflows (July 2019)Automate your Amazon SageMaker Workflows (July 2019)
Automate your Amazon SageMaker Workflows (July 2019)
 
Build, train and deploy ML models with Amazon SageMaker (May 2019)
Build, train and deploy ML models with Amazon SageMaker (May 2019)Build, train and deploy ML models with Amazon SageMaker (May 2019)
Build, train and deploy ML models with Amazon SageMaker (May 2019)
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

Big Data answers in seconds with Amazon Athena

  • 1. ©2017, Amazon Web Services, Inc. or its affiliates. All rights reserved Big Data answers in seconds ! with Amazon Athena Julien Simon, Principal Technical Evangelist, AWS julsimon@amazon.fr @julsimon
  • 2. Big Data the way it should be Questions (not data!) Answers We shouldn’t have to care about how this really works ! Data We shouldn’t have to mess with this at all
  • 3. Want to build it yourself? You need to master this •  Planning capacity for storage and compute •  Handling different data formats, structured and unstructured (CSV, JSON, Parquet, Avro, etc.) •  Learning complex programming models and languages (Map Reduce, Spark, Scala, etc.) •  Keeping costs under control •  Availability, performance, security and a few more
  • 4. Need help with your own Hadoop? •  Claranet: AWS Premier Consulting Partner •  They can build and run your Cloudera Enterprise platforms on top of AWS •  Claranet has certified AWS and Cloudera experts •  Security & compliance is built-in (ISO 27001, PCI-DSS) •  24/7 support is available •  Learn more on booth 110. Tell them I sent you ;) https://www.claranet.fr
  • 5.
  • 6. Amazon Athena •  New service announced at re:Invent 2016 •  Run read-only SQL queries on S3 data •  No data load, no indexing, no nothing •  No infrastructure to create, manage or scale •  Availability: us-east-1, us-east-2, us-west-2 •  Pricing: $5 per Terabyte scanned AWS re:Invent 2016: Introducing Athena (BDA303) https://www.youtube.com/watch?v=DxAuj_Ky5aw
  • 7. Athena queries •  Service based on Presto (already available in Amazon EMR) •  Table creation: Apache Hive Data Definition Language –  CREATE EXTERNAL_TABLE •  ANSI SQL operators and functions: what Presto supports •  Unsupported operations –  User-defined functions (UDF or UDAFs) –  Stored procedures –  Any transaction found in Hive or Presto https://prestodb.io http://docs.aws.amazon.com/athena/latest/ug/known-limitations.html
  • 8. Data formats supported by Athena •  Unstructured –  Apache logs, with customizable regular expression •  Semi-structured –  delimiter-separated values (CSV, OpenCSV) –  Tab-separated values (TSV) –  JSON •  Structured –  Apache Parquet https://parquet.apache.org/ –  Apache ORC https://orc.apache.org/ –  Apache Avro https://avro.apache.org/ •  Compression (Snappy, Zlib, GZIP) & partitioning
  • 9. Data partitioning •  Partitioning reduces the amount of scanned data –  Better performance –  Cost optimization •  Data may be already partitioned in S3 –  CREATE EXTERNAL TABLE table_name(…) PARTITIONED BY (...) –  MSCK REPAIR TABLE table_name •  Data can also be partitioned at table creation time –  CREATE EXTERNAL TABLE table_name(…) –  ALTER TABLE table_name ADD PARTITION …
  • 10. Running queries on Athena •  AWS Console (quite cool, actually) –  Wizard for schema definition and table creation –  Saved queries –  Query history –  Multiple queries in parallel •  JDBC driver –  SQL Workbench/J –  Java application http://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html
  • 11. Using columnar formats for fun and profit •  Apache Parquet •  Apache ORC •  Ditto: better performance & cost optimization •  You can convert your data to a columnar format with an Amazon EMR cluster •  More information and tutorial at http://docs.aws.amazon.com/athena/latest/ug/convert-to- columnar.html
  • 12. Demo
  • 13. GDELT Data set •  Global Database of Events, Language and Tone Database –  300 categories of political & diplomatic activities around the world –  Georeferenced to the city –  Dating back to January 1, 1979 –  http://www.gdeltproject.org/ •  1543 CSV files in S3 (146 GB) •  1 table (+ reference tables), 58 columns, 441M lines •  https://aws.amazon.com/public-datasets/gdelt/
  • 14. Using columnar formats for fun and profit •  Hive makes it easy to convert from CSV to Parquet https://docs.aws.amazon.com/athena/latest/ug/convert-to-columnar.html •  Large request –  CSV uncompressed : 26 seconds, 136GB scanned, $0.13 –  Parquet compressed : 4 seconds, 2.2GB scanned, $0.002
  • 15. Athena in a nutshell •  Run SQL queries on S3 data •  No infrastructure •  Multiple input formats supported •  Pretty fast! •  A simple, very cost-efficient option for ad-hoc analysis
  • 16. AWS User Groups Lille Paris Rennes Nantes Bordeaux Lyon Montpellier Toulouse Côte d’Azur (new!) facebook.com/groups/AWSFrance/ @aws_actus “Amazon Web Services France”