SlideShare una empresa de Scribd logo
1 de 22
Descargar para leer sin conexión
1Page
Building a Self Service Analytics
Platform on Hadoop
Avinash Ramineni
2Page
Clairvoyant
3Page
Clairvoyant Services
4Page
Quick Poll
• Big Data Deployments in Prod
• Hadoop Distributions
• People use Ecosystems rather than tools
• Architecture was implemented on Cloudera
• Cloud Experience – AWS ?
5Page
Challenges
• Data in Silos
• Acquires Perspectives as data is moved
• Data availability delays
• Legacy Systems handling the Volume , Veracity and Velocity
• Extracting data from legacy systems
• Lack of Self-Service Capabilities
• Knowledge becomes tribal – instead of institutional
• Security / Compliance Requirements
6Page
Data Lake Attributes
• Data Democratization
• Data Discovery
• Data Lineage
• Self-Service capabilities
• Metadata Management
7Page
Without Self-Service
8Page
Self-Service at all Levels
Ingest Organize Enrich Analyze Dashboards
AnalyzeIngest Organize Enrich Insights
9Page
Key Design Tenets
• Separation of Compute and Storage
• Independently scale compute and storage
• Data Democratization and Governance
• Bring your own Compute (BYOC)
• HA / DR
• Open Source Stack
1
0
Page
Separation of Compute and Storage
• Scale storage and compute independently
• Shifts bottleneck from Disk IO to Network
• Centralized Data Storage
• Data Democratization
• No data duplication
• Easier Hardware upgrade paths
• Flexible Architecture
• DR Simplified
1
1
Page
BYOC (Bring Your Own Cluster)
• Each department/application can bring its own Hadoop cluster
• Eliminates the need for very large clusters
• Easier to administer and maintain
• Reduces multi-tenancy issues
• Clusters can be upgraded independently
• Enables usage based cost model
Centralized / Common S3 Storage
Marketing
Cluster
Centralized
Storage
Personalization
Cluster
Main
Cluster
1
2
Page
Architecture
1
3
Page
Architecture – Data Ingestion Layer
• DB Ingestor
• Stream Ingestor
• Kafka and Spark Streaming
• File Ingestor
• FTP / SFTP / Logs
• Ingestion using Service API
1
4
Page
Architecture – Data Processing Layer
• Storage layer carved into logical buckets
• Landing, Raw, Derived and Delivery
• Schema stored with data (no guesswork)
• Platform Jobs
• Converting text to Parquet
• Saving streaming data Parquet
• Derivatives
• Compaction
• Standardization
1
5
Page
Architecture – Data Delivery Layer
• Data Delivery
• SQL - Spark Thrift Server / Impala
• Tableau, SQL IDE, Applications
• Self Service
• Derivatives
• Represented Via SQL on Delivery Layer
• Stored in Derived Storage Layer
• Metadata driven
• Derived Layer Generators
• Long running Spark Job
• Derivative Refresh
1
6
Page
Key Takeaways - Cloud
• Hadoop Cloud ready-ness
• Cloudera Director Limitations
• Multi-Availability zone, regions
• Storage
• Instance Storage
• EBS Volumes
• gp2 vs st1
• S3 Eventual Consistency
1
7
Page
Key Takeaways - Spark Thrift Server
• Spark Thrift Server Support
• Performance Tuning
• Concurrency
• partition strategy
• Cache Tables
• Compression Codec for Parquet
• Snappy vs gzip
1
8
Page
Key Takeaways - Security
• Secure by Design, Secure by Default
• Access to Data on S3
• IAM Roles
• Sentry
• Support for Spark
• Kerberos
• Spark Thrift Server
• Navigator
• Support for Spark
1
9
Page
Key Takeaways - General
• Rapidly Changing Technology
• Feature addition
• Documentation
• Bugs
• Jar hell
• Small files
• Performance Issues
• Compaction
2
0
Page
Key Takeaways - General
• Partition Strategy
• Parquet Files
• Balancing parallelism and throughput
• Table Partitions
• Cluster sizing, optimization and tuning
• Integrating with Corporate infrastructure
• Deployment practices
• Monitoring and Alerting
• Information Security Policies
2
1
Page
Data Security
2
2
Page
Questions
• Principal @ Clairvoyant
• Email: avinash@clairvoyantsoft.com
• LinkedIn: https://www.linkedin.com/in/avinashramineni

Más contenido relacionado

La actualidad más candente

Big Data and Data Virtualization
Big Data and Data VirtualizationBig Data and Data Virtualization
Big Data and Data Virtualization
Kenneth Peeples
 

La actualidad más candente (20)

Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data FabricUsing a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
 
Creating an Enterprise AI Strategy
Creating an Enterprise AI StrategyCreating an Enterprise AI Strategy
Creating an Enterprise AI Strategy
 
Modern Data Discovery and Integration in Retail Banking
Modern Data Discovery and Integration in Retail BankingModern Data Discovery and Integration in Retail Banking
Modern Data Discovery and Integration in Retail Banking
 
Necessity of Data Lakes in the Financial Services Sector
Necessity of Data Lakes in the Financial Services SectorNecessity of Data Lakes in the Financial Services Sector
Necessity of Data Lakes in the Financial Services Sector
 
Accelerate Digital Transformation with an Enterprise Big Data Fabric
Accelerate Digital Transformation with an Enterprise Big Data FabricAccelerate Digital Transformation with an Enterprise Big Data Fabric
Accelerate Digital Transformation with an Enterprise Big Data Fabric
 
The Convergence of Data & Digital: Mapping Out a Cohesive Strategy for Maximu...
The Convergence of Data & Digital: Mapping Out a Cohesive Strategy for Maximu...The Convergence of Data & Digital: Mapping Out a Cohesive Strategy for Maximu...
The Convergence of Data & Digital: Mapping Out a Cohesive Strategy for Maximu...
 
Solution architecture for big data projects
Solution architecture for big data projectsSolution architecture for big data projects
Solution architecture for big data projects
 
Sustainability Investment Research Using Cognitive Analytics
Sustainability Investment Research Using Cognitive AnalyticsSustainability Investment Research Using Cognitive Analytics
Sustainability Investment Research Using Cognitive Analytics
 
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESBData Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
 
Data Services Marketplace
Data Services MarketplaceData Services Marketplace
Data Services Marketplace
 
Why Data Virtualization? An Introduction by Denodo
Why Data Virtualization? An Introduction by DenodoWhy Data Virtualization? An Introduction by Denodo
Why Data Virtualization? An Introduction by Denodo
 
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep LearningRisk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning
 
Big Data and Data Virtualization
Big Data and Data VirtualizationBig Data and Data Virtualization
Big Data and Data Virtualization
 
Cortana Analytics Workshop: Azure Data Catalog
Cortana Analytics Workshop: Azure Data CatalogCortana Analytics Workshop: Azure Data Catalog
Cortana Analytics Workshop: Azure Data Catalog
 
ML Infra @ Spotify: Lessons Learned - Romain Yon - NYC ML Meetup
ML Infra @ Spotify: Lessons Learned - Romain Yon -  NYC ML MeetupML Infra @ Spotify: Lessons Learned - Romain Yon -  NYC ML Meetup
ML Infra @ Spotify: Lessons Learned - Romain Yon - NYC ML Meetup
 
Graph-driven Data Integration: Accelerating and Automating Data Delivery for ...
Graph-driven Data Integration: Accelerating and Automating Data Delivery for ...Graph-driven Data Integration: Accelerating and Automating Data Delivery for ...
Graph-driven Data Integration: Accelerating and Automating Data Delivery for ...
 
Supporting Data Services Marketplace using Data Virtualization
Supporting Data Services Marketplace using Data VirtualizationSupporting Data Services Marketplace using Data Virtualization
Supporting Data Services Marketplace using Data Virtualization
 
Denodo Data Virtualization - IT Days in Luxembourg with Oktopus
Denodo Data Virtualization - IT Days in Luxembourg with OktopusDenodo Data Virtualization - IT Days in Luxembourg with Oktopus
Denodo Data Virtualization - IT Days in Luxembourg with Oktopus
 
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
 

Destacado

Destacado (7)

Big data it’s impact on the finance function
Big data it’s impact on the finance functionBig data it’s impact on the finance function
Big data it’s impact on the finance function
 
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
 
Building enterprise advance analytics platform
Building enterprise advance analytics platformBuilding enterprise advance analytics platform
Building enterprise advance analytics platform
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
 
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 

Similar a Building A Self Service Analytics Platform on Hadoop

Presentation Presentation Presentation Presentation Presentation
Presentation Presentation Presentation Presentation PresentationPresentation Presentation Presentation Presentation Presentation
Presentation Presentation Presentation Presentation Presentation
bangel105
 
Keynote oracle days final 16x9 v3.alain
Keynote oracle days final 16x9 v3.alainKeynote oracle days final 16x9 v3.alain
Keynote oracle days final 16x9 v3.alain
Doina Draganescu
 

Similar a Building A Self Service Analytics Platform on Hadoop (20)

Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data...Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data...
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
 
Presentation Presentation Presentation Presentation Presentation
Presentation Presentation Presentation Presentation PresentationPresentation Presentation Presentation Presentation Presentation
Presentation Presentation Presentation Presentation Presentation
 
Managing storage on Prem and in Cloud
Managing storage on Prem and in CloudManaging storage on Prem and in Cloud
Managing storage on Prem and in Cloud
 
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
 
Move your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudMove your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in Cloud
 
SQL, NoSQL, Distributed SQL: Choose your DataStore carefully
SQL, NoSQL, Distributed SQL: Choose your DataStore carefullySQL, NoSQL, Distributed SQL: Choose your DataStore carefully
SQL, NoSQL, Distributed SQL: Choose your DataStore carefully
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
Apache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CITApache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CIT
 
Spark volume requirements 2018
Spark volume requirements 2018Spark volume requirements 2018
Spark volume requirements 2018
 
Drupal performance
Drupal performanceDrupal performance
Drupal performance
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on Kubernetes
 
Keynote oracle days final 16x9 v3.alain
Keynote oracle days final 16x9 v3.alainKeynote oracle days final 16x9 v3.alain
Keynote oracle days final 16x9 v3.alain
 
Teradata Loom Introductory Presentation
Teradata Loom Introductory PresentationTeradata Loom Introductory Presentation
Teradata Loom Introductory Presentation
 
CC -Unit4.pptx
CC -Unit4.pptxCC -Unit4.pptx
CC -Unit4.pptx
 
Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.
 
What are clouds made from
What are clouds made fromWhat are clouds made from
What are clouds made from
 
Apache Geode Meetup, London
Apache Geode Meetup, LondonApache Geode Meetup, London
Apache Geode Meetup, London
 
What is Cloud computing?
What is Cloud computing?What is Cloud computing?
What is Cloud computing?
 
Accelerating Business Intelligence Solutions with Microsoft Azure pass
Accelerating Business Intelligence Solutions with Microsoft Azure   passAccelerating Business Intelligence Solutions with Microsoft Azure   pass
Accelerating Business Intelligence Solutions with Microsoft Azure pass
 
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal HistoryData Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 

Building A Self Service Analytics Platform on Hadoop

  • 1. 1Page Building a Self Service Analytics Platform on Hadoop Avinash Ramineni
  • 4. 4Page Quick Poll • Big Data Deployments in Prod • Hadoop Distributions • People use Ecosystems rather than tools • Architecture was implemented on Cloudera • Cloud Experience – AWS ?
  • 5. 5Page Challenges • Data in Silos • Acquires Perspectives as data is moved • Data availability delays • Legacy Systems handling the Volume , Veracity and Velocity • Extracting data from legacy systems • Lack of Self-Service Capabilities • Knowledge becomes tribal – instead of institutional • Security / Compliance Requirements
  • 6. 6Page Data Lake Attributes • Data Democratization • Data Discovery • Data Lineage • Self-Service capabilities • Metadata Management
  • 8. 8Page Self-Service at all Levels Ingest Organize Enrich Analyze Dashboards AnalyzeIngest Organize Enrich Insights
  • 9. 9Page Key Design Tenets • Separation of Compute and Storage • Independently scale compute and storage • Data Democratization and Governance • Bring your own Compute (BYOC) • HA / DR • Open Source Stack
  • 10. 1 0 Page Separation of Compute and Storage • Scale storage and compute independently • Shifts bottleneck from Disk IO to Network • Centralized Data Storage • Data Democratization • No data duplication • Easier Hardware upgrade paths • Flexible Architecture • DR Simplified
  • 11. 1 1 Page BYOC (Bring Your Own Cluster) • Each department/application can bring its own Hadoop cluster • Eliminates the need for very large clusters • Easier to administer and maintain • Reduces multi-tenancy issues • Clusters can be upgraded independently • Enables usage based cost model Centralized / Common S3 Storage Marketing Cluster Centralized Storage Personalization Cluster Main Cluster
  • 13. 1 3 Page Architecture – Data Ingestion Layer • DB Ingestor • Stream Ingestor • Kafka and Spark Streaming • File Ingestor • FTP / SFTP / Logs • Ingestion using Service API
  • 14. 1 4 Page Architecture – Data Processing Layer • Storage layer carved into logical buckets • Landing, Raw, Derived and Delivery • Schema stored with data (no guesswork) • Platform Jobs • Converting text to Parquet • Saving streaming data Parquet • Derivatives • Compaction • Standardization
  • 15. 1 5 Page Architecture – Data Delivery Layer • Data Delivery • SQL - Spark Thrift Server / Impala • Tableau, SQL IDE, Applications • Self Service • Derivatives • Represented Via SQL on Delivery Layer • Stored in Derived Storage Layer • Metadata driven • Derived Layer Generators • Long running Spark Job • Derivative Refresh
  • 16. 1 6 Page Key Takeaways - Cloud • Hadoop Cloud ready-ness • Cloudera Director Limitations • Multi-Availability zone, regions • Storage • Instance Storage • EBS Volumes • gp2 vs st1 • S3 Eventual Consistency
  • 17. 1 7 Page Key Takeaways - Spark Thrift Server • Spark Thrift Server Support • Performance Tuning • Concurrency • partition strategy • Cache Tables • Compression Codec for Parquet • Snappy vs gzip
  • 18. 1 8 Page Key Takeaways - Security • Secure by Design, Secure by Default • Access to Data on S3 • IAM Roles • Sentry • Support for Spark • Kerberos • Spark Thrift Server • Navigator • Support for Spark
  • 19. 1 9 Page Key Takeaways - General • Rapidly Changing Technology • Feature addition • Documentation • Bugs • Jar hell • Small files • Performance Issues • Compaction
  • 20. 2 0 Page Key Takeaways - General • Partition Strategy • Parquet Files • Balancing parallelism and throughput • Table Partitions • Cluster sizing, optimization and tuning • Integrating with Corporate infrastructure • Deployment practices • Monitoring and Alerting • Information Security Policies
  • 22. 2 2 Page Questions • Principal @ Clairvoyant • Email: avinash@clairvoyantsoft.com • LinkedIn: https://www.linkedin.com/in/avinashramineni