These slides were presented by Avinash Ramineni of Clairvoyant to the Atlanta Apache Spark User Group on Wednesday, March 22, 2017: https://www.meetup.com/Atlanta-Apache-Spark-User-Group/events/238109721/
4. 4Page
Quick Poll
• Big Data Deployments in Prod
• Hadoop Distributions
• People use Ecosystems rather than tools
• Architecture was implemented on Cloudera
• Cloud Experience – AWS ?
5. 5Page
Challenges
• Data in Silos
• Acquires Perspectives as data is moved
• Data availability delays
• Legacy Systems handling the Volume , Veracity and Velocity
• Extracting data from legacy systems
• Lack of Self-Service Capabilities
• Knowledge becomes tribal – instead of institutional
• Security / Compliance Requirements
6. 6Page
Data Lake Attributes
• Data Democratization
• Data Discovery
• Data Lineage
• Self-Service capabilities
• Metadata Management
8. 8Page
Self-Service at all Levels
Ingest Organize Enrich Analyze Dashboards
AnalyzeIngest Organize Enrich Insights
9. 9Page
Key Design Tenets
• Separation of Compute and Storage
• Independently scale compute and storage
• Data Democratization and Governance
• Bring your own Compute (BYOC)
• HA / DR
• Open Source Stack
10. 1
0
Page
Separation of Compute and Storage
• Scale storage and compute independently
• Shifts bottleneck from Disk IO to Network
• Centralized Data Storage
• Data Democratization
• No data duplication
• Easier Hardware upgrade paths
• Flexible Architecture
• DR Simplified
11. 1
1
Page
BYOC (Bring Your Own Cluster)
• Each department/application can bring its own Hadoop cluster
• Eliminates the need for very large clusters
• Easier to administer and maintain
• Reduces multi-tenancy issues
• Clusters can be upgraded independently
• Enables usage based cost model
Centralized / Common S3 Storage
Marketing
Cluster
Centralized
Storage
Personalization
Cluster
Main
Cluster
13. 1
3
Page
Architecture – Data Ingestion Layer
• DB Ingestor
• Stream Ingestor
• Kafka and Spark Streaming
• File Ingestor
• FTP / SFTP / Logs
• Ingestion using Service API
14. 1
4
Page
Architecture – Data Processing Layer
• Storage layer carved into logical buckets
• Landing, Raw, Derived and Delivery
• Schema stored with data (no guesswork)
• Platform Jobs
• Converting text to Parquet
• Saving streaming data Parquet
• Derivatives
• Compaction
• Standardization
15. 1
5
Page
Architecture – Data Delivery Layer
• Data Delivery
• SQL - Spark Thrift Server / Impala
• Tableau, SQL IDE, Applications
• Self Service
• Derivatives
• Represented Via SQL on Delivery Layer
• Stored in Derived Storage Layer
• Metadata driven
• Derived Layer Generators
• Long running Spark Job
• Derivative Refresh
17. 1
7
Page
Key Takeaways - Spark Thrift Server
• Spark Thrift Server Support
• Performance Tuning
• Concurrency
• partition strategy
• Cache Tables
• Compression Codec for Parquet
• Snappy vs gzip
18. 1
8
Page
Key Takeaways - Security
• Secure by Design, Secure by Default
• Access to Data on S3
• IAM Roles
• Sentry
• Support for Spark
• Kerberos
• Spark Thrift Server
• Navigator
• Support for Spark
19. 1
9
Page
Key Takeaways - General
• Rapidly Changing Technology
• Feature addition
• Documentation
• Bugs
• Jar hell
• Small files
• Performance Issues
• Compaction
20. 2
0
Page
Key Takeaways - General
• Partition Strategy
• Parquet Files
• Balancing parallelism and throughput
• Table Partitions
• Cluster sizing, optimization and tuning
• Integrating with Corporate infrastructure
• Deployment practices
• Monitoring and Alerting
• Information Security Policies