Presentation highlights need for Presto in modern data analytics stack.
Reference architecture for multi-tenant Presto offering in GCP at Walmart.
Components used for automated deployment, auto-scaling, caching & security integrations to open source Presto.
2. 2
Agenda
• Data stores @ Walmart Labs
• Need for Presto PaaS offering
• Presto in GCP
• Presto deployment & auto-scaling
• Authentication & Authorization
• Monitoring
• Best practices and tuning
Footer
3. 3
Data stores @ Walmart Labs
Access needs are varied from team to team – one solution does not fit all….
4. 4
Motivation for Presto..
• DataLake cluster - powered by on-prem Hadoop/HDFS
• Compute storage colocation – GOOD
• Need to ingest data from all diverse sources – CHALLENGING
• Scaling out compute with growing needs – CHALLENGING
• Need to separate storage & compute / support federated query capability – PRESTO..
• Isolated clusters in private cloud powering dedicated data-marts
Datajourney
5. 5
Presto & Alluxio
Works well together…
Small range query response time
Lower is better
Large scan query response time
Lower is better
Concurrency
Higher is better
Presto Presto + Alluxio
6. 6
• Leverage public clouds – GCP
• Better scalability & Effective cluster utilization by auto-scaling
• Performant query response times
• Security
– Authentication – LDAP
– Authorization – work with existing policies
• Handle sensitive data – encryption at rest & over the wire
• Efficient Monitoring & alerting
PaaS offering - requirements
7. 7
• Cloud DataProc init scripts or optional image -
https://cloud.google.com/dataproc/docs/tutorials/presto-dataproc
– Super easy to spawn Presto cluster
– Elevated cost due to managed services such as DataProc
– Overhead of additional Hadoop components
– Difficult to source new catalog or deploy config changes
• Alluxio – no GCP managed deployment
• Presto-admin – can be used deployment and configuration not auto-scaling
• Need for lower level deployment strategy
Presto on GCP
8. 8
• Open source Presto data-ops - https://github.com/takari/presto-dataops
• Framework to deploy and auto-scale Presto cluster in GCP
• Leverages ansible & GCP deployment manager
• Auto-scaling via configurable cluster wide CPU & memory usage threshold
• Our recent changes – will be released soon
– Alluxio deployment co-located with Presto workers
– Efficient configurability – suitable for multiple envs
– More auto-scaling configs
GCP presto auto-deployment
10. 10
• Access interface
– Presto cli
– Lower level rest end points
– Presto jdbc
• Coordinator LDAP Auth
– authorization based AD group membership
– Consider using firewall to limit access to the HTTP endpoint
– Wrap presto cli to use HTTPS end points
• Hadoop backed Hive catalog
– Kerberos authentication and leverage impersonation fine grained authorization and auditing
– Known issue - Hive metastore impersonation is not supported requires presto explicit read access on file system
Presto authentication
11. 11
• Caching enabled for ranger policy
• Ranger policy check for authorization happens before query parsing phase
• Ranger policy check is done against username and unix groups both.
Presto ranger authorization
15. 15
System catalog - https://prestosql.io/docs/current/connector/system.html
• system.runtime.queries - information about currently running queries
– If queries are spending more time in queued/analysis state
– Identify long running hogging cluster resources.
• system.runtime.tasks - info about how many rows and bytes processed by each task
– Identify query performance issues by looking at split and data distribution across nodes
– Identify parallelism issues for shuffle intensive queries
Presto monitoring – key metrics
16. 16
JMX catalog - https://prestosql.io/docs/current/connector/jmx.html
Ability to query all MBeans from all nodes in cluster.
• jmx.current."java.lang:type=memory"
Utilization of memory, heap & off-heap etc
• jmx.current."java.lang:type=operatingsystem"
Utilization of swap size, file descriptors, cpu load avg etc
Presto monitoring – key metrics
17. 17
• ORC compression – ZLIB
– Point to point queries performs well for snappy
– Large aggregation ZLIB is better
• Enable bloom filter on frequently used columns in filters
• Enable sorting on frequently used columns (boost query perf on the cost of higher ingestion time )
• Increase ORC stripe & stride size
– ORC files are splittable on a stripe level thus affects parallelism.
– We observed 18%-22% increased in presto parallelism (after setting stripe size = 128Mb and index stride = 16k)
• Enable Table & column stats (Most important )
– Now stats can be computed via Presto - https://prestosql.io/docs/current/sql/analyze.html
ORC storage recommendations