Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Presto Meetup 2016 Small Start

1.775 visualizaciones

Publicado el

Presto Meetup 2016で発表した資料です

Publicado en: Tecnología
  • Inicia sesión para ver los comentarios

Presto Meetup 2016 Small Start

  1. 1. Presto Meetup 2016 My Use case Small Start
  2. 2. self‐introduction • @toyama0919 • Analytics Infra. • Nearly working embulk… • Presto using one and a half years.
  3. 3. Our development situation • We commonly used sql. • Marketing occupation don't write sql. • I often write the complicated SQL, that is 100 lines.. • We love OSS. • Not use Update, Insert, Delete by Presto.
  4. 4. Our Business situation • We manage and operate web site of BtoB. • Our data lifecycle is long. • Business side not write sql. • watching re:dash and Adobe analytics. • Sales increase 15 straight year.
  5. 5. analyst want data quickly
  6. 6. Ruby Batch CollectBatchVisualize Data Store (Digdag)
  7. 7. Analytics Priolity 1. Direct SQL 2. Presto 3. ETL
  8. 8. Cost is large difference from 1 to 3
  9. 9. Why use presto? • Cross server Join • Window function • UDF
  10. 10. Cross Server Join
  11. 11. Join • Cross server and cross database. • A single Presto query can combine data from multiple sources. • We use multiple sources join query. • reduce ETL pain.
  12. 12. Collect data in one place? • Equal able to get data by one query. • I not want to have duplicate data.(master data, user data) • Collect the data in one place, high develop cost.
  13. 13. with mysql_user as ( select user_id, user_name from mysql.schema.users ), redshift_user_log as ( select user_id, log_time from redshift.schema.pageview ) select user_id, user_name, count(*) from mysql_user inner join redshift_user_log on mysql_user.user_id = redshift_user_log.user_id group by user_id, user_name
  14. 14. UDF
  15. 15. Mysql not support mechanism • window function • with query – not support Recursive. • URL function • Array data type • cross join unnest
  16. 16. URL Function select url_encode('Presto最高'); => Presto%8d%c5%8d%82 select url_decode('Presto%8d%c5%8d%82'); => Presto最高
  17. 17. Regexp Function select regexp_extract_all('1a 2b 14m', 'd+') => [1, 2, 14] select regexp_extract( '超低床型自動梱包機 RQ-8LD', '([a-zA-z0-9-]+)’ ) => RQ-8LD
  18. 18. SQL no good at it • Normalization of the character string. • split csv string. • Morphological analysis.
  19. 19. Normalization select normalize(upper('hoge'), NFKC) #=> HOGE
  20. 20. Array type select split(keywords, ',') as keywords From mysql_keywords_table keywords ---------------------------- keyword1,keyword2,keyword3 keywords ---------------------------- ['keyword1','keyword2','keyword3']
  21. 21. horizontal to vertical SELECT keyword FROM mysql_keywords_table CROSS JOIN UNNEST(split(keywords, ',')) AS t (keyword)
  22. 22. horizontal to vertical keywords ---------------------------- keyword1,keyword2,keyword3 keyword4,keyword5 keyword6 keyword1 keyword ---------------------------- keyword1 keyword2 keyword3 keyword4 keyword5 keyword6 keyword1
  23. 23. window function • We use window function for Mysql. (Presto on mysql) • data source is Mysql, But Presto world can use. • But can not use original function of mysql.
  24. 24. Rank function on mysql select company_id, category_id, count(*), rank() over ( partition by company_id order by count(*) desc ) from mysql.schema.mysql_table group by company_id, category_id
  25. 25. other window function • last_value • first_value • dense_rank • percent_rank
  26. 26. Prestogres • PostgreSQL protocol gateway for Presto. • rewrite queries before sending Presto to PostgreSQL. • have password-based authentication and SSL.
  27. 27. Why Prestogres? • Other application connectivity. – pgAdmin, psql command. – re:dash connecte with PostgreSQL protocol to presto. – But can directly connect to presto. • We connect to presto, need Presto client. – I not want use java client. • Weak security. – certification is taken by prestogres
  28. 28. Prestogres Limitation • prepared statement. – not support Presto too. – so not work embulk-input-postgresql • Can’t fetch schema by sql. • Temporary table • DROP TABLE
  29. 29. re:dash • Visualization platform, write by python. • Supports many data sources. • Sharing query with member. • Scheduling query.(per day, per hour) • Very active contribution.
  30. 30. increased rapidly Presto query by re:dash • Number of the presto queries increased than 10 times. • That won't change with writing ETL on re:dash. • Re:dash having a good reputation in internal.
  31. 31. Okay, analytics problems all clear!
  32. 32. No.. Can’t escape from ETL
  33. 33. Embulk with Presto • use embulk-input-presto of own making. – Support json type. • Create point in time data. • Create machine learning data.
  34. 34. Why Embulk? • Very active plugin ecosystem. • Complicated string analysis can not only sql. • With digdag combination is very powerful. • Want can do it shortest distance. • Fluentd overwork..
  35. 35. Operation
  36. 36. Install by RPM • Presto have RPM. – not distribution. – need source build.. • include init script. • But not support open-jdk.. – Pull requesting..
  37. 37. AWS integration • We build Presto on ec2. • Not use EMR. • Worker is spot instance, multi instance types. – prevent down all at once
  38. 38. networking • Presto cluster(coordinator and workers) place in the same AZ. • If other AZ, very high traffic cost(and money). – should not multi AZ.
  39. 39. Networking on AWS Availability Zone Availability Zone cordinator worker worker worker
  40. 40. problem • Very huge repository. • SPOF cordinator. • run long range query, occur OutOfMemory Error.
  41. 41. Very huge repository • monolithic application. – I want Separate repository. • First build takes 30 minutes. • After the second time build takes 10 minutes. • All connector is main repository. – MongoDB、Kafka、cassandra.. – will nearly support Elasticsearch • Hard to do the contribution.
  42. 42. Big change for jdbc • support multi data type predicate pushdown. • We used apply patch presto… • Let's try mysql people.
  43. 43. listened Presto impression • extended technology of Hadoop. =>I don't know hadoop. Presto have many connector. • parallel processing looks difficult. =>Presto not have storage, There is not so influence. ・I do not have so big data. =>I don't so big player.
  44. 44. Summary • Presto is great software. • So not difficult. • Let's use it more.

×