Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Embulk - 進化するバルクデータローダ

Embulk Meetup Tokyo #2

  • Inicia sesión para ver los comentarios

Embulk - 進化するバルクデータローダ

  1. 1. Embulk - 進化するバルク
 データローダ Sadayuki Furuhashi
 Founder & Software Architect Embulk Meetup Tokyo #2
  2. 2. A little about me… Sadayuki Furuhashi github: @frsyuki Fluentd - Unifid log collection infrastracture Embulk - Plugin-based parallel ETL Founder & Software Architect
  3. 3. What’s Embulk? > An open-source parallel bulk data loader > loads records from “A” to “B” > using plugins > for various kinds of “A” and “B” > to make data integration easy. > which was very painful… Storage, RDBMS, NoSQL, Cloud Service, etc. broken records,
 transactions (idempotency),
 performance, …
  4. 4. The pains of bulk data loading Example: load a 10GB CSV file to PostgreSQL > 1. First attempt → fails > 2. Write a script to make the records cleaned • Convert ”2015-01-27T19:05:00Z” → “2015-01-27 19:05:00 UTC” • Convert “N" → “” • many cleanings… > 3. Second attempt → another error • Convert “Inf” → “Infinity” > 4. Fix the script, retry, retry, retry… > 5. Oh, some data got loaded twice!?
  5. 5. The pains of bulk data loading Example: load a 10GB CSV file to PostgreSQL > 6. Ok, the script worked. > 7. Register it to cron to sync data every day. > 8. One day… it fails with another error • Convert invalid UTF-8 byte sequence to U+FFFD
  6. 6. The pains of bulk data loading Example: load 10GB CSV × 720 files > Most of scripts are slow. • People have little time to optimize bulk load scripts > One file takes 1 hour → 720 files takes 1 month (!?) A lot of integration efforts for each storages: > XML, JSON, Apache log format (+some custom), … > SAM, BED, BAI2, HDF5, TDE, SequenceFile, RCFile… > MongoDB, Elasticsearch, Redshift, Salesforce, …
  7. 7. The problems: > Data cleaning (normalization) > How to normalize broken records? > Error handling > How to remove broken records? > Idempotent retrying > How to retry without duplicated loading? > Performance optimization > How to optimize the code or parallelize?
  8. 8. HDFS MySQL Amazon S3 Embulk CSV Files SequenceFile Salesforce.com Elasticsearch Cassandra Hive Redis ✓ Parallel execution ✓ Data validation ✓ Error recovery ✓ Deterministic behavior ✓ Resuming Plugins Plugins bulk load
  9. 9. Input Output Embulk’s Plugin Architecture Embulk Core Executor Plugin Filter Filter Guess
  10. 10. Output Embulk’s Plugin Architecture Embulk Core Executor Plugin Filter Filter GuessFileInput Parser Decoder
  11. 11. Guess Embulk’s Plugin Architecture Embulk Core FileInput Executor Plugin Parser Decoder FileOutput Formatter Encoder Filter Filter
  12. 12. Execution overview Task Transaction Task Task taskCount { taskIndex: 0, task: {…} } { taskIndex: 2, task: {…} } runs on a single thread runs on multiple threads
 (or machines)
  13. 13. Parallel execution Task Task Task Task Threads Task queue run tasks in parallel (embulk-executor-local-thread)
  14. 14. Distributed execution Task Task Task Task Map tasks Task queue run tasks on Hadoop (embulk-executor-mapreduce)
  15. 15. Distributed execution (w/ partitioning) Task Task Task Task Map - Shuffle - Reduce Task queue run tasks on Hadoop (embulk-executor-mapreduce)
  16. 16. Transaction control fileInput.transaction { parser.transaction { filters.transaction { formatter.transaction { fileOutput.transaction { executor.transaction { … } } } } } } file input plugin parser plugin filter plugins formatter plugin file output plugin executor plugin Task Task
  17. 17. Task configuration fileInput.transaction { fileInputTask, taskCount → parser.transaction { parserTask, schema → filters.transaction { filterTasks, schema → formatter.transaction { formatterTask → fileOutput.transaction { fileOutputTask → executor.transaction { → task = { fileInputTask, parserTask, filterTasks, formatterTask, fileOutputTask, } taskCount.times.inParallel { taskIndex → run(taskIndex, task) taskCount is decided by input schema is decided by input, and may be modified by filters
  18. 18. Task execution parser.run(fileInput, pageOutput) fileInput.open() formatter.open(fileOutput) fileOutput.open() parser plugin file input plugin filter plugins file output plugin formatter plugin …Task Task …
  19. 19. Type conversion Embulk type systemInput type system Output type system boolean long double string timestamp boolean integer bigint double precision text varchar date timestamp timestamp with zone … (e.g. PostgreSQL) boolean integer long float double string array geo point geo shape … (e.g. Elasticsearch) Input plugin
 (parser plugin if input is file-based) Output plugin
 (formatter plugin if output is file-based)
  20. 20. What’s added since the first release? • v0.3 • Resuming • Filter plugin type • v0.4 • Plugin template generator • Incremental execution (ConfigDiff) • Isolated ClassLoaders for Java plugins • Polyglot command launcher
  21. 21. What’s added since the first release? • v0.6 • Executor plugin type • Liquid template engine • v0.7 • EmbulkEmbed & Embulk::Runner • Plugin bundle (embulk-mkbundle) • JRuby 9000 • Gradle v2.6
  22. 22. Resuming • Retries a failed transaction without retrying everything. • Skips successful tasks by using information stored in a file by the previous transaction. • embulk run config.yml -r resume-state.yml
  23. 23. Filter plugin type • Filtering rows out, filtering columns out, or enrich the data. 18 plugins released.
  24. 24. Plugin template generator • Generates template of a plugin. • Generated code is already ready to compile. > You modify & compile it to do your work. • embulk new <category> <new>
  25. 25. Incremental execution • Store last file name or row in a file, and next execution starts from there. • Usecase:
 sync new files on S3 to Elasticsearch every day. • embulk run config.yml -o next-config.yml
  26. 26. Isolated ClassLoaders for Java plugins • Embulk can load multiple versions of java plugins.
  27. 27. Plugin Version Conflicts Embulk Core Java Runtime aws-sdk.jar v1.9 embulk-input-s3.jar Version conflicts! aws-sdk.jar v1.10 embulk-output-redshift.jar
  28. 28. Multiple Classloaders in JVM Embulk Core Java Runtime aws-sdk.jar v1.9 embulk-input-s3.jar Isolated environments aws-sdk.jar v1.10 embulk-output-redshift.jar Class Loader 1 Class Loader 2
  29. 29. Polyglot launcher script • embulk .jar is a jar file. • embulk.jar is a shell script. • embulk.jar is a bat script. • It sets JVM options to improve performance. • ./embulk run abc
  30. 30. Executor plugin type • embulk-executor-mapreduce executes tasks on distributed environment.
  31. 31. Liquid template engine • A config file can include variables.
  32. 32. EmbulkEmbed & Embulk::Runner • Embed embulk in an application.
  33. 33. Plugin bundle • Uses fixed version of plugins. • embulk mkbundle my-project • embulk run -b my-project config.yml
  34. 34. Gradle v2.6 • Continous compiling. • “embulk migrate .” upgrades gradle versio of your plugin project. • ./gradlew -t build
  35. 35. Future plan • v0.8 • JSON type (issue #306) • Error plugin type (#27, #124) • More (or less) concurrency for output (#231) • v0.9 • More Guess (#242, #235) • Multiple jobs using a single config file (#167)

×