Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Fighting Against Chaotically Separated Values with Embulk

We created a plugin-based data collection tool that can read any chaotically formatted files called "CSV" by guessing its schema automatically

Talked at csv,conf,v2 in Berlin
http://csvconf.com/

  • Inicia sesión para ver los comentarios

Fighting Against Chaotically Separated Values with Embulk

  1. 1. Fighting Against Chaotically Separated Values with Embulk Sadayuki Furuhashi
 Founder & Software Architect csv,conf,v2
  2. 2. A little about me… Sadayuki Furuhashi An open-source hacker. github: @frsyuki A founder of Treasure Data, Inc. located in Silicon Valley. Fluentd - Unifid log collection infrastracture Embulk - Plugin-based ETL tool OSS projects I founded:
  3. 3. It's like JSON. but fast and small. A little about me…
  4. 4. What’s Embulk? > An open-source parallel bulk data loader > loads records from “A” to “B” > using plugins > for various kinds of “A” and “B” > to make data loading easy. > which was very painful… Storage, RDBMS, NoSQL, Cloud Service, etc. broken records,
 transactions (idempotency),
 performance, …
  5. 5. The pains of bulk data loading Example: load a 10GB CSV file to PostgreSQL > 1. Run a script → fails! > 2. Improve the script to normalize records • Convert ”2015-01-27T19:05:00Z” → “2015-01-27 19:05:00 UTC” • Convert “N" → “” • many more normalization… > 3. Second attempt → another error! • Convert “Inf” → “Infinity” > 4. Improve the script, fix, retry, fix, retry… > 5. Oh, some data are loaded twice!?
  6. 6. The pains of bulk data loading Example: load a 10GB CSV file to PostgreSQL > 6. Ok, the script worked well today. > 7. Register it to cron to sync data every day. > 8. One day… it fails with another error • Convert invalid UTF-8 byte sequence to U+FFFD
  7. 7. The pains of bulk data loading Example: load 10GB CSV × 720 files > Most of scripts are slow. • People have little time to optimize bulk load scripts > One file takes 1 hour → 720 files takes 1 month (!?) A lot of efforts for each formats & storages: > XML, JSON, Apache log format (+some custom), … > SAM, BED, BAI2, HDF5, TDE, SequenceFile, RCFile… > MongoDB, Elasticsearch, Redshift, Salesforce, …
  8. 8. The problems: > Difficult to parse files correctly > How is the CSV file formatted? > Complex error handling > How to detect and remove broken records robustly? > Transactional load, or idempotent retrying > How to retry without duplicated loading? > Hard to optimize performance > How to parallelize the bulk data loading? > Many formats & storage in the world > How to save my time?
  9. 9. The problems at Treasure Data What’s “Treasure Data”? > “Fast, powerful SQL access to big data from connected applications and products, with no new infrastructure or special skills required.” > Customers want to try Treasure Data, but > SEs write scripts to bulk load their data. Hard work :( > Customers want to migrate their big data, but > It’s hard work :( > Fluentd solved streaming data collection, but > bulk data loading is another problem.
  10. 10. Embulk is an open-source, plugin-based parallel bulk data loader 
 that makes data loading easy and fast. Solution: IMPORTANT!
  11. 11. Amazon S3 MySQL FTP CSV Files Access Logs Salesforce.com Elasticsearch Cassandra Hive Redis Reliable framework :-) Parallel execution, transaction, auto guess, …and many by plugins.
  12. 12. Demo
  13. 13. $ embulk selfupdate $ embulk example demo $ vi demo/csv/sample_01.csv.gz $ embulk guess demo/seed.yml -o config.yml $ embulk run config.yml $ vi config.yml $ embulk run config.yml out: type: postgresql host: localhost user: pg password: '' database: embulk_demo table: sample1 mode: replace :%s/,/t/g :%s/""/"/g # Created by Sada # This is a comment N
  14. 14. Input Output Embulk’s Plugin Architecture Embulk Core Executor Plugin Filter Filter Guess
  15. 15. Output Embulk’s Plugin Architecture Embulk Core Executor Plugin Filter Filter GuessFileInput Parser Decoder
  16. 16. Guess Embulk’s Plugin Architecture Embulk Core FileInput Executor Plugin Parser Decoder FileOutput Formatter Encoder Filter Filter
  17. 17. Examples of Plugins (input) File Input Amazon S3 Google Cloud Storage HDFS Riak CS SCP FTP … CSV JSON MessagePack Excel Apache common logs pcap format XML / XPath regexp grok … File ParserInput PostgreSQL MySQL Oracle Vertica Redis Amazon Redshift Amazon DynamoDB Salesforce.com JIRA Mixpanel …
  18. 18. Examples of Plugins (output) File Output Amazon S3 Google Cloud Storage HDFS SFTP SCP FTP … CSV JSON MessagePack Excel … File FormatterOutput PostgreSQL MySQL Oracle Vertica Redis Amazon Redshift Elasticsearch Salesforce.com Treasure Data BigQuery …
  19. 19. Examples of Plugins (filters) > Filtering columns out by conditions > Extracting values from a JSON column to columns (JSON flattening) > Convert User-Agent strings to browser name, OS name, etc. > Parse query string (“?k1=v1&k2=v2…”) to columns > Applying SHA1 hash to a column …
  20. 20. Use case 1: Sync PostgreSQL to Elasticsearch embulk-input-postgresql embulk-filter-column embulk-output-elasticsearch PostgreSQL column filter Elasticsearch encrypt filter embulk-filter-encrypt remove unnecessary columns encrypt password columns
  21. 21. Use case 2: Load CSV on S3 to Analytics embulk-parser-csv embulk-decoder-gzip embulk-input-s3 csv.gz on S3 Treasure Data BigQuery Redshift + + embulk-output-td embulk-output-bigquery embulk-output-redshift Distributed execution on Hadoop embulk-executor-mapreduce
  22. 22. Use case 3: Embulk as a Service at Treasure Data REST API call MySQL
  23. 23. Internal Architecture
  24. 24. Plugin API > A plugin is written in Java or Ruby (JRuby). > A plugin implements “transaction” and “task”. > transaction controls the entire bulk loading session. > create a destination table, create a directory,
 commit the transaction, etc. > transaction creates multiple tasks. > tasks load load data. > Embulk runs tasks in parallel. > Embulk retries tasks if necessary.
  25. 25. Transaction stage & Task stage Task Transaction Task Task taskCount { taskIndex: 0, task: {…} } { taskIndex: 2, task: {…} } runs on a single thread runs on multiple threads
 (or machines)
  26. 26. Transaction control fileInput.transaction { parser.transaction { filters.transaction { formatter.transaction { fileOutput.transaction { executor.transaction { … } } } } } } file input plugin parser plugin filter plugins formatter plugin file output plugin executor plugin Task Task
  27. 27. Task execution parser.run(fileInput, pageOutput) fileInput.open() formatter.open(fileOutput) fileOutput.open() parser plugin file input plugin filter plugins file output plugin formatter plugin …Task Task …
  28. 28. Parallel execution of tasks Task Task Task Task Threads Task queue run tasks in parallel (embulk-executor-local-thread)
  29. 29. Distributed execution of tasks Task Task Task Task Map tasks Task queue run tasks on Hadoop (embulk-executor-mapreduce)
  30. 30. Distributed execution (w/ partitioning) Task Task Task Task Map - Shuffle - Reduce Task queue run tasks on Hadoop (embulk-executor-mapreduce) Useful to partition data by hour or day
 before loading data to a storage.
  31. 31. Past & Future
  32. 32. What’s added since the first release? • v0.3 (Feb, 2015) • Resuming • Filter plugin type • v0.4 (Feb, 2015) • Plugin template generator • Incremental load (ConfigDiff) • Isolated ClassLoaders for Java plugins • Polyglot command launcher
  33. 33. What’s added since the first release? • v0.6 (Apr, 2015) • Executor plugin type • Liquid template engine • v0.7 (Aug, 2015) • EmbulkEmbed & Embulk::Runner • Plugin bundle (embulk-mkbundle) • JRuby 9000 • Gradle v2.6
  34. 34. What’s added since the first release? • v0.8 (Jan, 2016) • JSON column type • Page scattaring for more parallel execution
  35. 35. Future plan • v0.9 • Error plugin type (#27) • Stats & metrics (#199) • v0.10 • More Guess (#242, #235) • Multiple jobs using a single config file (#167)
  36. 36. Hacks (if time allows)
  37. 37. Plugin Version Conflicts Embulk Core Java Runtime aws-sdk.jar v1.9 embulk-input-s3.jar Version conflicts! aws-sdk.jar v1.10 embulk-output-redshift.jar
  38. 38. Avoiding Conflicts in JVM Embulk Core Java Runtime aws-sdk.jar v1.9 embulk-input-s3.jar Isolated environments aws-sdk.jar v1.10 embulk-output-redshift.jar Class Loader 1 Class Loader 2
  39. 39. Liquid template engine • A config file can include variables.
  40. 40. ./embulk.jar $ ./embulk.jar guess example.yml executable jar!
  41. 41. Header of embulk.jar : <<BAT @echo off setlocal set this=%~f0 set java_args= rem ... java %java_args% -jar %this% %args% exit /b %ERRORLEVEL% BAT # ... exec java $java_args -jar "$0" "$@" exit 127 PK...
  42. 42. embulk.jar is a shell script : <<BAT @echo off setlocal set this=%~f0 set java_args= rem ... java %java_args% -jar %this% %args% exit /b %ERRORLEVEL% BAT # ... exec java $java_args -jar "$0" "$@" exit 127 PK... argument of “:” command (heredoc). “:” is a command that does nothing. #!/bin/sh is optional. Empty first line means a shell script. java -jar $0 shell script exits here (following data is ignored)
  43. 43. embulk.jar is a bat file : <<BAT @echo off setlocal set this=%~f0 set java_args= rem ... java %java_args% -jar %this% %args% exit /b %ERRORLEVEL% BAT # ... exec java $java_args -jar "$0" "$@" exit 127 PK... .bat exits here (following lines are ignored) “:” means a comment-line
  44. 44. embulk.jar is a jar file : <<BAT @echo off setlocal set this=%~f0 set java_args= rem ... java %java_args% -jar %this% %args% exit /b %ERRORLEVEL% BAT # ... exec java $java_args -jar "$0" "$@" exit 127 PK... jar (zip) format ignores headers (file entries are in footer)
  45. 45. Type conversion Embulk type systemInput type system Output type system boolean long double string timestamp boolean integer bigint double precision text varchar date timestamp timestamp with zone … (e.g. PostgreSQL) boolean integer long float double string array geo point geo shape … (e.g. Elasticsearch)

×