Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

データサイエンティスト必見!M-1グランプリ

21.666 visualizaciones

Publicado el

データサイエンティスト必見!
M-1グランプリ
前処理の頂点は誰だ!?
出場者はRのdplyr、PostgreSQL、NYSOLのMコマンドなど。

Publicado en: Datos y análisis
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (Unlimited) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... Download Full EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... ACCESS WEBSITE for All Ebooks ......................................................................................................................... Download Full PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... Download EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... Download doc Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Responder 
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (Unlimited) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... Download Full EPUB Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ACCESS WEBSITE for All Ebooks ......................................................................................................................... Download Full PDF EBOOK here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... Download EPUB Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... Download doc Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Responder 
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (Unlimited) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... Download Full EPUB Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ACCESS WEBSITE for All Ebooks ......................................................................................................................... Download Full PDF EBOOK here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... Download EPUB Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... Download doc Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Responder 
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (Unlimited) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... Download Full EPUB Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ACCESS WEBSITE for All Ebooks ......................................................................................................................... Download Full PDF EBOOK here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... Download EPUB Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... Download doc Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Responder 
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (Unlimited) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... Download Full EPUB Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ACCESS WEBSITE for All Ebooks ......................................................................................................................... Download Full PDF EBOOK here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... Download EPUB Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... Download doc Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Responder 
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí

データサイエンティスト必見!M-1グランプリ

  1. 1. 2014/8/30 第42回R勉強会@東京(#TokyoR)
  2. 2. 出場者のご紹介
  3. 3. 出場者1: R (R_baseと表記) パッケージを使わずに 勝負してやるぜ! 人からはピュアだね って言われます。
  4. 4. 出場者2: R (R_pkgと表記) 最強パッケージと名高い “dplyr”と”data.table” を使って勝負だ! 連続の必殺技! %>% 高速! fread()
  5. 5. 出場者3: PostgreSQL データベースの力を 見せつけてやる! 伝統のSQL文!
  6. 6. 出場者4: NYSOL(「にそる」と読みます) 日本で誕生した オープンソースです。 Mコマンドは、%>%ではなく UNIXのpipeを使います
  7. 7. 出演料と時間の都合により 以上の4出場者が参加します
  8. 8. ルール説明
  9. 9. ルールはシンプル! CSVデータ 前処理処理後の CSVデータ 所要時間が最も短い出場者が勝者! 所要時間: 各CSVデータにいくつかの前処理 を行い、処理後 のCSVデータを保存するまでにかかる時間の合計
  10. 10. 前処理のご紹介
  11. 11. 前処理1: 列選択(selColと表記) A B C D 列選択 (B,C) B C
  12. 12. 前処理2: 行選択(selRowと表記) A B C D あ い う あ A B C D あ あ 行選択 (B=あ)
  13. 13. 前処理3: 列計算(aggregatingと表記) A B C D 8 2 5 1 3 1 列計算 (E=B-C) A B C D E 8 2 6 5 1 4 3 1 2
  14. 14. 前処理4: 並び替え(sortingと表記) A B C D あ2 い1 う1 あ1 並び替え (B,C) A B C D あ1 あ2 い1 う1
  15. 15. 前処理5: 複合(mixと表記) A B C D あ2 い1 う1 あ1 複合 (前処理1〜4) B C E あ1 2 あ2 6
  16. 16. CSVデータの詳細
  17. 17. オープン& 生データ まず、同じ形式の複数 データを一つに結合。 (データ件数は1億以上) http://stat-computing.org/dataexpo/2009/
  18. 18. 結合先からランダム抽出(CSVデータは6つ) データ件数容量 1 千件約100 KB 2 一万件約1 MB 3 十万件約10 MB 4 百万件約100 MB 5 一千万件約1GB 6 一億件約10GB 列数(カラム)はすべて29個(違いはデータ件数)
  19. 19. テスト環境 OS: OSX Version 10.9.4 (MacBook Pro) CPU: 2.4 GHz Intel Core i7 (4 Cores) Memory: 16GB (1600MHz DDR3) Storage: SSD Software: R version 3.0.3 PostgreSQL version 9.3.4 NYSOL version 1.1
  20. 20. 会場の皆さんに お聞きします
  21. 21. 勝者と予想される番号の お手もとのスイッチオン! 1:R (パッケージなし) 2:R (パッケージあり) 3:PostgreSQL 4:NYSOL
  22. 22. 勝者と予想される番号の お手もとのスイッチオン! 1:R (パッケージなし) 2:R (パッケージあり) 3:PostgreSQL 4:NYSOL 結果 ・・・0% ・・・45% ・・・5% ・・・50%
  23. 23. データ件数容量 1 千件約100 KB 2 一万件約1 MB 3 十万件約10 MB 4 百万件約100 MB 5 一千万件約1GB 6 一億件約10GB
  24. 24. 0.5秒 R_base R_pkg PostgreSQL NYSOL
  25. 25. 列選択行選択列計算並び替え複合 0.5秒 R_base R_pkg PostgreSQL NYSOL
  26. 26. 0.5秒 R_base R_pkg PostgreSQL NYSOL
  27. 27. データ件数容量 1 千件約100 KB 2 一万件約1 MB 3 十万件約10 MB 4 百万件約100 MB 5 一千万件約1GB 6 一億件約10GB
  28. 28. 1秒 R_base R_pkg PostgreSQL NYSOL
  29. 29. データ件数容量 1 千件約100 KB 2 一万件約1 MB 3 十万件約10 MB 4 百万件約100 MB 5 一千万件約1GB 6 一億件約10GB
  30. 30. 5秒 1秒 R_base R_pkg PostgreSQL NYSOL
  31. 31. データ件数容量 1 千件約100 KB 2 一万件約1 MB 3 十万件約10 MB 4 百万件約100 MB 5 一千万件約1GB 6 一億件約10GB
  32. 32. 45秒 5秒 R_base R_pkg PostgreSQL NYSOL
  33. 33. データ件数容量 1 千件約100 KB 2 一万件約1 MB 3 十万件約10 MB 4 百万件約100 MB 5 一千万件約1GB 6 一億件約10GB
  34. 34. 5分 1分 R_base R_pkg PostgreSQL NYSOL
  35. 35. データ件数容量 1 千件約100 KB 2 一万件約1 MB 3 十万件約10 MB 4 百万件約100 MB 5 一千万件約1GB 6 一億件約10GB
  36. 36. R_base R_pkg PostgreSQL NYSOL 1時間 30分 10分 Rは一部の前処理が メモリエラーで計測不可
  37. 37. 結果発表!
  38. 38. (R_pkg)
  39. 39. まとめと考察 NYSOLは合計30個(6データ、5前処理)のテストにおいてす べて最速であった。 PostgreSQLはNYSOLに次いで高速であったものの、今回は インデックス機能を使用していないため、さらなるパフォーマン スの改善余地が残っている。 R_baseとR_pkgを比較すると、データが10MB以上になれば R_pkgの所要時間が改善され、特に1GBでの複合(Mix)は早 い。データが10MBより小さければR_baseの方が早くなる傾 向が見られた。 Rにもスクリプトの記載方法、パッケージ選択などによるパ フォーマンス改善余地が残っている。(magrittr?pipeR?)
  40. 40. 次回のM-1グランプリ 本家のM-1(漫才)は復活が決定しました (2015年夏) M-1(前処理)はどなたか次回開催をご検 討ください(今回のプログラムは公開します) どちらの開催もお楽しみに♪
  41. 41. 補足資料
  42. 42. 列選択 (B,C) R_base data <- read.csv("input.csv", header = TRUE, stringsAsFactors = FALSE ) write.csv(data[ , c("B","C")], ”output.csv", row.names = FALSE ) R_pkg library(data.table) library(dplyr) data <- fread("input.csv", header = TRUE, stringsAsFactors = FALSE, showProgress = FALSE ) write.table(select(data, B,C), "output.csv", sep=",", row.names = FALSE ) PostgreSQL set search_path=schema_name; COPY table_name FROM 'input.csv' WITH CSV HEADER NULL AS 'NA'; COPY (select B,C from table_name) TO 'output.csv' WITH CSV HEADER NULL AS 'NA'; truncate table table_name; NYSOL mcut f=B,C i=input.csv o=output.csv ※入力ファイルのパスなど一部を省略して記載しています。
  43. 43. 行選択 (B=あ) R_base data <- read.csv("input.csv", header = TRUE, stringsAsFactors = FALSE ) write.csv(data[ data$B == 'あ' , ], "output.csv", row.names = FALSE ) R_pkg library(data.table) library(dplyr) data <- fread("input.csv", header = TRUE, stringsAsFactors = FALSE, showProgress = FALSE ) setkey(data, B) write.table(filter(data, B == " あ" ), "output.csv", sep=",", row.names = FALSE ) PostgreSQL set search_path=schema_name; COPY table_name FROM 'input.csv' WITH CSV HEADER NULL AS 'NA'; COPY (select * from table_name where B=‘あ’) TO 'output.csv' WITH CSV HEADER NULL AS 'NA'; truncate table table_name; NYSOL mselstr f=B v=あi=input.csv o=output.csv ※入力ファイルのパスなど一部を省略して記載しています。
  44. 44. 列計算 (E=B-C) R_base data <- read.csv("input.csv", header = TRUE, stringsAsFactors = FALSE ) write.csv(transform(data, E = B - C), "output.csv", row.names = FALSE ) R_pkg library(data.table) library(dplyr) data <- fread("input.csv", header = TRUE, stringsAsFactors = FALSE, showProgress = FALSE ) write.table(mutate(data, E = B- C) , "output.csv", sep=",", row.names = FALSE ) PostgreSQL set search_path=schema_name; COPY table_name FROM 'input.csv' WITH CSV HEADER NULL AS 'NA'; COPY (select *,B-C as E from table_name) TO 'output.csv' WITH CSV HEADER NULL AS 'NA'; truncate table table_name; NYSOL mcal c=‘${B}-${C}' a=E i=input.csv o=output.csv ※入力ファイルのパスなど一部を省略して記載しています。
  45. 45. 並び替え (B,C) R_base data <- read.csv("input.csv", header = TRUE, stringsAsFactors = FALSE ) write.csv( data[order(data$B,data$C), ], "output.csv", row.names = FALSE ) R_pkg library(data.table) library(dplyr) data <- fread("input.csv", header = TRUE, stringsAsFactors = FALSE, showProgress = FALSE ) write.table(arrange(data,B,C), "output.csv", sep=",", row.names = FALSE ) PostgreSQL set search_path=schema_name; COPY table_name FROM 'input.csv' WITH CSV HEADER NULL AS 'NA'; COPY (select * from table_name order by B,C) TO 'output.csv' WITH CSV HEADER NULL AS 'NA'; truncate table table_name; NYSOL msortf f=B,C i=input.csv o=output.csv ※入力ファイルのパスなど一部を省略して記載しています。
  46. 46. 複合 (1〜4) R_base data <- read.csv("input.csv, header = TRUE, stringsAsFactors = FALSE ) data.trn <- transform(data[data$B == 'あ'' , c("B","C")], E = B - C) write.csv(data.trn[order(data.trn$B,data.trn$C),], "output.csv", row.names = FALSE ) R_pkg library(data.table) library(dplyr) data <- fread("input.csv", header = TRUE, stringsAsFactors = FALSE, showProgress = FALSE ) data.mixed = data %>% select(B,C) %>% filter(B == "あ" ) %>% mutate(E = B - C) %>% arrange(B,C) write.table(data.mixed, "output.csv", sep=",", row.names = FALSE ) ※入力ファイルのパスなど一部を省略して記載しています。
  47. 47. 複合 (1〜4) PostgreSQL set search_path=schema_name; COPY table_name FROM 'input.csv' WITH CSV HEADER NULL AS 'NA'; COPY (select B,C , B-C as E from table_name where B="あ" order by B,C) TO 'output.csv' WITH CSV HEADER NULL AS 'NA'; truncate table table_name; NYSOL mcut f=B,C i=input.csv | mselstr f=B v=あ| mcal c=‘${B}-${C}' a=E | msortf f=B,C o=output.csv ※入力ファイルのパスなど一部を省略して記載しています。
  48. 48. 千件(約100KB) 一万件(約1MB) 十万件(約10MB) 所要時間:単位(秒)
  49. 49. 百万件(約100MB) 一千万件(約1GB) 一億件(約10GB) 所要時間:単位(秒)

×