SlideShare una empresa de Scribd logo
1 de 37
How to tackle
big TSV files
2016.02.12 Fri.
Shimono Toshiyuki
データの前処理の多くに対する
一般的手法と 著作者の自作ツールの紹介bin4tsv 1
Even today of BIG DATA age,
Digital Data
- rarely tells anything from itself alone,
- needs to be skillfully handled,
- thus, new technology is required!
-- Although human/animals can easily
handle data on the earth via “sense organs”!
bin4tsv 2
Think the real situation
when you get big-sized data files
for analyzing as your work :
“check it quickly, setup environments, and visualize your
result.. “
bin4tsv 3
Speedy (pre-)analysis hints
with CLI appears here today.
CLI = “Command Line Interface”
bin4tsv 4
前処理
What is “MAE SHORI” in English ??
CHAPTER 1. PRECONDITIONS
Toward speedy actions for tsv files with 1,000–100,000,000 lines
bin4tsv 5
Why TSV format?
Easy to handle more than CSV!
CSV = Comma Separated Values
each values are often enclosed by (“).
TSV = Simply Tab Separated Values
less, cut, column etc. well handle tsv files,
-- these Unix commands will appear again later.
Note that CSV is defined formally by RFC4180.
bin4tsv 6
Concerning SQL :
When you use SQL database,
you need to design tables/columns,
so today’s tsv-handle-tech matters.
SQL sentences are not easy to do
the techniques presented from now.
Think summing all numbers in a big table.
Which is faster, SQL or tsv/csv format?
.. Type-determination, outlier-screening are must works.
Cross table , Venn diagrams , quantile plots, etc.
The true data analysis is not an aggregation of micro queries !
bin4tsv 7
Retrieval from SQL
Pre-analysis required!
Importing SQL again.
Preparatory Knowledge :
Supposed to be familiar with
Unix/Linux commands such as :
less wc gawk cut head tail grep sed
file od iconv nkf lv
diff sdiff sort uniq paste column
--including minor options of the above.
Also such techniques utilizing :
<( ) process substitution , stderr, /dev/null
bin4tsv 8
CHAPTER 2.
UNKNOWN BASIC TECHNIQUES
Toward speedy actions for tsv files with 1,000–100,000,000 lines
bin4tsv 9
Coloring makes comfort.
Case1 : So many columns in a TSV file
Case2 : Many Big-Digits Numbers
Can you speed-read “5 trillion” with 13 digits ?
‘less’ screen pager well handles tabs, but coloring is much comfortable to you.
bin4tsv 10
bin4tsv 11
Supplementary Page
Your client gave you data of tens of
columns. You see it on the editor, and want
to chase some curious columns.
If you color every 5 consecutive columns with black and
blue alternately, the situation improves greatly.
Try “-x 12 ENTER” on less, and things a little improve.
Almost all existing primitive CLI does not do digit
grouping. If you try it by putting commas between
some digits, then you may need to manage the
layout collapse. A simple solution is to give colors
differently. The above shows the solution, which at
most 3 consecutive digits from the right are in the
same color.
Commas have collapsed the layout  Coloring causes little trouble 
bin4tsv 12
Supplementary Page
Venn Diagrams
Scene:
Transaction data of Campaign (A) and Campaign (B)
Target customer data form Tokyo branch (C) and Osaka branch (D).
Then, how do you compare 4 sets of key-column values?
You need to draw that complex Venn diagram because :
1. Possibly, what your client/colleague says was “different”.
2. Data files can be damaged, while exporting/conveying.
A Venn diagram for 4 sets requires neat skills !
bin4tsv 13
A B
C
D
Supplementary Page
When you got multiple data files that have common “key” columns, what would
you do? Drawing the Venn diagram for the keys of each file, and filling in the
number of the elements in its split regions are essential. Then you can see (1)
how one set contains another, (2) how sets overlap each other, (3) how
erroneous case occurs, and so on. And the numbers filled in the region are
important to check the data-fab process that follows.
By the way, can you draw a Venn diagram and fill in the 15
numbers for given 4 sets, really neatly and quickly?
(And, some (maybe 99 or more) combinations of 15 regions are rather
meaningful. What are they and how they should be represented? )
bin4tsv 14
* * **
* * **
* * **
* * *
▲ Branko Grünbaum,
figure from ja.wikipedia
Frequency Table
Do you use `sort | uniq –c’ ? Why not use a shorter way?
The presented command works quite well also under Japanese environments,
also has the functions such as various sorting / cumulative summing.
Twitter account page statuses.
0.01 - 0.1% seems not work well.
bin4tsv 15
Cross Table
`crosstable’ promptly outputs cross-table without SQL/Excel Pivot.
The main functions : (1) counting for 2-column data,
(2) summing up for the first column’s values among 3-column data.
“Contingency table” in statistics
bin4tsv 16
Supplementary Page
“When the server is down” ?
bin4tsv 17
Copy/paste the output into Excel, after getting the “crosstable” that counts
the records of some events. And apply “conditional formatting”.
日付/時間 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
2013-12-10 32 157 153 156 139 161 170 182 315 356 276 221 256
2013-12-11 127 93 64 58 39 72 90 144 200 149 170 116 177 144 170 149 156 268 318 253 216 187 203 235
2013-12-12 136 86 68 63 46 61 64 115 177 134 130 121 463 582 423 442 703 1285 1394 1446 1444 1611 2023 1974
2013-12-13 1532 958 634 478 396 403 526 837 1122 1097 1457 1386 1988 1816 2037 3025 3662 4125 4551 5062 4655 4993 5576 2969
2013-12-14 2873 1626 1139 1018 529 682 807 1164 1655 2115 2536 2715 2868 3454 3720 3812 3611 3438 4202 5005 2557 3133 5031 4274
2013-12-15 3617 2232 1694 1330 1051 974 1173 1732 2590 3549 4942 5871 5306 3056 3571 5247 6666 6798 6964 2597 1710 3382 7879 6425
2013-12-16 4623 2741 1631 1304 927 1094 1185 2053 2597 2474 3415 3614 4580 4836 4187 4999 4115 4122 5957 6427 6328 6793 6545 5963
2013-12-17 4045 2405 1545 1189 1007 946 1237 2692 3532 3435 4178 4925 6158 5049 4756 4375 3890 4138 5090 6265 7628 7077 6639 6325
2013-12-18 4458 2807 1713 1404 1131 1051 1347 1954 2299 2624 3412 2707 1962 1773 1910 3851 5263 6359 7177 6583 6936 6644 6234 6119
2013-12-19 4637 3064 1739 1509 1141 1100 1394 2362 2994 3897 3999 4344 5260 6844 5092 4459 5706 6004 7407 3817 1672 4132 6727 5514
2013-12-20 5393 3082 1977 1973 1321 1462 1667 974 1222 1423 1818 2954 5261 4082 4889 6194 6355 6755 9276 6215 6275 6152 5849 5959
2013-12-21 4924 3247 1859 1637 1331 1414 1617 1249 1327 2072 2385 2784 2927 2630 2740 2916 2871 3139 6222 5042 4680 4828 4829 4633
2013-12-22 3838 2702 2073 1836 1417 1186 1427 1442 1633 2043 3376 4120 4346 4159 4473 4759 4450 4773 4962 6876 7718 7327 7472 6768
2013-12-23 5260 3316 2134 1776 1258 1231 1672 1299 894 1023 1195 2402 3516 3584 5172 7778 8700 8331 8030 6035 7712 7669 8084 7450
2013-12-24 7098 4150 2293 1774 1205 1189 1589 1476 1145 1733 2023 2678 4647 3835 4571 6430 6943 7864 8863 5181 6080 7125 8609 6496
2013-12-25 4905 2734 1695 1604 1377 1360 1580 1200 1289 1335 2341 4287 5532 4332 5077 6474 6952 7888 7702 3655 6388 9370 9115 8470
2013-12-26 7468 3877 2269 1959 1370 1398 1908 467 39 140 155 143 131 418 1081 2148 8863 8775 8934 8498 9086 8356
2013-12-27 6419 4015 1833 1627 1421 1361 1590 2511 3768 3996 5245 6196 7894 8098 7373 7497 7296 7324 8437 8120 8059 8384 8435 8277
2013-12-28 7678 4543 2619 2041 1587 1568 1726 2539 3578 4911 5945 7082 7563 7343 7916 7529 7863 8234 8165 8076 8528 8763 8670 7892
2013-12-29 6661 4372 2621 1823 1336 1224 1433 2290 3223 4477 5873 7296 7842 7614 7975 8060 8639 8402 8889 8398 8457 8320 9454 8700
2013-12-30 7607 4656 2765 2100 1528 1440 1635 2511 3997 5140 7044 7658 9435 9147 9147 10058 9956 10529 9852 9662 10644 11448 12848 12136
2013-12-31 10349 6155 3885 2703 1900 1803 2110 3586 5362 8051 10713 12249 13035 13001 13298 15048 15950 17235 17573 15644 16229 15365 14898 17461
2014-01-01 22849 15263 4149 3157 4218 4828 7216 15086 7594 12145 27222 2067 2511 2540 2478 4938 9088 14816 21263 7463 1861 4411 10241 9638
2014-01-02 8627 4964 2749 4744 3729 4168 6328 1915 1342 1889 2848 3233 3197 3612 4528 4449 7701 12565 20139 6596 4637 6998 8677 8949
2014-01-03 6857 5227 6151 3906 2678 2656 2899 2227 1378 2470 2706 3105 3573 3287 3754 7378 10992 11090 10936 5549 6432 8894 9887 8318
2014-01-04 6833 4113 2371 2687 2248 2016 2138 1689 1079 1437 1921 2448 2451 4279 7620 8104 8364 8326 7957 6164 5549 5233 5850 8926
2014-01-05 9757 6208 3823 2627 2073 1782 1854 695 419 511 661 1073 1976 2815 3173 3884 9464 11131 11139 9671 9997 10924 9716 10967
2014-01-06 7623 4818 2873 2159 1656 1480 1733 2948 3538 4550 5638 6605 9118 7973 7552 7691 7392 8170 7714 8308 8974 9368 10272 9641
2014-01-07 7025 4124 2604 1800 1488 1417 1647 1235 1434 2369 3146 3886 5686 6811 6426 6926 7317 8297 8453 6858 9441 9961 9958 8910
2014-01-08 7551 3997 2606 1837 1379 1248 1872 886 647 1022 1465 1828 2463 2138 1909 2163 2298 2832 2944 5583 7953 9266 10316 10737
2014-01-09 9161 4801 2999 2361 1871 1734 2094 3674 5135 5177 6758 7622 11370 10181 10548 12154 13235 13557 11017 4679 5650 6648 7858 7154
2014-01-10 5273 3243 1795 1349 1206 1020 1556 673 526 505 698 827 1039 996 1127 1232 2925 5981 10609 1481
An alternative of histogram
Utilize “quantile plot”.
You can read each percentiles directly.
The program utilizes R language internally thus you must install it.
Green : Following #
Blue : Followers #
of millions of twitter
accounts
Same plot in
LOG-SCALE
The graphs are designed
so that precise numbers are
readable from the curves.
<- The wall
of 2000
bin4tsv 18
Supplementary Page
What is a quantile plot? (Comparison with the histogram.)
bin4tsv 19
(The author wishes to banish histogram, and replace with so-called
“quantile plot” or “fractile plot”, hopefully. The latter plot can avoid a
problem of which histogram requires choosing how to `carefully’ group the
given values, which affects the subsequent analysis. )
Sampling, Shuffling
Without watching “randomly sampled data”,
probably you would see something false about
the data. Sample and Shuffle! It’s statistics.
bin4tsv 20
CHAPTER 3.
WHAT A DATA SCIENTIST DOES
WHEN DATA FILES COME
Last Chapter.
bin4tsv 21
Social skill sides:
1. Immediately check if the data files are
“different!” or not. You are responsible.
2. Doubt everything you heard about the data,
and check it quickly.
Also have the method to do it quickly.
3. The total number of lines of each data files
is important. Record them and take note.
- You will easily forget the file names.
- You will easily be lost in a maze
during your dazzling analysis work.
- But those numbers greatly help you !
Daily wisdoms of data scientists
bin4tsv 22
Technical sides
1. Check your file if it suits your client’s
intension. Use less, file, nkf –g, lv etc.
2. Record the sizes. `wc -l’ and `awk “{print
NF}” | uniq –c’ show line/column numbers.
3. Transform the character code into UTF8, the
line separator code into “n” ( 0A in ASCII ).
4. Transform the CSV file into TSV!
Use fgrep “(Ctrl+V TAB)” file, beforehand.
bin4tsv 23
Pay Attention :
1. CSV files may be retouched by Microsoft
Excel. (e.g. “1234”->”1,234”; “5/1” -> “5月”)
2. Many Database software are often buggy
on their import/export functions.
3. What you think utf8 may be utf8mb4. What
you think SJIS may be CP932.
-- Be careful !
-- Be experienced !
bin4tsv 24
You need
5-20 times debugging processes
when you do new things.
So try to make
simple and error-free tools,
and reuse them.
bin4tsv 25
Desirable functions to be implemented on
the commands
1. Data files may begin with “header” line.
Correspond to that.
2. GZIPPED files can be read/unfrozen
speedily. Deal with gzip files directly.
3. The process may take long time, so show
the statuses on the middle of processes.
4. And show some useful messages when
the interruption occurred by Control-C.
bin4tsv 26
APPENDIX CHAPTER :
COMMANDS PROVIDED..
Appendix.
bin4tsv 27
First of all.. (about commands)
Each of the commands the author provides
works standalone, basically. Possibly requires
library installations (using cpan).
You can try from :
crosstable, colors, colsummary
- The commands names are changing :
colors or color or coloring
bin4tsv 28
colsummary
• 1st (white) : column number
• 2nd(green, bright) : different values
• 3rd (blue) : average number
• 4th (yellow) : column name
• 5th (white, bright) : value range
• 6th (white) : frequent values from top
• 7th (green) : frequency numbers top to bottom
‘colsummary’ is not designed to show the character string length of
column values. Such function is implemented by another command
‘lengths’. The average number is useful when you like to try to encrypt
the additive relation among columns.
bin4tsv 29
This is the most appreciated
command from my friend !
samecols
At the position (a,b) of the matrix,
the lines where col[a] == col[b] satisfies are counted for WHITE part,
the lines where col[a] != col[b] satisfies are counted for YELLOW part.
This matrix is helpful when you get many columned table file, and you like
to know which group of columns are similar or not. Sometimes, 2 or 3
columns are mostly identical, and you can easily know how it happens.
bin4tsv 30
headkeep
Basically, all the command
provided by bin4tsv are designed
to handle the head lines of input
when each of the command is
given the special option. ( Simply,
the -= option )
However, Unix/Linux inherent
commands is out of this rule of
bin4tsv, so a mediation way is
provided. That is “headkeep”
command.
Example:
headkeep sort < data
# only sorts from the 2nd line.
headkeep tail -3 < data
# outputs 1st and last 3 lines.
bin4tsv 31
venn2-3
The Venn Diagram region sizes for 2 sets :
bin4tsv 32
The Venn Diagram region sizes for 3 sets :
At this moment, this command depends on List::Compare, and slow.
1. Commands for primitive checking
Command Output Remarks
csv2tsv TSV from CSV. Depends on Text::CSV_XS
tabsplit Files corresponding to all of each column of input
headkeep 1st line passes through. Tailing lines goes to the specified command.
color Colored numbers by “-3” / tabbing by “-t num”
colsummary Property of each columns of a file, quite nicely.
lengths Maps each cell into the string length. Try combine with colsummary.
chars Splits into each string char. Into separate lines. Try combine with sdiff.
headtail Shows first 3 lines, and {5,10,20}x{1,10,100,1000,..}-th lines, last 3 lines.
sampleL Randomly chosen lines appear with specified or weighting probabilities.
shuffleL Shuffle the lines (for non-huge number of lines)
bin4tsv 33
Each command works stand-alone at this moment.
The command names may change in the future.
2. Commands related to subtotals
Command Output Remarks
freq Frequency tbl. Similar to ‘sort | uniq –c’
freqfreq Freq. tbl. of Freq. num. Simialr to ‘freq | cut –f1 | freq’
crosstable 2-dim freq. tbl. Input may be 3-col, and outputs subtotal.
samecols Square matrix of how many values are same between paired vars.
venn2-3 Equivalent to draw Venn diagram for each key of 2 or 3 files.
venn4 Equivalent to draw Venn diagram for each key of 4 files.
fractile fractile plot that is also known as quantile plot.
marginsum Experimental ; Attaches marginal sums of a matrix.
bin4tsv 34
3. Commands for combining tables
Command Output Remarks
columns Retrieves or/and deletes specified subset of columns of a table
colAt Refer to the 1st line to search the name of col. then returns the position.
kvcmp Functionally similar to ‘sdiff <(sort –k num1 file1) <(sort –k num2 file2)’
join2 join2 refFile < keys ; a replacement of Unix join command.
keytrack Allign many target cols for given keys(stdin), given many files.
alluniq To check whether given keys are all unique or not.
layers experimental; numeric distributions of a col. for each val. of another col.
bin4tsv 35
This part is at the stage of the concept designing, thus easily be amendable.
4. Commands for other utilities
Command Output Remarks
inarow “in a row”. Only outputs lines whose specified col. has same
consecutive values with the number of specified times.
groupsum Experimental; subtotals of specified col. of each consecutive NUM lines.
transpose Experimental; the transposed matrix from given input matrix.
rbind4R Experimental; outputs the R command string from given input matrix
bin4tsv 36
This part is also at the stage of the concept designing, thus easily be amendable.
Abbreviations / Glossary
Abbr. Stands for ; Meaning
tbl. table
col. column (of a table)
var. vars. Variable(s). A var. is more abstract concept for a column.
val. Value. It is specific/concrete; cf. “The case when var. X equals val. “a” .
freq. frequency , the appearance number of a specific val. / specified var.
num. number. Often, mathematical integer. Not a text or a character.
line
key A special var. Used as indexing to refer other col. on the same line.
bin4tsv 37

Más contenido relacionado

Destacado (13)

Global Childcare in France
Global Childcare in FranceGlobal Childcare in France
Global Childcare in France
 
Project management mistakes to avoid
Project management mistakes to avoidProject management mistakes to avoid
Project management mistakes to avoid
 
Masgnb seminar operator_gnb_2013-program
Masgnb seminar operator_gnb_2013-programMasgnb seminar operator_gnb_2013-program
Masgnb seminar operator_gnb_2013-program
 
evaluatie van flipping de lessenserie, ervaringen met leerlingen
evaluatie van flipping de lessenserie, ervaringen met leerlingenevaluatie van flipping de lessenserie, ervaringen met leerlingen
evaluatie van flipping de lessenserie, ervaringen met leerlingen
 
Presentación efectiva, facebook
Presentación efectiva, facebookPresentación efectiva, facebook
Presentación efectiva, facebook
 
Mrs craig final exam 3
Mrs craig   final exam 3Mrs craig   final exam 3
Mrs craig final exam 3
 
Pdf
PdfPdf
Pdf
 
spacelab_Vienna's Vocational Training Guarantee
spacelab_Vienna's Vocational Training Guaranteespacelab_Vienna's Vocational Training Guarantee
spacelab_Vienna's Vocational Training Guarantee
 
Tissue culture
Tissue cultureTissue culture
Tissue culture
 
職場生存之道:內向心理學(二)
職場生存之道:內向心理學(二)職場生存之道:內向心理學(二)
職場生存之道:內向心理學(二)
 
Counting Frogs
Counting FrogsCounting Frogs
Counting Frogs
 
Dating columns
Dating columnsDating columns
Dating columns
 
Presentation software
Presentation softwarePresentation software
Presentation software
 

Similar a bin4tsv

Fine grained monitoring
Fine grained monitoringFine grained monitoring
Fine grained monitoring
Iben Rodriguez
 
CSE031.Lecture_07-FlowCharts_Pseudocode .Part_II.pdf
CSE031.Lecture_07-FlowCharts_Pseudocode .Part_II.pdfCSE031.Lecture_07-FlowCharts_Pseudocode .Part_II.pdf
CSE031.Lecture_07-FlowCharts_Pseudocode .Part_II.pdf
NourhanTarek23
 
Brian Suda: Designing with data (Webdagene 2014)
Brian Suda: Designing with data (Webdagene 2014)Brian Suda: Designing with data (Webdagene 2014)
Brian Suda: Designing with data (Webdagene 2014)
webdagene
 
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfHailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
cookie1969
 
CognitiveAnalyticsWithSparkAndZeppelinMeetup-v0.2
CognitiveAnalyticsWithSparkAndZeppelinMeetup-v0.2CognitiveAnalyticsWithSparkAndZeppelinMeetup-v0.2
CognitiveAnalyticsWithSparkAndZeppelinMeetup-v0.2
sundararavind
 

Similar a bin4tsv (20)

Future Architecture of Streaming Analytics: Capitalizing on the Analytics of ...
Future Architecture of Streaming Analytics: Capitalizing on the Analytics of ...Future Architecture of Streaming Analytics: Capitalizing on the Analytics of ...
Future Architecture of Streaming Analytics: Capitalizing on the Analytics of ...
 
Numerical and Statistical Quantifications of Biodiversity: Two-At-A-Time Equa...
Numerical and Statistical Quantifications of Biodiversity: Two-At-A-Time Equa...Numerical and Statistical Quantifications of Biodiversity: Two-At-A-Time Equa...
Numerical and Statistical Quantifications of Biodiversity: Two-At-A-Time Equa...
 
Ohecc_Bb_student_activity
Ohecc_Bb_student_activityOhecc_Bb_student_activity
Ohecc_Bb_student_activity
 
Performance Risk Management
Performance Risk ManagementPerformance Risk Management
Performance Risk Management
 
The future of AI programs in lens design
The future of AI programs in lens designThe future of AI programs in lens design
The future of AI programs in lens design
 
Manifest Data S-1 Speculative Sensation Lab Duke Digital Studio Presentation ...
Manifest Data S-1 Speculative Sensation Lab Duke Digital Studio Presentation ...Manifest Data S-1 Speculative Sensation Lab Duke Digital Studio Presentation ...
Manifest Data S-1 Speculative Sensation Lab Duke Digital Studio Presentation ...
 
Empowering the quantum revolution with Q#
Empowering the quantum revolution with Q#Empowering the quantum revolution with Q#
Empowering the quantum revolution with Q#
 
Criteo Infraestructure: Hadoop Datacenter
Criteo Infraestructure: Hadoop DatacenterCriteo Infraestructure: Hadoop Datacenter
Criteo Infraestructure: Hadoop Datacenter
 
Fine grained monitoring
Fine grained monitoringFine grained monitoring
Fine grained monitoring
 
Tablice rozkładu t-studenta
Tablice rozkładu t-studentaTablice rozkładu t-studenta
Tablice rozkładu t-studenta
 
CSE031.Lecture_07-FlowCharts_Pseudocode .Part_II.pdf
CSE031.Lecture_07-FlowCharts_Pseudocode .Part_II.pdfCSE031.Lecture_07-FlowCharts_Pseudocode .Part_II.pdf
CSE031.Lecture_07-FlowCharts_Pseudocode .Part_II.pdf
 
Cassandra Performance Tuning Like You've Been Doing It for Ten Years
Cassandra Performance Tuning Like You've Been Doing It for Ten YearsCassandra Performance Tuning Like You've Been Doing It for Ten Years
Cassandra Performance Tuning Like You've Been Doing It for Ten Years
 
Brian Suda: Designing with data (Webdagene 2014)
Brian Suda: Designing with data (Webdagene 2014)Brian Suda: Designing with data (Webdagene 2014)
Brian Suda: Designing with data (Webdagene 2014)
 
VizzMaintenance Eclipse Plugin Metrics
VizzMaintenance Eclipse Plugin MetricsVizzMaintenance Eclipse Plugin Metrics
VizzMaintenance Eclipse Plugin Metrics
 
Cassandra Community Webinar | Practice Makes Perfect: Extreme Cassandra Optim...
Cassandra Community Webinar | Practice Makes Perfect: Extreme Cassandra Optim...Cassandra Community Webinar | Practice Makes Perfect: Extreme Cassandra Optim...
Cassandra Community Webinar | Practice Makes Perfect: Extreme Cassandra Optim...
 
To find raise to five of any number
To find raise to five of any numberTo find raise to five of any number
To find raise to five of any number
 
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfHailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
 
CognitiveAnalyticsWithSparkAndZeppelinMeetup-v0.2
CognitiveAnalyticsWithSparkAndZeppelinMeetup-v0.2CognitiveAnalyticsWithSparkAndZeppelinMeetup-v0.2
CognitiveAnalyticsWithSparkAndZeppelinMeetup-v0.2
 
Alan Robinson
Alan RobinsonAlan Robinson
Alan Robinson
 
Cytoscape Tutorial Session 2 at UT-KBRIN Bioinformatics Summit 2014 (4/11/2014)
Cytoscape Tutorial Session 2 at UT-KBRIN Bioinformatics Summit 2014 (4/11/2014)Cytoscape Tutorial Session 2 at UT-KBRIN Bioinformatics Summit 2014 (4/11/2014)
Cytoscape Tutorial Session 2 at UT-KBRIN Bioinformatics Summit 2014 (4/11/2014)
 

Más de Toshiyuki Shimono

新型コロナの感染者数 全国の状況 2021年2月上旬まで
新型コロナの感染者数 全国の状況 2021年2月上旬まで新型コロナの感染者数 全国の状況 2021年2月上旬まで
新型コロナの感染者数 全国の状況 2021年2月上旬まで
Toshiyuki Shimono
 

Más de Toshiyuki Shimono (20)

国際産業数理・応用数理会議のポスター(作成中)
国際産業数理・応用数理会議のポスター(作成中)国際産業数理・応用数理会議のポスター(作成中)
国際産業数理・応用数理会議のポスター(作成中)
 
インターネット等からデータを自動収集するソフトウェアに必要な補助機能とその実装
インターネット等からデータを自動収集するソフトウェアに必要な補助機能とその実装インターネット等からデータを自動収集するソフトウェアに必要な補助機能とその実装
インターネット等からデータを自動収集するソフトウェアに必要な補助機能とその実装
 
extracting only a necessary file from a zip file
extracting only a necessary file from a zip fileextracting only a necessary file from a zip file
extracting only a necessary file from a zip file
 
A Hacking Toolset for Big Tabular Files -- JAPAN.PM 2021
A Hacking Toolset for Big Tabular Files -- JAPAN.PM 2021A Hacking Toolset for Big Tabular Files -- JAPAN.PM 2021
A Hacking Toolset for Big Tabular Files -- JAPAN.PM 2021
 
新型コロナの感染者数 全国の状況 2021年2月上旬まで
新型コロナの感染者数 全国の状況 2021年2月上旬まで新型コロナの感染者数 全国の状況 2021年2月上旬まで
新型コロナの感染者数 全国の状況 2021年2月上旬まで
 
Multiplicative Decompositions of Stochastic Distributions and Their Applicat...
 Multiplicative Decompositions of Stochastic Distributions and Their Applicat... Multiplicative Decompositions of Stochastic Distributions and Their Applicat...
Multiplicative Decompositions of Stochastic Distributions and Their Applicat...
 
Theory to consider an inaccurate testing and how to determine the prior proba...
Theory to consider an inaccurate testing and how to determine the prior proba...Theory to consider an inaccurate testing and how to determine the prior proba...
Theory to consider an inaccurate testing and how to determine the prior proba...
 
Interpreting Multiple Regression via an Ellipse Inscribed in a Square Extensi...
Interpreting Multiple Regressionvia an Ellipse Inscribed in a Square Extensi...Interpreting Multiple Regressionvia an Ellipse Inscribed in a Square Extensi...
Interpreting Multiple Regression via an Ellipse Inscribed in a Square Extensi...
 
Sqlgen190412.pdf
Sqlgen190412.pdfSqlgen190412.pdf
Sqlgen190412.pdf
 
BigQueryを使ってみた(2018年2月)
BigQueryを使ってみた(2018年2月)BigQueryを使ってみた(2018年2月)
BigQueryを使ってみた(2018年2月)
 
Seminar0917
Seminar0917Seminar0917
Seminar0917
 
既存分析ソフトへ
データを投入する前に
簡便な分析するためのソフトの作り方の提案
既存分析ソフトへ
データを投入する前に
簡便な分析するためのソフトの作り方の提案既存分析ソフトへ
データを投入する前に
簡便な分析するためのソフトの作り方の提案
既存分析ソフトへ
データを投入する前に
簡便な分析するためのソフトの作り方の提案
 
To Make Graphs Such as Scatter Plots Numerically Readable (PacificVis 2018, K...
To Make Graphs Such as Scatter Plots Numerically Readable (PacificVis 2018, K...To Make Graphs Such as Scatter Plots Numerically Readable (PacificVis 2018, K...
To Make Graphs Such as Scatter Plots Numerically Readable (PacificVis 2018, K...
 
To Make Graphs Such as Scatter Plots Numerically Readable (PacificVis 2018, K...
To Make Graphs Such as Scatter Plots Numerically Readable (PacificVis 2018, K...To Make Graphs Such as Scatter Plots Numerically Readable (PacificVis 2018, K...
To Make Graphs Such as Scatter Plots Numerically Readable (PacificVis 2018, K...
 
Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)
Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)
Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)
 
企業等に蓄積されたデータを分析するための処理機能の提案
企業等に蓄積されたデータを分析するための処理機能の提案企業等に蓄積されたデータを分析するための処理機能の提案
企業等に蓄積されたデータを分析するための処理機能の提案
 
新入社員の頃に教えて欲しかったようなことなど
新入社員の頃に教えて欲しかったようなことなど新入社員の頃に教えて欲しかったようなことなど
新入社員の頃に教えて欲しかったようなことなど
 
ページャ lessを使いこなす
ページャ lessを使いこなすページャ lessを使いこなす
ページャ lessを使いこなす
 
Guiを使わないテキストデータ処理
Guiを使わないテキストデータ処理Guiを使わないテキストデータ処理
Guiを使わないテキストデータ処理
 
データ全貌把握の方法170324
データ全貌把握の方法170324データ全貌把握の方法170324
データ全貌把握の方法170324
 

Último

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 

Último (20)

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 

bin4tsv

  • 1. How to tackle big TSV files 2016.02.12 Fri. Shimono Toshiyuki データの前処理の多くに対する 一般的手法と 著作者の自作ツールの紹介bin4tsv 1
  • 2. Even today of BIG DATA age, Digital Data - rarely tells anything from itself alone, - needs to be skillfully handled, - thus, new technology is required! -- Although human/animals can easily handle data on the earth via “sense organs”! bin4tsv 2
  • 3. Think the real situation when you get big-sized data files for analyzing as your work : “check it quickly, setup environments, and visualize your result.. “ bin4tsv 3
  • 4. Speedy (pre-)analysis hints with CLI appears here today. CLI = “Command Line Interface” bin4tsv 4 前処理 What is “MAE SHORI” in English ??
  • 5. CHAPTER 1. PRECONDITIONS Toward speedy actions for tsv files with 1,000–100,000,000 lines bin4tsv 5
  • 6. Why TSV format? Easy to handle more than CSV! CSV = Comma Separated Values each values are often enclosed by (“). TSV = Simply Tab Separated Values less, cut, column etc. well handle tsv files, -- these Unix commands will appear again later. Note that CSV is defined formally by RFC4180. bin4tsv 6
  • 7. Concerning SQL : When you use SQL database, you need to design tables/columns, so today’s tsv-handle-tech matters. SQL sentences are not easy to do the techniques presented from now. Think summing all numbers in a big table. Which is faster, SQL or tsv/csv format? .. Type-determination, outlier-screening are must works. Cross table , Venn diagrams , quantile plots, etc. The true data analysis is not an aggregation of micro queries ! bin4tsv 7 Retrieval from SQL Pre-analysis required! Importing SQL again.
  • 8. Preparatory Knowledge : Supposed to be familiar with Unix/Linux commands such as : less wc gawk cut head tail grep sed file od iconv nkf lv diff sdiff sort uniq paste column --including minor options of the above. Also such techniques utilizing : <( ) process substitution , stderr, /dev/null bin4tsv 8
  • 9. CHAPTER 2. UNKNOWN BASIC TECHNIQUES Toward speedy actions for tsv files with 1,000–100,000,000 lines bin4tsv 9
  • 10. Coloring makes comfort. Case1 : So many columns in a TSV file Case2 : Many Big-Digits Numbers Can you speed-read “5 trillion” with 13 digits ? ‘less’ screen pager well handles tabs, but coloring is much comfortable to you. bin4tsv 10
  • 11. bin4tsv 11 Supplementary Page Your client gave you data of tens of columns. You see it on the editor, and want to chase some curious columns. If you color every 5 consecutive columns with black and blue alternately, the situation improves greatly. Try “-x 12 ENTER” on less, and things a little improve.
  • 12. Almost all existing primitive CLI does not do digit grouping. If you try it by putting commas between some digits, then you may need to manage the layout collapse. A simple solution is to give colors differently. The above shows the solution, which at most 3 consecutive digits from the right are in the same color. Commas have collapsed the layout  Coloring causes little trouble  bin4tsv 12 Supplementary Page
  • 13. Venn Diagrams Scene: Transaction data of Campaign (A) and Campaign (B) Target customer data form Tokyo branch (C) and Osaka branch (D). Then, how do you compare 4 sets of key-column values? You need to draw that complex Venn diagram because : 1. Possibly, what your client/colleague says was “different”. 2. Data files can be damaged, while exporting/conveying. A Venn diagram for 4 sets requires neat skills ! bin4tsv 13 A B C D
  • 14. Supplementary Page When you got multiple data files that have common “key” columns, what would you do? Drawing the Venn diagram for the keys of each file, and filling in the number of the elements in its split regions are essential. Then you can see (1) how one set contains another, (2) how sets overlap each other, (3) how erroneous case occurs, and so on. And the numbers filled in the region are important to check the data-fab process that follows. By the way, can you draw a Venn diagram and fill in the 15 numbers for given 4 sets, really neatly and quickly? (And, some (maybe 99 or more) combinations of 15 regions are rather meaningful. What are they and how they should be represented? ) bin4tsv 14 * * ** * * ** * * ** * * * ▲ Branko Grünbaum, figure from ja.wikipedia
  • 15. Frequency Table Do you use `sort | uniq –c’ ? Why not use a shorter way? The presented command works quite well also under Japanese environments, also has the functions such as various sorting / cumulative summing. Twitter account page statuses. 0.01 - 0.1% seems not work well. bin4tsv 15
  • 16. Cross Table `crosstable’ promptly outputs cross-table without SQL/Excel Pivot. The main functions : (1) counting for 2-column data, (2) summing up for the first column’s values among 3-column data. “Contingency table” in statistics bin4tsv 16
  • 17. Supplementary Page “When the server is down” ? bin4tsv 17 Copy/paste the output into Excel, after getting the “crosstable” that counts the records of some events. And apply “conditional formatting”. 日付/時間 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 2013-12-10 32 157 153 156 139 161 170 182 315 356 276 221 256 2013-12-11 127 93 64 58 39 72 90 144 200 149 170 116 177 144 170 149 156 268 318 253 216 187 203 235 2013-12-12 136 86 68 63 46 61 64 115 177 134 130 121 463 582 423 442 703 1285 1394 1446 1444 1611 2023 1974 2013-12-13 1532 958 634 478 396 403 526 837 1122 1097 1457 1386 1988 1816 2037 3025 3662 4125 4551 5062 4655 4993 5576 2969 2013-12-14 2873 1626 1139 1018 529 682 807 1164 1655 2115 2536 2715 2868 3454 3720 3812 3611 3438 4202 5005 2557 3133 5031 4274 2013-12-15 3617 2232 1694 1330 1051 974 1173 1732 2590 3549 4942 5871 5306 3056 3571 5247 6666 6798 6964 2597 1710 3382 7879 6425 2013-12-16 4623 2741 1631 1304 927 1094 1185 2053 2597 2474 3415 3614 4580 4836 4187 4999 4115 4122 5957 6427 6328 6793 6545 5963 2013-12-17 4045 2405 1545 1189 1007 946 1237 2692 3532 3435 4178 4925 6158 5049 4756 4375 3890 4138 5090 6265 7628 7077 6639 6325 2013-12-18 4458 2807 1713 1404 1131 1051 1347 1954 2299 2624 3412 2707 1962 1773 1910 3851 5263 6359 7177 6583 6936 6644 6234 6119 2013-12-19 4637 3064 1739 1509 1141 1100 1394 2362 2994 3897 3999 4344 5260 6844 5092 4459 5706 6004 7407 3817 1672 4132 6727 5514 2013-12-20 5393 3082 1977 1973 1321 1462 1667 974 1222 1423 1818 2954 5261 4082 4889 6194 6355 6755 9276 6215 6275 6152 5849 5959 2013-12-21 4924 3247 1859 1637 1331 1414 1617 1249 1327 2072 2385 2784 2927 2630 2740 2916 2871 3139 6222 5042 4680 4828 4829 4633 2013-12-22 3838 2702 2073 1836 1417 1186 1427 1442 1633 2043 3376 4120 4346 4159 4473 4759 4450 4773 4962 6876 7718 7327 7472 6768 2013-12-23 5260 3316 2134 1776 1258 1231 1672 1299 894 1023 1195 2402 3516 3584 5172 7778 8700 8331 8030 6035 7712 7669 8084 7450 2013-12-24 7098 4150 2293 1774 1205 1189 1589 1476 1145 1733 2023 2678 4647 3835 4571 6430 6943 7864 8863 5181 6080 7125 8609 6496 2013-12-25 4905 2734 1695 1604 1377 1360 1580 1200 1289 1335 2341 4287 5532 4332 5077 6474 6952 7888 7702 3655 6388 9370 9115 8470 2013-12-26 7468 3877 2269 1959 1370 1398 1908 467 39 140 155 143 131 418 1081 2148 8863 8775 8934 8498 9086 8356 2013-12-27 6419 4015 1833 1627 1421 1361 1590 2511 3768 3996 5245 6196 7894 8098 7373 7497 7296 7324 8437 8120 8059 8384 8435 8277 2013-12-28 7678 4543 2619 2041 1587 1568 1726 2539 3578 4911 5945 7082 7563 7343 7916 7529 7863 8234 8165 8076 8528 8763 8670 7892 2013-12-29 6661 4372 2621 1823 1336 1224 1433 2290 3223 4477 5873 7296 7842 7614 7975 8060 8639 8402 8889 8398 8457 8320 9454 8700 2013-12-30 7607 4656 2765 2100 1528 1440 1635 2511 3997 5140 7044 7658 9435 9147 9147 10058 9956 10529 9852 9662 10644 11448 12848 12136 2013-12-31 10349 6155 3885 2703 1900 1803 2110 3586 5362 8051 10713 12249 13035 13001 13298 15048 15950 17235 17573 15644 16229 15365 14898 17461 2014-01-01 22849 15263 4149 3157 4218 4828 7216 15086 7594 12145 27222 2067 2511 2540 2478 4938 9088 14816 21263 7463 1861 4411 10241 9638 2014-01-02 8627 4964 2749 4744 3729 4168 6328 1915 1342 1889 2848 3233 3197 3612 4528 4449 7701 12565 20139 6596 4637 6998 8677 8949 2014-01-03 6857 5227 6151 3906 2678 2656 2899 2227 1378 2470 2706 3105 3573 3287 3754 7378 10992 11090 10936 5549 6432 8894 9887 8318 2014-01-04 6833 4113 2371 2687 2248 2016 2138 1689 1079 1437 1921 2448 2451 4279 7620 8104 8364 8326 7957 6164 5549 5233 5850 8926 2014-01-05 9757 6208 3823 2627 2073 1782 1854 695 419 511 661 1073 1976 2815 3173 3884 9464 11131 11139 9671 9997 10924 9716 10967 2014-01-06 7623 4818 2873 2159 1656 1480 1733 2948 3538 4550 5638 6605 9118 7973 7552 7691 7392 8170 7714 8308 8974 9368 10272 9641 2014-01-07 7025 4124 2604 1800 1488 1417 1647 1235 1434 2369 3146 3886 5686 6811 6426 6926 7317 8297 8453 6858 9441 9961 9958 8910 2014-01-08 7551 3997 2606 1837 1379 1248 1872 886 647 1022 1465 1828 2463 2138 1909 2163 2298 2832 2944 5583 7953 9266 10316 10737 2014-01-09 9161 4801 2999 2361 1871 1734 2094 3674 5135 5177 6758 7622 11370 10181 10548 12154 13235 13557 11017 4679 5650 6648 7858 7154 2014-01-10 5273 3243 1795 1349 1206 1020 1556 673 526 505 698 827 1039 996 1127 1232 2925 5981 10609 1481
  • 18. An alternative of histogram Utilize “quantile plot”. You can read each percentiles directly. The program utilizes R language internally thus you must install it. Green : Following # Blue : Followers # of millions of twitter accounts Same plot in LOG-SCALE The graphs are designed so that precise numbers are readable from the curves. <- The wall of 2000 bin4tsv 18
  • 19. Supplementary Page What is a quantile plot? (Comparison with the histogram.) bin4tsv 19 (The author wishes to banish histogram, and replace with so-called “quantile plot” or “fractile plot”, hopefully. The latter plot can avoid a problem of which histogram requires choosing how to `carefully’ group the given values, which affects the subsequent analysis. )
  • 20. Sampling, Shuffling Without watching “randomly sampled data”, probably you would see something false about the data. Sample and Shuffle! It’s statistics. bin4tsv 20
  • 21. CHAPTER 3. WHAT A DATA SCIENTIST DOES WHEN DATA FILES COME Last Chapter. bin4tsv 21
  • 22. Social skill sides: 1. Immediately check if the data files are “different!” or not. You are responsible. 2. Doubt everything you heard about the data, and check it quickly. Also have the method to do it quickly. 3. The total number of lines of each data files is important. Record them and take note. - You will easily forget the file names. - You will easily be lost in a maze during your dazzling analysis work. - But those numbers greatly help you ! Daily wisdoms of data scientists bin4tsv 22
  • 23. Technical sides 1. Check your file if it suits your client’s intension. Use less, file, nkf –g, lv etc. 2. Record the sizes. `wc -l’ and `awk “{print NF}” | uniq –c’ show line/column numbers. 3. Transform the character code into UTF8, the line separator code into “n” ( 0A in ASCII ). 4. Transform the CSV file into TSV! Use fgrep “(Ctrl+V TAB)” file, beforehand. bin4tsv 23
  • 24. Pay Attention : 1. CSV files may be retouched by Microsoft Excel. (e.g. “1234”->”1,234”; “5/1” -> “5月”) 2. Many Database software are often buggy on their import/export functions. 3. What you think utf8 may be utf8mb4. What you think SJIS may be CP932. -- Be careful ! -- Be experienced ! bin4tsv 24
  • 25. You need 5-20 times debugging processes when you do new things. So try to make simple and error-free tools, and reuse them. bin4tsv 25
  • 26. Desirable functions to be implemented on the commands 1. Data files may begin with “header” line. Correspond to that. 2. GZIPPED files can be read/unfrozen speedily. Deal with gzip files directly. 3. The process may take long time, so show the statuses on the middle of processes. 4. And show some useful messages when the interruption occurred by Control-C. bin4tsv 26
  • 27. APPENDIX CHAPTER : COMMANDS PROVIDED.. Appendix. bin4tsv 27
  • 28. First of all.. (about commands) Each of the commands the author provides works standalone, basically. Possibly requires library installations (using cpan). You can try from : crosstable, colors, colsummary - The commands names are changing : colors or color or coloring bin4tsv 28
  • 29. colsummary • 1st (white) : column number • 2nd(green, bright) : different values • 3rd (blue) : average number • 4th (yellow) : column name • 5th (white, bright) : value range • 6th (white) : frequent values from top • 7th (green) : frequency numbers top to bottom ‘colsummary’ is not designed to show the character string length of column values. Such function is implemented by another command ‘lengths’. The average number is useful when you like to try to encrypt the additive relation among columns. bin4tsv 29 This is the most appreciated command from my friend !
  • 30. samecols At the position (a,b) of the matrix, the lines where col[a] == col[b] satisfies are counted for WHITE part, the lines where col[a] != col[b] satisfies are counted for YELLOW part. This matrix is helpful when you get many columned table file, and you like to know which group of columns are similar or not. Sometimes, 2 or 3 columns are mostly identical, and you can easily know how it happens. bin4tsv 30
  • 31. headkeep Basically, all the command provided by bin4tsv are designed to handle the head lines of input when each of the command is given the special option. ( Simply, the -= option ) However, Unix/Linux inherent commands is out of this rule of bin4tsv, so a mediation way is provided. That is “headkeep” command. Example: headkeep sort < data # only sorts from the 2nd line. headkeep tail -3 < data # outputs 1st and last 3 lines. bin4tsv 31
  • 32. venn2-3 The Venn Diagram region sizes for 2 sets : bin4tsv 32 The Venn Diagram region sizes for 3 sets : At this moment, this command depends on List::Compare, and slow.
  • 33. 1. Commands for primitive checking Command Output Remarks csv2tsv TSV from CSV. Depends on Text::CSV_XS tabsplit Files corresponding to all of each column of input headkeep 1st line passes through. Tailing lines goes to the specified command. color Colored numbers by “-3” / tabbing by “-t num” colsummary Property of each columns of a file, quite nicely. lengths Maps each cell into the string length. Try combine with colsummary. chars Splits into each string char. Into separate lines. Try combine with sdiff. headtail Shows first 3 lines, and {5,10,20}x{1,10,100,1000,..}-th lines, last 3 lines. sampleL Randomly chosen lines appear with specified or weighting probabilities. shuffleL Shuffle the lines (for non-huge number of lines) bin4tsv 33 Each command works stand-alone at this moment. The command names may change in the future.
  • 34. 2. Commands related to subtotals Command Output Remarks freq Frequency tbl. Similar to ‘sort | uniq –c’ freqfreq Freq. tbl. of Freq. num. Simialr to ‘freq | cut –f1 | freq’ crosstable 2-dim freq. tbl. Input may be 3-col, and outputs subtotal. samecols Square matrix of how many values are same between paired vars. venn2-3 Equivalent to draw Venn diagram for each key of 2 or 3 files. venn4 Equivalent to draw Venn diagram for each key of 4 files. fractile fractile plot that is also known as quantile plot. marginsum Experimental ; Attaches marginal sums of a matrix. bin4tsv 34
  • 35. 3. Commands for combining tables Command Output Remarks columns Retrieves or/and deletes specified subset of columns of a table colAt Refer to the 1st line to search the name of col. then returns the position. kvcmp Functionally similar to ‘sdiff <(sort –k num1 file1) <(sort –k num2 file2)’ join2 join2 refFile < keys ; a replacement of Unix join command. keytrack Allign many target cols for given keys(stdin), given many files. alluniq To check whether given keys are all unique or not. layers experimental; numeric distributions of a col. for each val. of another col. bin4tsv 35 This part is at the stage of the concept designing, thus easily be amendable.
  • 36. 4. Commands for other utilities Command Output Remarks inarow “in a row”. Only outputs lines whose specified col. has same consecutive values with the number of specified times. groupsum Experimental; subtotals of specified col. of each consecutive NUM lines. transpose Experimental; the transposed matrix from given input matrix. rbind4R Experimental; outputs the R command string from given input matrix bin4tsv 36 This part is also at the stage of the concept designing, thus easily be amendable.
  • 37. Abbreviations / Glossary Abbr. Stands for ; Meaning tbl. table col. column (of a table) var. vars. Variable(s). A var. is more abstract concept for a column. val. Value. It is specific/concrete; cf. “The case when var. X equals val. “a” . freq. frequency , the appearance number of a specific val. / specified var. num. number. Often, mathematical integer. Not a text or a character. line key A special var. Used as indexing to refer other col. on the same line. bin4tsv 37