Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
bin4tsv
1. How to tackle
big TSV files
2016.02.12 Fri.
Shimono Toshiyuki
データの前処理の多くに対する
一般的手法と 著作者の自作ツールの紹介bin4tsv 1
2. Even today of BIG DATA age,
Digital Data
- rarely tells anything from itself alone,
- needs to be skillfully handled,
- thus, new technology is required!
-- Although human/animals can easily
handle data on the earth via “sense organs”!
bin4tsv 2
3. Think the real situation
when you get big-sized data files
for analyzing as your work :
“check it quickly, setup environments, and visualize your
result.. “
bin4tsv 3
4. Speedy (pre-)analysis hints
with CLI appears here today.
CLI = “Command Line Interface”
bin4tsv 4
前処理
What is “MAE SHORI” in English ??
6. Why TSV format?
Easy to handle more than CSV!
CSV = Comma Separated Values
each values are often enclosed by (“).
TSV = Simply Tab Separated Values
less, cut, column etc. well handle tsv files,
-- these Unix commands will appear again later.
Note that CSV is defined formally by RFC4180.
bin4tsv 6
7. Concerning SQL :
When you use SQL database,
you need to design tables/columns,
so today’s tsv-handle-tech matters.
SQL sentences are not easy to do
the techniques presented from now.
Think summing all numbers in a big table.
Which is faster, SQL or tsv/csv format?
.. Type-determination, outlier-screening are must works.
Cross table , Venn diagrams , quantile plots, etc.
The true data analysis is not an aggregation of micro queries !
bin4tsv 7
Retrieval from SQL
Pre-analysis required!
Importing SQL again.
8. Preparatory Knowledge :
Supposed to be familiar with
Unix/Linux commands such as :
less wc gawk cut head tail grep sed
file od iconv nkf lv
diff sdiff sort uniq paste column
--including minor options of the above.
Also such techniques utilizing :
<( ) process substitution , stderr, /dev/null
bin4tsv 8
9. CHAPTER 2.
UNKNOWN BASIC TECHNIQUES
Toward speedy actions for tsv files with 1,000–100,000,000 lines
bin4tsv 9
10. Coloring makes comfort.
Case1 : So many columns in a TSV file
Case2 : Many Big-Digits Numbers
Can you speed-read “5 trillion” with 13 digits ?
‘less’ screen pager well handles tabs, but coloring is much comfortable to you.
bin4tsv 10
11. bin4tsv 11
Supplementary Page
Your client gave you data of tens of
columns. You see it on the editor, and want
to chase some curious columns.
If you color every 5 consecutive columns with black and
blue alternately, the situation improves greatly.
Try “-x 12 ENTER” on less, and things a little improve.
12. Almost all existing primitive CLI does not do digit
grouping. If you try it by putting commas between
some digits, then you may need to manage the
layout collapse. A simple solution is to give colors
differently. The above shows the solution, which at
most 3 consecutive digits from the right are in the
same color.
Commas have collapsed the layout Coloring causes little trouble
bin4tsv 12
Supplementary Page
13. Venn Diagrams
Scene:
Transaction data of Campaign (A) and Campaign (B)
Target customer data form Tokyo branch (C) and Osaka branch (D).
Then, how do you compare 4 sets of key-column values?
You need to draw that complex Venn diagram because :
1. Possibly, what your client/colleague says was “different”.
2. Data files can be damaged, while exporting/conveying.
A Venn diagram for 4 sets requires neat skills !
bin4tsv 13
A B
C
D
14. Supplementary Page
When you got multiple data files that have common “key” columns, what would
you do? Drawing the Venn diagram for the keys of each file, and filling in the
number of the elements in its split regions are essential. Then you can see (1)
how one set contains another, (2) how sets overlap each other, (3) how
erroneous case occurs, and so on. And the numbers filled in the region are
important to check the data-fab process that follows.
By the way, can you draw a Venn diagram and fill in the 15
numbers for given 4 sets, really neatly and quickly?
(And, some (maybe 99 or more) combinations of 15 regions are rather
meaningful. What are they and how they should be represented? )
bin4tsv 14
* * **
* * **
* * **
* * *
▲ Branko Grünbaum,
figure from ja.wikipedia
15. Frequency Table
Do you use `sort | uniq –c’ ? Why not use a shorter way?
The presented command works quite well also under Japanese environments,
also has the functions such as various sorting / cumulative summing.
Twitter account page statuses.
0.01 - 0.1% seems not work well.
bin4tsv 15
16. Cross Table
`crosstable’ promptly outputs cross-table without SQL/Excel Pivot.
The main functions : (1) counting for 2-column data,
(2) summing up for the first column’s values among 3-column data.
“Contingency table” in statistics
bin4tsv 16
18. An alternative of histogram
Utilize “quantile plot”.
You can read each percentiles directly.
The program utilizes R language internally thus you must install it.
Green : Following #
Blue : Followers #
of millions of twitter
accounts
Same plot in
LOG-SCALE
The graphs are designed
so that precise numbers are
readable from the curves.
<- The wall
of 2000
bin4tsv 18
19. Supplementary Page
What is a quantile plot? (Comparison with the histogram.)
bin4tsv 19
(The author wishes to banish histogram, and replace with so-called
“quantile plot” or “fractile plot”, hopefully. The latter plot can avoid a
problem of which histogram requires choosing how to `carefully’ group the
given values, which affects the subsequent analysis. )
20. Sampling, Shuffling
Without watching “randomly sampled data”,
probably you would see something false about
the data. Sample and Shuffle! It’s statistics.
bin4tsv 20
21. CHAPTER 3.
WHAT A DATA SCIENTIST DOES
WHEN DATA FILES COME
Last Chapter.
bin4tsv 21
22. Social skill sides:
1. Immediately check if the data files are
“different!” or not. You are responsible.
2. Doubt everything you heard about the data,
and check it quickly.
Also have the method to do it quickly.
3. The total number of lines of each data files
is important. Record them and take note.
- You will easily forget the file names.
- You will easily be lost in a maze
during your dazzling analysis work.
- But those numbers greatly help you !
Daily wisdoms of data scientists
bin4tsv 22
23. Technical sides
1. Check your file if it suits your client’s
intension. Use less, file, nkf –g, lv etc.
2. Record the sizes. `wc -l’ and `awk “{print
NF}” | uniq –c’ show line/column numbers.
3. Transform the character code into UTF8, the
line separator code into “n” ( 0A in ASCII ).
4. Transform the CSV file into TSV!
Use fgrep “(Ctrl+V TAB)” file, beforehand.
bin4tsv 23
24. Pay Attention :
1. CSV files may be retouched by Microsoft
Excel. (e.g. “1234”->”1,234”; “5/1” -> “5月”)
2. Many Database software are often buggy
on their import/export functions.
3. What you think utf8 may be utf8mb4. What
you think SJIS may be CP932.
-- Be careful !
-- Be experienced !
bin4tsv 24
25. You need
5-20 times debugging processes
when you do new things.
So try to make
simple and error-free tools,
and reuse them.
bin4tsv 25
26. Desirable functions to be implemented on
the commands
1. Data files may begin with “header” line.
Correspond to that.
2. GZIPPED files can be read/unfrozen
speedily. Deal with gzip files directly.
3. The process may take long time, so show
the statuses on the middle of processes.
4. And show some useful messages when
the interruption occurred by Control-C.
bin4tsv 26
28. First of all.. (about commands)
Each of the commands the author provides
works standalone, basically. Possibly requires
library installations (using cpan).
You can try from :
crosstable, colors, colsummary
- The commands names are changing :
colors or color or coloring
bin4tsv 28
29. colsummary
• 1st (white) : column number
• 2nd(green, bright) : different values
• 3rd (blue) : average number
• 4th (yellow) : column name
• 5th (white, bright) : value range
• 6th (white) : frequent values from top
• 7th (green) : frequency numbers top to bottom
‘colsummary’ is not designed to show the character string length of
column values. Such function is implemented by another command
‘lengths’. The average number is useful when you like to try to encrypt
the additive relation among columns.
bin4tsv 29
This is the most appreciated
command from my friend !
30. samecols
At the position (a,b) of the matrix,
the lines where col[a] == col[b] satisfies are counted for WHITE part,
the lines where col[a] != col[b] satisfies are counted for YELLOW part.
This matrix is helpful when you get many columned table file, and you like
to know which group of columns are similar or not. Sometimes, 2 or 3
columns are mostly identical, and you can easily know how it happens.
bin4tsv 30
31. headkeep
Basically, all the command
provided by bin4tsv are designed
to handle the head lines of input
when each of the command is
given the special option. ( Simply,
the -= option )
However, Unix/Linux inherent
commands is out of this rule of
bin4tsv, so a mediation way is
provided. That is “headkeep”
command.
Example:
headkeep sort < data
# only sorts from the 2nd line.
headkeep tail -3 < data
# outputs 1st and last 3 lines.
bin4tsv 31
32. venn2-3
The Venn Diagram region sizes for 2 sets :
bin4tsv 32
The Venn Diagram region sizes for 3 sets :
At this moment, this command depends on List::Compare, and slow.
33. 1. Commands for primitive checking
Command Output Remarks
csv2tsv TSV from CSV. Depends on Text::CSV_XS
tabsplit Files corresponding to all of each column of input
headkeep 1st line passes through. Tailing lines goes to the specified command.
color Colored numbers by “-3” / tabbing by “-t num”
colsummary Property of each columns of a file, quite nicely.
lengths Maps each cell into the string length. Try combine with colsummary.
chars Splits into each string char. Into separate lines. Try combine with sdiff.
headtail Shows first 3 lines, and {5,10,20}x{1,10,100,1000,..}-th lines, last 3 lines.
sampleL Randomly chosen lines appear with specified or weighting probabilities.
shuffleL Shuffle the lines (for non-huge number of lines)
bin4tsv 33
Each command works stand-alone at this moment.
The command names may change in the future.
34. 2. Commands related to subtotals
Command Output Remarks
freq Frequency tbl. Similar to ‘sort | uniq –c’
freqfreq Freq. tbl. of Freq. num. Simialr to ‘freq | cut –f1 | freq’
crosstable 2-dim freq. tbl. Input may be 3-col, and outputs subtotal.
samecols Square matrix of how many values are same between paired vars.
venn2-3 Equivalent to draw Venn diagram for each key of 2 or 3 files.
venn4 Equivalent to draw Venn diagram for each key of 4 files.
fractile fractile plot that is also known as quantile plot.
marginsum Experimental ; Attaches marginal sums of a matrix.
bin4tsv 34
35. 3. Commands for combining tables
Command Output Remarks
columns Retrieves or/and deletes specified subset of columns of a table
colAt Refer to the 1st line to search the name of col. then returns the position.
kvcmp Functionally similar to ‘sdiff <(sort –k num1 file1) <(sort –k num2 file2)’
join2 join2 refFile < keys ; a replacement of Unix join command.
keytrack Allign many target cols for given keys(stdin), given many files.
alluniq To check whether given keys are all unique or not.
layers experimental; numeric distributions of a col. for each val. of another col.
bin4tsv 35
This part is at the stage of the concept designing, thus easily be amendable.
36. 4. Commands for other utilities
Command Output Remarks
inarow “in a row”. Only outputs lines whose specified col. has same
consecutive values with the number of specified times.
groupsum Experimental; subtotals of specified col. of each consecutive NUM lines.
transpose Experimental; the transposed matrix from given input matrix.
rbind4R Experimental; outputs the R command string from given input matrix
bin4tsv 36
This part is also at the stage of the concept designing, thus easily be amendable.
37. Abbreviations / Glossary
Abbr. Stands for ; Meaning
tbl. table
col. column (of a table)
var. vars. Variable(s). A var. is more abstract concept for a column.
val. Value. It is specific/concrete; cf. “The case when var. X equals val. “a” .
freq. frequency , the appearance number of a specific val. / specified var.
num. number. Often, mathematical integer. Not a text or a character.
line
key A special var. Used as indexing to refer other col. on the same line.
bin4tsv 37