Presented at IEEE BigData 2017, Boston, on Dec 11, 2017
in the Workshop of "3rd International Workshop on Methodologies to Improve Big Data projects".
The author is Toshiyuki Shimono, Digital Garage, Inc.
(This is PDF format instead of MS Powerpoint format for the sake of significantly smaller file size.)
2. Work Contributions
1. For exploring an unknown DB,
Ø Organized the milestones.
Ø Conceived the methods.
2. Proposed a beneficial software tool for “Big Data”.
Ø Seems that no other tools except [Shimono 16],
based on the surveys [Saltz, Shamshurin 16], [Kumar, Alencar 16].
3. Reducing labor to understand a DB.
Ø By shrinking it from months to a week.
Ø “Knowing” latently dominates a data analysis project.
A similar slide appears
again in the ending.
2
12. Assumption in environment
n CLI (Command Line Interface)
n to produce SQL statements.
n to store the data.
n to process the data.
n SQL-type DB.
▲ Command Line Interface (CLI) ▲ SQL Client software
SQL statements
are entered here.
The SQL
output
appears here.
12
18. The program functions
18
Program name What function the produced SQL statement(s) has.
serverInfo SQL DB system version information.
tableLines Counting the records of each table.
tableColumns Column information of all.
sampleRows Random sampling of rows.
minMax Taking min/max of each columns. Also taking the 4 values.
mostFreq/FewId Taking the most/few frequent values of each columns.
distinctCount Counting distinct values of each columns.
hasChar/nullCount Counting the values with specific character or null values.
byteTable/byteCol Compute or estimate the byte-size of each table or column.
vennTwo To calculate how sets of values overlap.
newTable Creating a table with ease.
hashSum Summing numerically mapped SHA-1 value to compare tables.
26. Are the 4 values enough to see a column?
26
Skip this page unless time is enough.
• 2 values (e.g. min/max) would not work L.
• 4 values can cause misleading in small possibility,
but it actually works well as shown later, so far.
• How about 5 or 6 or more values :
• The min/max from the 3rd set can be added.
• Indeed good to see the various/lengthy text values J .
• But it is becoming not simple. Requiring complex SQL.
• Much computation time as I once tried L .
39. An application example.
1. You may have a lot of tables.
2. You understand each column of them by :
• seeing some of the concrete values,
• seeing the special and anomalous values,
• determining all of the same code sharing columns.
3. Thereafter, you can :
1. narrow down to modest-sized tables.
2. easily handle the data for visualizations.
3. summarize the data you need into one table
that can be handled by many mathematical methods.
39
41. Contributions (summary)
1. For exploring an unknown DB,
Ø Organized the milestones.
Ø Conceived the methods.
2. Proposed a beneficial software tool for Big Data.
Ø Seems that no other tools except [Shimono 16],
based on the surveys [Saltz, Shamshurin 16], [Kumar, Alencar 16].
3. Reducing labor to understand a DB.
Ø By shrinking it from months only to a week.
Ø “Knowing” latently dominates a data analysis project.
41
51. Are the 4 values enough to see a column?
51
Skip this page unless time is enough.
• Only 2 values (e.g. min/max) would not work L.
• Only 4 values may cause misleading possibly L.
• Aligning more than 4 values :
• The min/max from some (the third) set can be added.
• Indeed good to see the various/lengthy text values J .
• Much computation time as I once tried L .
• SQL may really need “second_min” and “second_max”.
• Misc.
• Null value care is desirable.
• The frequency number may be desirable.
• The value lengths information is helpful.