SlideShare una empresa de Scribd logo
1 de 19
MeasureCamp VI
*NIX for ETL
(Using Scripting to Automate Data Cleaning & Transformation)
Andrew Hood
@lynchpin
Is this a 1970s tribute presentation?
UNIX
BSD Mac OS X
Linux Android
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 2
Why is *NIX relevant to analytics?
• Most input data is crap. And increasingly large.
• Requires data processing:
– Sequences of transformations
– Reproducibility
– Flexibility
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 3
Download
file(s)
Remove dodgy
characters
Fix date
format
Load into
database
Do I have it/how do I get *NIX?
• You might have it already:
– Max OS X, Android, …
• Linux
– Stick on an old PC
• E.g. http://www.ubuntu.com/
– 12 months free on AWS
• http://aws.amazon.com/free/
• Windows
– MinGW/MSYS ports of all the common command line tools
• http://www.mingw.org/
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 4
3 Key UNIX Principles
1. Small is Beautiful
2. Everything is a Filter
3. Silence is Golden
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 5
Principle 1: Small is Beautiful
• A UNIX command does one simple thing efficiently
– ls: lists files in the current directory
– cat: concatenates files together
– sort: er, sorts stuff
– uniq: de-duplicates stuff
– grep: search/filter on steroids
– sed: search and replace on steroids
– awk: calculator on steroids
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 6
In contrast to…
• One tool that does lots of things badly.
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 7
Learning Curve Disclaimer
• Commands are named to minimise typing (not deliberately
maximise obfuscation!)
– ls = list
– mv = move
– cp = copy
– pwd = present working directory
• man [command] is your friend – the manual pages
– E.g. man ls
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 8
Principle 2: Everything is a Filter
• Pretty much every command can take an input and generate
an output based on that input.
– This is rather good for data processing!
• Key punctuation:
– | chains output of one command to input of the next
– > redirects output to a file (or device)
• Example:
– cat myfile.txt | sort | uniq > sorted-and-deduplicated.txt
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 9
Terry Pratchett Tribute Slide
• How to remember what pipes and redirection do
• Source: alt.fan.pratchett (yes, USENET)
• A better LOL:
– C|N>K
• Coffee (filtered) through nose (redirected) to keyboard
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 10
Key Principle 3: Silence is Golden
• Commands have 3 default streams:
– STDIN: text streaming into the command
– STDOUT: text streaming out of the command
– STDERR: any errors or warnings from running the command
• If a command completes successfully you get no messages
– No “Done.”, “15 files copied” or “Are you sure?” prompts!
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 11
Example 1: Quick Line/Field Count
• How big are these massive compressed Webtrends Logs for December 2014?
% zcat *2014-12*.log.gz | wc -l
1514139
• And how many fields are there in the file?
% zcat *.log.gz | head -n5 | awk '{ print NF+1 }'
20
19
19
19
19
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 12
Example 1: Quick Line/Field Count
• Why 20 fields in the first line and 19 in the rest?
% zcat *2014-12*.log.gz | head -n1
Fields: date time c-ip cs-username cs-host cs-method cs-uri-stem
cs-uri-query sc-status sc-bytes cs-version cs(User-Agent)
cs(Cookie) cs(Referer) dcs-geo dcs-dns origin-id dcs-id
• Let’s get rid of that header line and pop into one file:
% zcat *2014-12*.log.gz | grep –v '^Fields' > 2014-12.log
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 13
Example 2: Quick & Dirty Sums
• How many unique IP addresses in that logfile?
Fields: date time c-ip cs-username cs-host cs-method cs-uri-stem cs-uri-query sc-status sc-
bytes cs-version cs(User-Agent) cs(Cookie) cs(Referer) dcs-geo dcs-dns origin-id dcs-id
% cat 2014-12.log | cut -d" " -f3 | sort | uniq | wc -l
8695
• What are the top 5 IP addresses by hits?
% cat 2014-12.log | cut -d" " -f3 | sort | uniq -c | sort -nr | head -n5
1110 91.214.5.128
840 193.108.78.10
730 193.108.73.47
184 203.112.87.138
173 203.112.87.220
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 14
(A Note on Regular Expressions)
• Best learn to love these
– . is any character.
–  escapes a character that has a meaning i.e. . is a literal full stop
– * Matches 0 or more characters
– + Matches 1 or more characters
– ? Matches 0 or 1 characters
– ^ Starts With
– $ Ends With
– [ ] matches a list or group of charcaters (e.g. [A-Za-z] any letter)
– ( ) groups and captures characters
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 15
Example 3: Fixing The Quotes/Date
% cat test.csv
Joe Bloggs,56.78,05/12/2014
Bill Bloggs,123.99,12/06/2014
% sed 's/[[:alnum:] /.]*/"&"/g' test.csv
"Joe Bloggs","56.78","05/12/2014"
"Bill Bloggs","123.99","12/06/2014“
% sed 's/[[:alnum:] /.]*/"&"/g' test.csv | sed 's|(..)/(..)/(....)|3-2-1|'
"Joe Bloggs","56.78","2014-12-05"
"Bill Bloggs","123.99","2014-06-12"
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 16
Example 4: Clean & Database Load
cat input.csv 
| grep '^[0-9]' 
| sed 's/,,/,NULL,/g' 
| sed 's/[^[:print:]]//g' 
| psql -c "copy mytable from stdin with delimiter ','"
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 17
Example 5: Basic Scripting
#!/bin/sh
DATE=`date -v -1d +%Y-%m-%d`
SOURCE="input-$DATE.zip"
LOGFILE="$DATE.log"
sftp -b /dev/stdin username@host 2>&1 >>$LOGFILE <<EOF
cd incoming
get $SOURCE
EOF
if [ ! -f "$SOURCE" ]; then
echo "Source file $SOURCE not found" >> $LOGFILE
exit 99
fi
unzip $SOURCE
echo "Source file $SOURCE successfully downloaded and decompressed" >> $LOGFILE
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 18
Resources
• UNIX Fundamentals:
– http://freeengineer.org/learnUNIXin10minutes.html
– http://www.ee.surrey.ac.uk/Teaching/Unix/
• Cheat Sheet:
– http://www.pixelbeat.org/cmdline.html
• Basic Analytics Applications
– http://www.gregreda.com/2013/07/15/unix-commands-for-data-science/
• Regular Expressions:
– http://www.lunametrics.com/regex-book/Regular-Expressions-Google-Analytics.pdf
– http://www.regexr.com/
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 19

Más contenido relacionado

La actualidad más candente

RDF Stream Processing and the role of Semantics
RDF Stream Processing and the role of SemanticsRDF Stream Processing and the role of Semantics
RDF Stream Processing and the role of SemanticsJean-Paul Calbimonte
 
VoxxedDays Luxembourg 2019
VoxxedDays Luxembourg 2019VoxxedDays Luxembourg 2019
VoxxedDays Luxembourg 2019Cédrick Lunven
 
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copyFulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copyUwe Korn
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteFlink Forward
 
Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...
Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...
Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...Flink Forward
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System OverviewFlink Forward
 
Continuous Processing with Apache Flink - Strata London 2016
Continuous Processing with Apache Flink - Strata London 2016Continuous Processing with Apache Flink - Strata London 2016
Continuous Processing with Apache Flink - Strata London 2016Stephan Ewen
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupMachine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupTill Rohrmann
 
First Flink Bay Area meetup
First Flink Bay Area meetupFirst Flink Bay Area meetup
First Flink Bay Area meetupKostas Tzoumas
 
pandas.(to/from)_sql is simple but not fast
pandas.(to/from)_sql is simple but not fastpandas.(to/from)_sql is simple but not fast
pandas.(to/from)_sql is simple but not fastUwe Korn
 
ログ収集プラットフォーム開発におけるElasticsearchの運用
ログ収集プラットフォーム開発におけるElasticsearchの運用ログ収集プラットフォーム開発におけるElasticsearchの運用
ログ収集プラットフォーム開発におけるElasticsearchの運用LINE Corporation
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flinkmxmxm
 
Predictive Datacenter Analytics with Strymon
Predictive Datacenter Analytics with StrymonPredictive Datacenter Analytics with Strymon
Predictive Datacenter Analytics with StrymonVasia Kalavri
 
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...Ververica
 
LINEデリマでのElasticsearchの運用と監視の話
LINEデリマでのElasticsearchの運用と監視の話LINEデリマでのElasticsearchの運用と監視の話
LINEデリマでのElasticsearchの運用と監視の話LINE Corporation
 
Apache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonApache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonStephan Ewen
 
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache FlinkGelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache FlinkVasia Kalavri
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingKostas Tzoumas
 
Flink Streaming @BudapestData
Flink Streaming @BudapestDataFlink Streaming @BudapestData
Flink Streaming @BudapestDataGyula Fóra
 

La actualidad más candente (20)

RDF Stream Processing and the role of Semantics
RDF Stream Processing and the role of SemanticsRDF Stream Processing and the role of Semantics
RDF Stream Processing and the role of Semantics
 
VoxxedDays Luxembourg 2019
VoxxedDays Luxembourg 2019VoxxedDays Luxembourg 2019
VoxxedDays Luxembourg 2019
 
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copyFulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward Keynote
 
Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...
Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...
Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System Overview
 
Continuous Processing with Apache Flink - Strata London 2016
Continuous Processing with Apache Flink - Strata London 2016Continuous Processing with Apache Flink - Strata London 2016
Continuous Processing with Apache Flink - Strata London 2016
 
Data Analysis With Pandas
Data Analysis With PandasData Analysis With Pandas
Data Analysis With Pandas
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupMachine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning Group
 
First Flink Bay Area meetup
First Flink Bay Area meetupFirst Flink Bay Area meetup
First Flink Bay Area meetup
 
pandas.(to/from)_sql is simple but not fast
pandas.(to/from)_sql is simple but not fastpandas.(to/from)_sql is simple but not fast
pandas.(to/from)_sql is simple but not fast
 
ログ収集プラットフォーム開発におけるElasticsearchの運用
ログ収集プラットフォーム開発におけるElasticsearchの運用ログ収集プラットフォーム開発におけるElasticsearchの運用
ログ収集プラットフォーム開発におけるElasticsearchの運用
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Predictive Datacenter Analytics with Strymon
Predictive Datacenter Analytics with StrymonPredictive Datacenter Analytics with Strymon
Predictive Datacenter Analytics with Strymon
 
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
 
LINEデリマでのElasticsearchの運用と監視の話
LINEデリマでのElasticsearchの運用と監視の話LINEデリマでのElasticsearchの運用と監視の話
LINEデリマでのElasticsearchの運用と監視の話
 
Apache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonApache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World London
 
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache FlinkGelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Flink Streaming @BudapestData
Flink Streaming @BudapestDataFlink Streaming @BudapestData
Flink Streaming @BudapestData
 

Destacado

Measurecamp 6 session: Effective Customer Feedback & Measurement Frameworks
Measurecamp 6 session: Effective Customer Feedback & Measurement FrameworksMeasurecamp 6 session: Effective Customer Feedback & Measurement Frameworks
Measurecamp 6 session: Effective Customer Feedback & Measurement FrameworksSean Burton
 
Breaking down the barriers to the use of digital analytics
Breaking down the barriers to the use of digital analyticsBreaking down the barriers to the use of digital analytics
Breaking down the barriers to the use of digital analyticsPeter O'Neill
 
User-Centric Analytics (MeasureCamp Talk)
User-Centric Analytics (MeasureCamp Talk)User-Centric Analytics (MeasureCamp Talk)
User-Centric Analytics (MeasureCamp Talk)Taste Medio
 
Superweek 2015 traffic attribution
Superweek 2015 traffic attributionSuperweek 2015 traffic attribution
Superweek 2015 traffic attributionJacob Kildebogaard
 
A/B Testing Pitfalls - MeasureCamp London 2015
A/B Testing Pitfalls - MeasureCamp London 2015A/B Testing Pitfalls - MeasureCamp London 2015
A/B Testing Pitfalls - MeasureCamp London 2015Michal Parizek
 
Customer Lifetime Value
Customer Lifetime ValueCustomer Lifetime Value
Customer Lifetime Valuepavel jašek
 
31 Ways To Destroy Your Google Analytics Implementation
31 Ways To Destroy Your Google Analytics Implementation31 Ways To Destroy Your Google Analytics Implementation
31 Ways To Destroy Your Google Analytics ImplementationCharles Meaden
 
Measurecamp - Improving e commerce tracking with universal analytics
Measurecamp - Improving e commerce tracking with universal analyticsMeasurecamp - Improving e commerce tracking with universal analytics
Measurecamp - Improving e commerce tracking with universal analyticsMatt Clarke
 
Getting to the People Behind The Keywords
Getting to the People Behind The KeywordsGetting to the People Behind The Keywords
Getting to the People Behind The KeywordsCarmen Mardiros
 
Don't you have a Funnel view? - Measure Camp VI
Don't you have a Funnel view? - Measure Camp VIDon't you have a Funnel view? - Measure Camp VI
Don't you have a Funnel view? - Measure Camp VIXavier Colomes
 
BrainSINS and AWS meetup Keynote
BrainSINS and AWS meetup KeynoteBrainSINS and AWS meetup Keynote
BrainSINS and AWS meetup KeynoteAndrés Collado
 
The Future of Customer Engagement
The Future of Customer EngagementThe Future of Customer Engagement
The Future of Customer EngagementAlterian
 
Customer lifetime value
Customer lifetime valueCustomer lifetime value
Customer lifetime valueyalisassoon
 
Predictive analytics in action real-world examples and advice
Predictive analytics in action real-world examples and advicePredictive analytics in action real-world examples and advice
Predictive analytics in action real-world examples and adviceThe Marketing Distillery
 
The Who, What, Why, How of Lead Scoring
The Who, What, Why, How of Lead ScoringThe Who, What, Why, How of Lead Scoring
The Who, What, Why, How of Lead ScoringMarketo
 
Customer Engagement Model (CEM)
Customer Engagement Model (CEM)Customer Engagement Model (CEM)
Customer Engagement Model (CEM)Mani Dev Gyawali
 
Customer Insight Analysis
Customer Insight AnalysisCustomer Insight Analysis
Customer Insight AnalysisVC4
 
Predictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesPredictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesKimberley Mitchell
 

Destacado (20)

Measurecamp 6 session: Effective Customer Feedback & Measurement Frameworks
Measurecamp 6 session: Effective Customer Feedback & Measurement FrameworksMeasurecamp 6 session: Effective Customer Feedback & Measurement Frameworks
Measurecamp 6 session: Effective Customer Feedback & Measurement Frameworks
 
Measure camp pres 5 cro myths
Measure camp pres   5 cro mythsMeasure camp pres   5 cro myths
Measure camp pres 5 cro myths
 
Breaking down the barriers to the use of digital analytics
Breaking down the barriers to the use of digital analyticsBreaking down the barriers to the use of digital analytics
Breaking down the barriers to the use of digital analytics
 
User-Centric Analytics (MeasureCamp Talk)
User-Centric Analytics (MeasureCamp Talk)User-Centric Analytics (MeasureCamp Talk)
User-Centric Analytics (MeasureCamp Talk)
 
Superweek 2015 traffic attribution
Superweek 2015 traffic attributionSuperweek 2015 traffic attribution
Superweek 2015 traffic attribution
 
A/B Testing Pitfalls - MeasureCamp London 2015
A/B Testing Pitfalls - MeasureCamp London 2015A/B Testing Pitfalls - MeasureCamp London 2015
A/B Testing Pitfalls - MeasureCamp London 2015
 
Customer Lifetime Value
Customer Lifetime ValueCustomer Lifetime Value
Customer Lifetime Value
 
31 Ways To Destroy Your Google Analytics Implementation
31 Ways To Destroy Your Google Analytics Implementation31 Ways To Destroy Your Google Analytics Implementation
31 Ways To Destroy Your Google Analytics Implementation
 
Measurecamp - Improving e commerce tracking with universal analytics
Measurecamp - Improving e commerce tracking with universal analyticsMeasurecamp - Improving e commerce tracking with universal analytics
Measurecamp - Improving e commerce tracking with universal analytics
 
Getting to the People Behind The Keywords
Getting to the People Behind The KeywordsGetting to the People Behind The Keywords
Getting to the People Behind The Keywords
 
Don't you have a Funnel view? - Measure Camp VI
Don't you have a Funnel view? - Measure Camp VIDon't you have a Funnel view? - Measure Camp VI
Don't you have a Funnel view? - Measure Camp VI
 
BrainSINS and AWS meetup Keynote
BrainSINS and AWS meetup KeynoteBrainSINS and AWS meetup Keynote
BrainSINS and AWS meetup Keynote
 
The Future of Customer Engagement
The Future of Customer EngagementThe Future of Customer Engagement
The Future of Customer Engagement
 
Customer lifetime value
Customer lifetime valueCustomer lifetime value
Customer lifetime value
 
Predictive analytics in action real-world examples and advice
Predictive analytics in action real-world examples and advicePredictive analytics in action real-world examples and advice
Predictive analytics in action real-world examples and advice
 
The Who, What, Why, How of Lead Scoring
The Who, What, Why, How of Lead ScoringThe Who, What, Why, How of Lead Scoring
The Who, What, Why, How of Lead Scoring
 
Customer Engagement Model (CEM)
Customer Engagement Model (CEM)Customer Engagement Model (CEM)
Customer Engagement Model (CEM)
 
Customer Insight Analysis
Customer Insight AnalysisCustomer Insight Analysis
Customer Insight Analysis
 
Customer engagement
Customer engagementCustomer engagement
Customer engagement
 
Predictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesPredictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use Cases
 

Similar a Nix for etl using scripting to automate data cleaning & transformation

Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...StampedeCon
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016Brendan Gregg
 
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...DataKitchen
 
Apache Storm vs. Spark Streaming - two stream processing platforms compared
Apache Storm vs. Spark Streaming - two stream processing platforms comparedApache Storm vs. Spark Streaming - two stream processing platforms compared
Apache Storm vs. Spark Streaming - two stream processing platforms comparedGuido Schmutz
 
Applied Detection and Analysis Using Flow Data - MIRCon 2014
Applied Detection and Analysis Using Flow Data - MIRCon 2014Applied Detection and Analysis Using Flow Data - MIRCon 2014
Applied Detection and Analysis Using Flow Data - MIRCon 2014chrissanders88
 
The Big Data Analytics Ecosystem at LinkedIn
The Big Data Analytics Ecosystem at LinkedInThe Big Data Analytics Ecosystem at LinkedIn
The Big Data Analytics Ecosystem at LinkedInrajappaiyer
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemGyula Fóra
 
A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)Toshiyuki Shimono
 
Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Hakka Labs
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixJeff Magnusson
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data ScienceErik Bernhardsson
 
Pablo Musa - Managing your Black Friday Logs - Codemotion Amsterdam 2019
Pablo Musa - Managing your Black Friday Logs - Codemotion Amsterdam 2019Pablo Musa - Managing your Black Friday Logs - Codemotion Amsterdam 2019
Pablo Musa - Managing your Black Friday Logs - Codemotion Amsterdam 2019Codemotion
 
Go with the Flow-v2
Go with the Flow-v2Go with the Flow-v2
Go with the Flow-v2Zobair Khan
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Codemotion
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Demi Ben-Ari
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Sparktsliwowicz
 
Cloud-Scale BGP and NetFlow Analysis
Cloud-Scale BGP and NetFlow AnalysisCloud-Scale BGP and NetFlow Analysis
Cloud-Scale BGP and NetFlow AnalysisAlex Henthorn-Iwane
 

Similar a Nix for etl using scripting to automate data cleaning & transformation (20)

Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
 
Apache Storm vs. Spark Streaming - two stream processing platforms compared
Apache Storm vs. Spark Streaming - two stream processing platforms comparedApache Storm vs. Spark Streaming - two stream processing platforms compared
Apache Storm vs. Spark Streaming - two stream processing platforms compared
 
Applied Detection and Analysis Using Flow Data - MIRCon 2014
Applied Detection and Analysis Using Flow Data - MIRCon 2014Applied Detection and Analysis Using Flow Data - MIRCon 2014
Applied Detection and Analysis Using Flow Data - MIRCon 2014
 
The Big Data Analytics Ecosystem at LinkedIn
The Big Data Analytics Ecosystem at LinkedInThe Big Data Analytics Ecosystem at LinkedIn
The Big Data Analytics Ecosystem at LinkedIn
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)
 
Lipstick On Pig
Lipstick On Pig Lipstick On Pig
Lipstick On Pig
 
Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
Pablo Musa - Managing your Black Friday Logs - Codemotion Amsterdam 2019
Pablo Musa - Managing your Black Friday Logs - Codemotion Amsterdam 2019Pablo Musa - Managing your Black Friday Logs - Codemotion Amsterdam 2019
Pablo Musa - Managing your Black Friday Logs - Codemotion Amsterdam 2019
 
Go with the Flow-v2
Go with the Flow-v2Go with the Flow-v2
Go with the Flow-v2
 
Go with the Flow
Go with the Flow Go with the Flow
Go with the Flow
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
 
Flink Streaming
Flink StreamingFlink Streaming
Flink Streaming
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
Cloud-Scale BGP and NetFlow Analysis
Cloud-Scale BGP and NetFlow AnalysisCloud-Scale BGP and NetFlow Analysis
Cloud-Scale BGP and NetFlow Analysis
 

Más de Lynchpin Analytics Consultancy

Attribution Modelling: Real Business Applications that work.
Attribution Modelling: Real Business Applications that work. Attribution Modelling: Real Business Applications that work.
Attribution Modelling: Real Business Applications that work. Lynchpin Analytics Consultancy
 
MeasureCamp 7 Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin
MeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from LynchpinMeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin
MeasureCamp 7 Bigger Faster Data by Andrew Hood and Cameron Gray from LynchpinLynchpin Analytics Consultancy
 
Brighton seo april 2015 lynchpin. segementation clustering presentation. crea...
Brighton seo april 2015 lynchpin. segementation clustering presentation. crea...Brighton seo april 2015 lynchpin. segementation clustering presentation. crea...
Brighton seo april 2015 lynchpin. segementation clustering presentation. crea...Lynchpin Analytics Consultancy
 

Más de Lynchpin Analytics Consultancy (6)

Attribution Modelling: Real Business Applications that work.
Attribution Modelling: Real Business Applications that work. Attribution Modelling: Real Business Applications that work.
Attribution Modelling: Real Business Applications that work.
 
MeasureCamp 7 Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin
MeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from LynchpinMeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin
MeasureCamp 7 Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin
 
Lynchpin Measurement and Analytics Survey 2015
Lynchpin Measurement and Analytics Survey 2015Lynchpin Measurement and Analytics Survey 2015
Lynchpin Measurement and Analytics Survey 2015
 
Econsultancy measurement-and-analytics-2015
Econsultancy measurement-and-analytics-2015Econsultancy measurement-and-analytics-2015
Econsultancy measurement-and-analytics-2015
 
Brighton seo april 2015 lynchpin. segementation clustering presentation. crea...
Brighton seo april 2015 lynchpin. segementation clustering presentation. crea...Brighton seo april 2015 lynchpin. segementation clustering presentation. crea...
Brighton seo april 2015 lynchpin. segementation clustering presentation. crea...
 
Data Analytics Time to Grow Up
Data Analytics Time to Grow Up Data Analytics Time to Grow Up
Data Analytics Time to Grow Up
 

Último

Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 

Último (20)

Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 

Nix for etl using scripting to automate data cleaning & transformation

  • 1. MeasureCamp VI *NIX for ETL (Using Scripting to Automate Data Cleaning & Transformation) Andrew Hood @lynchpin
  • 2. Is this a 1970s tribute presentation? UNIX BSD Mac OS X Linux Android 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 2
  • 3. Why is *NIX relevant to analytics? • Most input data is crap. And increasingly large. • Requires data processing: – Sequences of transformations – Reproducibility – Flexibility 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 3 Download file(s) Remove dodgy characters Fix date format Load into database
  • 4. Do I have it/how do I get *NIX? • You might have it already: – Max OS X, Android, … • Linux – Stick on an old PC • E.g. http://www.ubuntu.com/ – 12 months free on AWS • http://aws.amazon.com/free/ • Windows – MinGW/MSYS ports of all the common command line tools • http://www.mingw.org/ 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 4
  • 5. 3 Key UNIX Principles 1. Small is Beautiful 2. Everything is a Filter 3. Silence is Golden 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 5
  • 6. Principle 1: Small is Beautiful • A UNIX command does one simple thing efficiently – ls: lists files in the current directory – cat: concatenates files together – sort: er, sorts stuff – uniq: de-duplicates stuff – grep: search/filter on steroids – sed: search and replace on steroids – awk: calculator on steroids 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 6
  • 7. In contrast to… • One tool that does lots of things badly. 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 7
  • 8. Learning Curve Disclaimer • Commands are named to minimise typing (not deliberately maximise obfuscation!) – ls = list – mv = move – cp = copy – pwd = present working directory • man [command] is your friend – the manual pages – E.g. man ls 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 8
  • 9. Principle 2: Everything is a Filter • Pretty much every command can take an input and generate an output based on that input. – This is rather good for data processing! • Key punctuation: – | chains output of one command to input of the next – > redirects output to a file (or device) • Example: – cat myfile.txt | sort | uniq > sorted-and-deduplicated.txt 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 9
  • 10. Terry Pratchett Tribute Slide • How to remember what pipes and redirection do • Source: alt.fan.pratchett (yes, USENET) • A better LOL: – C|N>K • Coffee (filtered) through nose (redirected) to keyboard 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 10
  • 11. Key Principle 3: Silence is Golden • Commands have 3 default streams: – STDIN: text streaming into the command – STDOUT: text streaming out of the command – STDERR: any errors or warnings from running the command • If a command completes successfully you get no messages – No “Done.”, “15 files copied” or “Are you sure?” prompts! 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 11
  • 12. Example 1: Quick Line/Field Count • How big are these massive compressed Webtrends Logs for December 2014? % zcat *2014-12*.log.gz | wc -l 1514139 • And how many fields are there in the file? % zcat *.log.gz | head -n5 | awk '{ print NF+1 }' 20 19 19 19 19 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 12
  • 13. Example 1: Quick Line/Field Count • Why 20 fields in the first line and 19 in the rest? % zcat *2014-12*.log.gz | head -n1 Fields: date time c-ip cs-username cs-host cs-method cs-uri-stem cs-uri-query sc-status sc-bytes cs-version cs(User-Agent) cs(Cookie) cs(Referer) dcs-geo dcs-dns origin-id dcs-id • Let’s get rid of that header line and pop into one file: % zcat *2014-12*.log.gz | grep –v '^Fields' > 2014-12.log 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 13
  • 14. Example 2: Quick & Dirty Sums • How many unique IP addresses in that logfile? Fields: date time c-ip cs-username cs-host cs-method cs-uri-stem cs-uri-query sc-status sc- bytes cs-version cs(User-Agent) cs(Cookie) cs(Referer) dcs-geo dcs-dns origin-id dcs-id % cat 2014-12.log | cut -d" " -f3 | sort | uniq | wc -l 8695 • What are the top 5 IP addresses by hits? % cat 2014-12.log | cut -d" " -f3 | sort | uniq -c | sort -nr | head -n5 1110 91.214.5.128 840 193.108.78.10 730 193.108.73.47 184 203.112.87.138 173 203.112.87.220 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 14
  • 15. (A Note on Regular Expressions) • Best learn to love these – . is any character. – escapes a character that has a meaning i.e. . is a literal full stop – * Matches 0 or more characters – + Matches 1 or more characters – ? Matches 0 or 1 characters – ^ Starts With – $ Ends With – [ ] matches a list or group of charcaters (e.g. [A-Za-z] any letter) – ( ) groups and captures characters 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 15
  • 16. Example 3: Fixing The Quotes/Date % cat test.csv Joe Bloggs,56.78,05/12/2014 Bill Bloggs,123.99,12/06/2014 % sed 's/[[:alnum:] /.]*/"&"/g' test.csv "Joe Bloggs","56.78","05/12/2014" "Bill Bloggs","123.99","12/06/2014“ % sed 's/[[:alnum:] /.]*/"&"/g' test.csv | sed 's|(..)/(..)/(....)|3-2-1|' "Joe Bloggs","56.78","2014-12-05" "Bill Bloggs","123.99","2014-06-12" 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 16
  • 17. Example 4: Clean & Database Load cat input.csv | grep '^[0-9]' | sed 's/,,/,NULL,/g' | sed 's/[^[:print:]]//g' | psql -c "copy mytable from stdin with delimiter ','" 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 17
  • 18. Example 5: Basic Scripting #!/bin/sh DATE=`date -v -1d +%Y-%m-%d` SOURCE="input-$DATE.zip" LOGFILE="$DATE.log" sftp -b /dev/stdin username@host 2>&1 >>$LOGFILE <<EOF cd incoming get $SOURCE EOF if [ ! -f "$SOURCE" ]; then echo "Source file $SOURCE not found" >> $LOGFILE exit 99 fi unzip $SOURCE echo "Source file $SOURCE successfully downloaded and decompressed" >> $LOGFILE 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 18
  • 19. Resources • UNIX Fundamentals: – http://freeengineer.org/learnUNIXin10minutes.html – http://www.ee.surrey.ac.uk/Teaching/Unix/ • Cheat Sheet: – http://www.pixelbeat.org/cmdline.html • Basic Analytics Applications – http://www.gregreda.com/2013/07/15/unix-commands-for-data-science/ • Regular Expressions: – http://www.lunametrics.com/regex-book/Regular-Expressions-Google-Analytics.pdf – http://www.regexr.com/ 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 19