SlideShare una empresa de Scribd logo
1 de 37
Descargar para leer sin conexión
Dirty Data? Clean it up!
Or, how to do data science in the real world.
Dan Lynn
CEO, AgilData
@danklynn
dan@agildata.com
Patrick Russell
Director, Data Science, Craftsy
@patrickrm101
patrick@craftsy.com
© Phil Mislinksi - www.pmimage.com
Patrick Russell - Bass
Director, Data Science, Craftsy
Dan Lynn - Guitar
CEO, AgilData
© Phil Mislinksi - www.pmimage.com
www.craftsy.com
Learn It. Make it.
Explore expert-led video classes and shop
the best yarn, fabric and supplies for
quilting, sewing, knitting, cake
decorating & more.
© Phil Mislinksi - www.pmimage.com
EXPERT SOLUTIONS AND SERVICES FOR
COMPLEX DATA PROBLEMS
At AgilData, we help you get the most out of your data. We provide Software and Services to help firms deliver on
the promise of Big Data and complex data infrastructures:
● AgilData Scalable Cluster for MySQL – Massively scalable and performant MySQL databases combined
with 24×7 remote managed services for DBA/DevOps
● Trusted Big Data experts to solve problems, set strategy and develop solutions for BI, data
pipeline orchestration, ETL, APIs and custom applications.
www.agildata.com
Hey, you’re a data scientist, right? Great!
We have millions of users. How we can use email
to monetize our user base better?
— Marketing
1 / 1 + exp(-x)
https://www.etsy.com/shop/NausicaaDistribution
Source: https://www.oreilly.com/ideas/2015-data-science-salary-survey
http://www.lavante.com/the-hub/ap-industry/lavante-and-spend-matters-look-at-how-dirty-vendor-data-impacts-your-bottom-line/
Data Cleansing
Data Cleansing
● Dates & Times
● Numbers & Strings
● Addresses
● Clickstream Data
● Handling missing data
● Tidy Data
Dates & Times
● Timestamps can mean different things
○ ingested_date, event_timestamp
● Clocks can’t be trusted
○ Server time: which server? Is it synchronized?
○ Client time? Is there a synchronizing time scheme?
● Timezones
○ What tz is your own data in?
○ Your email provider? Your adwords account? Your Google Analytics?
Numbers & Strings
● Use the right types for your numbers (int, bigint, float, numeric
etc)
● Murphy’s Law of text inputs: If a user can put something in a text
field, anything and everything will happen.
● Watch out for floating point precision mistakes
Addresses
● Parsing / validation is not something you want to do yourself
○ USPS has validation and zip lookup for US addresses: https://www.usps.
com/business/web-tools-apis/documentation-updates.htm
● Remember zip codes are strings. And the rest of the world does not
use U.S. zips.
● IP geolocation: Get lat/long, state, city, postal & ISP, from visitor
IPs
○ https://www.maxmind.com/en/geoip2-city
○ This is ALWAYS approximate
● If working with GIS, recommend http://postgis.net/
○ Vanilla postgres also has earthdistance for great circle distance
Clickstream Data
● User agent => Device: Don’t do this yourself (we use WURFL and Google
Analytics)
● Query strings follow the rules of text. Everything will show up
○ They might be truncated
○ URL encoding might be missing characters (%2 instead of %20)
○ Use a library to parse params (ie Python ships with urlparse.parse_qs)
● If your system creates sessions (tomcat, Google Analytics), don’t be
afraid to create your own sessions on top of the pageview data
○ You’ll cross channel and cross device behavior this way
Clickstream Data
Missing / empty data
● Easy to overlook but important
● What does missing data in the context of your analysis mean?
○ Not collected (why not?)
○ Error state
○ N/A or undefined
○ Especially for histograms, missing data lead to very poor conclusions.
● Does your data use sentinel values? (ie -9999 or “null”)
○ df[‘nps_score’].replace(-9999, np.nan)
● Imputation
● Storage
Tidy Data
● Conceptual framework for structuring data for analysis and fitting
○ Each variable forms a column
○ Each observation is a row
○ Each type of observational unit forms a table
● Pretty much normal form from relational databases for stats
● Tidy can be different depending on the question asked
● R (dplyr, tidyr) and Python (pandas) have functions for making your
long data wide & wide data long (stack, unstack, melt, pivot)
● Paper: http://vita.had.co.nz/papers/tidy-data.pdf
● Python tutorial: http://tomaugspurger.github.io/modern-5-tidy.html
Tidy Data
● Example might be market place transaction data with 1 row per
transaction
● You might want to do analysis on participants, 1 row per participant
Hey, that’s a great model. How can we build it
into our decision-making process?
— Marketing
Operationalizing Data Science
● Doing an analysis once rarely delivers lasting value.
● The business needs continuous insight, so you need to get this stuff
into production.
○ Hosting
○ ETL
○ Pipelines
Operationalizing Data Science
Hosting
● Delivering continuous analyses requires operational infrastructure
○ Database(s)
○ Visualization tools (e.g. Chartio, Arcadia Data, Tableau, Looker, Qlik, etc..)
○ REST services / microservices
● These all have uptime requirements. You need to involve your (dev)ops
team earlier rather than later.
● Microservices / REST endpoints have architectural implications
● Visualization tools
○ Local (e.g. Jupyter, Zeppelin)
○ On-premise (Arcadia Data, Tableau, Qlik)
○ Hosted (Chartio)
● Visualization tools often require a SQL interface, thus….
ETL - Extract, Transform, Load
● Often used to herd data into some kind of data warehouse (e.g. RDBMS
+ star schema, Hadoop w/ unstructured data, etc..)
● Not just for data warehousing
● Not just for modeling
● No general solution
● Tooling
○ Apache Spark, Apache Sqoop
○ Commercial Tools: Informatica, Vertica, SQL Server, DataVirtuality etc…
● And then there is Apache Kafka…and the “NoETL” movement
○ Book: “I <3 Logs” - by Jay kreps
○ Replay history from the beginning of time as needed
ETL - Extract, Transform, Load - Example
● Not just for production runs
○ For example, Patrick does a lot of time-to-event analysis on email opens,
transactions, visits.
■ Survival functions, etc...
○ Setup ETL that builds tables With the right shape to put right into models
Pipelines
● From data to model output
● Define dependencies and define DAG for the work
○ Steps defined by assigning input as output of prior steps
○ Luigi (http://luigi.readthedocs.io/en/stable/index.html)
○ Drake (https://github.com/Factual/drake)
○ Scikit learn has its own Pipeline
■ That can be part of your bigger pipeline
● Scheduling can be trickier than you think
○ Resource contention
○ Loose dependencies
○ Cron is fine but Jenkins works really well for this!
● Don’t be afraid to create and teardown full environments as steps
○ For example, spin up and configure an EMR cluster, do stuff, tear it down*
* make your VP of Infrastructure less miserable
Pipelines - Luigi
● Written in Python. Steps implemented by subclassing Task
● Visualize your DAG
● Supports data in relational DBs, Redshift, HDFS, S3, file system
● Flexible and extensible
● Can parallelize jobs
● Workflow runs by executing last step scheduling all dependencies
Pipelines - Luigi
Pipelines - Drake
● JVM (written in Clojure)
● Like a Makefile but for data work
● Supports commands in Shell, Python, Ruby, Clojure
Pipelines - More Tools
● Oozie
○ The default job orchestration engine for Hadoop. Can chain together multiple jobs
to form a complete DAG.
○ Open source
● Kettle
○ Old-school, but still relevant.
○ Visual pipeline designer. Execution engine
○ Open source
● Informatica
○ Visual pipeline designer, mature toolset
○ Commercial
● Datavirtuality
○ Treats all your stores (including Google Analytics) like schemas in a single db
○ Great for microservice architectures
○ Commercial
© Patrick Coppinger
Thanks!
dan@agildata.com — patrick@craftsy.com
@danklynn — @patrickrm101
Shameless Plug:
Tonight at Galvanize, join us at the Denver/Boulder Big Data Meetup
to learn about distributed system design! (ask Dan for details)
References
● I Heart Logs
○ http://www.amazon.com/Heart-Logs-Stream-Processing-Integration/dp/1491909382
● Tidy Data
○ http://vita.had.co.nz/papers/tidy-data.pdf
Additional Tools
● Scientific python stack (ipython, numpy, scipy, pandas, matplotlib…)
● Hadleyverse for R (dplyr, ggplot, tidyr, lubridate…)
● csvkit: command line tools (csvcut, csvgrep, csvjoin...) for CSV data
● jq: fast command line tool for working with json (ie pipe cURL to jq)
● psql (if you use postgresql or Redshift)

Más contenido relacionado

La actualidad más candente

Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet odsc
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark OverviewairisData
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processingprajods
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Spark Summit
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 
Apache Spark Usage in the Open Source Ecosystem
Apache Spark Usage in the Open Source EcosystemApache Spark Usage in the Open Source Ecosystem
Apache Spark Usage in the Open Source EcosystemDatabricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkSamy Dindane
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Spark Summit
 
Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5SAP Concur
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014mahchiev
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIRyuji Tamagawa
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun JeongSpark Summit
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Wee Hyong Tok
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)hiteshnd
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Spark Summit
 

La actualidad más candente (20)

Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Spark Core
Spark CoreSpark Core
Spark Core
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Apache Spark Briefing
Apache Spark BriefingApache Spark Briefing
Apache Spark Briefing
 
Apache Spark Usage in the Open Source Ecosystem
Apache Spark Usage in the Open Source EcosystemApache Spark Usage in the Open Source Ecosystem
Apache Spark Usage in the Open Source Ecosystem
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
 
Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
 

Destacado

【STR2 Klab プレゼンテーション】
【STR2 Klab プレゼンテーション】【STR2 Klab プレゼンテーション】
【STR2 Klab プレゼンテーション】Up Hatch
 
【STR2 株式会社ケイブ プレゼンテーション】
【STR2 株式会社ケイブ プレゼンテーション】【STR2 株式会社ケイブ プレゼンテーション】
【STR2 株式会社ケイブ プレゼンテーション】Up Hatch
 
Understanding the misconception
Understanding the misconceptionUnderstanding the misconception
Understanding the misconceptionleondorsey1986
 
Stacked deck presentation (1)
Stacked deck presentation (1)Stacked deck presentation (1)
Stacked deck presentation (1)Joe Hines
 
Slides with sound
Slides with soundSlides with sound
Slides with soundJenn Agee
 
Fall sem 2010 exam set #3
Fall sem 2010 exam set #3Fall sem 2010 exam set #3
Fall sem 2010 exam set #3tracygrem
 
Berkley Building Materials Project Gallary
Berkley Building Materials Project GallaryBerkley Building Materials Project Gallary
Berkley Building Materials Project Gallarysajidd
 
Technology and accountability – ideas
Technology and accountability – ideasTechnology and accountability – ideas
Technology and accountability – ideasLaina Emmanuel
 
Wikipedia: a model for using the Internet for good
Wikipedia: a model for using the Internet for goodWikipedia: a model for using the Internet for good
Wikipedia: a model for using the Internet for goodPete Forsyth
 
Mobile-led innovations for Direct customer relationships
Mobile-led innovations for Direct customer relationshipsMobile-led innovations for Direct customer relationships
Mobile-led innovations for Direct customer relationshipsRamesh Raman
 
Seeds Of Greatness
Seeds Of GreatnessSeeds Of Greatness
Seeds Of Greatnessmjandmichael
 
Interculturalidad compilacion de temas
Interculturalidad compilacion de temasInterculturalidad compilacion de temas
Interculturalidad compilacion de temasTeodoro Pozo Uribe
 

Destacado (20)

Tijerina-RDA-NISO-Task Groups-sept11
Tijerina-RDA-NISO-Task Groups-sept11Tijerina-RDA-NISO-Task Groups-sept11
Tijerina-RDA-NISO-Task Groups-sept11
 
【STR2 Klab プレゼンテーション】
【STR2 Klab プレゼンテーション】【STR2 Klab プレゼンテーション】
【STR2 Klab プレゼンテーション】
 
Pdr v2
Pdr v2Pdr v2
Pdr v2
 
【STR2 株式会社ケイブ プレゼンテーション】
【STR2 株式会社ケイブ プレゼンテーション】【STR2 株式会社ケイブ プレゼンテーション】
【STR2 株式会社ケイブ プレゼンテーション】
 
Understanding the misconception
Understanding the misconceptionUnderstanding the misconception
Understanding the misconception
 
Stacked deck presentation (1)
Stacked deck presentation (1)Stacked deck presentation (1)
Stacked deck presentation (1)
 
Slides with sound
Slides with soundSlides with sound
Slides with sound
 
Fall sem 2010 exam set #3
Fall sem 2010 exam set #3Fall sem 2010 exam set #3
Fall sem 2010 exam set #3
 
Silver
SilverSilver
Silver
 
It og medier.
It og medier. It og medier.
It og medier.
 
Jasmine soap
Jasmine soapJasmine soap
Jasmine soap
 
Berkley Building Materials Project Gallary
Berkley Building Materials Project GallaryBerkley Building Materials Project Gallary
Berkley Building Materials Project Gallary
 
Technology and accountability – ideas
Technology and accountability – ideasTechnology and accountability – ideas
Technology and accountability – ideas
 
Wikipedia: a model for using the Internet for good
Wikipedia: a model for using the Internet for goodWikipedia: a model for using the Internet for good
Wikipedia: a model for using the Internet for good
 
Presentacio1
Presentacio1Presentacio1
Presentacio1
 
EveryCoin
EveryCoinEveryCoin
EveryCoin
 
Epidermis
EpidermisEpidermis
Epidermis
 
Mobile-led innovations for Direct customer relationships
Mobile-led innovations for Direct customer relationshipsMobile-led innovations for Direct customer relationships
Mobile-led innovations for Direct customer relationships
 
Seeds Of Greatness
Seeds Of GreatnessSeeds Of Greatness
Seeds Of Greatness
 
Interculturalidad compilacion de temas
Interculturalidad compilacion de temasInterculturalidad compilacion de temas
Interculturalidad compilacion de temas
 

Similar a Clean Up Dirty Data and Operationalize Data Science

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Jaroslav Gergic
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
 
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...VMware Tanzu
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSKimmo Kantojärvi
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at TwitterPrasad Wagle
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsZhenxiao Luo
 
How to become a data scientist
How to become a data scientist How to become a data scientist
How to become a data scientist Manjunath Sindagi
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
Starring sakila my sql university 2009
Starring sakila my sql university 2009Starring sakila my sql university 2009
Starring sakila my sql university 2009David Paz
 
Data_and_Analytics_Industry_IESE_v3.pdf
Data_and_Analytics_Industry_IESE_v3.pdfData_and_Analytics_Industry_IESE_v3.pdf
Data_and_Analytics_Industry_IESE_v3.pdfprevota
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the CloudAmihay Zer-Kavod
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...Marcin Bielak
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdfLars Albertsson
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
 

Similar a Clean Up Dirty Data and Operationalize Data Science (20)

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWS
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Workflow Engines + Luigi
Workflow Engines + LuigiWorkflow Engines + Luigi
Workflow Engines + Luigi
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
How to become a data scientist
How to become a data scientist How to become a data scientist
How to become a data scientist
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Starring sakila my sql university 2009
Starring sakila my sql university 2009Starring sakila my sql university 2009
Starring sakila my sql university 2009
 
Data_and_Analytics_Industry_IESE_v3.pdf
Data_and_Analytics_Industry_IESE_v3.pdfData_and_Analytics_Industry_IESE_v3.pdf
Data_and_Analytics_Industry_IESE_v3.pdf
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the Cloud
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 

Más de Dan Lynn

The Holy Grail of Data Analytics
The Holy Grail of Data AnalyticsThe Holy Grail of Data Analytics
The Holy Grail of Data AnalyticsDan Lynn
 
AgilData - How I Learned to Stop Worrying and Evolve with On-Demand Schemas
AgilData - How I Learned to Stop Worrying and Evolve  with On-Demand SchemasAgilData - How I Learned to Stop Worrying and Evolve  with On-Demand Schemas
AgilData - How I Learned to Stop Worrying and Evolve with On-Demand SchemasDan Lynn
 
Data Streaming Technology Overview
Data Streaming Technology OverviewData Streaming Technology Overview
Data Streaming Technology OverviewDan Lynn
 
Data decay and the illusion of the present
Data decay and the illusion of the presentData decay and the illusion of the present
Data decay and the illusion of the presentDan Lynn
 
Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.Dan Lynn
 
Storing and manipulating graphs in HBase
Storing and manipulating graphs in HBaseStoring and manipulating graphs in HBase
Storing and manipulating graphs in HBaseDan Lynn
 
Storm: The Real-Time Layer - GlueCon 2012
Storm: The Real-Time Layer  - GlueCon 2012Storm: The Real-Time Layer  - GlueCon 2012
Storm: The Real-Time Layer - GlueCon 2012Dan Lynn
 
When it rains: Prepare for scale with Amazon EC2
When it rains: Prepare for scale with Amazon EC2When it rains: Prepare for scale with Amazon EC2
When it rains: Prepare for scale with Amazon EC2Dan Lynn
 

Más de Dan Lynn (8)

The Holy Grail of Data Analytics
The Holy Grail of Data AnalyticsThe Holy Grail of Data Analytics
The Holy Grail of Data Analytics
 
AgilData - How I Learned to Stop Worrying and Evolve with On-Demand Schemas
AgilData - How I Learned to Stop Worrying and Evolve  with On-Demand SchemasAgilData - How I Learned to Stop Worrying and Evolve  with On-Demand Schemas
AgilData - How I Learned to Stop Worrying and Evolve with On-Demand Schemas
 
Data Streaming Technology Overview
Data Streaming Technology OverviewData Streaming Technology Overview
Data Streaming Technology Overview
 
Data decay and the illusion of the present
Data decay and the illusion of the presentData decay and the illusion of the present
Data decay and the illusion of the present
 
Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.
 
Storing and manipulating graphs in HBase
Storing and manipulating graphs in HBaseStoring and manipulating graphs in HBase
Storing and manipulating graphs in HBase
 
Storm: The Real-Time Layer - GlueCon 2012
Storm: The Real-Time Layer  - GlueCon 2012Storm: The Real-Time Layer  - GlueCon 2012
Storm: The Real-Time Layer - GlueCon 2012
 
When it rains: Prepare for scale with Amazon EC2
When it rains: Prepare for scale with Amazon EC2When it rains: Prepare for scale with Amazon EC2
When it rains: Prepare for scale with Amazon EC2
 

Último

原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 

Último (20)

原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 

Clean Up Dirty Data and Operationalize Data Science

  • 1. Dirty Data? Clean it up! Or, how to do data science in the real world. Dan Lynn CEO, AgilData @danklynn dan@agildata.com Patrick Russell Director, Data Science, Craftsy @patrickrm101 patrick@craftsy.com
  • 2. © Phil Mislinksi - www.pmimage.com Patrick Russell - Bass Director, Data Science, Craftsy Dan Lynn - Guitar CEO, AgilData
  • 3. © Phil Mislinksi - www.pmimage.com www.craftsy.com Learn It. Make it. Explore expert-led video classes and shop the best yarn, fabric and supplies for quilting, sewing, knitting, cake decorating & more.
  • 4. © Phil Mislinksi - www.pmimage.com EXPERT SOLUTIONS AND SERVICES FOR COMPLEX DATA PROBLEMS At AgilData, we help you get the most out of your data. We provide Software and Services to help firms deliver on the promise of Big Data and complex data infrastructures: ● AgilData Scalable Cluster for MySQL – Massively scalable and performant MySQL databases combined with 24×7 remote managed services for DBA/DevOps ● Trusted Big Data experts to solve problems, set strategy and develop solutions for BI, data pipeline orchestration, ETL, APIs and custom applications. www.agildata.com
  • 5. Hey, you’re a data scientist, right? Great! We have millions of users. How we can use email to monetize our user base better? — Marketing
  • 6. 1 / 1 + exp(-x)
  • 7.
  • 8.
  • 9.
  • 10.
  • 12.
  • 15. Data Cleansing ● Dates & Times ● Numbers & Strings ● Addresses ● Clickstream Data ● Handling missing data ● Tidy Data
  • 16. Dates & Times ● Timestamps can mean different things ○ ingested_date, event_timestamp ● Clocks can’t be trusted ○ Server time: which server? Is it synchronized? ○ Client time? Is there a synchronizing time scheme? ● Timezones ○ What tz is your own data in? ○ Your email provider? Your adwords account? Your Google Analytics?
  • 17. Numbers & Strings ● Use the right types for your numbers (int, bigint, float, numeric etc) ● Murphy’s Law of text inputs: If a user can put something in a text field, anything and everything will happen. ● Watch out for floating point precision mistakes
  • 18. Addresses ● Parsing / validation is not something you want to do yourself ○ USPS has validation and zip lookup for US addresses: https://www.usps. com/business/web-tools-apis/documentation-updates.htm ● Remember zip codes are strings. And the rest of the world does not use U.S. zips. ● IP geolocation: Get lat/long, state, city, postal & ISP, from visitor IPs ○ https://www.maxmind.com/en/geoip2-city ○ This is ALWAYS approximate ● If working with GIS, recommend http://postgis.net/ ○ Vanilla postgres also has earthdistance for great circle distance
  • 19. Clickstream Data ● User agent => Device: Don’t do this yourself (we use WURFL and Google Analytics) ● Query strings follow the rules of text. Everything will show up ○ They might be truncated ○ URL encoding might be missing characters (%2 instead of %20) ○ Use a library to parse params (ie Python ships with urlparse.parse_qs) ● If your system creates sessions (tomcat, Google Analytics), don’t be afraid to create your own sessions on top of the pageview data ○ You’ll cross channel and cross device behavior this way
  • 21. Missing / empty data ● Easy to overlook but important ● What does missing data in the context of your analysis mean? ○ Not collected (why not?) ○ Error state ○ N/A or undefined ○ Especially for histograms, missing data lead to very poor conclusions. ● Does your data use sentinel values? (ie -9999 or “null”) ○ df[‘nps_score’].replace(-9999, np.nan) ● Imputation ● Storage
  • 22. Tidy Data ● Conceptual framework for structuring data for analysis and fitting ○ Each variable forms a column ○ Each observation is a row ○ Each type of observational unit forms a table ● Pretty much normal form from relational databases for stats ● Tidy can be different depending on the question asked ● R (dplyr, tidyr) and Python (pandas) have functions for making your long data wide & wide data long (stack, unstack, melt, pivot) ● Paper: http://vita.had.co.nz/papers/tidy-data.pdf ● Python tutorial: http://tomaugspurger.github.io/modern-5-tidy.html
  • 23. Tidy Data ● Example might be market place transaction data with 1 row per transaction ● You might want to do analysis on participants, 1 row per participant
  • 24. Hey, that’s a great model. How can we build it into our decision-making process? — Marketing
  • 26. ● Doing an analysis once rarely delivers lasting value. ● The business needs continuous insight, so you need to get this stuff into production. ○ Hosting ○ ETL ○ Pipelines Operationalizing Data Science
  • 27. Hosting ● Delivering continuous analyses requires operational infrastructure ○ Database(s) ○ Visualization tools (e.g. Chartio, Arcadia Data, Tableau, Looker, Qlik, etc..) ○ REST services / microservices ● These all have uptime requirements. You need to involve your (dev)ops team earlier rather than later. ● Microservices / REST endpoints have architectural implications ● Visualization tools ○ Local (e.g. Jupyter, Zeppelin) ○ On-premise (Arcadia Data, Tableau, Qlik) ○ Hosted (Chartio) ● Visualization tools often require a SQL interface, thus….
  • 28. ETL - Extract, Transform, Load ● Often used to herd data into some kind of data warehouse (e.g. RDBMS + star schema, Hadoop w/ unstructured data, etc..) ● Not just for data warehousing ● Not just for modeling ● No general solution ● Tooling ○ Apache Spark, Apache Sqoop ○ Commercial Tools: Informatica, Vertica, SQL Server, DataVirtuality etc… ● And then there is Apache Kafka…and the “NoETL” movement ○ Book: “I <3 Logs” - by Jay kreps ○ Replay history from the beginning of time as needed
  • 29. ETL - Extract, Transform, Load - Example ● Not just for production runs ○ For example, Patrick does a lot of time-to-event analysis on email opens, transactions, visits. ■ Survival functions, etc... ○ Setup ETL that builds tables With the right shape to put right into models
  • 30. Pipelines ● From data to model output ● Define dependencies and define DAG for the work ○ Steps defined by assigning input as output of prior steps ○ Luigi (http://luigi.readthedocs.io/en/stable/index.html) ○ Drake (https://github.com/Factual/drake) ○ Scikit learn has its own Pipeline ■ That can be part of your bigger pipeline ● Scheduling can be trickier than you think ○ Resource contention ○ Loose dependencies ○ Cron is fine but Jenkins works really well for this! ● Don’t be afraid to create and teardown full environments as steps ○ For example, spin up and configure an EMR cluster, do stuff, tear it down* * make your VP of Infrastructure less miserable
  • 31. Pipelines - Luigi ● Written in Python. Steps implemented by subclassing Task ● Visualize your DAG ● Supports data in relational DBs, Redshift, HDFS, S3, file system ● Flexible and extensible ● Can parallelize jobs ● Workflow runs by executing last step scheduling all dependencies
  • 33. Pipelines - Drake ● JVM (written in Clojure) ● Like a Makefile but for data work ● Supports commands in Shell, Python, Ruby, Clojure
  • 34. Pipelines - More Tools ● Oozie ○ The default job orchestration engine for Hadoop. Can chain together multiple jobs to form a complete DAG. ○ Open source ● Kettle ○ Old-school, but still relevant. ○ Visual pipeline designer. Execution engine ○ Open source ● Informatica ○ Visual pipeline designer, mature toolset ○ Commercial ● Datavirtuality ○ Treats all your stores (including Google Analytics) like schemas in a single db ○ Great for microservice architectures ○ Commercial
  • 35. © Patrick Coppinger Thanks! dan@agildata.com — patrick@craftsy.com @danklynn — @patrickrm101 Shameless Plug: Tonight at Galvanize, join us at the Denver/Boulder Big Data Meetup to learn about distributed system design! (ask Dan for details)
  • 36. References ● I Heart Logs ○ http://www.amazon.com/Heart-Logs-Stream-Processing-Integration/dp/1491909382 ● Tidy Data ○ http://vita.had.co.nz/papers/tidy-data.pdf
  • 37. Additional Tools ● Scientific python stack (ipython, numpy, scipy, pandas, matplotlib…) ● Hadleyverse for R (dplyr, ggplot, tidyr, lubridate…) ● csvkit: command line tools (csvcut, csvgrep, csvjoin...) for CSV data ● jq: fast command line tool for working with json (ie pipe cURL to jq) ● psql (if you use postgresql or Redshift)