Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Dirty data? Clean it up! - Datapalooza Denver 2016

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Próximo SlideShare
Hands on with Apache Spark
Hands on with Apache Spark
Cargando en…3
×

Eche un vistazo a continuación

1 de 37 Anuncio

Dirty data? Clean it up! - Datapalooza Denver 2016

Descargar para leer sin conexión

Dan Lynn (AgilData) & Patrick Russell (Craftsy) present on how to do data science in the real world. We discuss data cleansing, ETL, pipelines, hosting, and share several tools used in the industry.

Dan Lynn (AgilData) & Patrick Russell (Craftsy) present on how to do data science in the real world. We discuss data cleansing, ETL, pipelines, hosting, and share several tools used in the industry.

Anuncio
Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

A los espectadores también les gustó (20)

Anuncio

Similares a Dirty data? Clean it up! - Datapalooza Denver 2016 (20)

Anuncio

Más reciente (20)

Dirty data? Clean it up! - Datapalooza Denver 2016

  1. 1. Dirty Data? Clean it up! Or, how to do data science in the real world. Dan Lynn CEO, AgilData @danklynn dan@agildata.com Patrick Russell Director, Data Science, Craftsy @patrickrm101 patrick@craftsy.com
  2. 2. © Phil Mislinksi - www.pmimage.com Patrick Russell - Bass Director, Data Science, Craftsy Dan Lynn - Guitar CEO, AgilData
  3. 3. © Phil Mislinksi - www.pmimage.com www.craftsy.com Learn It. Make it. Explore expert-led video classes and shop the best yarn, fabric and supplies for quilting, sewing, knitting, cake decorating & more.
  4. 4. © Phil Mislinksi - www.pmimage.com EXPERT SOLUTIONS AND SERVICES FOR COMPLEX DATA PROBLEMS At AgilData, we help you get the most out of your data. We provide Software and Services to help firms deliver on the promise of Big Data and complex data infrastructures: ● AgilData Scalable Cluster for MySQL – Massively scalable and performant MySQL databases combined with 24×7 remote managed services for DBA/DevOps ● Trusted Big Data experts to solve problems, set strategy and develop solutions for BI, data pipeline orchestration, ETL, APIs and custom applications. www.agildata.com
  5. 5. Hey, you’re a data scientist, right? Great! We have millions of users. How we can use email to monetize our user base better? — Marketing
  6. 6. 1 / 1 + exp(-x)
  7. 7. https://www.etsy.com/shop/NausicaaDistribution
  8. 8. Source: https://www.oreilly.com/ideas/2015-data-science-salary-survey
  9. 9. http://www.lavante.com/the-hub/ap-industry/lavante-and-spend-matters-look-at-how-dirty-vendor-data-impacts-your-bottom-line/ Data Cleansing
  10. 10. Data Cleansing ● Dates & Times ● Numbers & Strings ● Addresses ● Clickstream Data ● Handling missing data ● Tidy Data
  11. 11. Dates & Times ● Timestamps can mean different things ○ ingested_date, event_timestamp ● Clocks can’t be trusted ○ Server time: which server? Is it synchronized? ○ Client time? Is there a synchronizing time scheme? ● Timezones ○ What tz is your own data in? ○ Your email provider? Your adwords account? Your Google Analytics?
  12. 12. Numbers & Strings ● Use the right types for your numbers (int, bigint, float, numeric etc) ● Murphy’s Law of text inputs: If a user can put something in a text field, anything and everything will happen. ● Watch out for floating point precision mistakes
  13. 13. Addresses ● Parsing / validation is not something you want to do yourself ○ USPS has validation and zip lookup for US addresses: https://www.usps. com/business/web-tools-apis/documentation-updates.htm ● Remember zip codes are strings. And the rest of the world does not use U.S. zips. ● IP geolocation: Get lat/long, state, city, postal & ISP, from visitor IPs ○ https://www.maxmind.com/en/geoip2-city ○ This is ALWAYS approximate ● If working with GIS, recommend http://postgis.net/ ○ Vanilla postgres also has earthdistance for great circle distance
  14. 14. Clickstream Data ● User agent => Device: Don’t do this yourself (we use WURFL and Google Analytics) ● Query strings follow the rules of text. Everything will show up ○ They might be truncated ○ URL encoding might be missing characters (%2 instead of %20) ○ Use a library to parse params (ie Python ships with urlparse.parse_qs) ● If your system creates sessions (tomcat, Google Analytics), don’t be afraid to create your own sessions on top of the pageview data ○ You’ll cross channel and cross device behavior this way
  15. 15. Clickstream Data
  16. 16. Missing / empty data ● Easy to overlook but important ● What does missing data in the context of your analysis mean? ○ Not collected (why not?) ○ Error state ○ N/A or undefined ○ Especially for histograms, missing data lead to very poor conclusions. ● Does your data use sentinel values? (ie -9999 or “null”) ○ df[‘nps_score’].replace(-9999, np.nan) ● Imputation ● Storage
  17. 17. Tidy Data ● Conceptual framework for structuring data for analysis and fitting ○ Each variable forms a column ○ Each observation is a row ○ Each type of observational unit forms a table ● Pretty much normal form from relational databases for stats ● Tidy can be different depending on the question asked ● R (dplyr, tidyr) and Python (pandas) have functions for making your long data wide & wide data long (stack, unstack, melt, pivot) ● Paper: http://vita.had.co.nz/papers/tidy-data.pdf ● Python tutorial: http://tomaugspurger.github.io/modern-5-tidy.html
  18. 18. Tidy Data ● Example might be market place transaction data with 1 row per transaction ● You might want to do analysis on participants, 1 row per participant
  19. 19. Hey, that’s a great model. How can we build it into our decision-making process? — Marketing
  20. 20. Operationalizing Data Science
  21. 21. ● Doing an analysis once rarely delivers lasting value. ● The business needs continuous insight, so you need to get this stuff into production. ○ Hosting ○ ETL ○ Pipelines Operationalizing Data Science
  22. 22. Hosting ● Delivering continuous analyses requires operational infrastructure ○ Database(s) ○ Visualization tools (e.g. Chartio, Arcadia Data, Tableau, Looker, Qlik, etc..) ○ REST services / microservices ● These all have uptime requirements. You need to involve your (dev)ops team earlier rather than later. ● Microservices / REST endpoints have architectural implications ● Visualization tools ○ Local (e.g. Jupyter, Zeppelin) ○ On-premise (Arcadia Data, Tableau, Qlik) ○ Hosted (Chartio) ● Visualization tools often require a SQL interface, thus….
  23. 23. ETL - Extract, Transform, Load ● Often used to herd data into some kind of data warehouse (e.g. RDBMS + star schema, Hadoop w/ unstructured data, etc..) ● Not just for data warehousing ● Not just for modeling ● No general solution ● Tooling ○ Apache Spark, Apache Sqoop ○ Commercial Tools: Informatica, Vertica, SQL Server, DataVirtuality etc… ● And then there is Apache Kafka…and the “NoETL” movement ○ Book: “I <3 Logs” - by Jay kreps ○ Replay history from the beginning of time as needed
  24. 24. ETL - Extract, Transform, Load - Example ● Not just for production runs ○ For example, Patrick does a lot of time-to-event analysis on email opens, transactions, visits. ■ Survival functions, etc... ○ Setup ETL that builds tables With the right shape to put right into models
  25. 25. Pipelines ● From data to model output ● Define dependencies and define DAG for the work ○ Steps defined by assigning input as output of prior steps ○ Luigi (http://luigi.readthedocs.io/en/stable/index.html) ○ Drake (https://github.com/Factual/drake) ○ Scikit learn has its own Pipeline ■ That can be part of your bigger pipeline ● Scheduling can be trickier than you think ○ Resource contention ○ Loose dependencies ○ Cron is fine but Jenkins works really well for this! ● Don’t be afraid to create and teardown full environments as steps ○ For example, spin up and configure an EMR cluster, do stuff, tear it down* * make your VP of Infrastructure less miserable
  26. 26. Pipelines - Luigi ● Written in Python. Steps implemented by subclassing Task ● Visualize your DAG ● Supports data in relational DBs, Redshift, HDFS, S3, file system ● Flexible and extensible ● Can parallelize jobs ● Workflow runs by executing last step scheduling all dependencies
  27. 27. Pipelines - Luigi
  28. 28. Pipelines - Drake ● JVM (written in Clojure) ● Like a Makefile but for data work ● Supports commands in Shell, Python, Ruby, Clojure
  29. 29. Pipelines - More Tools ● Oozie ○ The default job orchestration engine for Hadoop. Can chain together multiple jobs to form a complete DAG. ○ Open source ● Kettle ○ Old-school, but still relevant. ○ Visual pipeline designer. Execution engine ○ Open source ● Informatica ○ Visual pipeline designer, mature toolset ○ Commercial ● Datavirtuality ○ Treats all your stores (including Google Analytics) like schemas in a single db ○ Great for microservice architectures ○ Commercial
  30. 30. © Patrick Coppinger Thanks! dan@agildata.com — patrick@craftsy.com @danklynn — @patrickrm101 Shameless Plug: Tonight at Galvanize, join us at the Denver/Boulder Big Data Meetup to learn about distributed system design! (ask Dan for details)
  31. 31. References ● I Heart Logs ○ http://www.amazon.com/Heart-Logs-Stream-Processing-Integration/dp/1491909382 ● Tidy Data ○ http://vita.had.co.nz/papers/tidy-data.pdf
  32. 32. Additional Tools ● Scientific python stack (ipython, numpy, scipy, pandas, matplotlib…) ● Hadleyverse for R (dplyr, ggplot, tidyr, lubridate…) ● csvkit: command line tools (csvcut, csvgrep, csvjoin...) for CSV data ● jq: fast command line tool for working with json (ie pipe cURL to jq) ● psql (if you use postgresql or Redshift)

×