Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Data Scientist Toolbox

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Cargando en…3
×

Eche un vistazo a continuación

1 de 30 Anuncio
Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

Similares a Data Scientist Toolbox (20)

Anuncio

Más de Andrei Savu (20)

Más reciente (20)

Anuncio

Data Scientist Toolbox

  1. 1. Data Scientist Toolbox Andrei Savu - Axemblr.com BigData.ro 2013
  2. 2. Me • Founder of Axemblr.com • Organizer of Bucharest JUG (bjug.ro) • Passion for DevOps, Data Analysis • Connect with me on LinkedIn
  3. 3. @ Axemblr • Service Deployment Orchestration • Infrastructure Automation (DevOps) • Apache Hadoop On-Demand Appliance • Axemblr Provisionr https://github.com/axemblr/axemblr-provisionr
  4. 4. (Big)Data in a nutshell • Business Intelligence / Research Evolved • Significant change in Decision Making • Enables new Products & Features • Enables new Business Models
  5. 5. Data Scientist • Has a Business / Research oriented perspective • Knowledge of statistics & software engineering (AI, infrastructure) • Ability to explore questions and formulate hypotheses to be tested
  6. 6. Data Science Project • Focused on particular business goals • Based on a set of important questions • Result > Answers that support business decisions
  7. 7. The Algorithm • Find *Important* • Create Pipelines Questions • Automate & Deploy • Identify & Extract Data • Learn & Repeat! • Store & Sample • Analyse • Visualization
  8. 8. Start w/ “Big” Questions ... answer them with (Big)Data How can we understand & improve the conversion rate? How can we increase customer satisfaction? How can we find important mentions in social media?
  9. 9. Identify Data Sources OR add more probes / sensors as needed Google Analytics,Web server logs, Mixpanel, Custom application metrics, Mouse tracking, Facebook metrics etc.
  10. 10. Extract Data ... to a medium that allows you to run arbitrary queries Local filesystem, Databases, Hadoop, HBase, HDFS, Hive, Pig
  11. 11. Extract • Database dump tool, replicas or backups • External web services • Apache Sqoop (SQL-to-Hadoop) • Implement pipelines / real-time streams • Write custom tools as needed
  12. 12. Curate Unfortunately Data is Messy
  13. 13. Curate - Your Way • Use or develop tools / scripts • On large volumes there no obvious choices • Custom ways of filtering & aggregating large streams (e.g. twitter, sensors) • Reuse existing software components for data curation / validation
  14. 14. DataWrangler Interactive System for Data cleaning a transformation http://vis.stanford.edu/wrangler/
  15. 15. Open Refine Former Google Refine https://github.com/OpenRefine/ OpenRefine
  16. 16. Sample (time, etc.) As needed to support interactive exploration
  17. 17. Why Sample? • Interactive exploration to create and check assumptions, to create algorithms • Be careful with “Statistical Significance” • Sample Smart: By time, By location etc.
  18. 18. Analyse Sample This is were the fun begins
  19. 19. Analyse Sample • Create models • Create algorithms • Check hypotheses • Faster feedback loops & Immediate Gratification
  20. 20. Excel-like
  21. 21. Python
  22. 22. RStudio
  23. 23. Gephi.org
  24. 24. Analyse All apply your results to the entire data set
  25. 25. How to Analyse All? • “Easy” on a single machine • Go distributed w/ Hadoop, MPI, Storm, Oracle Exa* etc. • Key: Leverage existing tools • Tools: sed, awkSQL, RHadoop, Apache Hive, Pig, Cloudera Impala, MPI, Custom MR
  26. 26. Visualization Communicate meaning w/ Graphics
  27. 27. http://selection.datavisualization.ch/
  28. 28. Automate & Deploy Make it part of your internal dashboard
  29. 29. Learn & Repeat Answer most of the time generate new questions
  30. 30. Thanks! Questions? Andrei Savu / asavu@axemblr.com @andreisavu

×