Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

First steps in Data Mining Kindergarten

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Cargando en…3
×

Eche un vistazo a continuación

1 de 40 Anuncio

First steps in Data Mining Kindergarten

Alexey Zinoviev presented this paper on Second Thumbtack Technology Expert Day.

This paper covers next topics: Data Mining, Machine Learning, Octave, R language

YouTube: http://youtu.be/kGIP6XeWiaA

Alexey Zinoviev presented this paper on Second Thumbtack Technology Expert Day.

This paper covers next topics: Data Mining, Machine Learning, Octave, R language

YouTube: http://youtu.be/kGIP6XeWiaA

Anuncio
Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

A los espectadores también les gustó (13)

Anuncio

Similares a First steps in Data Mining Kindergarten (20)

Más de Alexey Zinoviev (20)

Anuncio

Más reciente (20)

First steps in Data Mining Kindergarten

  1. 1. Speaker : Alexey Zinoviev First Steps in Data Mining Kindergarten
  2. 2. About ● I am a scientist. The area of my interests includes graph theory, machine learning, traffic jams prediction, BigData algorythms. ● But I'm a programmer, so I'm interested in NoSQL databases, Java, JavaScript, Android, MongoDB, Cassandra, Hadoop, MapReduce, metaprogramming, reflection. ● I am a fan of variety GEO API (Maps API for example)
  3. 3. Data mining Mining coal in your data
  4. 4. Hey, man, predict me something!
  5. 5. Man or sofa?
  6. 6. ● Which loan applicants are high-risk? ● Which customer will respond to a planned promotion? ● How do we detect phone card fraud? ● How do customer profile change over time? ● Which customers do prefer product A over product B? ● What is the revenue prediction for next year? ● Which students are most likely to transfer than others? ● Which tax payer may be cheating the system? ● Who is most likely to violate a probation sentence? Typical questions for DM
  7. 7. What is Data Mining?
  8. 8. Statistics?
  9. 9. Tag cloud?
  10. 10. Data visualization?
  11. 11. Not OLAP, 100%
  12. 12. 1. Selection 2. Pre-processing 3. Transformation 4. Data Mining 5. Interpretation/Evaluation Magic part of KDD (Knowledge Discovery in Databases)
  13. 13. 1. Share your date with us 2. Thumbtack’s magic manipulations 3. Get Answers Machine 4. PROFIT!!! Thumbtack’s definition
  14. 14. Data
  15. 15. ● Facebook users, tweets ● Weather ● Sea routes ● Trade transactions ● Goverment ● Medicine (genomic data) ● Telecommuncations (phone call records) Data examples
  16. 16. ● Relational Databases (transactional data with many tables) ● Data warehouses (Historical data, aggregated and updated periodically) ● Files (In special format (e.g., CSV) or proprietary binary) ● Internet or electronic mail (HTML, XML, web search results, e-mails) ● Scientific, research (R, Octave, Matlab) Data sources
  17. 17. Target data
  18. 18. Targeting Advertising
  19. 19. ● All your personal data (PD) are being deeply mined ● The industry of collecting, aggregating, and brokering PD is “database marketing.” ● 1.1 billion browser cookies , 200 million mobile profiles, and an average of 1,500 pieces of data per consumer in Acxiom Pay with your personal data
  20. 20. Preprocessing
  21. 21. ● Select small pieces ● Define default values for missed data ● Remove strange signals from data ● Merge some tables in one if required
  22. 22. Pattern mining
  23. 23. ● Set of items & transactions ● Need to find rules (simple implication: if A & B than C) Association rule learning
  24. 24. It is the process of finding model of function that describes and distinguishes data class to predict the class of objects whose class label is unknown. What is Cluster Analysis?
  25. 25. ● Statistical process for estimating the relationships among variables ● The estimation target is function (it can be probability distribution) ● Can be linear, polynomial, nonlinear and etc. Regression
  26. 26. ● Training set of classified examples (supervised learning) ● Test set of non-classified items ● Main goal: find a function (classifier) that maps input data to a category ● Computer vision, drug discovery, speech recognition, biometric indentification, credit scoring Classification
  27. 27. A tree showing survival of passengers on the Titanic ("sibsp" is the number of spouses or siblings aboard). The figures under the leaves show the probability of survival and the percentage of observations in the leaf. Decision trees
  28. 28. ● There are two classes of objects A & B (red & blue) ● Define the class of new object, based on information about its neighbors ● Changing the boundaries of an new object area, we form a set of neighbors. ● New object is B becuase majority of the neighbors is a B. kNN
  29. 29. Skills & Tools
  30. 30. Most popular Data Mining algorythms
  31. 31. ● R is free and R is language ● Graphics and data visualization ● A flexible statistical analysis toolkit ● Access to powerful, cutting-edge analytics ● A robust, vibrant community ● Unlimited possibilities R
  32. 32. ● Octave is free and Octave is language ● C++ support (you can call Octave functions from C/C++) ● Easy prototyping for Matlab ● Extensive graphics capabilities for data visualization and manipulation ● Java support Octave
  33. 33. In conclusion ● All we need in Math … ● Use R & Octave ● Data is only first step ● Dependencies is everywhere
  34. 34. Your questions?

×