Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Próximo SlideShare
Helping Computers Find Meaning they Lost in Translation
Siguiente
Descargar para leer sin conexión y ver en pantalla completa.

0

Compartir

IP Clustering as Exploratory Data Analysis

Descargar para leer sin conexión

Class project presentation; MATH331 Optimization, Fall '09

Libros relacionados

Gratis con una prueba de 30 días de Scribd

Ver todo
  • Sé el primero en recomendar esto

IP Clustering as Exploratory Data Analysis

  1. 1. Clustering Using Integer Programming Katie Kuksenok Michael Brooks
  2. 2. What are we doing? • Investigating IP formulations of the clustering problem • Developing a web application for clustering • Applying clustering to several datasets • Visualizing resulting clusters
  3. 3. The Clustering Problem • Input: ▫ N entities ▫ Each entity has a value for each of M features • Goal: ▫ Partition the entities “optimally” • Issues with clustering: ▫ Definition of “optimal” ▫ Different feature types don’t work well together
  4. 4. The Clustering Problem Exposes high-level properties of data Important for exploratory data analysis in data mining Finding the optimal clustering is NP-Complete
  5. 5. Grotschel and Wakabayashi IP • Recall the Grotschel-Wakabayashi formulation (clustering wild cats): • Variables: xij = 1 if i and j are in the same cluster • Data: wij = disagreements(i,j) – agreements(i,j) • Minimize: SUM { wij xij : over all i, j }
  6. 6. Formulation: Constraints • Constraints for reflexivity, symmetry, transitivity reduce to the following constraints:
  7. 7. Exploratory Clustering Application • Currently in development • User provides data to website • Web server processes data ▫ Produces OPL code ▫ Runs the model in LPSolve (another MIP solver) • Results are returned to the user, with visuals
  8. 8. Preliminary Exploration • With working parts of application: • can cluster and visualize datasets • verified Grotschel/Wakabayashi clustering results • used clustering to explore senate voting records
  9. 9. Data Sets (Cats) • Cluster 30 varieties of wild cats (from paper) • Cats have 14 features • Reproduced results from paper • Model has 12180 constraints and 435 variables • Runs in less than 1 second • No branching
  10. 10. Data Sets (Senate Voting Records) • 100 senators • Senate records for 8 passed bills from this year • Model has 485100 constraints and 4950 variables • Runs in 1 minute • No branching
  11. 11. Data Sets (Senate Voting By State) • Grouped senate voting by state • If senators for a state disagreed, ‘Maybe’ • Model has 58800 constraints and 1225 variables • Runs in 1.5 seconds • No branching
  12. 12. What now? • The Application ▫ We’d like to put it all together • Branch and Bound ▫ No real-life datasets so far have required branching ▫ We have constructed datasets that did ▫ Can we determine under what conditions branching will be required?
  13. 13. This is a guepard (cheetah in French?): …any other questions?
  14. 14. Approaches to Clustering • What makes a good clustering algorithm? ▫ Runtime and space efficiency (when N is large) Many algorithms ▫ Ability to quickly incorporate new data ▫ Ability to deal with many-dimensional IP data (when M is large) ▫ Optimality (closeness to)

Class project presentation; MATH331 Optimization, Fall '09

Vistas

Total de vistas

1.040

En Slideshare

0

De embebidos

0

Número de embebidos

2

Acciones

Descargas

8

Compartidos

0

Comentarios

0

Me gusta

0

×