EL CICLO PRÁCTICO DE UN MOTOR DE CUATRO TIEMPOS.pptx
El dato tiene forma y la forma significado - Josep Curto
1. El dato tiene forma y
la forma, significado
Josep Curto
CEO, Delfos Research | Director Académico, Master Big Data y BI, UOC
@josepcurto, 2018
2. 2
Me presento
• CEO, Delfos Research
• Director Académico, Master
Big Data y BI, UOC
• Advisor, Institute of Passion
• Autor de multiples artículos y
libros
@josepcurto, 2018
17. 17
If TDA es tan fantástica, por
qué no la estamos usando?
@josepcurto, 2018
18. 18
Datos Originales Datos formateados
[100,480,507:3]
300 millones de elementos
[17,770:480,189]
8.5 billones de elementos
@josepcurto, 2018
19. 19
Split dataset in buckets by
range of movie_ids
Pivot each data bucket
(rows: movies, columns: users)
…
…
Perform serial executions of PCA on each
batch using previously learned PCA vectors
Merging batches in whole dataset
Learn PCA coefficients on random subset
Alguna idea?
Divide y
venceras
@josepcurto, 2018
Predictive Analytics approach: Fit predictive models to the data. But the complexity of the data means hypothesis testing is often challenging. We need to know what questions to ask. Are we asking the right questions? With big data, insights can be slow.
Conventional approaches for reduction and visualization: Use linear and nonlinear dimension reduction techniques such as PCA, MCA, and MDS. But, even if they work, are sensitive to distance metrics and do not preserve topological structures of the data.
Data-Driven Discovery Approach: Hypothesis-free approach based on computational topology to qualitatively analyze functions on very high-dimensional data and visualize the data structure in low- dimensional topological spaces. Topological data analysis (TDA) reveals structures in the data that have invariant properties and can propel insight and improve hypothesis-generation and predictive modeling; “digital serendipity” (Singh 2013).
TDA draws on theory of topological spaces and simplicial complexes (algebraic topology); implementation invokes computational topology (computational geometry, computational complexity theory, and computer science) – e.g. see Carlsson 2009; Lum, et.al. 2012; Singh, Memoli and Carlsson 2007.
TDA applies a function (lens) to a data set and builds a compressed summary of the data.
A visual network of nodes (representing data points) connected by edges is created using four types of parameters:
Metric (measure of similarity) Lenses (functions on the data) Bin resolution Bin overlap
Metrics: correlation, Euclidean distance, cosine, hamming, categorical cosine, user-defined...
Functions: mean, variance, density, centrality, PCA, MDS, user-defined...
Supervised Machine Learning Models: TDA of machine learning outputs with outcome variables can enhance models through discovery of systematic error and construction of local models as opposed to a single global model