Hemos actualizado nuestra política de privacidad. Haga clic aquí para revisar los detalles. Pulse aquí para revisar los detalles
Active su período de prueba de 30 días gratis para desbloquear las lecturas ilimitadas.
Active su período de prueba de 30 días gratis para seguir leyendo.
Descargar para leer sin conexión
Industry surveys [1] reveal that the number one hassle of data scientists is cleaning the data to analyze it. Textbook statistical modeling is sufficient for noisy signals, but errors of a discrete nature break standard tools of machine learning. I will discuss how to easily run machine learning on data tables with two common dirty-data problems: missing values and non-normalized entries. On both problems, I will show how to run standard machine-learning tools such as scikit-learn in the presence of such errors. The talk will be didactic and will discuss simple software solutions. It will build on the latest improvements to scikit-learn for preprocessing and missing values and the DirtyCat package [2] for non normalized entries. I will also summarize theoretical analyses in recent machine learning publications.
This talk targets data practitioners. Its goal are to help data scientists to be more efficient analysing data with such errors and understanding their impacts.
With missing values, I will use simple arguments and examples to outline how to obtain asymptotically good predictions [3]. Two components are key: imputation and adding an indicator of missingness. I will explain theoretical guidelines for these, and I will show how to implement these ideas in practice, with scikit-learn as a learner, or as a preprocesser.
For non-normalized categories, I will show that using their string representations to “vectorize” them, creating vectorial representations gives a simple but powerful solution that can be plugged in standard statistical analysis tools [4].
[1] Kaggle, the state of ML and data science 2017 https://www.kaggle.com/surveys/2017
[2] https://dirty-cat.github.io/stable/
[3] Josse Julie, Prost Nicolas, Scornet Erwan, and Varoquaux Gaël (2019). “On the consistency of supervised learning with missing values”. https://arxiv.org/abs/1902.06931
[4] Cerda Patricio, Varoquaux Gaël, and Kégl Balázs. "Similarity encoding for learning with dirty categorical variables." Machine Learning 107.8-10 (2018): 1477 https://arxiv.org/abs/1806.00979
Industry surveys [1] reveal that the number one hassle of data scientists is cleaning the data to analyze it. Textbook statistical modeling is sufficient for noisy signals, but errors of a discrete nature break standard tools of machine learning. I will discuss how to easily run machine learning on data tables with two common dirty-data problems: missing values and non-normalized entries. On both problems, I will show how to run standard machine-learning tools such as scikit-learn in the presence of such errors. The talk will be didactic and will discuss simple software solutions. It will build on the latest improvements to scikit-learn for preprocessing and missing values and the DirtyCat package [2] for non normalized entries. I will also summarize theoretical analyses in recent machine learning publications.
This talk targets data practitioners. Its goal are to help data scientists to be more efficient analysing data with such errors and understanding their impacts.
With missing values, I will use simple arguments and examples to outline how to obtain asymptotically good predictions [3]. Two components are key: imputation and adding an indicator of missingness. I will explain theoretical guidelines for these, and I will show how to implement these ideas in practice, with scikit-learn as a learner, or as a preprocesser.
For non-normalized categories, I will show that using their string representations to “vectorize” them, creating vectorial representations gives a simple but powerful solution that can be plugged in standard statistical analysis tools [4].
[1] Kaggle, the state of ML and data science 2017 https://www.kaggle.com/surveys/2017
[2] https://dirty-cat.github.io/stable/
[3] Josse Julie, Prost Nicolas, Scornet Erwan, and Varoquaux Gaël (2019). “On the consistency of supervised learning with missing values”. https://arxiv.org/abs/1902.06931
[4] Cerda Patricio, Varoquaux Gaël, and Kégl Balázs. "Similarity encoding for learning with dirty categorical variables." Machine Learning 107.8-10 (2018): 1477 https://arxiv.org/abs/1806.00979
Parece que ya has recortado esta diapositiva en .
¡Acabas de recortar tu primera diapositiva!
Los recortes son una forma práctica de recopilar diapositivas importantes para volver a ellas más tarde. Ahora puedes personalizar el nombre de un tablero de recortes para guardar tus recortes.La familia SlideShare crece. Disfruta de acceso a millones de libros electrónicos, audiolibros, revistas y mucho más de Scribd.
Cancela en cualquier momento.Lecturas ilimitadas
Aprenda más rápido y de forma más inteligente con los mejores expertos
Descargas ilimitadas
Descárguelo para aprender sin necesidad de estar conectado y desde cualquier lugar
¡Además, tiene acceso gratis a Scribd!
Acceso instantáneo a millones de libros electrónicos, audiolibros, revistas, podcasts y mucho más.
Lea y escuche sin conexión desde cualquier dispositivo.
Acceso gratis a servicios prémium como TuneIn, Mubi y muchos más.
Hemos actualizado su política de privacidad para cumplir con las cambiantes normativas de privacidad internacionales y para ofrecerle información sobre las limitadas formas en las que utilizamos sus datos.
Puede leer los detalles a continuación. Al aceptar, usted acepta la política de privacidad actualizada.
¡Gracias!