Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Spark + Scikit Learn- Performance Tuning

915 visualizaciones

Publicado el

This slides show how to integrate with the powerful tool in big data area. When using spark to do data preprocessing then produce the training data set to scikit learn , it will cause performance issue . So i share some tips how to overcome related performance issue

Publicado en: Datos y análisis
  • DOWNLOAD FULL BOOKS INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Responder 
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Responder 
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí

Spark + Scikit Learn- Performance Tuning

  1. 1. Spark + Scikit-Learn - Performance Tuning
  2. 2. Who am I ? ● Kent (施晨揚) ● 熱愛 Machine Learning & Big Data ● 兩個孩子的爸 https://www.facebook.com/texib
  3. 3. What ? Key Factor - Influence Performance • Key Factor - Influence Performance • Large Raw Data Size (4 Billion Record) • Large Number of Cookies (40 Million Records ) • Machine Learning Library - Prediction Function Cost
  4. 4. How ? ● Spark o Parallel Computing o Scaleable o Very Powerful Data Processing Tool o 但 Machine Learning Library …. ● Python Scikit-Learn o Very Powerful Machine Learning Library o 但部份都只能用到單核 XD
  5. 5. So We Have a Idea !
  6. 6. Data Size Training Data Prediction Data Major Problem <~ 100 times
  7. 7. Use Python Prepare Prediction Data Prepare Data Train Model Prepare Prediction Data About 30 Mins > 1 Weeks Do Prediction
  8. 8. Aggregation - Where is Slow ? 50% • Aggregate 4 Billion Rows to 40 Million Cookies is a Very Consuming Job
  9. 9. Use mapPartitions() • Instead of Using ReduceByKey() with Yours Aggregation Logic • How : • 1 step : use db(redshift) to prepare prediction data order by cookie • 2 step : use local map partitions to do batch prediction
  10. 10. Use ReduceByKey (A,1) (B,2) (A,3) (A,3) (B,4) (C,2) (B,2) (C,2) (D.1) (C,1) (D,2) (D,1) (A,4) (B,2) (A,3) (B,4) (C,2) (B,2) (C,2) (D.1) (C,1) (D,3) (A,7) (B,6) (C,2) (B,2) (C,3) (D,4) (A,7) (B,8) (C,5) (D,4) 24 hours!!
  11. 11. Use DB to Pre-Sort and mapPartitions (A,1) (A,3) (A,3) (B,2) (B,4) (B,2) (C,2) (C,2) (C,1) (D,1) (D,2) (D,1) (A,7) (B,8) (D,4) (C,5) 12 hours!!
  12. 12. Prediction - Atomic Job Do Prediction
  13. 13. Prediction - Batch Job Do Prediction
  14. 14. Conclusion • Use db to do presort data better than do aggregation by spark • Use batch better than atomic
  15. 15. Another Case - Spam Article Classifier • Article Structure Classifier • Article Content Classifier • Bag of Word • High Dimension Feature Space • Very Sparse Vector • Large Number of Documents
  16. 16. Original sc.textfile RDD text to terms RDD collect to python list 2 Millions Docs Bang!! dict vectorize sparse vectorstd-idf transformbuild classifier
  17. 17. New sc.textfile RDD text to terms RDD distinct rdd tf-idf transformtf-idf sparse matrix collect terms and Build Vectorize terms to sparse vector RDD collect sparse vector to list Use Vstack list to sparse martix build classifier

×