Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio

Eche un vistazo a continuación

1 de 57 Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

Anuncio

Similares a Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 (20)

Más reciente (20)

Anuncio

Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

  1. 1. Azure Machine Learning – 其他篇 台灣微軟 技術傳教士 吳宏彬 8/25/2016
  2. 2. 什麼是R語言 Open Source “lingua franca” Analytics, Computing, Modeling Global Community Millions of users 7000+ Algorithms, Test Data & Evaluations Can be Scaled to Big Data, Big Analytics Ecosystem Scalability
  3. 3. Polls of data miners and analytics professionals on their software choices since 2007 Source: http://blog.revolutionanalytics.com/2013/10/r-usage-skyrocketing-rexer-poll.html
  4. 4.  R is developed and contributed by open source community  CRAN – the Comprehensive R Archive Network  Package repository of R  7500+ packages, covering all aspects of statistical analysis, machine learning, natural language processing …  Still exponentially growth  Free! Source: http://r4stats.com/2014/04/07/r-continues-its-rapid-growth/
  5. 5. 1.Seasonal ARIMA 2.Non Seasonal ARIMA 3.Seasonal ETS 4.Non -Seasonal ETS 5.Average of Seasonal ETS and Seasonal ARIMA
  6. 6. Mean Error (ME) - Average forecasting error (an error is the difference between the predicted value and the actual value) on the test dataset Root Mean Squared Error (RMSE) - The square root of the average of squared errors of predictions made on the test dataset. Mean Absolute Error (MAE) - The average of absolute errors Mean Percentage Error (MPE) - The average of percentage errors Mean Absolute Percentage Error (MAPE) - The average of absolute percentage errors Mean Absolute Scaled Error (MASE) Symmetric Mean Absolute Percentage Error (sMAPE)
  7. 7. Datasize In-memory In-memory In-Memory or Disk Based Speed of Analysis Single threaded Multi-threaded Multi-threaded, parallel processing 1:N servers Support Community Community Community + Commercial Analytic Breadth & Depth 7500+ innovative analytic packages 7500+ innovative analytic packages 7500+ innovative packages + commercial parallel high- speed functions License Open Source Open Source Commercial license. Supported release with indemnity Microsoft R Open Microsoft R Server
  8. 8.  Support standard Python library types such as Pandas data frames and NumPy arrays.  Execute the Python code is based on Anaconda 2.1, It comes with close to 200 of the most common Python packages (as NumPy, SciPy and Scikits-Learn )  Output generate images from MatplotLib
  9. 9. KNN
  10. 10. 21 What is Spark?
  11. 11. Data is growing faster than processing speeds Only solution is to parallelize data processing on large clusters Example: HDInsight
  12. 12. Fast, expressive cluster computing system compatible with Apache Hadoop • Works with any Hadoop-supported storage system (HDFS, S3, Avro, …) Improves efficiency through: • In-memory computing primitives • General computation graphs Improves usability through: • Rich APIs in Java, Scala, Python • Interactive shell Spark was initially started by Matei Zaharia at UC Berkeley AMPLab in 2009, was open sourced in 2010 and donated to Apache in 2013 Up to 100× faster Often 2-10× less code What is Spark?
  13. 13. Spark for Azure HDInsight Spark Node Spark Node Spark Node Spark Node Spark Node Storage Layer Decision Maker Decision Maker Decision Maker Spark Cluster clients
  14. 14. Spark Notebooks Using the Spark shell to run interactive queries Using the Spark shell to run Spark SQL queries Using a standalone Scala program
  15. 15. Spark Notebooks Zeppelin – for Scala users Zupyter – for Python users
  16. 16. Programming Spark
  17. 17. 2015 System Human Error Rate 4% Speech Recognition could reach human parity in the next 3 years
  18. 18. 33
  19. 19. Microsoft 透過深度學習技術贏得 ImageNet 2015所 有比賽項目冠軍 28.2 25.8 16.4 11.7 7.3 6.7 3.5 ILSVRC 2010 NEC America ILSVRC 2011 Xerox ILSVRC 2012 AlexNet ILSVRC 2013 Clarifi ILSVRC 2014 VGG ILSVRC 2014 GoogleNet ILSVRC 2015 MSRA ResNet ImageNet Classification top-5 error (%) Microsoft had all 5 entries being the 1-st places this year: ImageNet classification, ImageNet localization, ImageNet detection, COCO detection, and COCO segmentation
  20. 20. CNTK At the Heart: Computational Networks •A generalization of machine learning models that can be described as a series of computational steps. • E.g., DNN, CNN, RNN, LSTM, DSSM, Seq2Sqe, Log-linear model •Representation: • A list of computational nodes denoted as n = {node name : operation name} • The parent-children relationship describing the operands {n : c1, ···, cKn } • Kn is the number of children of node n. For leaf nodes Kn = 0. • Order of the children matters: e.g., XY is different from YX • Given the inputs (operands) the value of the node can be computed. •Can flexibly describe deep learning models. • Adopted by many other popular tools as well 35
  21. 21. 36 •A generalization of machine learning models that can be described as a series of computational steps. • E.g., DNN, CNN, RNN, LSTM, DSSM, Log-linear model •Representation: • A list of computational nodes denoted as n = {node name : operation name} • The parent-children relationship describing the operands {n : c1, ···, cKn } • Kn is the number of children of node n. For leaf nodes Kn = 0. • Order of the children matters: e.g., XY is different from YX • Given the inputs (operands) the value of the node can be computed. •Can flexibly describe deep learning models. • Adopted by many other popular tools as well
  22. 22. “CNTK is production-ready: State-of-the-art accuracy, efficient, and scales to multi-GPU/multi-server.” Theano only supports 1 GPU Achieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 70000 80000 CNTK Theano TensorFlow Torch 7 Caffe speed comparison (samples/second), higher = better [note: December 2015] 1 GPU 1 x 4 GPUs 2 x 4 GPUs (8 GPUs) * TensorFlow add distributed compute support in April 2016
  23. 23.  Micrsoft Reacher SLAWEK SMYL win in CIF 2016 by using LSTM Neural Network  Powered by CNTK
  24. 24. CIF Competition 2016 – Final Results • Contestant 1 – Slawek Smyl (LSTM-based NN on deseasonalized data) • Contestant 2 – Slawek Smyl (weighted average of my 3 methods) • Contestant 3 – prof. Sven Crone (Multilayer Perceptron with a thorough feature search) • Contestant 4 - Mikhail Artyukhov (previous competition winner, ensemble models) • Contestant 5 - Joerg Wichard, Bayer Healthcare AG (Adaptive Forecasting Strategy with Hybrid Ensemble Models) • Contestant 6 – Slawek Smyl (LSTM-based NN)
  25. 25. CNTK Demo
  26. 26. CNTK Architecture 41 CN Builder Lambda CN Description Use Build ILearnerIDataReaderFeatures & Labels Load Get data IExecutionEngine CPU/GPU Task-specific reader SGD, AdaGrad, etc. Evaluate Compute Gradient
  27. 27. (1) Kai Chen and Qiang Huo, “Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering”, in Internal Conference on Acoustics, Speech and Signal Processing , March 2016, Shanghai, China.
  28. 28.  CNTK is a powerful tool that supports CPU/GPU and runs under Windows/Linux  CNTK is extensible with the low-coupling modular design: adding new readers and new computation nodes is easy with a new reader design  Network definition language, macros, and model editing language (as well as Python and C++ bindings in the future) makes network design and modification easy  Compared to other tools CNTK has a great balance between efficiency, performance, and flexibility
  29. 29. microsoft.com/cognitive
  30. 30. Mahout Spark ML Azure ML R Server Shared Service No No Yes No Deployment Model PaaS PaaS PaaS IaaS Extensibility High High Medium High Deployment Complexity Medium High Low Medium Cost High High Low High Programming Languages Java/Scala Scala/Java/Python Python/R R Algorithms Limited (growing) MLlib/scikit Many (scikit/CRAN) Many (CRAN) Scalability High High Medium Medium
  31. 31. xgboost Vowpal Wabbit Rattle CNTK *Copy
  32. 32. 雲端隨選隨用 各式資料 快速上線服務 資料分享 跟協同合作 開放 支援完整資料 分析流程
  33. 33. https://gallery.cortanaintelligence.com/
  34. 34. 唯一一家提供從資料匯 入到產生行動及資料呈 現完整的解決方案
  35. 35. ConnectR • High-speed & direct connectors Available for: • High-performance XDF • SAS, SPSS, delimited & fixed format text data files • Hadoop HDFS (text & XDF) • Teradata Database & Aster • EDWs and ADWs • ODBC ScaleR • Ready-to-Use high-performance big data big analytics • Fully-parallelized analytics • Data prep & data distillation • Descriptive statistics & statistical tests • Range of predictive functions • User tools for distributing customized R algorithms across nodes DistributedR • Distributed computing framework • Delivers cross-platform portability Available on: • Windows Servers • Red Hat and SuSE Linux Servers • Teradata Database • Cloudera Hadoop • Hortonworks Hadoop • MapR Hadoop R+CRAN • Open source R interpreter • R 3.2.2 • Freely-available huge range of R algorithms • Algorithms callable by RevoR • 100% Compatible with existing R scripts, functions and packages RevoR • Performance enhanced R interpreter • Based on open source R • Adds high-performance math library to speed up linear algebra functions R Open MicrosoftR Server DeployRDevelopR
  36. 36.  Gradient Boosted Decision Trees  Naïve Bayes  Data import – Delimited, Fixed, SAS, SPSS, OBDC  Variable creation & transformation  Recode variables  Factor variables  Missing value handling  Sort, Merge, Split  Aggregate by category (means, sums)  Min / Max, Mean, Median (approx.)  Quantiles (approx.)  Standard Deviation  Variance  Correlation  Covariance  Sum of Squares (cross product matrix for set variables)  Pairwise Cross tabs  Risk Ratio & Odds Ratio  Cross-Tabulation of Data (standard tables & long form)  Marginal Summaries of Cross Tabulations  Chi Square Test  Kendall Rank Correlation  Fisher’s Exact Test  Student’s t-Test  Subsample (observations & variables)  Random Sampling Data Step Statistical Tests Sampling Descriptive Statistics  Sum of Squares (cross product matrix for set variables)  Multiple Linear Regression  Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions.  Covariance & Correlation Matrices  Logistic Regression  Classification & Regression Trees  Predictions/scoring for models  Residuals for all models Predictive Models  K-Means  Decision Trees  Decision Forests Cluster Analysis Classification Simulation Variable Selection  Stepwise Regression  Simulation (e.g. Monte Carlo)  Parallel Random Number Generation Combination  rxDataStep  rxExec  PEMA-R API Custom Algorithms
  37. 37. Additional Resources •CNTK: • https://github.com/Microsoft/CNTK • Contains all the source code and example setups • You may understand better how CNTK works by reading the source code • New features are added constantly •How to contact: • CNTK team: ask a question on CNTK GitHub! • Alexey: • Email: alexey.kamenev@microsoft.com • : https://www.linkedin.com/in/alexeykamenev 59

×