Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS

1.865 visualizaciones

Publicado el

Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Abstract: We will introduce RAPIDS, a suite of open source libraries for GPU-accelerated data science, and illustrate how it operates seamlessly with MLflow to enable reproducible training, model storage, and deployment. We will walk through a baseline example that incorporates MLflow locally, with a simple SQLite backend, and briefly introduce how the same workflow can be deployed in the context of GPU enabled Kubernetes clusters.

Publicado en: Software
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS

  1. 1. John Zedlewski, Devin Robison Accelerating Machine Learning with RAPIDS and MLflow
  2. 2. 2 Outline ● RAPIDS for accelerated data science ● Why RAPIDS + MLflow? ● Example Integration ● Training and Deployment as an MLproject
  3. 3. 3 What is RAPIDS?
  4. 4. 4 Pandas Analytics CPU Memory Data Preparation VisualizationModel Training Scikit-Learn Machine Learning NetworkX Graph Analytics PyTorch, TensorFlow, MxNet Deep Learning Matplotlib Visualization Dask Open Standards Data Science Ecosystem Traditional Python APIs on CPU
  5. 5. 5 cuDF cuIO Analytics Data Preparation VisualizationModel Training cuML, XGBoost Machine Learning cuGraph Graph Analytics PyTorch, TensorFlow, MxNet Deep Learning cuxfilter, pyViz, plotly Visualization Dask GPU Memory RAPIDS End-to-End GPU Accelerated Data Science
  6. 6. 6 Dask GPU Memory Data Preparation VisualizationModel Training cuML Machine Learning cuGraph Graph Analytics PyTorch, TensorFlow, MxNet Deep Learning cuxfilter, pyViz, plotly Visualization RAPIDS ETL GPU Accelerated Data Wrangling and Feature Engineering cuDF cuIO Analytics
  7. 7. 7 25-100x Improvement Less Code Language Flexible Primarily In-Memory HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train HDFS Read Query ETL ML Train HDFS Read GPU Read Query CPU Write GPU Read ETL CPU Write GPU Read ML Train 5-10x Improvement More Code Language Rigid Substantially on GPU Traditional GPU Processing Hadoop Processing, Reading from Disk Spark In-Memory Processing Data Processing Evolution Faster Data Access, Less Data Movement
  8. 8. 8 25-100x Improvement Less Code Language Flexible Primarily In-Memory HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train HDFS Read Query ETL ML Train HDFS Read GPU Read Query CPU Write GPU Read ETL CPU Write GPU Read ML Train 5-10x Improvement More Code Language Rigid Substantially on GPU Traditional GPU Processing Hadoop Processing, Reading from Disk Spark In-Memory Processing Data Processing Evolution Faster Data Access, Less Data Movement RAPIDS Arrow Read ETL ML Train Query 50-100x Improvement Same Code Language Flexible Primarily on GPU
  9. 9. 9 GPU-Accelerated ETL The Average Data Scientist Spends 90+% of Their Time in ETL as Opposed to Training Models
  10. 10. 10 Dask cuDF cuDF Pandas Thrust Cub Jitify Python Cython cuDF C++ CUDA Libraries CUDA ETL Technology Stack
  11. 11. 11 ETL - the Backbone of Data Science PYTHON LIBRARY ▸ A Python library for manipulating GPU DataFrames following the Pandas API ▸ Python interface to CUDA C++ library with additional functionality ▸ Creating GPU DataFrames from Numpy arrays, Pandas DataFrames, and PyArrow Tables ▸ JIT compilation of User-Defined Functions (UDFs) using Numba ▸ Most common formats: CSV, Parquet, ORC, JSON, AVRO, HDF5, and more... cuDF is…
  12. 12. 12 Benchmarks: Single-GPU Speedup vs. Pandas cuDF v0.13, Pandas 0.25.3 ▸ Running on NVIDIA DGX-1: ▸ GPU: NVIDIA Tesla V100 32GB ▸ CPU: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz ▸ Benchmark Setup: ▸ RMM Pool Allocator Enabled ▸ DataFrames: 2x int32 columns key columns, 3x int32 value columns ▸ Merge: inner; GroupBy: count, sum, min, max calculated for each value column 300 900 500 0 Merge Sort GroupBy GPUSpeedupOver CPU 10M 100M 970 500 370 350 330 320
  13. 13. 13 PyTorch, TensorFlow, MxNet Deep Learning Dask cuDF cuIO Analytics GPU Memory Data Preparation VisualizationModel Training cuGraph Graph Analytics cuxfilter, pyViz, plotly Visualization Machine Learning with RAPIDS More Models More Problems cuML Machine Learning
  14. 14. 14 Dask cuML Dask cuDF cuDF Numpy Python Thrust Cub cuSolver nvGraph CUTLASS cuSparse cuRand cuBlas Cython cuML Algorithms cuML Prims CUDA Libraries CUDA ML Technology Stack
  15. 15. 15 from sklearn.datasets import make_moons import pandas X, y = make_moons(n_samples=int(1e2), noise=0.05, random_state=0) X = pandas.DataFrame({'fea%d'%i: X[:, i] for i in range(X.shape[1])}) from sklearn.cluster import DBSCAN dbscan = DBSCAN(eps = 0.3, min_samples = 5) dbscan.fit(X) y_hat = dbscan.predict(X) RAPIDS Matches Common Python APIs CPU-accelerated Clustering
  16. 16. 16 from sklearn.datasets import make_moons import cudf X, y = make_moons(n_samples=int(1e2), noise=0.05, random_state=0) X = cudf.DataFrame({'fea%d'%i: X[:, i] for i in range(X.shape[1])}) from cuml import DBSCAN dbscan = DBSCAN(eps = 0.3, min_samples = 5) dbscan.fit(X) y_hat = dbscan.predict(X) RAPIDS Matches Common Python APIs GPU-accelerated Clustering
  17. 17. 17 Decision Trees / Random Forests Linear/Lasso/Ridge/ElasticNet Regression Logistic Regression K-Nearest Neighbors Support Vector Machine Classification and Regression Naive Bayes K-Means DBSCAN Spectral Clustering Principal Components Singular Value Decomposition UMAP Spectral Embedding T-SNE Holt-Winters Seasonal ARIMA / Auto ARIMA More to come! Random Forest / GBDT Inference (FIL) Time Series Clustering Decomposition & Dimensionality Reduction Preprocessing Inference Classification / Regression Hyper-parameter Tuning Cross Validation Key: Preexisting | NEW or enhanced for 0.15 Algorithms GPU-accelerated Scikit-Learn Text vectorization (TF-IDF / Count) Target Encoding Cross-validation / splitting
  18. 18. 18 Benchmarks: Single-GPU cuML vs Scikit-learn 1x V100 vs. 2x 20 Core CPUs (DGX-1, RAPIDS 0.15)
  19. 19. 19 Forest Inference cuML’s Forest Inference Library accelerates prediction (inference) for random forests and boosted decision trees: ▸ Works with existing saved models (XGBoost, LightGBM, scikit-learn RF, cuML RF) ▸ Lightweight Python API ▸ Single V100 GPU can infer up to 34x faster than XGBoost dual-CPU node ▸ Over 100 million forest inferences/sec on a DGX-1V Taking Models From Training to Production 4000 3000 2000 1000 0 Bosch Airline Epsilon Time(ms) CPU Time (XGBoost, 40 Cores) FIL GPU Time (1x V100) Higgs XGBoost CPU Inference vs. FIL GPU (1000 trees) 23x 36x 34x 23x
  20. 20. 20 XGBoost + RAPIDS: Better Together ▸ RAPIDS comes paired with XGBoost 1.2 (as of 0.15) ▸ XGBoost now builds on the GoAI interface standards to provide zero-copy data import from cuDF, cuPY, Numba, PyTorch and more ▸ Official Dask API makes it easy to scale to multiple nodes or multiple GPUs ▸ gpu_hist tree builder delivers huge perf gains Memory usage when importing GPU data decreased by 2/3 or more ▸ New objectives support Learning to Rank on GPU All RAPIDS changes are integrated upstream and provided to all XGBoost users – via pypi or RAPIDS conda
  21. 21. 21 https://github.com/rapidsai https://medium.com/rapids-ai Explore: RAPIDS Getting Started, Code, and Blogs From intro to in-depth https://rapids.ai
  22. 22. 22 Exactly as it sounds—our goal is to make RAPIDS as usable and performant as possible wherever data science is done. We will continue to work with more open source projects to further democratize acceleration and efficiency in data science. RAPIDS Everywhere The Next Phase of RAPIDS
  23. 23. 23 MLflow + RAPIDS
  24. 24. 24 “... an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.” - mlflow.org …. And it works with RAPIDS, out of the box! MLflow
  25. 25. 25 Why RAPIDS + MLflow? RAPIDS substantial speedups across a wide range of machine learning and ETL tasks, SKlearn compatible API. MLflow improved collaboration, experiment tracking, model storage, registration, and deployment. Production / Engineering Update Good? Training ValidateUpdate
  26. 26. 26 HPO Use Case: 100-Job Random Forest Airline Model Huge speedups translate into >7x TCO reduction Based on sample Random Forest training code from cloud-ml-examples repository, running on Azure ML. 10 concurrent workers with 100 total runs, 100M rows, 5-fold cross-validation per run. GPU nodes: 10x Standard_NC6s_v3, 1 V100 16G, vCPU 6 memory 112G, Xeon E5-2690 v4 (Broadwell) - $3.366/hour CPU nodes: 10x Standard_DS5_v2, vCPU 16 memory 56G, Xeon E5-2673 v3 (Haswell) or v4 (Broadwell) - $1.017/hour" Cost Time(hours)
  27. 27. 27 Integration and Training: Nested HPO Experiments } Parent Experiment Child HPO Runs Accuracy Metric Configuration Parameters Metadata: Tags
  28. 28. 28 Component Overview: Some Terminology Local File System Backend Store Artifact Store /tmp/... /
  29. 29. 29 A Quick Example: Convert an Existing Project 29 Conversion to RAPIDS and MLflow Add nesting+HPO and model logging Add project entry points Anaconda and Docker training Deployment A Trained Model
  30. 30. 30 Integration and Training: Basic Conversion from sklearn.ensemble import RandomForestClassifier def train(fpath, max_depth, max_features, n_estimators): X_train, X_test, y_train, y_test = load_data(fpath) mod = RandomForestClassifier( max_depth=max_depth, max_features=max_features, n_estimators=n_estimators ) mod.fit(X_train, y_train) preds = mod.predict(X_test) accuracy = accuracy_score(y_test, preds) return mod, accuracy Start MLflow ‘run’ from cuml.ensemble import RandomForestClassifier def train(fpath, max_depth, max_features, n_estimators): X_train, X_test, y_train, y_test = load_data(fpath) with mlflow.start_run(run_name="RAPIDS-MLFlow"): mlparams = { "max_depth": str(max_depth), "max_features": str(max_features), "n_estimators": str(n_estimators), } mlflow.log_params(mlparams) mod = RandomForestClassifier( max_depth=max_depth, max_features=max_features, n_estimators=n_estimators ) mod.fit(X_train, y_train) preds = mod.predict(X_test) accuracy = accuracy_score(y_test, preds) mlflow.log_metric("accuracy", accuracy) return mod Record Parameters Record Performance Metrics Unmodified Training Code Augmented Training Code SKlearn to cuML
  31. 31. 31 Integration: Nesting+HPO and Model Logging hpo_runner = HPO_Runner(hpo_train) with mlflow.start_run(run_name=f"RAPIDS-HPO", nested=True): search_space = [ uniform("max_depth", 5, 20), uniform("max_features", 0.1, 1.0), uniform("n_estimators", 150, 1000), ] hpo_results = hpo_runner(fpath, search_space) artifact_path = "rapids-mlflow-example" with mlflow.start_run(run_name='Final Classifier', nested=True): mlflow.sklearn.log_model(hpo_results.best_model, artifact_path=artifact_path, registered_model_name="rapids-mlflow-example", conda_env='conda/conda.yaml') from cuml.ensemble import RandomForestClassifier from your_hpo_library import HPO_Runner # Called by hpo_runner def hpo_train(params): X_train, X_test, y_train, y_test = load_data(params.fpath) with mlflow.start_run(run_name=f”Trial {params.trail}", nested=True): mod = RandomForestClassifier( max_depth=params.max_depth, max_features=params.max_features, n_estimators=params.n_estimators ) mod.fit(X_train, y_train) preds = mod.predict(X_test) accuracy = accuracy_score(y_test, preds return mod, accuracy Add HPO Runner Log Runs and Best Model Import our HPO library Update Nested Training Register Best Result
  32. 32. 32 Integration and Training: Packaging Your Environment ./ ├── airline_small.parquet ├── envs └── conda.yaml ├── Dockerfile.training ├── MLproject ├── README.md └── src ├── entrypoint.sh └── train.py Project Environment MLProject name: rapids-mlflow docker_env: image: mlflow-rapids-example entry_points: hpo_run: parameters: fpath: {type: str} n_estimators: {type: int, default: 100} max_features: {type: float} max_depth: {type: int} command: "/bin/bash src/entrypoint.sh src/train.py --fpath={fpath} --n_estimators={n_estimators} --max_features={max_features} --max_depth={max_depth}" name: rapids-mlflow conda_env: envs/conda.yaml entry_points: hpo_run: parameters: fpath: {type: str} n_estimators: {type: int, default: 100} max_features: {type: float} max_depth: {type: int} command: "python src/train.py --fpath={fpath} --n_estimators={n_estimators} --max_features={max_features} --max_depth={max_depth}" MLProject (Anaconda) MLProject (Docker/K8s) FROM rapidsai/rapidsai:cuda11.0-runtime-ubuntu18.04-py3.8 RUN source activate rapids && pip install mlflow Dockerfile.training $ conda env export --name mlflow > envs/conda.yaml Conda Export Conda Path Docker Path
  33. 33. 33 Integration and Training: Bringing Things Together ## New conda environment $ conda create --name mlflow python=3.8 .... $ conda activate mlflow ## Install mlflow libs/tools -- this gives us the mlflow util $ pip install mlflow ## Create a training run with ‘mlflow run’ $ export MLFLOW_TRACKING_URI=sqlite:////tmp/mlflow-db.sqlite ## Train in a custom Conda Environment $ mlflow run --experiment-name "RAPIDS-MLflow-Conda" --entry-point hpo_run ./ .... Created version '10' of model 'rapids_mlflow_cli'. Model uri: ./mlruns/3/c20642df4137490fba2ca96a7b4431b0/artifacts/Airline-De mo 2020/09/29 23:36:37 INFO mlflow.projects: === Run (ID 'c20642df4137490fba2ca96a7b4431b0') succeeded == Anaconda ## New conda environment $ conda create --name mlflow python=3.8 .... $ conda activate mlflow ## Install mlflow libs/tools -- this gives us the mlflow util $ pip install mlflow ## Export our conda environment so we can deploy later $ docker build --tag mlflow-rapids-example --file ./Dockerfile.training ./ .... ## Create a training run with ‘mlflow run’ $ export MLFLOW_TRACKING_URI=sqlite:////tmp/mlflow-db.sqlite $ mlflow run --experiment-name "RAPIDS-MLflow-Docker" --entry-point hpo_run ./ Docker $ vi /etc/docker/daemon.json { "default-runtime": "nvidia", "runtimes": { "nvidia": { .... } } } Nvidia-Docker
  34. 34. 34 Integration and Training: Nested HPO Experiments } Parent Experiment Child HPO Runs Accuracy Metric Configuration Parameters Metadata: Tags
  35. 35. 35 Model Deployment $ mlflow models serve -m models:/rapids_mlflow_cli/1 -p 56767 2020/09/24 18:05:26 INFO mlflow.models.cli: Selected backend for flavor 'python_function' 2020/09/24 18:05:26 INFO mlflow.pyfunc.backend: === Running command 'gunicorn --timeout=60 -b 127.0.0.1:56767 -w 1 ${GUNICORN_CMD_ARGS} -- mlflow.pyfunc.scoring_server.wsgi:app' [2020-09-24 18:05:26 -0600] [17024] [INFO] Starting gunicorn 20.0.4 [2020-09-24 18:05:26 -0600] [17024] [INFO] Listening at: http://127.0.0.1:56767 (17024) [2020-09-24 18:05:26 -0600] [17024] [INFO] Using worker: sync [2020-09-24 18:05:26 -0600] [17026] [INFO] Booting worker with pid: 17026 [2020-09-24 18:05:28 -0600] [17024] [INFO] Handling signal: winch Registered Model Anaconda This can also be a storage path. Query Request Docker Serving (Experimental) $ mlflow models build-docker -m models:/rapids_mlflow_cli/9 -n mlflow-rapids-example 2020/09/24 16:43:18 INFO mlflow.models.cli: Selected backend for flavor 'python_function' 2020/09/24 16:43:18 INFO mlflow.models.docker_utils: Building docker image with name mlflow-rapids-example …. build process …. Successfully built 900f8e84b370 Successfully tagged mlflow-rapids-example:latest $ Registered Model EXPERIM ENTAL
  36. 36. 36 Endpoint Inference import json import requests host = 'localhost' port = '56767' headers = { "Content-Type": "application/json", "format": "pandas-split" } data = { "columns": ["Year", "Month", "DayofMonth", "DayofWeek", "CRSDepTime", "CRSArrTime", "UniqueCarrier", "FlightNum", "ActualElapsedTime", "Origin", "Dest", "Distance", "Diverted"], "data": [[1987, 10, 1, 4, 1, 556, 0, 190, 247, 202, 162, 1846, 0]] } resp = requests.post(url="http://%s:%s/invocations" % (host, port), data=json.dumps(data), headers=headers) print('Classification: %s' % ("ON-Time" if resp.text == "[0.0]" else "LATE")) test_query.py $ python src/rf_test/test_query.py Classification: ON-Time Shell
  37. 37. 37 RAPIDS Cloud-ML Examples https://github.com/rapidsai/cloud-ml-examples RAPIDS + MLflow All-In-One Deployments (coming soon!) RAPIDS Cloud Notebooks Amazon AWS, Databricks, Microsoft Azure, Google GCP RAPIDS Platform Integration SageMaker, AzureML, Google AI Platform RAPIDS Framework Integration DASK, MLflow, Optuna, RayTune
  38. 38. Thank you! Find us on Twitter: @rapidsai

×