Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Pandas on AWS - Let me count the ways.pdf

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Cargando en…3
×

Eche un vistazo a continuación

1 de 32 Anuncio

Pandas on AWS - Let me count the ways.pdf

Descargar para leer sin conexión

Chris Fregly (Principal Solution Architect, AI and machine learning at AWS) will give a brief presentation on the various ways to perform scalable Pandas, Modin, and Ray on AWS. He will then answer questions from the audience and moderator, Alejandro Herrera (whatever he is) at Ponder.

Chris Fregly is a Principal Solution Architect for AI and Machine Learning at Amazon Web Services (AWS) based in San Francisco, California. He is the organizer of the Global Data Science on AWS meetup. He is co-author of the O'Reilly Book, "Data Science on AWS."

Related Links
O'Reilly Book: https://www.amazon.com/dp/1492079391/
Website: https://datascienceonaws.com
Meetup: https://meetup.datascienceonaws.com
GitHub Repo: https://github.com/data-science-on-aws/
YouTube: https://youtube.datascienceonaws.com
Slideshare: https://slideshare.datascienceonaws.com

Chris Fregly (Principal Solution Architect, AI and machine learning at AWS) will give a brief presentation on the various ways to perform scalable Pandas, Modin, and Ray on AWS. He will then answer questions from the audience and moderator, Alejandro Herrera (whatever he is) at Ponder.

Chris Fregly is a Principal Solution Architect for AI and Machine Learning at Amazon Web Services (AWS) based in San Francisco, California. He is the organizer of the Global Data Science on AWS meetup. He is co-author of the O'Reilly Book, "Data Science on AWS."

Related Links
O'Reilly Book: https://www.amazon.com/dp/1492079391/
Website: https://datascienceonaws.com
Meetup: https://meetup.datascienceonaws.com
GitHub Repo: https://github.com/data-science-on-aws/
YouTube: https://youtube.datascienceonaws.com
Slideshare: https://slideshare.datascienceonaws.com

Anuncio
Anuncio

Más Contenido Relacionado

Más de Chris Fregly (20)

Más reciente (20)

Anuncio

Pandas on AWS - Let me count the ways.pdf

  1. 1. Scaling Pandas on AWS Let me count the ways! Chris Fregly, Principal Solution Architect @ AWS
  2. 2. Agenda Updates from re:Invent 2022 (this week!) Amazon Code Whisperer Overview of AWS services for data, AI and machine learning AWS SDK for Pandas Serverless Ray on AWS
  3. 3. Updates from AWS re:Invent 2022 (this week!) Focused on SageMaker usability, collaboration, and notebook-as-jobs First-class support for large language and generative models like Stable Diffusion Introduces Serverless Ray (ray.io) to AWS services including SageMaker and Glue
  4. 4. Agenda Updates from re:Invent 2022 (this week!) Amazon Code Whisperer Overview of AWS services for data, AI and machine learning AWS SDK for Pandas Serverless Ray on AWS
  5. 5. Amazon Code Whisperer AWS SDK for Pandas Modin Pandas Spark Ray … Everything!
  6. 6. DEMOs!
  7. 7. Agenda Updates from re:Invent 2022 (this week!) Amazon Code Whisperer Overview of AWS services for data, AI and machine learning AWS SDK for Pandas Serverless Ray on AWS
  8. 8. Quick overview of AWS services for data and AI/ML AI and machine learning Data and analytics
  9. 9. Agenda Updates from re:Invent 2022 (this week!) Amazon Code Whisperer Overview of AWS services for data, AI and machine learning AWS SDK for Pandas Serverless Ray on AWS
  10. 10. Let’s search the internet for “Pandas on AWS”
  11. 11. AWS SDK for Pandas - Python library featuring Modin! Captures AWS best practices including concurrent reads/writes and security Developed and maintained by AWS Professional Services and Solution Architects Allows Pandas to scale to the cloud!
  12. 12. Why does scale matter? Big data is, well, big! Requires a lot of RAM to analyze Data that doesn’t fit into a single server’s RAM needs to run on a cluster Preferably a dedicated, serverless cluster Dedicated cluster avoids contention with other users/jobs (long-running) Serverless reduces cost because we pay for only what we use - less idle time Fortunately, AWS has innovated on fast-start (<1 sec) dedicated serverless clusters Based on Firecracker open source project https://firecracker-microvm.github.io
  13. 13. Which AWS services supported by AWS SDK for Pandas? ● Amazon S3 ● AWS Glue Catalog ● Amazon Athena ● AWS Lake Formation ● Amazon Redshift ● PostgreSQL ● MySQL ● SQL Server ● Oracle ● Data API Redshift ● Data API RDS ● OpenSearch ● Amazon Neptune ● DynamoDB ● Amazon Timestream ● Amazon EMR ● Amazon CloudWatch Logs ● Amazon Chime ● Amazon QuickSight ● AWS STS ● AWS Secrets Manager ● Global Configurations
  14. 14. Where can I run the AWS SDK for Pandas library? Local laptop - limited by RAM AWS Glue notebooks/jobs, interactive serverless clusters Amazon SageMaker Studio notebooks/jobs, interactive serverless clusters thru Glue Amazon Elastic MapReduce (EMR) Studio notebooks/jobs - serverless clusters (2021)
  15. 15. Lots of tutorials for the AWS SDK for Pandas library ● 001 - Introduction ● 002 - Sessions ● 003 - Amazon S3 ● 004 - Parquet Datasets ● 005 - Glue Catalog ● 006 - Amazon Athena ● 007 - Databases (Redshift, MySQL, PostgreSQL) ● 008 - Redshift - Copy & Unload.ipynb ● 009 - Redshift - Append, Overwrite and Upsert ● 010 - Parquet Crawler ● 011 - CSV Datasets ● 012 - CSV Crawler ● 013 - Merging Datasets on S3 ● 014 - Schema Evolution ● 015 - EMR ● 016 - EMR & Docker ● 017 - Partition Projection ● 018 - QuickSight ● 019 - Athena Cache ● 020 - Spark Table Interoperability ● 021 - Global Configurations ● 022 - Writing Partitions Concurrently ● 023 - Flexible Partitions Filter ● 024 - Athena Query Metadata ● 025 - Redshift - Loading Parquet files with Spectrum ● 026 - Amazon Timestream ● 027 - Amazon Timestream 2 ● 028 - Amazon DynamoDB ● 029 - S3 Select ● 030 - Data Api ● 031 - OpenSearch ● 032 - Lake Formation Governed Tables ● 033 - Amazon Neptune
  16. 16. DEMOs!
  17. 17. Agenda Updates from re:Invent 2022 (this week!) Amazon Code Whisperer Overview of AWS services for data, AI and machine learning AWS SDK for Pandas Serverless Ray on AWS
  18. 18. Why Ray on AWS? Used by Amazon.com for some data-intensive use cases Better performance than Apache Spark in some cases Customers are asking for unified Ray API for both data and AI/ML workloads
  19. 19. Serverless Ray on AWS Scalable data transformations through Ray Datasets Scalable AI and machine learning through Ray AI Runtime (AIR) Serverless clusters through SageMaker + Glue Interactive Sessions integration
  20. 20. Which AWS services support Ray?
  21. 21. Scaling Ray from laptop to cluster
  22. 22. 22 Ray Datasets - group by and count
  23. 23. 23 Modin - group by and count
  24. 24. 24 Apache Spark - group by and count
  25. 25. Ray use cases - data processing Large-scale data ingest and transform Change data capture Distributed shuffle
  26. 26. Ray use cases - AI and machine learning Fast “last-mile” data loading to improve model-training resources usage Automated machine learning (AutoML) - find best model and tuning parameters Hyper-parameter tuning - find best tuning parameters for a given model Reinforcement learning - learn from repeated actions and results Model-ensemble predictions
  27. 27. AWS SDK for Pandas uses Modin for distributed Pandas! https://github.com/aws/aws-sdk-pandas/discussions/1815#common-errors <= Debugging and Performance
  28. 28. Lots of Ray tutorials https://github.com/aws-samples/aws-samples-for-ray https://docs.ray.io/en/latest/ray-core/examples/overview.html
  29. 29. DEMOs!
  30. 30. Scaling Pandas on AWS Cheers! Chris Fregly, Principal Solution Architect @ AWS
  31. 31. EXTRAS
  32. 32. High-memory instance types on AWS

×