SlideShare una empresa de Scribd logo
1 de 35
Descargar para leer sin conexión
1	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
High	
  Performance	
  Python	
  on	
  
Apache	
  Spark	
  
Wes	
  McKinney	
  @wesmckinn	
  
Spark	
  Summit	
  West	
  -­‐-­‐	
  June	
  7,	
  2016	
  
2	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Me	
  
•  Data	
  Science	
  Tools	
  at	
  Cloudera	
  
•  Serial	
  creator	
  of	
  structured	
  data	
  tools	
  /	
  user	
  interfaces	
  
•  Wrote	
  bestseller	
  Python	
  for	
  Data	
  Analysis	
  2012	
  
• 	
  Working	
  on	
  expanded	
  and	
  revised	
  2nd	
  edi-on,	
  coming	
  2017	
  
•  Open	
  source	
  projects	
  
•  Python	
  {pandas,	
  Ibis,	
  statsmodels}	
  
•  Apache	
  {Arrow,	
  Parquet,	
  Kudu	
  (incubaUng)}	
  
•  Focused	
  on	
  C++,	
  Python,	
  and	
  Hybrid	
  projects	
  
3	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Agenda	
  
•  Why	
  care	
  about	
  Python?	
  
	
  
•  What	
  does	
  “high	
  performance	
  Python”	
  even	
  mean?	
  	
  
	
  
•  A	
  modern	
  approach	
  to	
  Python	
  data	
  so]ware	
  
	
  
•  Spark	
  and	
  Python:	
  performance	
  analysis	
  and	
  development	
  direcUons	
  
4	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
•  Accessible,	
  “swiss	
  army	
  knife”	
  programming	
  language	
  
	
  
•  Highly	
  producUve	
  for	
  so]ware	
  engineering	
  and	
  data	
  science	
  alike	
  
	
  
•  Has	
  excelled	
  as	
  the	
  agile	
  “orchestraUon”	
  or	
  “glue”	
  layer	
  for	
  applicaUon	
  
business	
  logic	
  
	
  
•  Easy	
  to	
  interface	
  with	
  C	
  /	
  C++	
  /	
  Fortran	
  code.	
  Well-­‐designed	
  Python	
  C	
  API	
  
	
  
Why	
  care	
  about	
  (C)Python?	
  
5	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Defining	
  “High	
  Performance	
  Python”	
  
•  The	
  end-­‐user	
  workflow	
  involves	
  primarily	
  Python	
  programming;	
  programs	
  can	
  
be	
  invoked	
  with	
  “python	
  app_entry_point.py	
  ...”	
  
	
  
•  The	
  so]ware	
  uses	
  system	
  resources	
  within	
  an	
  acceptable	
  factor	
  of	
  an	
  
equivalent	
  program	
  developed	
  completely	
  in	
  Java	
  or	
  C++	
  
•  Preferably	
  1-­‐5x	
  slower,	
  not	
  20-­‐50x	
  
	
  
•  The	
  so]ware	
  is	
  suitable	
  for	
  interacUve	
  /	
  exploratory	
  compuUng	
  on	
  modestly	
  
large	
  data	
  sets	
  (=	
  gigabytes)	
  on	
  a	
  single	
  node	
  
6	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Building	
  fast	
  Python	
  so]ware	
  
means	
  embracing	
  certain	
  
limitaUons	
  
7	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Having	
  a	
  healthy	
  relaUonship	
  with	
  the	
  interpreter	
  
•  The	
  Python	
  interpreter	
  itself	
  is	
  “slow”,	
  as	
  compared	
  with	
  hand-­‐coded	
  C	
  or	
  Java	
  
•  Each	
  line	
  of	
  Python	
  code	
  may	
  feature	
  mulUple	
  internal	
  C	
  API	
  calls,	
  
temporary	
  data	
  structures,	
  etc.	
  
	
  
•  Python	
  built-­‐in	
  data	
  structures	
  (numbers,	
  strings,	
  tuples,	
  lists,	
  dicts,	
  etc.)	
  have	
  
significant	
  memory	
  and	
  performance	
  use	
  overhead	
  
	
  
•  Threads	
  performing	
  concurrent	
  CPU	
  or	
  IO	
  work	
  must	
  take	
  care	
  not	
  to	
  block	
  
other	
  threads	
  
8	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Mantras	
  for	
  great	
  success	
  
•  Key	
  quesUon	
  1:	
  Am	
  I	
  making	
  the	
  Python	
  interpreter	
  do	
  a	
  lot	
  of	
  work?	
  
	
  
•  Key	
  quesUon	
  2:	
  Am	
  I	
  blocking	
  other	
  interpreted	
  code	
  from	
  execu-ng?	
  
	
  
•  Key	
  quesUon	
  3:	
  Am	
  I	
  handling	
  data	
  (memory)	
  in	
  a	
  “good”	
  way?	
  
9	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Toy	
  example:	
  interpreted	
  vs.	
  compiled	
  code	
  
10	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Toy	
  example:	
  interpreted	
  vs.	
  compiled	
  code	
  
	
  
Cython: 78x faster than
interpreted
11	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Toy	
  example:	
  interpreted	
  vs.	
  compiled	
  code	
  
NumPy
Creating a full 80MB temporary array +
PyArray_Sum is only 35% slower than a
fully inlined Cython ( C ) function
Interesting: ndarray.sum by itself is almost
2x faster than the hand-coded Cython
function...
12	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Submarines	
  and	
  Icebergs:	
  metaphors	
  for	
  fast	
  Python	
  so]ware	
  
13	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
SubopUmal	
  control	
  flow	
  
Elsewhere Data
Python
code
Python data
structures
Pure Python
computation
Python data
structures
Pure Python
computation
Python data
structures
Data
Deserialization Serialization
Time for a coffee
break...
14	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Beler	
  control	
  flow	
  
Extension
code
(C / C++)
Native
data
Python
code
C Func
Native
data
Native
data
C Func
Python
app logic
Python
app logic
Users only see this!
Zoom zoom!
(if the extension code is good)
15	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
But	
  it’s	
  much	
  easier	
  to	
  write	
  100%	
  Python!	
  
•  Building	
  hybrid	
  C/C++	
  and	
  Python	
  systems	
  adds	
  a	
  lot	
  of	
  complexity	
  to	
  the	
  
engineering	
  process	
  
•  (but	
  it’s	
  o]en	
  worth	
  it)	
  
•  See:	
  Cython,	
  SWIG,	
  Boost.Python,	
  Pybind11,	
  and	
  other	
  “hybrid”	
  so]ware	
  
creaUon	
  tools	
  
	
  
•  BONUS:	
  Python	
  programs	
  can	
  orchestrate	
  mulU-­‐threaded	
  /	
  concurrent	
  systems	
  
wrilen	
  in	
  C/C++	
  (no	
  Python	
  C	
  API	
  needed)	
  
•  The	
  GIL	
  only	
  comes	
  in	
  when	
  you	
  need	
  to	
  “bubble	
  up”	
  data	
  or	
  control	
  flow	
  
(e.g.	
  Python	
  callbacks)	
  into	
  the	
  Python	
  interpreter	
  
16	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
A	
  story	
  of	
  reading	
  a	
  CSV	
  file	
  
f	
  =	
  get_stream(...)	
  
df	
  =	
  pandas.read_csv(f,	
  **csv_options)	
  
while more_data():
buffer = f.read()
parse_bytes(buffer)
df = type_infer_columns()
internally, pseudocode
Concerns
Uses PyString_FromStringAndSize, must
hold GIL for this
Synchronous or asynchronous with IO?
Type infer in parallel?
Data structures used?
17	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
It’s	
  All	
  About	
  the	
  Benjamins	
  (Data	
  Structures)	
  
•  The	
  hard	
  currency	
  of	
  data	
  so]ware	
  is:	
  in-­‐memory	
  data	
  structures	
  
•  How	
  costly	
  are	
  they	
  to	
  send	
  and	
  receive?	
  
•  How	
  costly	
  to	
  manipulate	
  and	
  munge	
  in-­‐memory?	
  
•  How	
  difficult	
  is	
  it	
  to	
  add	
  new	
  proprietary	
  computaUon	
  logic?	
  
	
  
•  In	
  Python:	
  NumPy	
  established	
  a	
  gold	
  standard	
  for	
  interoperable	
  array	
  data	
  
•  pandas	
  is	
  built	
  on	
  NumPy,	
  and	
  made	
  it	
  easy	
  to	
  “plug	
  in”	
  to	
  the	
  ecosystem	
  
•  (but	
  there	
  are	
  plenty	
  of	
  warts	
  sUll)	
  
18	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What’s	
  this	
  have	
  to	
  do	
  with	
  Spark?	
  
•  Some	
  known	
  performance	
  issues	
  in	
  PySpark	
  
•  IO	
  throughput	
  
•  Python	
  to	
  Spark	
  
•  Spark	
  to	
  Python	
  (or	
  Python	
  extension	
  code)	
  
•  Running	
  interpreted	
  Python	
  code	
  on	
  RDDs	
  /	
  Spark	
  DataFrames	
  
•  Lambda	
  mappers	
  /	
  reducers	
  (rdd.map(...))	
  
•  Spark	
  SQL	
  UDFs	
  (registerFuncUon(...))	
  
	
  
19	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Spark	
  IO	
  throughput	
  to/from	
  Python	
  
1.15 MB/s in
9.82 MB/s out
Spark 1.6.1 running on
localhost
76 MB pandas.DataFrame
20	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Spark	
  IO	
  throughput	
  to/from	
  Python	
  
Unofficial improved
toPandas
25.6 MB/s out
21	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Compared	
  with	
  HiveServer2	
  Thri]	
  RPC	
  fetch	
  
Impala 2.5 + Parquet
file on localhost
ibis + impyla
41.46 MB/s read
hs2client (C++ / Python)
90.8 MB/s
Task benchmarked: Thrift TFetchResultsReq + deserialization + conversion to
pandas.DataFrame
22	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Back	
  of	
  envelope	
  comp	
  w/	
  file	
  formats	
  
Feather: 1105 MB/s write
CSV (pandas): 6.2 MB/s write
Feather: 2414 MB/s read
CSV (pandas): 51.9 MB/s read
disclaimer: warm NVMe / OS file cache
23	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Aside:	
  CSVs	
  can	
  be	
  fast	
  
See: https://github.com/wiseio/paratext
24	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
How	
  Python	
  lambdas	
  work	
  in	
  PySpark	
  
Spark
RDD
Python task
worker pool
Python worker
Python worker
Python worker
Python worker
Python worker
Data stream +
pickled PyFunction
See: spark/api/python/PythonRDD.scala
python/pyspark/worker.py
The inner loop of RDD.map
map(f,	
  iterator)	
  
25	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
How	
  Python	
  lambdas	
  perform	
  
NumPy array-oriented operations are
about 100x faster… but that’s not the
whole story
Disclaimer: this isn’t a remotely “fair” comparison, but it helps illustrate the
real pitfalls associated with introducing serialization and RPC/IPC into a
computational process
26	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
How	
  Python	
  lambdas	
  perform	
  
8 cores
1 core
Lessons learned: Python data analytics should
not be based around scalar object iteration
27	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Asides	
  /	
  counterpoints	
  
•  Spark<-­‐>Python	
  IO	
  may	
  not	
  be	
  important	
  -­‐-­‐	
  can	
  leave	
  all	
  of	
  the	
  data	
  remote	
  
•  Spark	
  DataFrame	
  operaUons	
  have	
  reduced	
  the	
  need	
  for	
  many	
  types	
  of	
  Lambda	
  
funcUons	
  
•  Can	
  use	
  binary	
  file	
  formats	
  as	
  an	
  alternate	
  IO	
  interface	
  
•  Parquet	
  (Python	
  support	
  soon	
  via	
  apache/parquet-­‐cpp)	
  
•  Avro	
  (see	
  cavro,	
  fastavro,	
  pyavroc)	
  
•  ORC	
  (needs	
  a	
  Python	
  champion)	
  
•  ...	
  
28	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Apache	
  
Arrow	
  
http://arrow.apache.org
Some slides from Strata-HW talk w/
Jacques Nadeau
29	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Apache	
  Arrow	
  in	
  a	
  Slide	
  
•  New	
  Top-­‐level	
  Apache	
  So]ware	
  FoundaUon	
  project	
  
• 	
  hlp://arrow.apache.org	
  
	
  
•  Focused	
  on	
  Columnar	
  In-­‐Memory	
  AnalyUcs	
  
1.  10-­‐100x	
  speedup	
  on	
  many	
  workloads	
  
2.  Common	
  data	
  layer	
  enables	
  companies	
  to	
  choose	
  best	
  of	
  
breed	
  systems	
  	
  
3.  Designed	
  to	
  work	
  with	
  any	
  programming	
  language	
  
4.  Support	
  for	
  both	
  relaUonal	
  and	
  complex	
  data	
  as-­‐is	
  
	
  
•  Oriented	
  at	
  collaboraUon	
  amongst	
  other	
  OSS	
  projects	
  
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
30	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
High	
  Performance	
  Sharing	
  &	
  Interchange	
  
Today With Arrow
•  Each system has its own internal
memory format
•  70-80% CPU wasted on serialization
and deserialization
•  Similar functionality implemented in
multiple projects
•  All systems utilize the same memory
format
•  No overhead for cross-system
communication
•  Projects can share functionality (eg,
Parquet-to-Arrow reader)
31	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Arrow	
  and	
  PySpark	
  
•  Build	
  a	
  C	
  API	
  level	
  data	
  protocol	
  to	
  move	
  data	
  between	
  Spark	
  and	
  Python	
  
•  Either	
  	
  
•  (Fast)	
  Convert	
  Arrow	
  to/from	
  pandas.DataFrame	
  
•  (Faster)	
  Perform	
  naUve	
  analyUcs	
  on	
  Arrow	
  data	
  in-­‐memory	
  
•  Use	
  Arrow	
  
•  For	
  efficiently	
  handling	
  nested	
  Spark	
  SQL	
  data	
  in-­‐memory	
  
•  IO:	
  pandas/NumPy	
  data	
  push/pull	
  
•  Lambda/UDF	
  evaluaUon	
  
	
  
32	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
• Problem:	
  fast,	
  language-­‐
agnosUc	
  binary	
  data	
  frame	
  
file	
  format	
  
• Creators:	
  Wes	
  McKinney	
  
(Python)	
  and	
  Hadley	
  
Wickham	
  (R)	
  
• Read	
  speeds	
  close	
  to	
  disk	
  IO	
  
performance	
  
Arrow	
  in	
  acUon:	
  Feather	
  File	
  Format	
  for	
  Python	
  and	
  R	
  
33	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
More	
  on	
  Feather	
  
array 0
array 1
array 2
...
array n - 1
METADATA
Feather File
libfeather
C++ library
Rcpp
Cython
R data.frame
pandas DataFrame
34	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Summary	
  
•  It’s	
  essenUal	
  to	
  improve	
  Spark’s	
  low-­‐level	
  data	
  interoperability	
  with	
  the	
  Python	
  
data	
  ecosystem	
  
	
  
•  I’m	
  personally	
  excited	
  to	
  work	
  with	
  the	
  Spark	
  +	
  Arrow	
  +	
  PyData	
  +	
  other	
  
communiUes	
  to	
  help	
  make	
  this	
  a	
  reality	
  
	
  
35	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Thank	
  you	
  
Wes	
  McKinney	
  @wesmckinn	
  
Views	
  are	
  my	
  own	
  

Más contenido relacionado

La actualidad más candente

Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
François Garillot
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Spark Summit
 

La actualidad más candente (20)

Large Scale Multimedia Data Intelligence And Analysis On Spark
Large Scale Multimedia Data Intelligence And Analysis On SparkLarge Scale Multimedia Data Intelligence And Analysis On Spark
Large Scale Multimedia Data Intelligence And Analysis On Spark
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on Mesos
 
Reactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark StreamingReactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark Streaming
 
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef Habdank
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik Sivashanmugam
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni Schiefer
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDeep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce Spitler
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache Spark
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons Learned
 
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
Spark Summit EU talk by Patrick Baier and Stanimir DragievSpark Summit EU talk by Patrick Baier and Stanimir Dragiev
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
 

Destacado

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 

Destacado (20)

Connecting Python To The Spark Ecosystem
Connecting Python To The Spark EcosystemConnecting Python To The Spark Ecosystem
Connecting Python To The Spark Ecosystem
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySpark
 
Spark: Interactive To Production
Spark: Interactive To ProductionSpark: Interactive To Production
Spark: Interactive To Production
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow
 
Solr As A SparkSQL DataSource
Solr As A SparkSQL DataSourceSolr As A SparkSQL DataSource
Solr As A SparkSQL DataSource
 
Anaconda and PyData Solutions
Anaconda and PyData SolutionsAnaconda and PyData Solutions
Anaconda and PyData Solutions
 
From MapReduce to Apache Spark
From MapReduce to Apache SparkFrom MapReduce to Apache Spark
From MapReduce to Apache Spark
 
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay ListingsScalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark Streaming
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
 
Big Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the CloudBig Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the Cloud
 
Huohua: A Distributed Time Series Analysis Framework For Spark
Huohua: A Distributed Time Series Analysis Framework For SparkHuohua: A Distributed Time Series Analysis Framework For Spark
Huohua: A Distributed Time Series Analysis Framework For Spark
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
 
Spark at Bloomberg: Dynamically Composable Analytics
Spark at Bloomberg:  Dynamically Composable Analytics Spark at Bloomberg:  Dynamically Composable Analytics
Spark at Bloomberg: Dynamically Composable Analytics
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Bolt: Building A Distributed ndarray
Bolt: Building A Distributed ndarrayBolt: Building A Distributed ndarray
Bolt: Building A Distributed ndarray
 

Similar a High-Performance Python On Spark

28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines
Timothy Spann
 

Similar a High-Performance Python On Spark (20)

Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
 
Python Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FuturePython Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the Future
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
 
PyData: The Next Generation
PyData: The Next GenerationPyData: The Next Generation
PyData: The Next Generation
 
Improving data interoperability in Python and R
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and R
 
Improving Data Interoperability for Python and R
Improving Data Interoperability for Python and RImproving Data Interoperability for Python and R
Improving Data Interoperability for Python and R
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 
28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines
 
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
IoT Edge Data Processing with NVidia Jetson Nano oct 3 2019
IoT  Edge Data Processing with NVidia Jetson Nano oct 3 2019IoT  Edge Data Processing with NVidia Jetson Nano oct 3 2019
IoT Edge Data Processing with NVidia Jetson Nano oct 3 2019
 
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsTensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAccelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 

Más de Jen Aman

Más de Jen Aman (20)

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
 
Spatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkSpatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using Spark
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkDeploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using Spark
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache Spark
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache Spark
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduScalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In Baidu
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
 

Último

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
wsppdmt
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 

Último (20)

Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 

High-Performance Python On Spark

  • 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   High  Performance  Python  on   Apache  Spark   Wes  McKinney  @wesmckinn   Spark  Summit  West  -­‐-­‐  June  7,  2016  
  • 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   Me   •  Data  Science  Tools  at  Cloudera   •  Serial  creator  of  structured  data  tools  /  user  interfaces   •  Wrote  bestseller  Python  for  Data  Analysis  2012   •   Working  on  expanded  and  revised  2nd  edi-on,  coming  2017   •  Open  source  projects   •  Python  {pandas,  Ibis,  statsmodels}   •  Apache  {Arrow,  Parquet,  Kudu  (incubaUng)}   •  Focused  on  C++,  Python,  and  Hybrid  projects  
  • 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   Agenda   •  Why  care  about  Python?     •  What  does  “high  performance  Python”  even  mean?       •  A  modern  approach  to  Python  data  so]ware     •  Spark  and  Python:  performance  analysis  and  development  direcUons  
  • 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   •  Accessible,  “swiss  army  knife”  programming  language     •  Highly  producUve  for  so]ware  engineering  and  data  science  alike     •  Has  excelled  as  the  agile  “orchestraUon”  or  “glue”  layer  for  applicaUon   business  logic     •  Easy  to  interface  with  C  /  C++  /  Fortran  code.  Well-­‐designed  Python  C  API     Why  care  about  (C)Python?  
  • 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   Defining  “High  Performance  Python”   •  The  end-­‐user  workflow  involves  primarily  Python  programming;  programs  can   be  invoked  with  “python  app_entry_point.py  ...”     •  The  so]ware  uses  system  resources  within  an  acceptable  factor  of  an   equivalent  program  developed  completely  in  Java  or  C++   •  Preferably  1-­‐5x  slower,  not  20-­‐50x     •  The  so]ware  is  suitable  for  interacUve  /  exploratory  compuUng  on  modestly   large  data  sets  (=  gigabytes)  on  a  single  node  
  • 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   Building  fast  Python  so]ware   means  embracing  certain   limitaUons  
  • 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   Having  a  healthy  relaUonship  with  the  interpreter   •  The  Python  interpreter  itself  is  “slow”,  as  compared  with  hand-­‐coded  C  or  Java   •  Each  line  of  Python  code  may  feature  mulUple  internal  C  API  calls,   temporary  data  structures,  etc.     •  Python  built-­‐in  data  structures  (numbers,  strings,  tuples,  lists,  dicts,  etc.)  have   significant  memory  and  performance  use  overhead     •  Threads  performing  concurrent  CPU  or  IO  work  must  take  care  not  to  block   other  threads  
  • 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.   Mantras  for  great  success   •  Key  quesUon  1:  Am  I  making  the  Python  interpreter  do  a  lot  of  work?     •  Key  quesUon  2:  Am  I  blocking  other  interpreted  code  from  execu-ng?     •  Key  quesUon  3:  Am  I  handling  data  (memory)  in  a  “good”  way?  
  • 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   Toy  example:  interpreted  vs.  compiled  code  
  • 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   Toy  example:  interpreted  vs.  compiled  code     Cython: 78x faster than interpreted
  • 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.   Toy  example:  interpreted  vs.  compiled  code   NumPy Creating a full 80MB temporary array + PyArray_Sum is only 35% slower than a fully inlined Cython ( C ) function Interesting: ndarray.sum by itself is almost 2x faster than the hand-coded Cython function...
  • 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   Submarines  and  Icebergs:  metaphors  for  fast  Python  so]ware  
  • 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   SubopUmal  control  flow   Elsewhere Data Python code Python data structures Pure Python computation Python data structures Pure Python computation Python data structures Data Deserialization Serialization Time for a coffee break...
  • 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   Beler  control  flow   Extension code (C / C++) Native data Python code C Func Native data Native data C Func Python app logic Python app logic Users only see this! Zoom zoom! (if the extension code is good)
  • 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   But  it’s  much  easier  to  write  100%  Python!   •  Building  hybrid  C/C++  and  Python  systems  adds  a  lot  of  complexity  to  the   engineering  process   •  (but  it’s  o]en  worth  it)   •  See:  Cython,  SWIG,  Boost.Python,  Pybind11,  and  other  “hybrid”  so]ware   creaUon  tools     •  BONUS:  Python  programs  can  orchestrate  mulU-­‐threaded  /  concurrent  systems   wrilen  in  C/C++  (no  Python  C  API  needed)   •  The  GIL  only  comes  in  when  you  need  to  “bubble  up”  data  or  control  flow   (e.g.  Python  callbacks)  into  the  Python  interpreter  
  • 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   A  story  of  reading  a  CSV  file   f  =  get_stream(...)   df  =  pandas.read_csv(f,  **csv_options)   while more_data(): buffer = f.read() parse_bytes(buffer) df = type_infer_columns() internally, pseudocode Concerns Uses PyString_FromStringAndSize, must hold GIL for this Synchronous or asynchronous with IO? Type infer in parallel? Data structures used?
  • 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.   It’s  All  About  the  Benjamins  (Data  Structures)   •  The  hard  currency  of  data  so]ware  is:  in-­‐memory  data  structures   •  How  costly  are  they  to  send  and  receive?   •  How  costly  to  manipulate  and  munge  in-­‐memory?   •  How  difficult  is  it  to  add  new  proprietary  computaUon  logic?     •  In  Python:  NumPy  established  a  gold  standard  for  interoperable  array  data   •  pandas  is  built  on  NumPy,  and  made  it  easy  to  “plug  in”  to  the  ecosystem   •  (but  there  are  plenty  of  warts  sUll)  
  • 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   What’s  this  have  to  do  with  Spark?   •  Some  known  performance  issues  in  PySpark   •  IO  throughput   •  Python  to  Spark   •  Spark  to  Python  (or  Python  extension  code)   •  Running  interpreted  Python  code  on  RDDs  /  Spark  DataFrames   •  Lambda  mappers  /  reducers  (rdd.map(...))   •  Spark  SQL  UDFs  (registerFuncUon(...))    
  • 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   Spark  IO  throughput  to/from  Python   1.15 MB/s in 9.82 MB/s out Spark 1.6.1 running on localhost 76 MB pandas.DataFrame
  • 20. 20  ©  Cloudera,  Inc.  All  rights  reserved.   Spark  IO  throughput  to/from  Python   Unofficial improved toPandas 25.6 MB/s out
  • 21. 21  ©  Cloudera,  Inc.  All  rights  reserved.   Compared  with  HiveServer2  Thri]  RPC  fetch   Impala 2.5 + Parquet file on localhost ibis + impyla 41.46 MB/s read hs2client (C++ / Python) 90.8 MB/s Task benchmarked: Thrift TFetchResultsReq + deserialization + conversion to pandas.DataFrame
  • 22. 22  ©  Cloudera,  Inc.  All  rights  reserved.   Back  of  envelope  comp  w/  file  formats   Feather: 1105 MB/s write CSV (pandas): 6.2 MB/s write Feather: 2414 MB/s read CSV (pandas): 51.9 MB/s read disclaimer: warm NVMe / OS file cache
  • 23. 23  ©  Cloudera,  Inc.  All  rights  reserved.   Aside:  CSVs  can  be  fast   See: https://github.com/wiseio/paratext
  • 24. 24  ©  Cloudera,  Inc.  All  rights  reserved.   How  Python  lambdas  work  in  PySpark   Spark RDD Python task worker pool Python worker Python worker Python worker Python worker Python worker Data stream + pickled PyFunction See: spark/api/python/PythonRDD.scala python/pyspark/worker.py The inner loop of RDD.map map(f,  iterator)  
  • 25. 25  ©  Cloudera,  Inc.  All  rights  reserved.   How  Python  lambdas  perform   NumPy array-oriented operations are about 100x faster… but that’s not the whole story Disclaimer: this isn’t a remotely “fair” comparison, but it helps illustrate the real pitfalls associated with introducing serialization and RPC/IPC into a computational process
  • 26. 26  ©  Cloudera,  Inc.  All  rights  reserved.   How  Python  lambdas  perform   8 cores 1 core Lessons learned: Python data analytics should not be based around scalar object iteration
  • 27. 27  ©  Cloudera,  Inc.  All  rights  reserved.   Asides  /  counterpoints   •  Spark<-­‐>Python  IO  may  not  be  important  -­‐-­‐  can  leave  all  of  the  data  remote   •  Spark  DataFrame  operaUons  have  reduced  the  need  for  many  types  of  Lambda   funcUons   •  Can  use  binary  file  formats  as  an  alternate  IO  interface   •  Parquet  (Python  support  soon  via  apache/parquet-­‐cpp)   •  Avro  (see  cavro,  fastavro,  pyavroc)   •  ORC  (needs  a  Python  champion)   •  ...  
  • 28. 28  ©  Cloudera,  Inc.  All  rights  reserved.   Apache   Arrow   http://arrow.apache.org Some slides from Strata-HW talk w/ Jacques Nadeau
  • 29. 29  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  Arrow  in  a  Slide   •  New  Top-­‐level  Apache  So]ware  FoundaUon  project   •   hlp://arrow.apache.org     •  Focused  on  Columnar  In-­‐Memory  AnalyUcs   1.  10-­‐100x  speedup  on  many  workloads   2.  Common  data  layer  enables  companies  to  choose  best  of   breed  systems     3.  Designed  to  work  with  any  programming  language   4.  Support  for  both  relaUonal  and  complex  data  as-­‐is     •  Oriented  at  collaboraUon  amongst  other  OSS  projects   Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
  • 30. 30  ©  Cloudera,  Inc.  All  rights  reserved.   High  Performance  Sharing  &  Interchange   Today With Arrow •  Each system has its own internal memory format •  70-80% CPU wasted on serialization and deserialization •  Similar functionality implemented in multiple projects •  All systems utilize the same memory format •  No overhead for cross-system communication •  Projects can share functionality (eg, Parquet-to-Arrow reader)
  • 31. 31  ©  Cloudera,  Inc.  All  rights  reserved.   Arrow  and  PySpark   •  Build  a  C  API  level  data  protocol  to  move  data  between  Spark  and  Python   •  Either     •  (Fast)  Convert  Arrow  to/from  pandas.DataFrame   •  (Faster)  Perform  naUve  analyUcs  on  Arrow  data  in-­‐memory   •  Use  Arrow   •  For  efficiently  handling  nested  Spark  SQL  data  in-­‐memory   •  IO:  pandas/NumPy  data  push/pull   •  Lambda/UDF  evaluaUon    
  • 32. 32  ©  Cloudera,  Inc.  All  rights  reserved.   • Problem:  fast,  language-­‐ agnosUc  binary  data  frame   file  format   • Creators:  Wes  McKinney   (Python)  and  Hadley   Wickham  (R)   • Read  speeds  close  to  disk  IO   performance   Arrow  in  acUon:  Feather  File  Format  for  Python  and  R  
  • 33. 33  ©  Cloudera,  Inc.  All  rights  reserved.   More  on  Feather   array 0 array 1 array 2 ... array n - 1 METADATA Feather File libfeather C++ library Rcpp Cython R data.frame pandas DataFrame
  • 34. 34  ©  Cloudera,  Inc.  All  rights  reserved.   Summary   •  It’s  essenUal  to  improve  Spark’s  low-­‐level  data  interoperability  with  the  Python   data  ecosystem     •  I’m  personally  excited  to  work  with  the  Spark  +  Arrow  +  PyData  +  other   communiUes  to  help  make  this  a  reality    
  • 35. 35  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   Wes  McKinney  @wesmckinn   Views  are  my  own