Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Data Science Languages and Industry Analytics

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio

Eche un vistazo a continuación

1 de 19 Anuncio

Data Science Languages and Industry Analytics

Descargar para leer sin conexión

September 19, 2015 talk at Berkeley Institute for Data Science. On how comparatively poor JSON / structured data tools pose a challenge for the data science languages (Python, R, Julia, etc.).

September 19, 2015 talk at Berkeley Institute for Data Science. On how comparatively poor JSON / structured data tools pose a challenge for the data science languages (Python, R, Julia, etc.).

Anuncio
Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

Anuncio

Similares a Data Science Languages and Industry Analytics (20)

Más de Wes McKinney (17)

Anuncio

Más reciente (20)

Data Science Languages and Industry Analytics

  1. 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   Data  Science  Languages  and   Industry  Analy<cs   Wes  McKinney,  BIDS  2015-­‐09-­‐19  
  2. 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   Me   •  Serial  creator  of  structured  data  tools  /  user  interfaces   •  Mathema<cian  —  MIT  ‘07   •  Professional  SQL  programmer  2007-­‐2010  (@  AQR)   •  Created  pandas,  April  2008   •  Wrote  Python  for  Data  Analysis  2012   •  Founder  of  DataPad  -­‐>  Cloudera    
  3. 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   A  sample  big  data  architecture   Kafka Kafka Kafka Kafka Application data S3 or HDFS JSON Spark/MapReduce Columnar storage Analytic SQL Engine User SQL
  4. 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   Big  data  architectures  currently   dominated  by  Java  /  JVM   languages     Python/R/Julia  don’t  have  much  of   a  “seat  at  the  table”  
  5. 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   Industry  Analy<cs   Scien<fic  Compu<ng   Heterogeneous  data          Flat  tables  and  JSON   Spark  /  MapReduce   SQL   DFS-­‐friendly  /  streaming  data  formats   More  physical  machines   Homogeneous  data          Mul<dimensional  arrays   HPC  tools   Linear  algebra   Scien<fic  data  formats   Fewer  physical  machines   Some  simplis<c  generaliza<ons  
  6. 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   Many  Interac<ve-­‐speed  SQL  engines   …  and  more  
  7. 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   Ibis:  not  the  direct  subject  of  this  talk   •  hjp://blog.ibis-­‐project.org   •  Craking  a  compelling  Python-­‐on-­‐Hadoop  user  experience   • Remove  SQL-­‐programming  from  user  workflows   • Develop  high  performance  Python  extension  APIs   •  Pythonic  composable  DSL  designed  to  target  SQL  seman<cs   •  Develop  roadmap  targets  Impala  (C++  /  LLVM)  query  engine   • …  but  SQL  compiler  toolchain  works  well  with  other  SQL  dialects  
  8. 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.   Enabling  interoperability  with  big  data  systems   •  Distributed  /  MPP  query  engines:  implemented  in  a  host  language   • Typically  C++,  Java,  or  Scala   •  User-­‐defined  func<ons  (UDFs)  through  various  means   • Implement  in  host  language   • Implement  in  user  language  through  some  external  language  protocol   •  External  UDFs  are  usually  very  slow  (cf:  PL/Python,  PySpark,  etc.)  
  9. 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   What  are  UDFs  good  for?   •  Note:  industry  data  scien<sts  have  libraries  containing  100s  of  UDFs  for  Hive  or   other  distributed  query  engines   •  Custom  data  transforma<ons   •  Custom  domain  logic  (date  /  <me  /  data  types)   •  Custom  data  types   •  Custom  aggrega<ons  (incl.  machine  learning  /  sta<s<cs  expressible  as  reduc<ons)  
  10. 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   Why  are  external  UDFs  slow?   •  Serializa<on  /  deserializa<on  overhead   •  Scalar  vs  vectorized  computa<ons   •  RPC  overhead  
  11. 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.   How  to  make  them  fast?   •  Common  run<me  memory  representa<on  for  tabular  data   •  Share-­‐memory  (zero-­‐copy  or  memcpy-­‐only)  external  UDF  protocol   •  Vectorized  UDF  interface  (for  interpreted  languages)  
  12. 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   Memory  representa<on   •  Many  query  engines  are  standardizing  on  in-­‐memory  columnar  rep’n  of   materialized  transient  data   • Apache  Drill:  hjps://drill.apache.org/faq/   • Spark   • Impala:   hjp://blog.cloudera.com/blog/2015/07/whats-­‐next-­‐for-­‐impala-­‐more-­‐ reliability-­‐usability-­‐and-­‐performance-­‐at-­‐even-­‐greater-­‐scale/   •  Industry-­‐standard  serializa<on  format:  Apache  Parquet   • hjps://parquet.apache.org/  
  13. 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   Serializa<on  vs  In-­‐memory   •  Serializa<on  formats  (e.g.  Parquet)     • Op<mize  for  IO  /  DFS  throughput  at  expense  of  CPU/memory  bus  throughput   • Do  not  consider  random  access  or  in-­‐memory  analy<cs  as  a  goal   •  No  standardized  in-­‐memory  containers  for  materialized  data  from  file  /  RPC   protocols  (Parquet,  Thrik,  protobuf,  Avro,  etc.)  
  14. 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   One  possible  proposal   •  Standardize  on  an  augmented  variant  of  the  Apache  Drill  in-­‐memory  columnar   memory  layout   • hjps://drill.apache.org/docs/value-­‐vectors/   •  Common  /  shared  C  impl  for  R/Python/Julia   • Currently  all  languages  have  poor  support  for  JSON-­‐like  data   • make  your  needs  known!   • Enumerate  required  data  types  and  other  requirements  
  15. 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   More  on  the  Drill  layout   persons'='[ ''{ ''''name:'‘wes’, ''''addresses:'[ '''''''{number:'2,'street:'‘a’}, '''''''{number:'3,'street:'‘bb’}, ''''] ''}, ''{ ''''name:'‘mark’, ''''addresses:'[ '''''''{number:'4,'street:'‘ccc’}, '''''''{number:'5,'street:'‘dddd’}, '''''''{number:'6,'street:'‘f’}, ''''] ''},
  16. 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   Strings  in  Drill   person.name offset 0 3 w e s m a r k
  17. 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.   Array<Struct>  example   person.addresses.street person.addresses 0 2 5 offset 0 1 3 6 10 a b b c c c d d d d f person.addresses.number 2 3 4 5 6 offset
  18. 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   Array<Array<Int32>>  example   persons'='[ ''{ ''''name:'‘wes’, ''''fav_sequences:'[ ''''''[0,'1,'2], ''''''[2,'3] ''''] ''}, ''{ ''''name:'‘mark’, ''''fav_sequences:'[ ''''''[3], ''''''[4,'5], ''''''[6,'7] ''''] ''}, person.fav_sequences/values person.fav_sequences 0 2 5 offset 0 3 5 6 8 0 1 2 2 3 3 4 5 6 7 offset
  19. 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   Wes  McKinney  @wesmckinn   Views  are  my  own  

×