On June 11 Thomas Dinsmore gave a nice outline on tools and technologies that are out there handling analytics in Hadoop. It is a must watch for anyone looking for what advance analytics Hadoop could deliver.
Please find video and slides below.
Synopsis
What is the state of play for advanced analytics in Hadoop? A year ago, options included "roll your own" and little else; today there are a number of serious open source and commercial options available, with new capabilities announced daily.
In this presentation, we begin with a brief overview of use cases for advanced analytics and a discussion of what types of analytics must run in Hadoop. We continue with an overview of available architectures. The presentation concludes with a hype-free survey of available open source and commercial software for advanced analytics in Hadoop.
Bio
Thomas W. Dinsmore is Director of Product Management for Revolution Analytics, a company that provides commercial support and services for open source R. In this role, Mr. Dinsmore closely tracks the market for commercial and open source software on all platforms, including Hadoop. Prior to joining Revolution Analytics, Mr. Dinsmore served as an Analytics Solution Architect for IBM Big Data, and as a Principal Consultant for Razorfish and SAS.
Mr. Dinsmore has hands-on experience with leading commercial and open source tools for advanced analytics, including SAS, SPSS, R, Oracle Data Mining across a range of platforms, including Hadoop, Netezza, Teradata and Oracle. He is certified in SAS 9.
In his career, Mr. Dinsmore has worked with more than 500 enterprises in the United States, Canada, Mexico, Venezuela, Chile, Brazil, the United Kingdom, Belgium, Italy, Turkey, Israel, Malaysia and Singapore.
19. Apache Mahout
• Apache incubator project (2007)
• Machine learning library
• Included in most distributions
• Thin acceptance, few contributors
• Diverse architecture
• Single-node
• MapReduce
• New algos run on Spark
• Recently cleaned up
19
20. Apache Giraph
• Apache top-level project
• Runs in MapReduce
• Dedicated graph engine
• Used by Facebook, few others
• Dead in the water
• No presence in leading distros
• No significant commercial support
• No releases in 13 months
• No recent code commits on Git
20
33. Summary: Open Source
• Giraph is toast
• Mahout may be recovering from roadkill status
• GraphLab outperforms Spark GraphX today in graph analytics
• 0xdata H2O outperforms Spark MLLib today in machine learning
• Spark catching up fast
• More resources and distribution
• Integrated platform for ML and graph analysis
33
35. Alpine
• Business user interface
• Collaboration environment
• Broad library of techniques
• Strong cloud offering
• Leverages Hadoop (multiple distros), Hawq or
Pivotal Greenplum
• Push-down MapReduce
• Certified on Spark
• Small but growing customer base
35
36. IBM SPSS Analytics Server
• Introduced 2013
• Serves as “back end” for SPSS
Modeler
• Uses push-down MR
• Limited analytic feature set
• IBM supports on multiple Hadoop
distros
• Customer acceptance unknown
36
37. Revolution Analytics ScaleR
• ScaleR library of distributed statistics,
machine learning functions
• Tools to distribute arbitrary R functions
• Runs in Cloudera, Hortonworks, Teradata, LSF
clusters, MS HPC
• Hadoop edition uses MR push-down
• Tools simplify installation in large clusters
• R interface
• Partnerships with Alteryx, Qlik, MicroStrategy,
Tableau provide business interfaces
37
38. Skytree Server
• Georgia Tech’s FastLab project, repurposed as
commercial software
• Distributed machine learning platform
• Very opaque about technical details
• User interface is an API
• Co-located in Hadoop under YARN
• Just certified by Hortonworks
• Customer acceptance unknown
• No new public references in a year
• Used by leading credit card company
38
39. SAS High-Performance Analytics
• Distributed in-memory analytics
• Designed to run in special-purpose appliances (2011)
• Repurposed to run in Hadoop (2013)
• Co-exists poorly — cannot run SAS and MapReduce at
the same time
• Reads entire dataset into memory
• Uses MPI to communicate among nodes
• Requires upgrades from standard Hadoop infrastructure
• Customer acceptance unknown
• No public references
• Generic success stories missing from Strata presos
39
40. SAS LASR Server
• SAS’ “other” distributed in-memory platform
• Back end for several end-user products
• SAS Visual Analytics (2012)
• SAS Visual Statistics (New)
• SAS In-Memory Statistics for Hadoop (New)
• Recently added statistics and machine learning
• Does not read raw HDFS; must be transformed to proprietary
SASHDAT
• Like HPA, reads entire dataset into memory.
• 16 Core 256GB node can load 75GB table
• Runs DS2 programs, not Legacy SAS programs
• Fast, but with limited feature set
• SAS claims 1,400 “sites” for Visual Analytics
• Many of those are standalone boxes
40
41. Summary: Commercial
• Alpine’s interface is compelling to business user
• IBM Analytics Server is a good first release
• RRE ScaleR appeals to R users, plays well in Hadoop sandbox
• Skytree Server: strong in prediction
• SAS: why two competing memory-centric architectures?
41
42. Progress
• Spark: blindingly fast maturity
• Rapidly expanding library of analytic features
• Growing developer community, ecosystem
• Commercial: from zero to many
42
43. Interesting Questions
• Will Mahout get a second wind?
• Will Spark MLLib displace 0xdata?
• Will Spark GraphX catch up to GraphLab?
• Can Spark Streaming compete with Storm and commercial entrants?
• How quickly will customers adopt memory-centric architecture for analytics?
• What will Alpine and MicroStrategy do with Spark?
• Will IBM distribute Spark in BigInsights?
• When will SAS announce a reference customer for HPA/LASR in Hadoop?
43