A bit like Haiku:
Limited expressivity
But can be used to approach diverse problem domains
* MapReduce struggles from performance optimization for individual systems because of its design
Google has used both techniques in-house quite a bit and the future will contain both
MPI predates MR significantly
Special purpose systems to solve one problem domain well.
Ex: Giraph / Graphlab (graph processing), Impala (interactive SQL)
Generalize the capabilities of MapReduce to provide a richer foundation to solve problems.
Ex: MPI, Hama (BSP), Dryad (arbitrary DAGs)
Interactive exploration of data for data scientists – no need to develop “applications”
Developers can prototype application on live system as they build application
Cloudera Spark Committers: 4
Intel (our close partner): 1
Hortonworks: 1
IBM: 0
MapR: 0
8 full-time engineers working on Spark
Contributed over 370 patches and 43,000 lines of code to Spark
Compare to Hortonworks (4), IBM (12), and MapR (1)
Spark is suited for iterative workloads such as ML models and is fast becoming good at general purpose computational workloads with more integrations coming down the road with frameworks like HBase, Solr etc.
MapReduce is suited for I/O intensive workloads where a high level of fault tolerance and scale is required.
Spark is slowly eating into the MapReduce workloads as it is maturing up.
Problems:
Management
Spark is difficult to deploy and operate in production, especially at scale. Good management tooling is critical.
Security
Strong security is a must for any enterprise tool. Spark needs fine-grained access controls, data access tracking and reporting, and data privacy.
Scale
Spark outperforms MapReduce on latency and throughput, but can’t compete on scale. Spark needs to work on the largest multi-tenant clusters.
Streaming
Streaming workloads are increasing and Spark needs to be able to support all of the common stream processing use cases with production-grade guarantees.
What’s Required:
Management
Resource Management (YARN integration); Automation; Metrics and visibility; Accessibility
Security
Authentication; Authorization; Governance Visibility (Audit/Lineage); Data Protection (Encryption)
Scale
Handle jobs on thousands of executors each, running simultaneously on large multi-tenant clusters with over 10k nodes (Fault-Tolerance; Stability; Performance)
Streaming
Zero data loss; ingest integration; zero downtime management; performance; accessibility
In response, many organizations have turned to a new architecture – an enterprise data hub – to complement and extend existing investments.
An enterprise data hub can store unlimited data, cost-effectively and reliably, for as long as you need, and lets users access that data in a variety of ways. Data can be collected, stored, processed, explored, modeled, and served in one unified platform. It’s connected to the systems you already rely on.
Cloudera’s enterprise data hub, powered by Apache Hadoop, the popular open source distributed data platform, is differentiated in several crucial areas. We provide:
Leading query performance.
The enterprise management and governance that you require of all of your mission-critical infrastructure.
Comprehensive, transparent, compliance-ready security at the core.
An open source platform that is also built of open standards – projects that are supported by multiple vendors to ensure sustainability, portability, and compatibility.
Our platform runs in your choice of environment, whether on-premises or in the cloud.
===
Cheat Sheet version: Our enterprise data hub is:
One place for unlimited data
Accessible to anyone
Connected to the systems you already depend on
Secure, governed, managed & compliant
Built on open source and open standards
Deployed however you want
Coupled with the support and enablement you need to succeed.
Important Note: Our EDH emphasizes “unified analytics” over “unified data”: It’s not practical or probable that customers will actually unify all their data. Much of it lives in the cloud or on storage (e.g. Isilon), in remote datacenters, is of uncertain value vs. cost of moving it to a hub, or security mandates preclude collocation. We enable customers to gather unlimited data, while bringing diverse processing and analytics to that data.