First, the industry has embraced polyglot persistence – accepting and embracing that we should store and manage data in a “fit for purpose” approach that is optimized for the data at hand.
Second, that we can parallelize our MPP data foundation for both speed and size, this is crucial for next-gen data services and analytics that can scale to any latency and size requirements.
Third, MPP data pipelines that allow us to treat data events in a moving time windows at variable latencies; in the long run this will change how we do ETL for most use cases.
The value of applying a fast data strategy as an end-to-end solution has become more apparent across many industry segments; it’s also becoming more mainstream now. We’re seeing many of Oracle customers adopting solutions which address these requirements around fast data; for example:
Transportation: Monitor all airline's operational events to eliminate flight delays
Retail: Customer service centers are using Fast Data for click-stream analysis and customer experience management.
Telco: Location based offers, or Intelligent Network Management to drive new services and lower costs.
Healthcare: Monitoring Medical Device Data to help save lives
Manufacturing: Real-time corrective action for reducing maintenance costs or risk outages
Oracle Data Integration (Oracle GoldenGate and Oracle Data Integrator) provide best-in class real-time capture and big data transformation:
Capture changed data and events in real time
Apply basic filtering and transformation at capture point
Transform and load structured or unstructured data for analysis
Improve the trusted quality of information
In-memory analytics to provide actionable insight from large amounts of data - fast.
Oracle TimesTen In-Memory Database for Oracle Exalytics has also been certified to work with both Oracle GoldenGate Real-Time Replication technology, as well as Oracle Data Integrator, allowing more flexibility for customers to report on events as they happen. With these certifications, Oracle Exalytics data can be updated either via replication or via incremental updates, making the refreshes quicker and more efficient.
Trust your data model is current and accurate : customers benefit from the Common enterprise information model provides centralized metadata management, common query request generation and data access, and a rich spectrum of visualization, collaboration, and search features
Quickly organize and explore diverse and unstructured Big data from inside and outside your organization – Endeca allows business users to freely explore and discover meaningful new insight from both structured and unstructured sources to help identify root causes and new associations.
In-memory analytics to provide actionable insight from large amounts of data - fast.
Oracle TimesTen In-Memory Database for Oracle Exalytics has also been certified to work with both Oracle GoldenGate Real-Time Replication technology, as well as Oracle Data Integrator, allowing more flexibility for customers to report on events as they happen. With these certifications, Oracle Exalytics data can be updated either via replication or via incremental updates, making the refreshes quicker and more efficient.
Trust your data model is current and accurate : customers benefit from the Common enterprise information model provides centralized metadata management, common query request generation and data access, and a rich spectrum of visualization, collaboration, and search features
Quickly organize and explore diverse and unstructured Big data from inside and outside your organization – Endeca allows business users to freely explore and discover meaningful new insight from both structured and unstructured sources to help identify root causes and new associations.
In-memory analytics to provide actionable insight from large amounts of data - fast.
Oracle TimesTen In-Memory Database for Oracle Exalytics has also been certified to work with both Oracle GoldenGate Real-Time Replication technology, as well as Oracle Data Integrator, allowing more flexibility for customers to report on events as they happen. With these certifications, Oracle Exalytics data can be updated either via replication or via incremental updates, making the refreshes quicker and more efficient.
Trust your data model is current and accurate : customers benefit from the Common enterprise information model provides centralized metadata management, common query request generation and data access, and a rich spectrum of visualization, collaboration, and search features
Quickly organize and explore diverse and unstructured Big data from inside and outside your organization – Endeca allows business users to freely explore and discover meaningful new insight from both structured and unstructured sources to help identify root causes and new associations.
In general, I see three main categories: HDFS ecosystem based, ones that are built on other ecosystems (and/or), and the composite systems
HDFS based does not mean that is *must* be HDFS based; Spark and H20 can all run outside of HDFS. However, most times you will find them, there is going to be HDFS running around as well.
An effort to make R both more scalable and faster.
Can run on top of HDFS, but other platforms as well
YARN based processing
Includes Malhar, a “lego box” of premade operators and widgets to speed adoption
An in memory OLAP store, designed to ingest event/log data, chunking and compressing that data into column-based queryable segments.
Data Ingestion
Data is ingested by Druid directly through its real-time nodes, or batch-loaded into historical nodes from a deep storage facility. Real-time nodes accept JSON-formatted data from a streaming datasource. Batch-loaded data formats can be JSON, CSV, or TSV. Real-time nodes temporarily store and serve data in real time, but eventually push the data to the deep storage facility, from which it is loaded into historical nodes. Historical nodes hold the bulk of data in the cluster.
Real-time nodes chunk data into segments, and are designed to frequently move these segments out to deep storage. To maintain cluster awareness of the location of data, these nodes must interact with MySQL to update metadata about the segments, and with Apache ZooKeeper to monitor their transfer.
Query Management
Client queries first hit broker nodes, which forward them to the appropriate data nodes (either historical or real-time). Since Druid segments may be partitioned, an incoming query can require data from multiple segments and partitions (or shards) stored on different nodes in the cluster. Brokers are able to learn which nodes have the required data, and also merge partial results before returning the aggregated result.
Cluster Management
Operations relating to data management in historical nodes are overseen by coordinator nodes, which are the prime users of the MySQL metadata tables. Apache ZooKeeper is used to register all nodes, manage certain aspects of internode communications, and provide for leader elections.
Features
Low latency (real-time) data ingestion
Arbitrary slice and dice data exploration
Sub-second analytic queries
Approximate and exact computations
Flink is intended to be a framework for unified stream and batch process (Kappa architecture). Flink also can handle backpressure on the queues more gracefully than other platforms (::cough:: storm)
Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner.[3] Flink's pipelined runtime system enables the execution of bulk/batch and stream processing programs.[4][5] Furthermore, Flink's runtime supports the execution of iterative algorithms natively.[6
Stream processor only. Bolts & Spouts
Samza’s goal is to provide a lightweight framework for continuous data processing.
Samza continuously computes results as data arrives which makes sub-second response times possible.
It’s unlike batch processing systems such as Hadoop which typically has high-latency responses which can sometimes take hours.
Samza might help you to update databases, compute counts or other aggregations, transform messages, or any number of other operations. It’s been in production at LinkedIn for several years and currently runs on hundreds of machines across multiple data centers.
Our largest Samza job is processing more than one million messages per-second during peak traffic hours.
Architecture & Concepts
Streams and Jobs are the building blocks of a Samza application:
A stream is composed of immutable sequences of messages of a similar type or category. In order to scale the system to handle large-scale data, we break down each stream into partitions. Within each partition, the sequence of messages is totally ordered and each message’s position is uniquely identified by its offset. At LinkedIn, streams are provided by Apache Kafka.
A job is the code that consumes and processes a set of input streams. In order to scale the throughput of the stream processor, jobs are broken into smaller units of execution called Tasks. Each task consumes data from one or more partitions for each of the job’s input streams. Since there is no defined ordering of messages across the partitions, it allows tasks to operate independently.
Samza assigns groups of tasks to be executed inside one or more containers – UNIX processes running a JVM that execute a set of Samza tasks for a single job. Samza’s container code is single threaded (when one task is processing a message, no other task in the container is active), and is responsible for managing the startup, execution, and shutdown of one or more tasks.
SPARK – fast, general purpose engine for distributed processing (everywhere)
MESOS – cluster management and resource isolation (blue hexagon)
AKKA – runtime for highly concurrent, distributed, message-driven applications (blue triangle)
CASSANDRA – distributed NoSQL store optimized for reads (eye)
KAFKA – high throughput, low latency pub/sub messaging system (orange pipe)