Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Big data architecture: Technologies (Part 3)

3.583 visualizaciones

Publicado el

All you wanted to know about big data, hadoop technologies, streaming, etc.

Publicado en: Tecnología
  • Sé el primero en comentar

Big data architecture: Technologies (Part 3)

  1. 1. (Big-)Data Architecture (Re-)Invented Part 3: Big Data Technologies William El Kaim May 2018 – V 4.0
  2. 2. This Presentation is part of the Enterprise Architecture Digital Codex © William El Kaim 2018 2
  3. 3. • Hadoop from V1 to V3 • Encoding Technologies • Ingestion Technologies • Storage Technologies • Processing Technologies • Big Data Fabric Copyright © William El Kaim 2018 3
  4. 4. Hadoop: Open Source Bazaar Style Dev. • Hadoop was first conceived at Yahoo as a distributed file system (HDFS) and a processing framework (MapReduce) for indexing the Internet. • It worked so well that other Internet firms in the Silicon Valley started using the open source software too. • Apache Hadoop, by all accounts, has been a huge success on the open source front. • Hadoop project has spawned off into dozens of Apache projects • Hive, Impala, Spark, HBase, Cassandra, Pig, Tez, etc. Copyright © William El Kaim 2018 4
  5. 5. Is there an Hadoop Standard? • Apache Software Foundation (ASF) is managing Apache Hadoop • The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. • It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. • Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. • Source: Apache Software Foundation Copyright © William El Kaim 2018 5
  6. 6. Open Data Platform Initiative • ODPi defines itself as "a shared industry effort focused on promoting and advancing the state of Apache Hadoop and big data technologies for the enterprise." • The group has grown its membership steadily since launching in February 2015 under the name Open Data Platform Alliance : • Ampool, Altiscale, Capgemini, DataTorrent, EMC, GE, Hortonworks, IBM, Infosys, Linaro, NEC, Pivotal, PLDT, SAS Institute Inc, Splunk, Squid Solutions, SyncSort, Telstra, Teradata, Toshiba, UNIFi, Verizon, VMware, WANdisco, Xiilab, zData and Zettaset. • ODPi takes a major step forward by securing official endorsement by the Linux Foundation turning it into a Linux Foundation collaborative project. • Major companies against OdPi are Amazon, Cloudera, and Mapr • Specifications • ODPi runtime specification (march 2016) and ODPI Operations Copyright © William El Kaim 2018 Source: Odpi 6
  7. 7. Open Data Platform Initiative • Objectives are : • Reinforces the role of the Apache Software Foundation (ASF) in the development and governance of upstream projects. • Accelerates the delivery of Big Data solutions by providing a well- defined core platform to target. • Defines, integrates, tests, and certifies a standard "ODPi Core" of compatible versions of select Big Data open source projects. • Provides a stable base against which Big Data solution providers can qualify solutions. • Produces a set of tools and methods that enable members to create and test differentiated offerings based on the ODPi Core. • Contributes to ASF projects in accordance with ASF processes and Intellectual Property guidelines. • Supports community development and outreach activities that accelerate the rollout of modern data architectures that leverage Apache Hadoop®. • Will help minimize the fragmentation and duplication of effort within the industry. Source: Odpi Copyright © William El Kaim 2018 7
  8. 8. Hadoop History Copyright © William El Kaim 2018 8
  9. 9. Hadoop V1 Architecture Batch & Scheduled Integration Near Real-Time Integration Existing Infrastructure HDFS Pig REST Hive HBase MapReduce HCatalog WebHDFS Databases & Warehouses Applications & Spreadsheets Visualization & Intelligence Flume Logs & Files Existing Infrastructure HDFS Pig Hive HBase MapReduce HCatalog Databases & Warehouses Applications & Spreadsheets Visualization & Intelligence Logs & Files Data Integration (Talend, Informatica) ODBC/JDBC SQOOP Source: HortonWorksCopyright © William El Kaim 2018 9
  10. 10. • Hive - A data warehouse infrastructure than runs on top of Hadoop. Hive supports SQL queries, star schemas, partitioning, join optimizations, caching of data, etc. • Pig - A scripting language for processing Hadoop data in parallel. • MapReduce - Java applications that can process data in parallel. • Ambari - An open source management interface for installing, monitoring and managing a Hadoop cluster. Ambari has also been selected as the management interface for OpenStack. Hadoop V1 Architecture • HBase - A NoSQL columnar database for providing extremely hast scanning of column data for analytics. • Scoop, Flume - tools providing large data ingestion for Hadoop using SQL, streaming and REST API interfaces. • Oozie - A workflow manager and scheduler. • Zookeeper - A coordinator infrastructure • Mahout - a machine learning library supporting Recommendation, Clustering, Classification and Frequent Itemset mining. • Hue - is a Web interface that contains a file browser for HDFS, a Job Browser for YARN, an HBase Browser, Query Editors for Hive, Pig and Sqoop and a Zookeeper browser. Copyright © William El Kaim 2018 10
  11. 11. Hadoop V1 Architecture Copyright © William El Kaim 2018 11
  12. 12. Hadoop V1 Issues • Availability • Hadoop 1.0 Architecture had only one single point of availability i.e. the Job Tracker, so in case if the Job Tracker fails then all the jobs will have to restart. • Scalability • The Job Tracker runs on a single machine performing various tasks such as Monitoring, Job Scheduling, Task Scheduling and Resource Management. • In spite of the presence of several machines (Data Nodes), they were not being utilized in an efficient manner, thereby limiting the scalability of the system. • Multi-Tenancy • The major issue with Hadoop MapReduce that paved way for the advent of Hadoop YARN was multi-tenancy. With the increase in the size of clusters in Hadoop systems, the clusters can be employed for a wide range of models. • Cascading Failure • In case of Hadoop MapReduce when the number of nodes is greater than 4000 in a cluster, some kind of fickleness is observed. • The most common kind of failure that was observed is the cascading failure which in turn could cause the overall cluster to deteriorate when trying to overload the nodes or replicate data via network flooding. Source: DezyreCopyright © William El Kaim 2018 12
  13. 13. Hadoop V2 • Hadoop 1 popularized MapReduce programming for batch jobs and demonstrated the potential value of large scale, distributed processing. • I/O intensive, not suitable for interactive analysis, and constrained in support for graph, machine learning and on other memory intensive algorithms. • In Hadoop V2, developers improved HDFS and created an new resource management layer, known as YARN (Yet another Resource Negotiator) Copyright © William El Kaim 2018 Source: HortonWorks Video 13
  14. 14. Hadoop V2: YARN • Foundation layer for parallel processing in Hadoop. • Scalable to 10,000+ data node systems. • Highly scalable and parallel processing operating system that supports all kinds of different types of workloads • Supports batch processing providing high throughput performing sequential read scans. • Supports real time interactive queries (Tez), with low latency and random reads. • graphing data, in-memory processing, messaging systems, streaming video, etc. Copyright © William El Kaim 2018 14
  15. 15. Hadoop V2: Yarn applications • Apache™ Tez generalizes the MapReduce paradigm to a more powerful framework for executing a complex DAG (directed acyclic graph) of tasks. • By eliminating unnecessary tasks, synchronization barriers, and reads from and write to HDFS, Tez speeds up data processing across both small-scale, low-latency and large- scale, high-throughput workloads. • Apache™ Slider is an engine that runs other applications in a YARN environment. • With Slider, distributed applications that aren’t YARN-aware can now participate in the YARN ecosystem – usually with no code modification. • Slider allows applications to use Hadoop’s data and processing resources, as well as the security, governance, and operations capabilities of enterprise Hadoop. Copyright © William El Kaim 2018 15
  16. 16. Hadoop V2: Spark Revolution • Spark replaces MapReduce. • MapReduce is inefficient at handling iterative algorithms as well as interactive data mining tools. • Spark is fast: uses memory differently and efficiently • Run programs up to 100x faster than MapReduce in memory, or 10x faster on disk • Spark excels at programming models • involving iterations, interactivity (including streaming) and more. • Spark offers over 80 high-level operators that make it easy to build parallel apps • Spark runs Everywhere • Runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3. Copyright © William El Kaim 2018 16
  17. 17. Hadoop V2: Spark Stack Evolutions (2015) Source: Databricks Goal: unified engine across data sources, workloads and environments DataFrame is a distributed collection of data organized into named columns ML pipeline to define a sequence of data pre- processing, feature extraction, model fitting, and validation stages Copyright © William El Kaim 2018 17
  18. 18. Hadoop V2: Spark V2 • Spark programming revolved around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. • So the original Spark core API did not always feel natural for the larger population of data analysts and data engineers, who worked mainly with SQL and statistical languages such as R. • Today, Spark provides higher level APIs for advanced analytics and data science, and supports five different languages, including SQL, Scala, Java, Python, R. • What makes Spark quite special in the distributed computing arena is the fact that different techniques such as SQL queries and machine learning can be mixed and combined together, even within the same script. • By using Spark, data scientists and engineers do not have to switch to different environments and tools for data pre-processing, SQL queries or machine learning algorithms. This fact boosts the productivity of data professionals and delivers better and simpler data processing solutions. Source: DatabricksCopyright © William El Kaim 2018 18
  19. 19. Hadoop V1 vs. V2 • YARN has taken an edge over the cluster management responsibilities from MapReduce • now MapReduce just takes care of the Data Processing and other responsibilities are taken care of by YARN. Copyright © William El Kaim 2018 19
  20. 20. Hadoop V2 Full Architecture Source: HortonWorksCopyright © William El Kaim 2018 20
  21. 21. Hadoop V3 released 5 in 2018 • Agility & Time to Market • Although Hadoop 2 uses containers, Hadoop 3 containerization brings agility and package isolation story of Docker. • A container-based service makes it possible to build apps quickly and roll one out in minutes. It also brings faster time to market for services. • Total Cost of Ownership • Hadoop 2 has a lot more storage overhead than Hadoop 3. • Erasure coding in Hadoop 3 halves the storage cost of HDFS while also retaining data durability. Storage overhead can be reduced from 200% to 50%. • Extensible resource-types • Hadoop 3.0.0 is extending YARN, the compute platform piece, to have an extensible framework for managing additional resource-types beyond memory and CPU that YARN supports today. • One use-case for this extensible framework is to enable bringing machine learning and deep learning workloads to your Hadoop cluster by pooling GPU and FPGA resources and elastically sharing them between different business-units and users from different parts of the organization. Copyright © William El Kaim 2018 Source: HortonWorks 21
  22. 22. Hadoop V3: Scalability & Availability • Hadoop 2 and 1 only use a single NameNode to manage all Namespaces. • Hadoop 3 has multiple Namenodes for multiple namespaces for NameNode Federation which improves scalability. • In Hadoop 2, there is only one standby NameNode. • Hadoop 3 supports multiple standby NameNodes. If one standby node goes down, you have the benefit of other standby NameNodes so the cluster can continue to operate. This feature gives you a longer servicing window. • Hadoop 2 uses an old timeline service which has scalability issues. • Hadoop 3 improves the timeline service v2and improves the scalability and reliability of timeline service. • Hadoop 2 cannot accommodate intra-node disk balancing. • Hadoop 3 has intra-node disk balancing. If you are repurposing or adding new storage to an existing server with older capacity drives, this leads to unevenly disks space in each server. With intra-node disk balancing, the space in each disk is evenly distributed. • Hadoop 2 has only inter-queue preemption across queues. • Hadoop 3 introduces intra-queue preemption which goes to the next level time by allowing preemption between application within a single queue. This means that you can prioritize jobs within the queue based on user limits and/or application priority Copyright © William El Kaim 2018 Source: HortonWorks 22
  23. 23. Hadoop V3: Scalability & Availability • Cloud storage support • Hadoop 3.0.0 also brings a host of improvements for cloud storage systems such as Amazon S3, Microsoft Azure Data Lake, and Aliyun Object Storage System. • YARN Federation • In 3.0.0, Hadoop now supports federation of YARN clusters. Till now, users setup independent YARN clusters and run workloads on them, with each cluster completely oblivious to others. YARN federation helps put an over-arching layer on top of multiple clusters to solve one primary use-case – scale. Federation enables YARN to be scaled to 100s of thousands of nodes, far beyond the original design goal of 10,000 machines. Copyright © William El Kaim 2018 23
  24. 24. • Hadoop from V1 to V3 • Encoding Technologies • Ingestion Technologies • Storage Technologies • Processing Technologies • Big Data Fabric Copyright © William El Kaim 2018 24
  25. 25. Encoding is Key! • A huge bottleneck for HDFS-enabled applications like MapReduce and Spark is the time it takes to find relevant data in a particular location and the time it takes to write the data back to another location. These issues are exacerbated with the difficulties managing large datasets, such as evolving schemas, or storage constraints. • Choosing an appropriate file format can have some significant benefits: • Faster read times • Faster write times • Splittable files (so you don’t need to read the whole file, just a part of it) • Schema evolution support (allowing you to change the fields in a dataset) • Advanced compression support (compress the files with a compression codec without sacrificing these features) Copyright © William El Kaim 2018 Source: Matthew Rathbone 25
  26. 26. Encoding is Key! • The format of the files you can store on HDFS, like any file system, is entirely up to you. • However unlike a regular file system, HDFS is best used in conjunction with a data processing toolchain like MapReduce or Spark. • These processing systems typically (although not always) operate on some form of textual data like webpage content, server logs, or location data. • If you’re just getting started with Hadoop, HDFS, Hive and wondering what file format you should be using to begin with, then use tab delimited files for your prototyping (and first production jobs). • They’re easy to debug (because you can read them), they are the default format of Hive, and they’re easy to create and reason about. • Once you have a production MapReduce or Spark job regularly generating data come back and pick something better. Copyright © William El Kaim 2018 Source: Matthew Rathbone 26
  27. 27. Encoding Technologies Copyright © William El Kaim 2018 Source: Matthew Rathbone 27
  28. 28. Encoding Technologies • Text Files (E.G. CSV, TSV) • Data is laid out in lines, with each line being a record. Lines are terminated by a newline character n in the typical Unix fashion. • Text-files are inherently splittable (just split on n characters!), but if you want to compress them you’ll have to use a file- level compression codec that support splitting, such as BZIP2. • Because these files are just text files you can encode anything you like in a line of the file. One common example is to make each line a JSON document to add some structure. While this can waste space with needless column headers, it is a simple way to start using structured data in HDFS. • Sequence files were originally designed for MapReduce. • They encode a key and a value for each record and nothing more. Records are stored in a binary format that is smaller than a text-based format would be. Like text files, the format does not encode the structure of the keys and values, so if you make schema migrations they must be additive. • Sequence files by default use Hadoop’s Writable interface in order to figure out how to serialize and de- serialize classes to the file. Typically if you need to store complex data in a sequence file you do so in the value part while encoding the id in the key. The problem with this is that if you add or change fields in your Writable class it will not be backwards compatible with the data stored in the sequence file. • One benefit of sequence files is that they support block-level compression, so you can compress the contents of the file while also maintaining the ability to split the file into segments for multiple map tasks. • Sequence files are well supported across Hadoop and many other HDFS enabled projects, and represent the easiest next step away from text files. Copyright © William El Kaim 2018 28
  29. 29. Encoding Technologies • Avro • Avro is not really a file format, it’s a file format plus a serialization and deserialization framework. It encodes the schema of its contents directly in the file which allows to store complex objects natively. • Avro provides: • Rich data structures. • A compact, fast, binary data format. • A container file, to store persistent data. • Remote procedure call (RPC). • Simple integration with dynamic languages. • Avro: • defines file data schemas in JSON (for interoperability), allows for schema evolutions (remove a column, add a column), and multiple serialization/deserialization use cases. • supports block-level compression. • For most Hadoop-based use cases Avro is a really good choice. • Columnar File Formats • The latest evolution concerning file formats for Hadoop • Instead of just storing rows of data adjacent to one another you also store column values adjacent to each other. So datasets are partitioned both horizontally and vertically. • One huge benefit of columnar oriented file formats is that data in the same column tends to be compressed together which can yield some massive storage optimizations (as data in the same column tends to be similar). • When to use: • If you’re chopping and cutting up datasets regularly then these formats can be very beneficial to the speed of your application. • if you have an application that usually needs entire rows of data then the columnar formats may actually be a detriment to performance due to the increased network activity required. • Format: • Apache Parquet seems to have the most community support. • RCFile (Record Columnar File) is a data placement structure that determines how to store relational tables on computer clusters. Most known is Apache Orc • Apache CarbonData is an indexed columnar data format for fast analytics on big data platform Copyright © William El Kaim 2018 29
  30. 30. Serialization Technologies • Elastic Grok: to parse unstructured log data into something structured and queryable. • JSON: JavaScript Object Notation is an open-standard file format that uses human- readable text to transmit data objects consisting of attribute–value pairs and array data types (or any other serializable value). • Amazon Ion is a richly-typed, self-describing, hierarchical data serialization format offering interchangeable binary and text representations. • The text format (a superset of JSON) is easy to read and author, supporting rapid prototyping. The binary representation is efficient to store, transmit, and skip-scan parse. • The rich type system provides unambiguous semantics for long-term preservation of business data which can survive multiple generations of software evolution. • RegexSerDe uses regular expression (regex) to serialize/deserialize. • It can deserialize the data using regex and extracts groups as columns. It can also serialize the row object using a format string. Copyright © William El Kaim 2018 Source: Matthew Rathbone 30
  31. 31. File compression done in two ways • File-Level Compression • compress entire files regardless of the file format, the same way you would compress a file in Linux. Some of these formats are splittable (e.g. bzip2, or LZO if indexed). • Block-Level Compression • Is internal to the file format, so individual blocks of data within the file are compressed. • This means that the file remains splittable even if you use a non-splittable compression codec • Snappy is a great balance of speed and compression ratio. Copyright © William El Kaim 2018 31
  32. 32. • Hadoop from V1 to V3 • Encoding Technologies • Ingestion Technologies • Storage Technologies • Processing Technologies • Big Data Fabric Copyright © William El Kaim 2018 32
  33. 33. Big Data Technologies Copyright © William El Kaim 2018 33
  34. 34. Streaming Solutions Landscape Copyright © William El Kaim 2018 34Source: Baqend
  35. 35. Stream Processing Unifies Data Processing, Analytics, and Applications Copyright © William El Kaim 2018 35Source: Baqend
  36. 36. Streaming Semantics Copyright © William El Kaim 2018 36
  37. 37. Data Ingestion Frameworks • Amazon Kinesis – real-time processing of streaming data at massive scale. • Apache Chukwa – data collection system. • Apache Flume – service to manage large amount of log data. • Apache Kafka – distributed publish-subscribe messaging system. • Apache Sqoop – tool to transfer data between Hadoop and a structured datastore. • Cloudera Morphlines – framework that help ETL to Solr, HBase and HDFS. • Facebook Scribe – streamed log data aggregator. • Fluentd – tool to collect events and logs. • Google Photon – geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency. • Heka – open source stream processing software system. • HIHO – framework for connecting disparate data sources with Hadoop. • Kestrel – distributed message queue system. • LinkedIn Databus – stream of change capture events for a database. • LinkedIn Kamikaze – utility package for compressing sorted integer arrays. • LinkedIn White Elephant – log aggregator and dashboard. • Logstash – a tool for managing events and logs. • Netflix Suro – log agregattor like Storm and Samza based on Chukwa. • Pinterest Secor – is a service implementing Kafka log persistance. • Linkedin Gobblin – linkedin’s universal data ingestion framework. Copyright © William El Kaim 2018 37Source: Big Data Ingestion and Streaming Tools
  38. 38. Ingestion Technologies Apache Flume • Apache Flume is a distributed and reliable service for efficiently collecting, aggregating, and moving large amounts of streaming data into HDFS (especially “logs”). • Data is pushed to the the destination (Push Mode). • Flume does not replicate events - in case of flume-agent failure, you will lose events in the channel Copyright © William El Kaim 2018 38
  39. 39. Ingestion Technologies Apache Kafka • Apache Kafka is a a fast, scalable, durable, and fault-tolerant publish- subscribe messaging system, developed by LinkedIn, that persists messages to disk (Pull Mode) • Designed for high Throughput, Kafka is often used in place of traditional message brokers like JMS and AMQP because of its higher throughput, reliability, and replication. • Use Topics which many listeners can subscribe to, and thus processing of messages can happen in parallel on various channels • High availability of events (recoverable in case of failures) Copyright © William El Kaim 2018 39
  40. 40. Ingestion Technologies Apache Storm • Apache Storm, developed by BackType (bought wy Twitter) is a reliable real- time system for processing streaming data in real time (and generating new streams). • Designed to support wiring “spouts” (think input streams) and “bolts” (processing and output modules) together as a directed acyclic graph (DAG) called a topology. • One strength is the catalogue of available spouts specialized for receiving data from all types of sources. • Storm topologies run on clusters and the Storm scheduler distributes work to nodes around the cluster, based on the topology configuration. Copyright © William El Kaim 2018 40
  41. 41. Ingestion Technologies Storm: Example • Twitter streams, counting words, and storing them in NoSQL database Source: TrivadisCopyright © William El Kaim 2018 41
  42. 42. Ingestion Technologies Storm: Example Source: TrivadisCopyright © William El Kaim 2018 42
  43. 43. Ingestion Technologies Twitter Heron • Twitter dropped Apache Storm in production in 2015 and replaced it with a homegrown data processing system, named Heron. • Apache Storm was the original solution to Twitter's problems. • Storm it was reputedly hard to work with and hard to get good results from, and despite a recent 1.0 renovation, it's been challenged by other projects, including Apache Spark and its own revised streaming framework. • Heron was built from scratch with a container- and cluster-based design, outlined in a research paper. • The user creates Heron jobs, or "topologies," and submits them to a scheduling system, which launches the topology in a series of containers. • The scheduler can be any of a number of popular schedulers, like Apache Mesos or Apache Aurora. Storm, by contrast, has to be manually provisioned on clusters to add scale. • In May 2016 Twitter released Heron under an open source license Source: InfoworldCopyright © William El Kaim 2018 43
  44. 44. Ingestion Technologies Twitter Heron • Heron is backward-compatible with Storm's API. • Storm spouts and bolts could then be reused in Heron • Gives existing Storm users some incentive to check out Heron. • Heron • Code is to be written in Java (or Scala • The web-based UI components are written in Python • The critical parts of the framework, the code that manages the topologies and network communications is written in C++. • Twitter claims it's been able to gain anywhere from two to five times an improvement in "efficiency" (basically, lower opex and capex) with Heron. Source: InfoworldCopyright © William El Kaim 2018 44
  45. 45. Ingestion Technologies Apache Spark • Spark supports real-time distributed computation and stream-oriented processing, but it's more of a general-purpose distributed computing platform. • In-memory data storage for very fast iterative processing • Replacement for the MapReduce functions of Hadoop, running on top of an existing Hadoop cluster, relying on YARN for resource scheduling. • Spark can layer on top of Mesos for scheduling or run as a stand-alone cluster using its built-in scheduler. • Spark shines is in its support for multiple processing paradigms and the supporting libraries Copyright © William El Kaim 2018 45
  46. 46. Ingestion Technologies Apache Spark Source: Ippon Source: Databricks Copyright © William El Kaim 2018 46
  47. 47. Ingestion Technologies Apache Spark • Spark Core • General execution engine for the Spark platform • In-memory computing capabilities deliver speed • General execution model supports wide variety of use cases • Spark Streaming • Run a streaming computation as a series of very small, deterministic batch jobs • Batch size as low as ½ sec, latency of about 1 sec • Exactly-once semantics • Potential for combining batch and streaming processing in same system Copyright © William El Kaim 2018 47
  48. 48. Ingestion Technologies Apache Spark • At the core of Apache Spark is the notion of data abstraction as distributed collection of objects called Resilient Distributed Dataset (RDD) • RDD Allows you to write programs that transform these distributed datasets. • RDDs are Immutable, recomputable, and fault tolerant distributed collection of objects (partitions) spread across a cluster of machines • data can be stored in memory or in disk (local). • RDD enables parallel processing on data sets • Data is partitioned across machines in a cluster that can be operated in parallel with a low- level API that offers transformations and actions. • RDDs are fault tolerant as they track data lineage information to rebuild lost data automatically on failure. • Contains transformation history (“lineage”) for whole data set • Operations • Stateless Transformations (map, filter, groupBy) • Actions (count, collect, save) Source: TrivadisCopyright © William El Kaim 2018 48
  49. 49. Ingestion Technologies Apache Spark • DataFrame is an immutable distributed collection of data (like RDD) • Unlike an RDD, data is organized into named columns, like a table in a relational database. • Designed to make large data sets processing even easier, DataFrame allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction. • It provides a domain specific language API to manipulate your distributed data • makes Spark accessible to a wider audience, beyond specialized data engineers. • Datasets • Introduced in Spark 1.6, the goal of Spark Datasets is to provide an API that allows users to easily express transformations on domain objects, while also providing the performance and benefits of the robust Spark SQL execution engine. • In Spark 2.0, the DataFrame APIs will merge with Datasets APIs, unifying data processing capabilities across all libraries. Copyright © William El Kaim 2018 49
  50. 50. Ingestion Technologies Apache Spark Machine Learning • MLlib is Apache Spark general machine learning library • Allows data scientists to focus on their data problems and models instead of solving the complexities surrounding distributed data (such as infrastructure, configurations, etc.). • The data engineers can focus on distributed systems engineering using Spark’s easy-to- use APIs, while the data scientists can leverage the scale and speed of Spark core. • ML Pipelines • Running machine learning algorithms involves executing a sequence of tasks including pre-processing, feature extraction, model fitting, and validation stages. • High-Level API for MLlib that lives under the “” package. • A pipeline consists of a sequence of stages. There are two basic types of pipeline stages: Transformer and Estimator. • A Transformer takes a dataset as input and produces an augmented dataset as output. • An Estimator must be first fit on the input dataset to produce a model, which is a Transformer that transforms the input dataset. Copyright © William El Kaim 2018 50
  51. 51. Ingestion Technologies Google Cloud Dataflow • Fully-managed cloud service and programming model for batch and streaming big data processing. • Used for developing and executing a wide range of data processing patterns including ETL, batch computation, and continuous computation. • Cloud Dataflow “frees” from operational tasks like resource management and performance optimization. • The open source Java-based Cloud Dataflow SDK enables developers to implement custom extensions and to extend Dataflow to alternate service environments Source: GoogleCopyright © William El Kaim 2018 51
  52. 52. Ingestion Technologies Google Dataflow vs. Spark • Dataflow is clearly faster than Spark. • But Spark has an ace up its sleeve in the form of REPL, or its “read evaluate print loop” functionality, which enables users to iterate on their problems quickly and easily. • “If you have a bunch of data scientists and you’re trying to figure out what they want to do, and they need to play around a lot, then Spark may be a better solution for those sorts of cases,” Oliver says. • While Spark maintains an edge among data scientists looking to iterate quickly, Google Cloud Dataflow seems to hold the advantage in the operations department, thanks to all the work that Google has done over the years to optimize queries at scale. • “Google Cloud Dataflow has some key advantages, in particular if you have a well thought out process that you’re trying to implement, and you’re trying to do it cost effectively…then Google Cloud Dataflow is an excellent option for doing it at scale and at a lower cost,” Oliver says. Source: DatanamiCopyright © William El Kaim 2018 52
  53. 53. Ingestion Technologies Apache Beam • Apache Beam is an open source, unified programming model used to create a data processing pipeline. • Start by building a program that defines the pipeline using one of the open source Beam SDKs. • The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Flink, Apache Spark, and Google Cloud Dataflow. • Beam is particularly useful for Embarrassingly Parallel data processing tasks, in which the problem can be decomposed into many smaller bundles of data that can be processed independently and in parallel. • Beam cans also be used for Extract, Transform, and Load (ETL) tasks and pure data integration. Copyright © William El Kaim 2018 53
  54. 54. Ingestion Technologies Concord © William El Kaim 2018 54
  55. 55. Ingestion Technologies Apache Flink • Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. • Flink includes several APIs for creating applications that use the Flink engine: • DataStream API for unbounded streams embedded in Java and Scala, and • DataSet API for static data embedded in Java, Scala, and Python, • Table API with a SQL-like expression language embedded in Java and Scala. • Flink also bundles libraries for domain-specific use cases: • CEP, a complex event processing library, • Machine Learning library, and • Gelly, a graph processing API and library. Copyright © William El Kaim 2018 55
  56. 56. Ingestion Technologies Apache Flink Source: Ippon Source: Apache Copyright © William El Kaim 2018 56
  57. 57. Ingestion Technologies Apache Flink Commercial Support Data ArtisansCopyright © William El Kaim 2018 57
  58. 58. Ingestion Technologies Spark vs. Flink • Flink is: • optimized for cyclic or iterative processes by using iterative transformations on collections. • This is achieved by an optimization of join algorithms, operator chaining and reusing of partitioning and sorting. • However, Flink is also a strong tool for batch processing. • Spark is: • based on resilient distributed datasets (RDDs). • This (mostly) in-memory data structure gives the power to sparks functional programming paradigm. It is capable of big batch calculations by pinning memory. Source: Zalando Source: Quora Copyright © William El Kaim 2018 58
  59. 59. Ingestion Technologies Apache APEX • Apache Apex is a YARN-native integrated platform that unifies stream and batch processing. • It processes big data in-motion in a way that is highly scalable, highly performant, fault tolerant, statefull, secure, and distributed. • Github • Comparisons to others • Spark and Storm are considered difficult to use. They’re built on batch engines, rather than true streaming architecture, and don’t natively support statefull computation, • They can’t do low-latency processing that Apex and Flink can, and will suffer a latency overhead for having to schedule batches repeatedly, no matter how quickly that occurs. • Use cases • GE’s Predix IoT cloud platform uses Apex for industrial data and analytics • Capital One for real-time decisions and fraud detection. Source: ASFCopyright © William El Kaim 2018 59
  60. 60. Ingestion Technologies • Apache Samza • Samza is a distributed stream-processing framework that is based on Apache Kafka and YARN. • It provides a simple callback-based API that’s similar to MapReduce, and it includes snapshot management and fault tolerance in a durable and scalable way. • Amazon Kinesis • Kinesis is Amazon’s service for real-time processing of streaming data on the cloud. • Deeply integrated with other Amazon services via connectors, such as S3, Redshift, and DynamoDB, for a complete Big Data architecture. • Apache Pulsar • Pulsar Functions allows developers to create fast, flexible data processing tasks that operate on data in motion as it passes through Pulsar, without requiring external systems or add-ons. Copyright © William El Kaim 2018 60
  61. 61. Ingestion Technologies • NFS Gateway • The NFS Gateway supports NFSv3 and allows HDFS to be mounted as part of the client’s local file system. • Apache Sqoop • Tool designed for efficiently transferring bulk data between Hadoop and structured data stores (and vice-versa). • Import data from external structured data stores into a HDFS • Extract data from Hadoop and export it to external structured data stores like relational database and enterprise data warehouses. Copyright © William El Kaim 2018 61
  62. 62. Ingestion Technologies Sqoop Example Source: Rubén Casado TejedorCopyright © William El Kaim 2018 62
  63. 63. Ingestion Technologies: Streaming Platforms Copyright © William El Kaim 2018 63 • Other Platforms • InsightEdge • Lenses • StreamAnalytix • Streamlio • Streamsets • Streamtools (from NYT Labs) • Talend Data Streams (on AWS)
  64. 64. Streaming PaaS Example: StreamAnalytix © William El Kaim 2018 64
  65. 65. Streaming PaaS Exemple: InsightEdge © William El Kaim 2018 65
  66. 66. • Hadoop from V1 to V3 • Encoding Technologies • Ingestion Technologies • Storage Technologies • Processing Technologies • Big Data Fabric Copyright © William El Kaim 2018 66
  67. 67. Big Data Technologies Copyright © William El Kaim 2018 67
  68. 68. Hadoop Storage & Processing Processing Hadoop Distributed Storage Distributed FS Local FS NoSQL datastores GlusterFS HDFS S3 CephCassandra RingDynamoDB OLAP OLTP Machine Learning HBase Impala Hawq Map Reduce / Tez Map Reduce / Tez R, Python,… MahoutStreaming Cascading R, Python,… Hive Pig StreamingCascading Spark Spark Openstack SwiftIsilon Scalding Giraph Hama SciKit Stinger MapR Source: Octo TechnologyCopyright © William El Kaim 2018 68
  69. 69. Rise of the Immutable Datastore • In a relational database, files are mutable, which means a given cell can be overwritten when there are changes to the data relevant to that cell. • New architectures offer accumulate- only file system that overwrites nothing. Each file is immutable, and any changes are recorded as separate timestamped files. • The method lends itself not only to faster and more capable stream processing, but also to various kinds of historical time-series analysis. Copyright © William El Kaim 2018 69Source: PWC
  70. 70. Why is immutability so important? • Fewer dependencies & Higher-volume data handling and improved site-response capabilities • Immutable files reduce dependencies or resource contention, which means one part of the system doesn’t need to wait for another to do its thing. That’s a big deal for large, distributed systems that need to scale and evolve quickly. • More flexible reads and faster writes • writing data without structuring it beforehand means that you can have both fast reads and writes, as well as more flexibility in how you view the data. • Compatibility with Hadoop & log-based messaging protocols • A popular method of distributed storage for less-structured data. • Ex: Apache Samza and Apache Kafka are symbiotic and compatible with the Hadoop Distributed File System (HDFS),. • Suitability for auditability and forensics • Log-centric databases and the transactional logs of many traditional databases share a common design approach that stresses consistency and durability (the C and D in ACID). • But only the fully immutable shared log systems preserve the history that is most helpful for audit trails and forensics. Copyright © William El Kaim 2018 70
  71. 71. Databases Evolutions Source: PWCCopyright © William El Kaim 2018 71
  72. 72. Storage Technologies: Cost & Speed Copyright © William El Kaim 2018 72
  73. 73. • HDFS: Distributed File System for Hadoop • A Java-based filesystem that provides scalable and reliable data storage. Designed to span large clusters of commodity servers. • Master-Slaves Architecture (NameNode – DataNodes) • NameNode: Manage the directory tree and regulates access to files by clients • DataNodes: Store the data. Files are split into blocks of the same size and these blocks are stored and replicated in a set of DataNodes • Apache Hive • An open-source data warehouse system for querying and analyzing large datasets stored in Hadoop files. • Abstraction layer on top of MapReduce • SQL-like language called HiveQL. • Metastore: Central repository of Hive metadata. Storage Technologies Copyright © William El Kaim 2018 73
  74. 74. Storage Technologies • Apache KUDU • Kudu is an innovative new storage engine that is designed from the ground up to overcome the limitations of various storage systems available today in the Hadoop ecosystem. • For the very first time, Kudu enables the use of the same storage engine for large scale batch jobs and complex data processing jobs that require fast random access and updates. • As a result, applications that require both batch as well as real-time data processing capabilities can use Kudu for both types of workloads. • With Kudu’s ability to handle atomic updates, you no longer need to worry about boundary conditions relating to late-arriving or out-of-sequence data. • In fact, data with inconsistencies can be fixed in place in almost real time, without wasting time deleting or refreshing large datasets. • Having one system of record that is capable of handling fast data for both analytics and real- time workloads greatly simplifies application design and implementation. Copyright © William El Kaim 2018 74
  75. 75. Storage Technologies • HBase • An open source, non-relational, distributed column-oriented database written in Java. • Modeled after Google’s BigTable and developed as part of Apache Hadoop project, it runs on top of HDFS. • Random, real time read/write access to the data. • Very light «schema», Rows are stored in sorted order. • MapR DB • An enterprise-grade, high performance, in-Hadoop No-SQL database management system, MapR is used to add real-time operational analytics capabilities to Hadoop. • Pivotal HDB • Hadoop Native SQL Database powered by Apache HAWQ Copyright © William El Kaim 2018 75
  76. 76. Storage Technologies • Apache Impala • Open source MPP analytic database built to work with data stored on open, shared data platforms like Apache Hadoop’s HDFS filesystem, Apache Kudu’s columnar storage, and object stores like S3. • By being able to query data from multiple sources stored in different, open formats like Apache Parquet, Apache Avro, and text, Impala decouples data and compute and lets users query data without having to move/load data specifically into Impala clusters. • In the cloud, this capability is especially useful as you can create transient clusters with Impala to run your reports/analytics and shut down the cluster when you are done or elastically scale compute power to support peak demands, letting you save on cluster-hosting costs. • Impala is designed to run efficiently on large datasets, and scales to hundreds of nodes and hundreds of users. • You can learn more about the unique use cases Impala on S3 delivers in this blog post. Copyright © William El Kaim 2018 76
  77. 77. Storage Technologies • MemSQL • MemSQL unveiled its “Spark Streamliner” initiative, in which it incorporated Apache Spark Streaming as a middleware component to buffer the parallel flow of data coming from Kafka before it’s loaded into MemSQL’s consistent storage. • This enabled customers like Pintrest to eliminate batch processing and move to continuous processing of data. • The exactly-once semantics is available through the “Create Pipeline” command in MemSQL version 5.5. • The command will automatically extract data from the Kafka source, perform some type of transformation, and then load it into the MemSQL database’s lead nodes (as opposed to loading them in MemSQL’s aggregator nodes first, as it did with Streamliner). • The database can work on multiple, simultaneous streams, and while adhering to exactly-once semantics Copyright © William El Kaim 2018 77
  78. 78. Storage Technology Landscape Copyright © William El Kaim 2018 78Source: Octo Technology
  79. 79. • Hadoop from V1 to V3 • Encoding Technologies • Ingestion Technologies • Storage Technologies • Processing Technologies • Big Data Fabric Copyright © William El Kaim 2018 79
  80. 80. Big Data Technologies Copyright © William El Kaim 2018 80
  81. 81. Hadoop Processing Paradigms Evolutions Source: Rubén Casado TejedorCopyright © William El Kaim 2018 81
  82. 82. Hadoop Processing Paradigms Evolutions Batch processing • Large amount of statics data • Generally incurs a high-latency / Volume Real-time processing • Compute streaming data • Low latency • Velocity Hybrid computation • Lambda Architecture • Volume + Velocity Source: Rubén Casado & ClouderaCopyright © William El Kaim 2018 82
  83. 83. Hadoop Processing Paradigms & Time Copyright © William El Kaim 2018 83 • Example: Apache OMID • Contributed to ASF by Yahoo • Omid provides a high-performant ACID transactional framework with Snapshot Isolation guarantees on top of HBase, being able to scale to thousands of clients triggering transactions on application data. • It’s one of the few open-source transactional frameworks that can scale beyond 100K transactions per second on mid-range hardware while incurring minimal impact on the latency accessing the datastore.
  84. 84. Processing Technologies • Apache Drill • Called the Omni-SQL: Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage • An open-source software framework that supports data intensive distributed applications for interactive analysis of large-scale datasets • Example of how to use Drill here. • MapReduce • A programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. • Apache Hive • Provides a mechanism to project structure onto large data sets and to query the data using a SQL-like language called HiveQL. • Apache Pig • Platform for analyzing large data sets • High-level procedural language for expressing data analysis programs. • Pig Latin: Data flow programming language. Source: Dataiku Copyright © William El Kaim 2018 84
  85. 85. Hadoop Batch processing • Scalable • Large amount of static data • Distributed • Parallel • Fault tolerant • High latency Volume Source: Rubén CasadoCopyright © William El Kaim 2018 85
  86. 86. • MapReduce was designed by Google as a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. • Key Terminology • Job: A “full program” - an execution of a Mapper and Reducer across a data set • Task: An execution of a Mapper or a Reducer on a slice of data – a.k.a. Task- In-Progress (TIP) • Task Attempt: A particular instance of an attempt to execute a task on a machine Hadoop Batch Processing: Map Reduce Copyright © William El Kaim 2018 86
  87. 87. Source: HadooperCopyright © William El Kaim 2018 87
  88. 88. • Processing can occur on data stored either in a filesystem (unstructured) or in a database (structured). • MapReduce can take advantage of the locality of data, processing it near the place it is stored in order to reduce the distance over which it must be transmitted. • "Map" step • Each worker node applies the "map()" function to the local data, and writes the output to a temporary storage. • A master node ensures that only one copy of redundant input data is processed. • "Shuffle" step • Worker nodes redistribute data based on the output keys (produced by the "map()" function), such that all data belonging to one key is located on the same worker node. • "Reduce" step • Worker nodes now process each group of output data, per key, in parallel. Hadoop Batch Processing: Map Reduce Copyright © William El Kaim 2018 88
  89. 89. Hadoop Batch Processing: Map Reduce Copyright © William El Kaim 2018 89
  90. 90. Batch Processing Technologies Source: Rubén CasadoCopyright © William El Kaim 2018 90
  91. 91. Batch Processing Architecture Example Source: Helena EdelsonCopyright © William El Kaim 2018 91
  92. 92. Real-time Processing • Low latency • Continuous unbounded streams of data • Distributed • Parallel • Fault-tolerant Velocity Source: Rubén CasadoCopyright © William El Kaim 2018 92
  93. 93. Real-time Processing Technologies Source: Rubén CasadoCopyright © William El Kaim 2018 93
  94. 94. • Computational model and Infrastructure for continuous data processing, with the ability to produce low-latency results • Data collected continuously is naturally processed continuously (Event Processing or Complex Event Processing -CEP) • Stream processing and real-time analytics are increasingly becoming where the action is in the big data space. • As real-time streaming architectures like Kafka continue to gain steam, companies that are building next-generation applications upon them will debate the merits of the unified and the federated approaches Real-time (Stream) Processing Source: TrivadisCopyright © William El Kaim 2018 94
  95. 95. Real-time (Stream) Processing Source: TrivadisCopyright © William El Kaim 2018 95
  96. 96. Real-time (Stream) Processing Arch. Pattern Source: ClouderaCopyright © William El Kaim 2018 96
  97. 97. Real-time (Stream) Processing • (Event-) Stream Processing • A one-at-a-time processing model • A datum is processed as it arrives • Sub-second latency • Difficult to process state data efficiently • Micro-Batching • A special case of batch processing with very small batch sizes (tiny) • A nice mix between batching and streaming • At cost of latency • Gives statefull computation, making windowing an easy task Source: TrivadisCopyright © William El Kaim 2018 97
  98. 98. Hybrid Computation Model • Low latency • Massive data + Streaming data • Scalable • Combine batch and real-time results Volume Velocity Source: Rubén CasadoCopyright © William El Kaim 2018 98
  99. 99. Hybrid Computation: Lambda Architecture • Data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods. • A system consisting of three layers: batch processing, speed (or real-time) processing, and a serving layer for responding to queries. • This approach to architecture attempts to balance latency, throughput, and fault- tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data. • The two view outputs may be joined before presentation. • Lambda Architecture case stories via Source: KrepsCopyright © William El Kaim 2018 99
  100. 100. • Batch layer • Receives arriving data, combines it with historical data and recomputes results by iterating over the entire combined data set. • The batch layer has two major tasks: • managing historical data; and recomputing results such as machine learning models. • Operates on the full data and thus allows the system to produce the most accurate results. However, the results come at the cost of high latency due to high computation time. • The speed layer • Is used in order to provide results in a low-latency, near real-time fashion. • Receives the arriving data and performs incremental updates to the batch layer results. • Thanks to the incremental algorithms implemented at the speed layer, computation cost is significantly reduced. • The serving layer enables various queries of the results sent from the batch and speed layers. Hybrid Computation: Lambda Architecture Source: KrepsCopyright © William El Kaim 2018 100
  101. 101. Hybrid computation: Lambda Architecture Source: MaprCopyright © William El Kaim 2018 101
  102. 102. Hybrid computation: Lambda Architecture DataTorrent • DataTorrent RTS Core • Open source enterprise-grade unified stream and batch platform • High performing, fault tolerant, scalable, Hadoop-native in-memory platform • Supports Kafka, HDFS, AWS S3n, NFS, (s)FTP, JMS • dtManage - DataTorrent Management Console • Hadoop-integrated application that provides an intuitive graphical management interface for Devops teams • manage, monitor, update and troubleshoot the DataTorrent RTS system and applications Source: DataTorrentCopyright © William El Kaim 2018 102
  103. 103. Ex: (ex. Lambdoop) Batch Real-Time Hybrid Source: Novelti.ioCopyright © William El Kaim 2018 103
  104. 104. Ex: Lambda Architecture Source: DatastaxCopyright © William El Kaim 2018 104
  105. 105. Ex: Lambda Architecture Stacks Source: Helena EdelsonCopyright © William El Kaim 2018 105
  106. 106. Different Streaming Architecture Vision • Hadoop major distributors have different views on how streaming fits into traditional Hadoop architectures. • Hortonworks has taken a data plane approach (with HDP) • that seeks to virtually connect multiple data repositories in a federated manner • to unify the security and governance of data existing in different places (on- and off- premise data lakes like HDP and streaming data platforms like HDF). • Specifically, it’s building hooks between Apache Atlas (the data governance component) and Apache Knox (the security tool) that give customers a single view of their data. • MapR is going all-in on the converged approach that stressed the importance of a single unified data repository. • Cloudera, meanwhile, sits somewhere in the middle (although it’s probably closer to MapR). Source: DatanamiCopyright © William El Kaim 2018 106
  107. 107. Ex: Lambda Architecture Cloudera Vision • Kafka as the piece of a larger real-time or near real-time architecture • Combination of Kafka and Spark Streaming for the so called speed layer. • In conjunction with a batch layer, leading to the use of lambda architecture • Because people want to operate with larger history of events • Kudu project as the real optimized store for Lambda architectures because • KUDU offers a happy medium between the scan performance of HDFS and the record- level updating capability of Hbase. • It enables real-time response to single events and can be the speed layer and batch layer for a single store Source: DatanamiCopyright © William El Kaim 2018 107
  108. 108. Hybrid computation: Kappa Architecture • Proposal from Jay Kreps (LinkedIn) in this article. • Then talk “Turning the database inside out with Apache Samza” by Martin Kleppmann • Main objective • Avoid maintaining two separate code bases for the batch and speed layers (lambda). • Key benefits • Handle both real-time data processing and continuous data reprocessing using a single stream processing engine. • Data reprocessing is an important requirement for making visible the effects of code changes on the results. Source: KrepsCopyright © William El Kaim 2018 108
  109. 109. Hybrid computation: Kappa Architecture • Architecture is composed of only two layers: • The stream processing layer runs the stream processing jobs. • Normally, a single stream processing job is run to enable real-time data processing. • Data reprocessing is only done when some code of the stream processing job needs to be modified. • This is achieved by running another modified stream processing job and replying all previous data. • The serving layer is used to query the results (like the Lambda architecture). Source: O’ReillyCopyright © William El Kaim 2018 109
  110. 110. Hybrid computation: Kappa Architecture • Intrinsically, there are four main principles in the Kappa architecture: • Everything is a stream: Batch operations become a subset of streaming operations. Hence, everything can be treated as a stream. • Immutable data sources: Raw data (data source) is persisted and views are derived, but a state can always be recomputed as the initial record is never changed. • Single analytics framework: Keep it short and simple (KISS) principle. A single analytics engine is required. Code, maintenance, and upgrades are considerably reduced. • Replay functionality: Computations and results can evolve by replaying the historical data from a stream. • Data pipeline must guarantee that events stay in order from generation to ingestion. This is critical to guarantee consistency of results, as this guarantees deterministic computation results. Running the same data twice through a computation must produce the same result Source: MapRCopyright © William El Kaim 2018 110
  111. 111. Hybrid computation: Kappa Architecture • Use Kafka or some other system that will let you retain the full log of the data you want to be able to reprocess and that allows for multiple subscribers. • For example, if you want to reprocess up to 30 days of data, set your retention in Kafka to 30 days. • When you want to do the reprocessing, start a second instance of your stream processing job that starts processing from the beginning of the retained data, but direct this output data to a new output table. • When the second job has caught up, switch the application to read from the new table. • Stop the old version of the job, and delete the old output table. Source: KrepsCopyright © William El Kaim 2018 111
  112. 112. Hybrid computation: Kappa Archi. Example Source: TrivadisCopyright © William El Kaim 2018 112
  113. 113. Hybrid computation: Lambda vs. Kappa Lambda Kappa Source: Kreps Used to value all data in a unique treatment chain Used to provide the freshest data to customers Copyright © William El Kaim 2018 113
  114. 114. Hybrid computation: Lambda vs. Kappa Lambda Kappa Source: Ericsson Copyright © William El Kaim 2018 114
  115. 115. • Hadoop from V1 to V3 • Encoding Technologies • Ingestion Technologies • Storage Technologies • Processing Technologies • Big Data Fabric Copyright © William El Kaim 2018 115
  116. 116. What is a Big Data Fabric? • Definition: • Bringing together disparate big data sources automatically, intelligently, and securely, and processing them in a big data platform technology, such as Hadoop and Apache Spark, to deliver a unified, trusted, and comprehensive view of customer and business data. • Big data fabric focuses on automating the process of ingestion, curation, and integrating big data sources to deliver intelligent insights that are critical for businesses to succeed. • The platform minimizes complexity by automating processes, generating big data technology and platform code automatically, and integrating workflows to simplify the deployment. • Big data fabric is not just about Hadoop or Spark — it comprises several components, all of which must work in tandem to deliver a flexible, integrated, secure, and scalable platform. • Big data fabric architecture has secore layers Source: ForresterCopyright © William El Kaim 2018 116
  117. 117. Big Data Fabric Architecture Source: Eckerson Group Source: Forrester Copyright © William El Kaim 2018 117
  118. 118. Big Data Fabric Six core Architecture Layers • Data ingestion layer. • The data ingestion layer deals with getting the big data sources connected, ingested, streamed, and moved into the data fabric. • Big data can come from devices, sensors, logs, clickstreams, databases, applications, and various cloud sources, in the form of structured or unstructured data. • Processing and persistence layer. • This layer uses Hadoop, Spark, and other Hadoop ecosystem components such as Kafka, Flume, and Hive to process and persist big data for use within the big data fabric framework. • Orchestration layer. • The orchestration layer is a critical layer of the big data fabric that transforms, integrates, and cleans data to support various use cases in real time or near real time. • It can transform data inside Hadoop to enable integration, or it can match and clean data dynamically. • Data discovery layer. • This layer automates the discovery of new internal or external big data sources and presents them as a new data asset for consumption by business users. • Dynamic discovery includes several components such as data modeling, data preparation, curation, and virtualization to deliver a flexible big data platform to support any use case. • Data management and intelligence layer. • This layer enables end-to-end data management capabilities that are essential to ensuring the reliability, security, integration, and governance of data. • Its components include data security, governance, metadata management, search, data quality, and lineage. • Data access layer. • This layer includes caching and in-memory technologies, self-service capabilities and interactions, and fabric components that can be embedded in analytical solutions, tools, and dashboards. Source: ForresterCopyright © William El Kaim 2018 118
  119. 119. Big Data Fabric Adoption Is In Its Infancy • Most enterprises that have a big data fabric platform are building it themselves by integrating various core open source technologies • In addition, they are supporting the platform with commercial products for data integration, security, governance, machine learning, SQL-on-Hadoop, and data preparation technologies. • However, organizations are realizing that creating a custom technology stack to support a big data fabric implementation (and then customizing it to meet business requirements) requires significant time and effort. • Solutions are starting to emerge from vendors. Source: ForresterCopyright © William El Kaim 2018 119
  120. 120. Big Data Fabric Examples • Arcadia • BlueData EPIC • Google Cloud • InfoWorks • InsightEdge • Kx Data Refinery • LightBend Fast Data Platform • Microsoft HDInsight • Octopeek • Qbole • SAP Data Hub • Splunk • StreamSets Data Operations Platform Copyright © William El Kaim 2018 120
  121. 121. Copyright © William El Kaim 2018 121