SlideShare una empresa de Scribd logo
1 de 12
Descargar para leer sin conexión
BIG DATA AND ORACLE
Sourabh Saxena
Tech Lead – Infosys
The Big Data architecture matters because it enables organizations to:
 Translate the vision and business goals into incremental projects that characterize an agile
approach to development
 Be proactive in using Big Data technology and all the relevant internal and external data to drive
better decisions or to monetize the data externally
 Provide a balance between long-term stability of Big Data assets and the need for innovation to
leverage the latest and greatest technology, people, processes, and data
One of the building blocks of Big Data solutions for any organization is the Big Data architecture, which needs to
account for a mix of technology to address three primary use cases
Operational intelligence. With a focus on real-time operations, this use case addresses the high-velocity attribute
of Big Data, with streaming data technology and event processing to enable decisions "on the fly." Architecturally,
we can think of data flowing simultaneously along two paths: real-time pipelining to facilitate real-time operations
and batch-oriented processes that capture and stage data at various states.
Exploration and discovery. Exploratory work looks to discover signals and nonobvious relationships or patterns
among broader data sets to uncover new insights that can positively impact decision making and organizational
performance Once discovered and validated, newly discovered signals are operationalized to improve operational
intelligence and monitored for ongoing quality using performance management processes. Discovery can be visual,
collaborative, and algorithmic. High-volume data sets require rapid pivoting for exploration — sometimes
orthogonally connected. Discovery requires data processing optimization, flexible schemas, statistical modeling,
fast processors, and lots of memory.
Performance management. This use case addresses strategic decisions about past performance as well as planning
functions. This is the traditional domain of the data warehouse for descriptive analytics and reporting. Adding Big
Data can increase the timeliness of business reporting and nuanced precision with larger data sets and new data
sources. Examples include the addition of reputation risk assessment, based on social media monitoring, into
broader risk metrics or the move toward real-time performance management.
Source: (White Paper) Six Patterns of Big Data and Analytics Adoption: The Importance of the Information
Architecture
What is Big Data?
Big Data describes a holistic information management strategy that includes and integrates many new types of data and
data management alongside traditional data. While many of the techniques to process and analyze these data types have
existed for some time, it has been the massive proliferation of data and the lower cost computing models that have
encouraged broader adoption.
Big Data has also been defined by the four “V”s: Volume, Velocity, Variety, and Value. These become a reasonable test to
determine whether you should add Big Data to your information architecture.
» Volume. The amount of data. While volume indicates more data, it is the granular nature of the data that is unique. Big
Data requires processing high volumes of low-density data, that is, data of unknown value, such as twitter data feeds, clicks
on a web page, network traffic, sensor-enabled equipment capturing data at the speed of light, and many more. It is the task
of Big Data to convert low-density data into high-density data, that is, data that has value. For some companies, this might
be tens of terabytes, for others it may be hundreds of petabytes.
» Velocity. A fast rate that data is received and perhaps acted upon. The highest velocity data normally streams directly into
memory versus being written to disk. Some Internet of Things (IoT) applications have health and safety ramifications that
require real-time evaluation and action. Other internet-enabled smart products operate in real-time or near real-time. As an
example, consumer eCommerce applications seek to combine mobile device location and personal preferences to make
time sensitive offers. Operationally, mobile application experiences have large user populations, increased network traffic,
and the expectation for immediate response.
» Variety. New unstructured data types. Unstructured and semi-structured data types, such as text, audio, and video require
additional processing to both derive meaning and the supporting metadata. Once understood, unstructured data has many of
the same requirements as structured data, such as summarization, lineage, auditability, and privacy. Further complexity
arises when data from a known source changes without notice. Frequent or real-time schema changes are an enormous
burden for both transaction and analytical environments.
» Value. Data has intrinsic value—but it must be discovered. There are a range of quantitative and investigative techniques
to derive value from data – from discovering a consumer preference or sentiment, to making a relevant offer by location, or
for identifying a piece of equipment that is about to fail. The technological breakthrough is that the cost of data storage and
compute has exponentially decreased, thus providing an abundance of data from which statistical sampling and other
techniques become relevant, and meaning can be derived. However, finding value also requires new discovery processes
involving clever and insightful analysts, business users, and executives. The real Big Data challenge is a human one, which
is learning to ask the right questions, recognizing patterns, making informed assumptions, and predicting behavior.
Big Data Reference Architecture Overview:
Traditional Information Architecture Components
In the illustration, you see two data sources that use integration (ELT/ETL/Change Data Capture) techniques to transfer data
into a DBMS data warehouse or operational data store, and then offer a wide variety of analytical capabilities to reveal the
data.
Adding Big Data Capabilities:
Big Data Information Architecture Components
The defining processing capabilities for big data architecture are to meet the volume, velocity, variety, and value
requirements. Unique distributed (multi-node) parallel processing architectures have been created to parse these large data
sets. There are differing technology strategies for real-time and batch processing storage requirements. For real-time, key-
value data stores, such as NoSQL, allow for high performance, index-based retrieval. For batch processing, a technique
known as “Map Reduce,” filters data according to a specific data discovery strategy. After the filtered data is discovered, it
can be analyzed directly, loaded into other unstructured or semi-structured databases, sent to mobile devices, or merged
into traditional data warehousing environment and correlated to structured data.
In addition to the new components, new architectures are emerging to efficiently accommodate new storage, access,
processing, and analytical requirements.
Big Data Architecture Patterns
A Polyglot strategy suggests that big data oriented architectures will deploy multiple types of data stores.
With this Lambda based architecture we’re now able to address fast data that might be needed in an Internet of Things
architecture. Lambda architecture which provides us a combined solution of realtime data with batch data.
Kappa architecture makes all the data processing in Near Real Time or Streaming mode, which in simple terms removing the
batch layer from Lambda Architecture makes it a Kappa Architecture
The defining processing capabilities for big data architecture are to meet the volume, velocity, variety, and value
requirements. Unique distributed (multi-node) parallel processing architectures have been created to parse these large data
sets. There are differing technology strategies for real-time and batch processing storage requirements. For real-time, key-
value data stores, such as NoSQL, allow for high performance, index-based retrieval. For batch processing, a technique
known as “Map Reduce,” filters data according to a specific data discovery strategy. After the filtered data is discovered, it
can be analyzed directly, loaded into other unstructured or semi-structured databases, sent to mobile devices, or merged
into traditional data warehousing environment and correlated to structured data.
Unified Reference Architecture from Oracle
Conceptual model for The Oracle Big Data Platform for unified information management
A description of these primary components:
» Fast Data: Components which process data in-flight (streams) to identify actionable events and then determine next-best-
action based on decision context and event profile data and persist in a durable storage system. The decision context relies
on data in the data reservoir or other enterprise information stores.
» Reservoir: Economical, scale-out storage and parallel processing for data which does not have stringent requirements for
data formalization or modelling. Typically manifested as a Hadoop cluster or staging area in a relational database.
» Factory: Management and orchestration of data into and between the Data Reservoir and Enterprise Information Store as
well as the rapid provisioning of data into the Discovery Lab for agile discovery.
» Warehouse: Large scale formalized and modelled business critical data store, typically manifested by a Data Warehouse
or Data Marts.
» Data Lab: A set of data stores, processing engines, and analysis tools separate from the data management activities to
facilitate the discovery of new knowledge. Key requirements include rapid data provisioning and subsetting, data
security/governance, and rapid statistical processing for large data sets.
» Business Analytics: A range of end user and analytic tools for business Intelligence, faceted navigation, and data mining
analytic tools including dashboards, reports, and mobile access for timely and accurate reporting.
» Apps: A collection of prebuilt adapters and application programming interfaces that enable all data sources and
processing to be directly integrated into custom or packaged business applications.
The interplay of these components and their assembly into solutions can be further simplified by dividing the flow of data into
execution -- tasks which support and inform daily operations -- and innovation – tasks which drive new insights back to the
business. Arranging solutions on either side of this division (as shown by the horizontal line) helps inform system
requirements for security, governance, and timeliness.
Enterprise Information Management Capabilities
Drilling a little deeper into the unified information management platform, here is Oracle’s holistic capability map
Oracle’s Unified Information Management Capabilities
A brief overview of these capabilities appears beginning on the left hand side of the diagram.
As various data types are ingested (under Acquire), they can either be written directly (real-time) into memory processes or
can be written to disk as messages, files, or database transactions. Once received, there are multiple options on where to
persist the data. It can be written to the file system, a traditional RDBMS, or distributed-clustered systems such as NoSQL
and Hadoop Distributed File System (HDFS). The primary techniques for rapid evaluation of unstructured data is by running
map-reduce (Hadoop) in batch or map-reduce (Spark) in-memory. Additional evaluation options are available for real-time
streaming data.
The integration layer in the middle (under Organize) is extensive and enables an open ingest, data reservoir, data
warehouse, and analytic architecture. It extends across all of the data types and domains, and manages the bi-directional
gap between the traditional and new data acquisition and processing environments. Most importantly, it meets the
requirements of the four Vs: extreme volume and velocity, variety of data types, and finding value where ever your analytics
operate. In addition, it provides data quality services, maintains metadata, and tracks transformation lineage.
The Big Data processing output, having converted it from low density to high density data, will be loaded into the a
foundation data layer, data warehouse, data marts, data discovery labs or back into the reservoir. Of note, the discovery lab
requires fast connections to the data reservoir, event processing, and the data warehouse. For all of these reasons a high
speed network, such as InfiniBand provides data transport.
The next layer (under Analyze) is where the “reduction-results” are loaded from Big Data processing output into your data
warehouse for further analysis. You will notice that the reservoir and the data warehouse both offer ‘in-place’ analytics which
means that analytical processing can occur on the source system without an extra step to move the data to another
analytical environment. The SQL analytics capability allows simple and complex analytical queries optimally at each data
store independently and on separate systems as well as combining results in a single query.
There are many performance options at this layer which can improve performance by many orders of magnitude. By
leveraging Oracle Exadata for your data warehouse, processing can be enhanced with flash memory, columnar databases,
in-memory databases, and more. Also, a critical capability for the discovery lab is a fast, high powered search, known as
faceted navigation, to support a responsive investigative environment.
The Business Intelligence layer (under Decide) is equipped with interactive, real-time, and data modeling tools. These tools
are able to query, report and model data while leaving the large volumes of data in place. These tools include advanced
analytics, in-database and in-reservoir statistical analysis, and advanced visualization, in addition to the traditional
components such as reports, dashboards, alerts and queries.
Governance, security, and operational management also cover the entire spectrum of data and information landscape at the
enterprise level.
With a unified architecture, the business and analytical users can rely on richer, high quality data. Once ready for
consumption, the data and analysis flow would be seamless as they navigate through various data and information sets, test
hypothesis, analyze patterns, and make informed decisions.
Oracle Open Integration Architecture
Oracle’s family of Integration Products supports nearly all of the Apache Big Data technologies as well as many non-Oracle
products. Core integration capabilities support an entire infrastructure around data movement and transformation that
include integration orchestration, data quality, data lineage, and data governance. The modern data warehouse is no longer
confined to a single physical solution, so complementing technologies that enable this new logical data warehouse are more
important than ever.
Ingest Capability
There are a number of methods to introduce data into a Big Data platform.
Apache Flume
» Flume provides a distributed, reliable, and available service for efficiently moving large amounts of log data and other data.
It captures and processes data asynchronously. A data event will capture data in a queue (channel) and then a consumer
will dequeue the event (sink) on demand. Once consumed, the data in the original queue is removed which forces writing of
data to another log or HDFS for archival purposes. Data can be reliably advanced through multiple states by linking queues
(sinks to channels) with 100% recoverability. Data can be processed in the file system or in-memory. However, in memory
processing is not recoverable.
Apache Storm
» Storm provides a distributed real-time, parallelized computation system that runs across a cluster of nodes. The topology
is designed to consume streams of data and process those streams in arbitrarily complex ways, repartitioning the streams
between each stage of the computation. Use cases can include real-time analytics, on-line machine learning, continuous
computation, distributed RPC, ETL, and more.
Apache Kafka
» Kafka is an Apache publish-subscribe messaging system where messages are immediately written to file system and
replicated within the cluster to prevent data loss. Messages are not deleted when they are read but retained with a
configurable SLA. A single cluster serves as the central data backbone that can be elastically expanded without downtime.
Apache Spark Streaming:
» Spark Streaming is an extension of Spark. It extends Spark for doing large scale stream processing, and is capable of
scaling to 100’s of nodes and achieves second scale latencies. Spark Streaming supports both Java and Scala, which
makes it easy for users to map, filter, join, and reduce streams (among other operations) using functions in the Scala/Java
programming language. It integrates with Spark’s batch and interactive processing while maintaining fault tolerance similar
to batch systems that can recover from both outright failures and stragglers. In addition, Spark streaming provides support
for applications with requirements for combining data streams with historical data computed through batch jobs or ad-hoc
queries, providing a powerful real-time analytics environment.
Oracle Stream Explorer
» Stream Explorer can process multiple event streams, detecting patterns and trends in real time, and then initiating an
action. It can be deployed as standalone, integrated in the SOA stack, or in a lightweight fashion on embedded Java. Stream
Explorer can ensures downstream applications and service-oriented and event-driven architectures are driven by true, real-
time intelligence.
Oracle GoldenGate
» Golden Gate enables log-based change data capture, distribution, transformation, and delivery. It has support for
heterogeneous data management systems and operating systems and provides bidirectional replication without distance
limitation. GoldenGate can ensure transactional integrity and reliable data delivery and fast recovery after interruptions.
Distributed File System Capability
Hadoop Distributed File System (HDFS):
» HDFS is an Apache open source distributed file system that runs on high-performance commodity hardware and
appliances built with such (e.g. Oracle Big Data Appliance). It is designed to be deployed highly scalable nodes and
associated storage. HDFS provides automatic data replication (usually deployed as triple replication) for fault tolerance.
Most organizations deploy data and manipulate it directly in HDFS for write once, read many applications such as those
common in analytics.
Cloudera Manager:
» Cloudera Manager is an end-to-end management application for Cloudera’s Distribution of Apache Hadoop
» Cloudera Manager gives a cluster-wide, real-time view of nodes and services running; provides a single, central place to
enact configuration changes across the cluster; and incorporates a full range of reporting and diagnostic tools to help
optimize cluster performance and utilization.
DataManagementCapability
Apache HBase:
» Apache HBase is designed to provide random read/write access to very large non-relational tables deployed in Hadoop.
Among the features are linear and modular scalability, strictly consistent reads and writes, automatic and configurable
sharding, automatic failover between Region Servers, base classes for Hadoop MapReduce jobs and Apache HBase tables,
Java API client access, and a REST-ful web service.
Apache Kudu:
» Kudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic
workloads across a single storage layer. As a more recent complement to HDFS and Apache HBase, Kudu gives architects
the flexibility to address a wider variety of use cases without exotic workarounds. For example, Kudu can be used in
situations that require fast analytics on fast (rapidly changing) data. Kudu promises to lower query latency significantly for
Apache Impala and Apache Spark initially, with other execution engines to come.
Oracle NoSQL Database:
» For high transaction environments (not just append), where data models call for table based key-value pairs, and
consistency is defined by policies needing superior availability in NoSQL environments, Oracle’s NoSQL DB excels in web
scale-out and click-stream type low latency environments.
» Oracle NoSQL Database is designed as a highly scalable, distributed database based on Oracle Berkeley DB. Sleepycat
Software. Oracle NoSQL Database is a general purpose, enterprise class key value store that adds an intelligent driver on
top of an enhanced distributed Berkeley database. This intelligent driver keeps track of the underlying storage topology,
understands and uses data shards where necessary, and knows where data can be placed in a clustered environment for
the lowest possible latency. Unlike competitive solutions, Oracle NoSQL Database is easy to install, configure and manage.
It supports a broad set of workloads and delivers enterprise-class reliability backed by enterprise-class Oracle support.
» Using Oracle NoSQL Database allows data to be more efficiently acquired, organized and analyzed. Primary use cases
include low latency capture plus fast querying of the same data as it is being ingested, most typically by key-value lookup.
Examples of such use cases include Credit Card transaction environments, high velocity, low latency embedded device data
capture, and high volume stock market trading applications. Oracle NoSQL Database can also provide a near consistent
Oracle Database table copy of key-value pairs where a high rate of updates is required. It can serve as a target for Oracle
GoldenGate change data capture and used in conjunction with event processing using Oracle Steam Explorer and Oracle
Real Time Decisions. The product is available in both an open source community edition and an enterprise edition for large
distributed data centers. The latter version is part of the Big Data Appliance.
Processing Capability
Apache Hadoop:
» The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across
clusters of nodes using simple programming models. It is designed to scale up from single servers to thousands of
machines, each offering local computation and storage. Processing capabilities for query and analysis are primarily
delivered using programs and utilities that leverage MapReduce and Spark. Other key technologies include the Hadoop
Distributed File System (HDFS) and YARN (a framework for job scheduling and cluster resource management.
MapReduce:
» MapReduce relies on a linear dataflow structure evenly allocated as highly distributed programs. Apache provides
MapReduce in Hadoop as a programming model and an implementation designed to process large data sets in parallel
where the data resides on disks in a cluster.
Apache Spark:
» Spark provides programmers with an application programming interface centered on a data structure called the resilient
distributed dataset (RDD). Spark's RDDs function as a fast working set or cache for distributed programs that in essence
provide a form of distributed shared memory. The speed of access and availability of this working dataset facilitates high
performance implementations of two common algorithm paradigms: (1) iterative algorithms, that must reuse and access data
multiple times, as well as (2) newer analytics and exploratory type of processing that are akin to processing and query
models found with traditional databases. Leading the class in the category of iterative algorithms is machine learning.
Data Integration Capability
Oracle Big Data Connectors - Oracle Loader for Hadoop, Oracle Data Integrator:
(Click here for Oracle Data Integration and Big Data)
» The Oracle Loader for Hadoop enables parallel high-speed loading of data from Hadoop into an Oracle Database. Oracle
Data Integrator Enterprise Edition in combination with the Big Data Connectors enables high performance data movement
and deployment of data transformations in Hadoop. Other features in the Big Data Connectors include an Oracle SQL
Connector for HDFS, Oracle R Advanced Analytics for Hadoop, and Oracle XQuery for Hadoop.
SQL Data Access
Oracle Big Data SQL
» Big Data SQL enables Oracle SQL queries to be initiated against data also residing in Apache Hadoop clusters and
NoSQL databases. Oracle Database 12c provides the means to query this data using external tables. Smart Scan
capabilities deployed on the other data sources minimizes data movement and maximizes performance. Because queries
are initiated through the Oracle Database, advanced security, data redaction, and virtual private database capabilities are
extended to Hadoop and NoSQL databases.
Apache Hive:
» Hive provides a mechanism to project structure onto Hadoop data sets and query the data using a SQL-like language
called HiveQL. The language also enables traditional MapReduce programmers to plug in their custom mappers and
reducers when it is inconvenient or inefficient to express this logic in HiveQL. It only contains metadata that describes data
access in Apache HDFS and Apache HBase, not the data itself. More recently, HiveQL query execution is commonly used
with Spark for better performance.
Apache Impala - Cloudera
» With Impala, you can query data, whether stored in HDFS or Apache HBase, using typical SQL query functions of select,
join, and various aggregate functions an order of magnitude faster than using Hive with MapReduce. To avoid latency and
improve application speed, Impala circumvents MapReduce to directly access the data through a specialized distributed
query engine that is very similar to those found in commercial parallel RDBMSs.
Statistical Analysis Capability
Open source Project R and Oracle R Enterprise (part of Oracle Advanced Analytics):
» R is a programming language for statistical analysis (Click here for Project R). Oracle first enabled running R algorithms in
parallel without the need to move data out of the data store in the Oracle Database (Click here for Oracle R Enterprise).
Oracle R Advanced Analytics for Hadoop (ORAAH) is bundled into Oracle’s Big Data Connectors and provides high
performance in-Hadoop statistical analysis capabilities leveraging Spark and MapReduce.
Spatial & Graph Capability
Oracle Big Data Spatial and Graph:
» Oracle Big Data Spatial and Graph provides analytic services and data models supporting Big Data workloads on Apache
Hadoop and NoSQL database technologies. Oracle Big Data Spatial and Graph includes two main components: A property
graph database and 35 built-in graph analytics that discover relationships, recommendations and other graph patterns in big
data and a wide range of spatial analysis functions and services to evaluate data based on how near or far something is to
one another, whether something falls within a boundary or region, or to process and visualize geospatial map data and
imagery.
Information Discovery
Oracle Big Data Discovery
» Oracle Big Data Discovery provides an interactive interface into Apache Hadoop (e.g. Cloudera, Hortonworks) to easily
find and explore data, quickly transform and enrich data, intuitively discover new insights by combining data sets, and share
the results through highly visual interfaces.
Business Intelligence
Oracle Business Intelligence Suite
» Oracle Business Intelligence Suite provides a single platform for ad-hoc query and analysis, report publishing, and data
visualization from you workstation or mobile device. Direct access to Hadoop is supported via Hive and Impala.
Real-Time Recommendation Engine
Oracle Real-Time Decisions
» Using Oracle RTD, business logic can be expressed in the form of rules and self-learning predictive models that can
recommend optimal courses of action with very low latency using specific performance objectives. This is accomplished by
using data from different “channels”. Whether it’s on the web in the form of click-stream data, in a call center where
computer telephony integration provides valuable insight to call center agents, or at the point-of-sale, Oracle RTD can be
combined with the complex event processing of Oracle Streams Explorer to create a complete event based Decision
Management System.
Source: An Enterprise Architect’s Guide to Big Data Reference Architecture Overview
ORACLE ENTERPRISE ARCHITECTURE WHITE PAPER | MARCH 2016
How Big Data designed for cloud offering
Big Data Administrator, Data Architect
 Manages the Big Data Clusters
 Monitors data and network traffic, prevents glitches
 Integrates, centralizes, protect s and maintains data sources
 Grants and revokes permissions to various clients and nodes.
Roadmap to Big Data Architect
 Get your hands on Big Data platform, Cloud services or VMs e.g., Oracle BDALite Vbox, Oracle Big Data Cloud
Services, Oracle BDA
 Leverage your Oracle background and notions: clusters, nodes, Oracle SQL (Big Data SQL), Big Data Connectors
(e.g., Oracle Datasource for Hadoop)
 Get familiar with Big Data databases & storages: HDFS, NoSQL, DBMSes
 Get familiar with key Big Data Frameworks: Hadoop, Spark, Flink, Beam, streaming frameworks (Kafka, Storm),
and their integration with Oracle (OD4H, OD4S, and so on)
 Get familiar with Big Data tools and programming: Hive SQL, Spark SQL, visualization tools, R, Java, Scala, and
so on
 Read, Practice and Get involved in Big Data projects
Resources
•Apache Projects
http://hadoop.apache.org/
http://spark.apache.org/
http://flink.apache.org/
https://beam.apache.org/
• Big Data in the Cloud, Big Data Compute Edition
https://cloud.oracle.com/en_US/big-data
https://cloud.oracle.com/en_US/big-data-compute-edition
• Big Data Connectors
https://www.oracle.com/database/big-data-connectors/index.html

Más contenido relacionado

La actualidad más candente

What is big data - Architectures and Practical Use Cases
What is big data - Architectures and Practical Use CasesWhat is big data - Architectures and Practical Use Cases
What is big data - Architectures and Practical Use CasesTony Pearson
 
Lecture3 business intelligence
Lecture3 business intelligenceLecture3 business intelligence
Lecture3 business intelligencehktripathy
 
DOCUMENT SELECTION USING MAPREDUCE
DOCUMENT SELECTION USING MAPREDUCEDOCUMENT SELECTION USING MAPREDUCE
DOCUMENT SELECTION USING MAPREDUCEijsptm
 
Performance Acceleration: Summaries, Recommendation, MPP and more
Performance Acceleration: Summaries, Recommendation, MPP and morePerformance Acceleration: Summaries, Recommendation, MPP and more
Performance Acceleration: Summaries, Recommendation, MPP and moreDenodo
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesSpringPeople
 
Logical Data Fabric: Architectural Components
Logical Data Fabric: Architectural ComponentsLogical Data Fabric: Architectural Components
Logical Data Fabric: Architectural ComponentsDenodo
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035Neelam Rawat
 
Making Sense of NoSQL and Big Data Amidst High Expectations
Making Sense of NoSQL and Big Data Amidst High ExpectationsMaking Sense of NoSQL and Big Data Amidst High Expectations
Making Sense of NoSQL and Big Data Amidst High ExpectationsRackspace
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_publicAttila Barta
 
How different between Big Data, Business Intelligence and Analytics ?
How different between Big Data, Business Intelligence and Analytics ?How different between Big Data, Business Intelligence and Analytics ?
How different between Big Data, Business Intelligence and Analytics ?Thanakrit Lersmethasakul
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...Denodo
 
Denodo’s Data Catalog: Bridging the Gap between Data and Business
Denodo’s Data Catalog: Bridging the Gap between Data and BusinessDenodo’s Data Catalog: Bridging the Gap between Data and Business
Denodo’s Data Catalog: Bridging the Gap between Data and BusinessDenodo
 
Introduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data ScienceIntroduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data ScienceIMC Institute
 
Unit i big data introduction
Unit  i big data introductionUnit  i big data introduction
Unit i big data introductionSujaMaryD
 
GDPR Noncompliance: Avoid the Risk with Data Virtualization
GDPR Noncompliance: Avoid the Risk with Data VirtualizationGDPR Noncompliance: Avoid the Risk with Data Virtualization
GDPR Noncompliance: Avoid the Risk with Data VirtualizationDenodo
 

La actualidad más candente (19)

What is big data - Architectures and Practical Use Cases
What is big data - Architectures and Practical Use CasesWhat is big data - Architectures and Practical Use Cases
What is big data - Architectures and Practical Use Cases
 
BigData Analysis
BigData AnalysisBigData Analysis
BigData Analysis
 
Lecture3 business intelligence
Lecture3 business intelligenceLecture3 business intelligence
Lecture3 business intelligence
 
DOCUMENT SELECTION USING MAPREDUCE
DOCUMENT SELECTION USING MAPREDUCEDOCUMENT SELECTION USING MAPREDUCE
DOCUMENT SELECTION USING MAPREDUCE
 
Performance Acceleration: Summaries, Recommendation, MPP and more
Performance Acceleration: Summaries, Recommendation, MPP and morePerformance Acceleration: Summaries, Recommendation, MPP and more
Performance Acceleration: Summaries, Recommendation, MPP and more
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practices
 
Logical Data Fabric: Architectural Components
Logical Data Fabric: Architectural ComponentsLogical Data Fabric: Architectural Components
Logical Data Fabric: Architectural Components
 
Big data
Big dataBig data
Big data
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035
 
Making Sense of NoSQL and Big Data Amidst High Expectations
Making Sense of NoSQL and Big Data Amidst High ExpectationsMaking Sense of NoSQL and Big Data Amidst High Expectations
Making Sense of NoSQL and Big Data Amidst High Expectations
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
 
How different between Big Data, Business Intelligence and Analytics ?
How different between Big Data, Business Intelligence and Analytics ?How different between Big Data, Business Intelligence and Analytics ?
How different between Big Data, Business Intelligence and Analytics ?
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
 
Denodo’s Data Catalog: Bridging the Gap between Data and Business
Denodo’s Data Catalog: Bridging the Gap between Data and BusinessDenodo’s Data Catalog: Bridging the Gap between Data and Business
Denodo’s Data Catalog: Bridging the Gap between Data and Business
 
Introduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data ScienceIntroduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data Science
 
Unit i big data introduction
Unit  i big data introductionUnit  i big data introduction
Unit i big data introduction
 
GDPR Noncompliance: Avoid the Risk with Data Virtualization
GDPR Noncompliance: Avoid the Risk with Data VirtualizationGDPR Noncompliance: Avoid the Risk with Data Virtualization
GDPR Noncompliance: Avoid the Risk with Data Virtualization
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 

Similar a Big data and oracle

Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfrajsharma159890
 
Evolving Big Data Strategies: Bringing Data Lake and Data Mesh Vision to Life
Evolving Big Data Strategies: Bringing Data Lake and Data Mesh Vision to LifeEvolving Big Data Strategies: Bringing Data Lake and Data Mesh Vision to Life
Evolving Big Data Strategies: Bringing Data Lake and Data Mesh Vision to LifeSG Analytics
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundationshktripathy
 
What Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfWhat Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfPridesys IT Ltd.
 
A Survey on Data Mining
A Survey on Data MiningA Survey on Data Mining
A Survey on Data MiningIOSR Journals
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviationranjit banshpal
 
What Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfWhat Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfPridesys IT Ltd.
 
Data Warehousing AWS 12345
Data Warehousing AWS 12345Data Warehousing AWS 12345
Data Warehousing AWS 12345AkhilSinghal21
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Scienceijtsrd
 
Unlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationUnlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationDenodo
 
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefitsRicky Barron
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Nathan Bijnens
 
Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Denodo
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond Rajesh Kumar
 
An introduction to Data Mining by Kurt Thearling
An introduction to Data Mining by Kurt ThearlingAn introduction to Data Mining by Kurt Thearling
An introduction to Data Mining by Kurt ThearlingPim Piepers
 

Similar a Big data and oracle (20)

Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdf
 
Evolving Big Data Strategies: Bringing Data Lake and Data Mesh Vision to Life
Evolving Big Data Strategies: Bringing Data Lake and Data Mesh Vision to LifeEvolving Big Data Strategies: Bringing Data Lake and Data Mesh Vision to Life
Evolving Big Data Strategies: Bringing Data Lake and Data Mesh Vision to Life
 
big_data.ppt
big_data.pptbig_data.ppt
big_data.ppt
 
big_data.ppt
big_data.pptbig_data.ppt
big_data.ppt
 
big_data.ppt
big_data.pptbig_data.ppt
big_data.ppt
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
What Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfWhat Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdf
 
A Survey on Data Mining
A Survey on Data MiningA Survey on Data Mining
A Survey on Data Mining
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviation
 
What Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfWhat Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdf
 
Data Warehousing AWS 12345
Data Warehousing AWS 12345Data Warehousing AWS 12345
Data Warehousing AWS 12345
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Science
 
Unlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationUnlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data Virtualization
 
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefits
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)
 
Data Science
Data ScienceData Science
Data Science
 
Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)
 
1
11
1
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond
 
An introduction to Data Mining by Kurt Thearling
An introduction to Data Mining by Kurt ThearlingAn introduction to Data Mining by Kurt Thearling
An introduction to Data Mining by Kurt Thearling
 

Último

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Último (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Big data and oracle

  • 1. BIG DATA AND ORACLE Sourabh Saxena Tech Lead – Infosys The Big Data architecture matters because it enables organizations to:  Translate the vision and business goals into incremental projects that characterize an agile approach to development  Be proactive in using Big Data technology and all the relevant internal and external data to drive better decisions or to monetize the data externally  Provide a balance between long-term stability of Big Data assets and the need for innovation to leverage the latest and greatest technology, people, processes, and data One of the building blocks of Big Data solutions for any organization is the Big Data architecture, which needs to account for a mix of technology to address three primary use cases Operational intelligence. With a focus on real-time operations, this use case addresses the high-velocity attribute of Big Data, with streaming data technology and event processing to enable decisions "on the fly." Architecturally, we can think of data flowing simultaneously along two paths: real-time pipelining to facilitate real-time operations and batch-oriented processes that capture and stage data at various states. Exploration and discovery. Exploratory work looks to discover signals and nonobvious relationships or patterns among broader data sets to uncover new insights that can positively impact decision making and organizational performance Once discovered and validated, newly discovered signals are operationalized to improve operational intelligence and monitored for ongoing quality using performance management processes. Discovery can be visual, collaborative, and algorithmic. High-volume data sets require rapid pivoting for exploration — sometimes orthogonally connected. Discovery requires data processing optimization, flexible schemas, statistical modeling, fast processors, and lots of memory.
  • 2. Performance management. This use case addresses strategic decisions about past performance as well as planning functions. This is the traditional domain of the data warehouse for descriptive analytics and reporting. Adding Big Data can increase the timeliness of business reporting and nuanced precision with larger data sets and new data sources. Examples include the addition of reputation risk assessment, based on social media monitoring, into broader risk metrics or the move toward real-time performance management. Source: (White Paper) Six Patterns of Big Data and Analytics Adoption: The Importance of the Information Architecture What is Big Data? Big Data describes a holistic information management strategy that includes and integrates many new types of data and data management alongside traditional data. While many of the techniques to process and analyze these data types have existed for some time, it has been the massive proliferation of data and the lower cost computing models that have encouraged broader adoption. Big Data has also been defined by the four “V”s: Volume, Velocity, Variety, and Value. These become a reasonable test to determine whether you should add Big Data to your information architecture. » Volume. The amount of data. While volume indicates more data, it is the granular nature of the data that is unique. Big Data requires processing high volumes of low-density data, that is, data of unknown value, such as twitter data feeds, clicks on a web page, network traffic, sensor-enabled equipment capturing data at the speed of light, and many more. It is the task of Big Data to convert low-density data into high-density data, that is, data that has value. For some companies, this might be tens of terabytes, for others it may be hundreds of petabytes.
  • 3. » Velocity. A fast rate that data is received and perhaps acted upon. The highest velocity data normally streams directly into memory versus being written to disk. Some Internet of Things (IoT) applications have health and safety ramifications that require real-time evaluation and action. Other internet-enabled smart products operate in real-time or near real-time. As an example, consumer eCommerce applications seek to combine mobile device location and personal preferences to make time sensitive offers. Operationally, mobile application experiences have large user populations, increased network traffic, and the expectation for immediate response. » Variety. New unstructured data types. Unstructured and semi-structured data types, such as text, audio, and video require additional processing to both derive meaning and the supporting metadata. Once understood, unstructured data has many of the same requirements as structured data, such as summarization, lineage, auditability, and privacy. Further complexity arises when data from a known source changes without notice. Frequent or real-time schema changes are an enormous burden for both transaction and analytical environments. » Value. Data has intrinsic value—but it must be discovered. There are a range of quantitative and investigative techniques to derive value from data – from discovering a consumer preference or sentiment, to making a relevant offer by location, or for identifying a piece of equipment that is about to fail. The technological breakthrough is that the cost of data storage and compute has exponentially decreased, thus providing an abundance of data from which statistical sampling and other techniques become relevant, and meaning can be derived. However, finding value also requires new discovery processes involving clever and insightful analysts, business users, and executives. The real Big Data challenge is a human one, which is learning to ask the right questions, recognizing patterns, making informed assumptions, and predicting behavior. Big Data Reference Architecture Overview: Traditional Information Architecture Components In the illustration, you see two data sources that use integration (ELT/ETL/Change Data Capture) techniques to transfer data into a DBMS data warehouse or operational data store, and then offer a wide variety of analytical capabilities to reveal the data.
  • 4. Adding Big Data Capabilities: Big Data Information Architecture Components The defining processing capabilities for big data architecture are to meet the volume, velocity, variety, and value requirements. Unique distributed (multi-node) parallel processing architectures have been created to parse these large data sets. There are differing technology strategies for real-time and batch processing storage requirements. For real-time, key- value data stores, such as NoSQL, allow for high performance, index-based retrieval. For batch processing, a technique known as “Map Reduce,” filters data according to a specific data discovery strategy. After the filtered data is discovered, it can be analyzed directly, loaded into other unstructured or semi-structured databases, sent to mobile devices, or merged into traditional data warehousing environment and correlated to structured data. In addition to the new components, new architectures are emerging to efficiently accommodate new storage, access, processing, and analytical requirements. Big Data Architecture Patterns A Polyglot strategy suggests that big data oriented architectures will deploy multiple types of data stores. With this Lambda based architecture we’re now able to address fast data that might be needed in an Internet of Things architecture. Lambda architecture which provides us a combined solution of realtime data with batch data. Kappa architecture makes all the data processing in Near Real Time or Streaming mode, which in simple terms removing the batch layer from Lambda Architecture makes it a Kappa Architecture The defining processing capabilities for big data architecture are to meet the volume, velocity, variety, and value requirements. Unique distributed (multi-node) parallel processing architectures have been created to parse these large data sets. There are differing technology strategies for real-time and batch processing storage requirements. For real-time, key- value data stores, such as NoSQL, allow for high performance, index-based retrieval. For batch processing, a technique known as “Map Reduce,” filters data according to a specific data discovery strategy. After the filtered data is discovered, it can be analyzed directly, loaded into other unstructured or semi-structured databases, sent to mobile devices, or merged into traditional data warehousing environment and correlated to structured data.
  • 5. Unified Reference Architecture from Oracle Conceptual model for The Oracle Big Data Platform for unified information management A description of these primary components: » Fast Data: Components which process data in-flight (streams) to identify actionable events and then determine next-best- action based on decision context and event profile data and persist in a durable storage system. The decision context relies on data in the data reservoir or other enterprise information stores. » Reservoir: Economical, scale-out storage and parallel processing for data which does not have stringent requirements for data formalization or modelling. Typically manifested as a Hadoop cluster or staging area in a relational database. » Factory: Management and orchestration of data into and between the Data Reservoir and Enterprise Information Store as well as the rapid provisioning of data into the Discovery Lab for agile discovery. » Warehouse: Large scale formalized and modelled business critical data store, typically manifested by a Data Warehouse or Data Marts. » Data Lab: A set of data stores, processing engines, and analysis tools separate from the data management activities to facilitate the discovery of new knowledge. Key requirements include rapid data provisioning and subsetting, data security/governance, and rapid statistical processing for large data sets. » Business Analytics: A range of end user and analytic tools for business Intelligence, faceted navigation, and data mining analytic tools including dashboards, reports, and mobile access for timely and accurate reporting. » Apps: A collection of prebuilt adapters and application programming interfaces that enable all data sources and processing to be directly integrated into custom or packaged business applications. The interplay of these components and their assembly into solutions can be further simplified by dividing the flow of data into execution -- tasks which support and inform daily operations -- and innovation – tasks which drive new insights back to the business. Arranging solutions on either side of this division (as shown by the horizontal line) helps inform system requirements for security, governance, and timeliness. Enterprise Information Management Capabilities Drilling a little deeper into the unified information management platform, here is Oracle’s holistic capability map
  • 6. Oracle’s Unified Information Management Capabilities A brief overview of these capabilities appears beginning on the left hand side of the diagram. As various data types are ingested (under Acquire), they can either be written directly (real-time) into memory processes or can be written to disk as messages, files, or database transactions. Once received, there are multiple options on where to persist the data. It can be written to the file system, a traditional RDBMS, or distributed-clustered systems such as NoSQL and Hadoop Distributed File System (HDFS). The primary techniques for rapid evaluation of unstructured data is by running map-reduce (Hadoop) in batch or map-reduce (Spark) in-memory. Additional evaluation options are available for real-time streaming data. The integration layer in the middle (under Organize) is extensive and enables an open ingest, data reservoir, data warehouse, and analytic architecture. It extends across all of the data types and domains, and manages the bi-directional gap between the traditional and new data acquisition and processing environments. Most importantly, it meets the requirements of the four Vs: extreme volume and velocity, variety of data types, and finding value where ever your analytics operate. In addition, it provides data quality services, maintains metadata, and tracks transformation lineage. The Big Data processing output, having converted it from low density to high density data, will be loaded into the a foundation data layer, data warehouse, data marts, data discovery labs or back into the reservoir. Of note, the discovery lab requires fast connections to the data reservoir, event processing, and the data warehouse. For all of these reasons a high speed network, such as InfiniBand provides data transport. The next layer (under Analyze) is where the “reduction-results” are loaded from Big Data processing output into your data warehouse for further analysis. You will notice that the reservoir and the data warehouse both offer ‘in-place’ analytics which means that analytical processing can occur on the source system without an extra step to move the data to another analytical environment. The SQL analytics capability allows simple and complex analytical queries optimally at each data store independently and on separate systems as well as combining results in a single query. There are many performance options at this layer which can improve performance by many orders of magnitude. By leveraging Oracle Exadata for your data warehouse, processing can be enhanced with flash memory, columnar databases, in-memory databases, and more. Also, a critical capability for the discovery lab is a fast, high powered search, known as faceted navigation, to support a responsive investigative environment. The Business Intelligence layer (under Decide) is equipped with interactive, real-time, and data modeling tools. These tools are able to query, report and model data while leaving the large volumes of data in place. These tools include advanced
  • 7. analytics, in-database and in-reservoir statistical analysis, and advanced visualization, in addition to the traditional components such as reports, dashboards, alerts and queries. Governance, security, and operational management also cover the entire spectrum of data and information landscape at the enterprise level. With a unified architecture, the business and analytical users can rely on richer, high quality data. Once ready for consumption, the data and analysis flow would be seamless as they navigate through various data and information sets, test hypothesis, analyze patterns, and make informed decisions. Oracle Open Integration Architecture Oracle’s family of Integration Products supports nearly all of the Apache Big Data technologies as well as many non-Oracle products. Core integration capabilities support an entire infrastructure around data movement and transformation that include integration orchestration, data quality, data lineage, and data governance. The modern data warehouse is no longer confined to a single physical solution, so complementing technologies that enable this new logical data warehouse are more important than ever. Ingest Capability There are a number of methods to introduce data into a Big Data platform. Apache Flume » Flume provides a distributed, reliable, and available service for efficiently moving large amounts of log data and other data. It captures and processes data asynchronously. A data event will capture data in a queue (channel) and then a consumer will dequeue the event (sink) on demand. Once consumed, the data in the original queue is removed which forces writing of data to another log or HDFS for archival purposes. Data can be reliably advanced through multiple states by linking queues (sinks to channels) with 100% recoverability. Data can be processed in the file system or in-memory. However, in memory processing is not recoverable. Apache Storm » Storm provides a distributed real-time, parallelized computation system that runs across a cluster of nodes. The topology is designed to consume streams of data and process those streams in arbitrarily complex ways, repartitioning the streams between each stage of the computation. Use cases can include real-time analytics, on-line machine learning, continuous computation, distributed RPC, ETL, and more. Apache Kafka » Kafka is an Apache publish-subscribe messaging system where messages are immediately written to file system and replicated within the cluster to prevent data loss. Messages are not deleted when they are read but retained with a configurable SLA. A single cluster serves as the central data backbone that can be elastically expanded without downtime. Apache Spark Streaming: » Spark Streaming is an extension of Spark. It extends Spark for doing large scale stream processing, and is capable of scaling to 100’s of nodes and achieves second scale latencies. Spark Streaming supports both Java and Scala, which
  • 8. makes it easy for users to map, filter, join, and reduce streams (among other operations) using functions in the Scala/Java programming language. It integrates with Spark’s batch and interactive processing while maintaining fault tolerance similar to batch systems that can recover from both outright failures and stragglers. In addition, Spark streaming provides support for applications with requirements for combining data streams with historical data computed through batch jobs or ad-hoc queries, providing a powerful real-time analytics environment. Oracle Stream Explorer » Stream Explorer can process multiple event streams, detecting patterns and trends in real time, and then initiating an action. It can be deployed as standalone, integrated in the SOA stack, or in a lightweight fashion on embedded Java. Stream Explorer can ensures downstream applications and service-oriented and event-driven architectures are driven by true, real- time intelligence. Oracle GoldenGate » Golden Gate enables log-based change data capture, distribution, transformation, and delivery. It has support for heterogeneous data management systems and operating systems and provides bidirectional replication without distance limitation. GoldenGate can ensure transactional integrity and reliable data delivery and fast recovery after interruptions. Distributed File System Capability Hadoop Distributed File System (HDFS): » HDFS is an Apache open source distributed file system that runs on high-performance commodity hardware and appliances built with such (e.g. Oracle Big Data Appliance). It is designed to be deployed highly scalable nodes and associated storage. HDFS provides automatic data replication (usually deployed as triple replication) for fault tolerance. Most organizations deploy data and manipulate it directly in HDFS for write once, read many applications such as those common in analytics. Cloudera Manager: » Cloudera Manager is an end-to-end management application for Cloudera’s Distribution of Apache Hadoop » Cloudera Manager gives a cluster-wide, real-time view of nodes and services running; provides a single, central place to enact configuration changes across the cluster; and incorporates a full range of reporting and diagnostic tools to help optimize cluster performance and utilization. DataManagementCapability Apache HBase: » Apache HBase is designed to provide random read/write access to very large non-relational tables deployed in Hadoop. Among the features are linear and modular scalability, strictly consistent reads and writes, automatic and configurable sharding, automatic failover between Region Servers, base classes for Hadoop MapReduce jobs and Apache HBase tables, Java API client access, and a REST-ful web service. Apache Kudu: » Kudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer. As a more recent complement to HDFS and Apache HBase, Kudu gives architects the flexibility to address a wider variety of use cases without exotic workarounds. For example, Kudu can be used in situations that require fast analytics on fast (rapidly changing) data. Kudu promises to lower query latency significantly for Apache Impala and Apache Spark initially, with other execution engines to come. Oracle NoSQL Database: » For high transaction environments (not just append), where data models call for table based key-value pairs, and consistency is defined by policies needing superior availability in NoSQL environments, Oracle’s NoSQL DB excels in web scale-out and click-stream type low latency environments. » Oracle NoSQL Database is designed as a highly scalable, distributed database based on Oracle Berkeley DB. Sleepycat Software. Oracle NoSQL Database is a general purpose, enterprise class key value store that adds an intelligent driver on top of an enhanced distributed Berkeley database. This intelligent driver keeps track of the underlying storage topology, understands and uses data shards where necessary, and knows where data can be placed in a clustered environment for the lowest possible latency. Unlike competitive solutions, Oracle NoSQL Database is easy to install, configure and manage. It supports a broad set of workloads and delivers enterprise-class reliability backed by enterprise-class Oracle support. » Using Oracle NoSQL Database allows data to be more efficiently acquired, organized and analyzed. Primary use cases include low latency capture plus fast querying of the same data as it is being ingested, most typically by key-value lookup. Examples of such use cases include Credit Card transaction environments, high velocity, low latency embedded device data capture, and high volume stock market trading applications. Oracle NoSQL Database can also provide a near consistent Oracle Database table copy of key-value pairs where a high rate of updates is required. It can serve as a target for Oracle GoldenGate change data capture and used in conjunction with event processing using Oracle Steam Explorer and Oracle
  • 9. Real Time Decisions. The product is available in both an open source community edition and an enterprise edition for large distributed data centers. The latter version is part of the Big Data Appliance. Processing Capability Apache Hadoop: » The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of nodes using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Processing capabilities for query and analysis are primarily delivered using programs and utilities that leverage MapReduce and Spark. Other key technologies include the Hadoop Distributed File System (HDFS) and YARN (a framework for job scheduling and cluster resource management. MapReduce: » MapReduce relies on a linear dataflow structure evenly allocated as highly distributed programs. Apache provides MapReduce in Hadoop as a programming model and an implementation designed to process large data sets in parallel where the data resides on disks in a cluster. Apache Spark: » Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD). Spark's RDDs function as a fast working set or cache for distributed programs that in essence provide a form of distributed shared memory. The speed of access and availability of this working dataset facilitates high performance implementations of two common algorithm paradigms: (1) iterative algorithms, that must reuse and access data multiple times, as well as (2) newer analytics and exploratory type of processing that are akin to processing and query models found with traditional databases. Leading the class in the category of iterative algorithms is machine learning. Data Integration Capability Oracle Big Data Connectors - Oracle Loader for Hadoop, Oracle Data Integrator: (Click here for Oracle Data Integration and Big Data) » The Oracle Loader for Hadoop enables parallel high-speed loading of data from Hadoop into an Oracle Database. Oracle Data Integrator Enterprise Edition in combination with the Big Data Connectors enables high performance data movement and deployment of data transformations in Hadoop. Other features in the Big Data Connectors include an Oracle SQL Connector for HDFS, Oracle R Advanced Analytics for Hadoop, and Oracle XQuery for Hadoop. SQL Data Access Oracle Big Data SQL » Big Data SQL enables Oracle SQL queries to be initiated against data also residing in Apache Hadoop clusters and NoSQL databases. Oracle Database 12c provides the means to query this data using external tables. Smart Scan capabilities deployed on the other data sources minimizes data movement and maximizes performance. Because queries are initiated through the Oracle Database, advanced security, data redaction, and virtual private database capabilities are extended to Hadoop and NoSQL databases. Apache Hive: » Hive provides a mechanism to project structure onto Hadoop data sets and query the data using a SQL-like language called HiveQL. The language also enables traditional MapReduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. It only contains metadata that describes data access in Apache HDFS and Apache HBase, not the data itself. More recently, HiveQL query execution is commonly used with Spark for better performance. Apache Impala - Cloudera » With Impala, you can query data, whether stored in HDFS or Apache HBase, using typical SQL query functions of select, join, and various aggregate functions an order of magnitude faster than using Hive with MapReduce. To avoid latency and improve application speed, Impala circumvents MapReduce to directly access the data through a specialized distributed query engine that is very similar to those found in commercial parallel RDBMSs. Statistical Analysis Capability Open source Project R and Oracle R Enterprise (part of Oracle Advanced Analytics): » R is a programming language for statistical analysis (Click here for Project R). Oracle first enabled running R algorithms in parallel without the need to move data out of the data store in the Oracle Database (Click here for Oracle R Enterprise). Oracle R Advanced Analytics for Hadoop (ORAAH) is bundled into Oracle’s Big Data Connectors and provides high performance in-Hadoop statistical analysis capabilities leveraging Spark and MapReduce. Spatial & Graph Capability Oracle Big Data Spatial and Graph:
  • 10. » Oracle Big Data Spatial and Graph provides analytic services and data models supporting Big Data workloads on Apache Hadoop and NoSQL database technologies. Oracle Big Data Spatial and Graph includes two main components: A property graph database and 35 built-in graph analytics that discover relationships, recommendations and other graph patterns in big data and a wide range of spatial analysis functions and services to evaluate data based on how near or far something is to one another, whether something falls within a boundary or region, or to process and visualize geospatial map data and imagery. Information Discovery Oracle Big Data Discovery » Oracle Big Data Discovery provides an interactive interface into Apache Hadoop (e.g. Cloudera, Hortonworks) to easily find and explore data, quickly transform and enrich data, intuitively discover new insights by combining data sets, and share the results through highly visual interfaces. Business Intelligence Oracle Business Intelligence Suite » Oracle Business Intelligence Suite provides a single platform for ad-hoc query and analysis, report publishing, and data visualization from you workstation or mobile device. Direct access to Hadoop is supported via Hive and Impala. Real-Time Recommendation Engine Oracle Real-Time Decisions » Using Oracle RTD, business logic can be expressed in the form of rules and self-learning predictive models that can recommend optimal courses of action with very low latency using specific performance objectives. This is accomplished by using data from different “channels”. Whether it’s on the web in the form of click-stream data, in a call center where computer telephony integration provides valuable insight to call center agents, or at the point-of-sale, Oracle RTD can be combined with the complex event processing of Oracle Streams Explorer to create a complete event based Decision Management System. Source: An Enterprise Architect’s Guide to Big Data Reference Architecture Overview ORACLE ENTERPRISE ARCHITECTURE WHITE PAPER | MARCH 2016 How Big Data designed for cloud offering
  • 11.
  • 12. Big Data Administrator, Data Architect  Manages the Big Data Clusters  Monitors data and network traffic, prevents glitches  Integrates, centralizes, protect s and maintains data sources  Grants and revokes permissions to various clients and nodes. Roadmap to Big Data Architect  Get your hands on Big Data platform, Cloud services or VMs e.g., Oracle BDALite Vbox, Oracle Big Data Cloud Services, Oracle BDA  Leverage your Oracle background and notions: clusters, nodes, Oracle SQL (Big Data SQL), Big Data Connectors (e.g., Oracle Datasource for Hadoop)  Get familiar with Big Data databases & storages: HDFS, NoSQL, DBMSes  Get familiar with key Big Data Frameworks: Hadoop, Spark, Flink, Beam, streaming frameworks (Kafka, Storm), and their integration with Oracle (OD4H, OD4S, and so on)  Get familiar with Big Data tools and programming: Hive SQL, Spark SQL, visualization tools, R, Java, Scala, and so on  Read, Practice and Get involved in Big Data projects Resources •Apache Projects http://hadoop.apache.org/ http://spark.apache.org/ http://flink.apache.org/ https://beam.apache.org/ • Big Data in the Cloud, Big Data Compute Edition https://cloud.oracle.com/en_US/big-data https://cloud.oracle.com/en_US/big-data-compute-edition • Big Data Connectors https://www.oracle.com/database/big-data-connectors/index.html