SlideShare una empresa de Scribd logo
1 de 29
Descargar para leer sin conexión
2012
Big Data – Insights & Challenges




                       Rupen Momaya
                       WEschool Part Time Masters Program
                       8/28/2012

                                                     1
Table of Contents

Executive Summary ......................................................................................................................... 4
Introduction .................................................................................................................................... 5
  1.0 Big Data Basics ...................................................................................................................... 5
  1.1 What is Big Data ?.................................................................................................................. 5
  1.2 Big Data Steps, Vendors & Technology Landscape .............................................................. 6
  1.3 What Business Problems are being targeted ? ..................................................................... 7
  1.4 Key Terms ............................................................................................................................ 8
     1.4.1 Data Scientists ................................................................................................................ 8
     1.4.2 Massive Parallel Processing (MPP) ................................................................................ 8
     1.4.3 In Memory analytics ...................................................................................................... 8
     1.4.4 Structured, Semi-Structured & UnStructured Data........................................................ 9
2.0. Big Data Infrastructure ........................................................................................................... 10
  2.1 Storage................................................................................................................................. 10
     2.1.1 Why RAID Fails at Scale ................................................................................................ 10
     2.1.2 Scale up vs Scale out NAS ............................................................................................ 10
     2.1.3 Object Based Storage.................................................................................................... 11
  2.2 Apache Hadoop ................................................................................................................... 12
  2.3 Data Appliances .................................................................................................................. 13
     2.3.1 HP Vertica .................................................................................................................... 13
     2.3.2 Terradata Aster ............................................................................................................ 14
3.0 Domain Wise Challenges in Big Data Era ................................................................................ 16
  3.1 Log Management................................................................................................................. 16
  3.2 Data Integrity & Reliability in the Big Data Era ................................................................... 16
  3.3 Backup Management in Big Data Era ................................................................................. 17
  3.4 Database Management in Big Data Era .............................................................................. 19
4.0 Big Data Use Cases ................................................................................................................. 21
  4.1 Potential Use Cases ............................................................................................................. 21
  4.2 Big Data Actual Use Cases .................................................................................................. 24
Bibliography ................................................................................................................................. 29




                                                                                                                                                      2
Table of Figures
Figure 1 - Big Data Statistics InfoGraphic ____________________________________________________________ 6
Figure 2 - Big Data Vendors & Technology Landscape __________________________________________________ 7
Figure 3 - HP Vertica Analytics Appliance ___________________________________________________________ 13
Figure 4 - Terradata Unified Big Data Architecture for the Enterprise ____________________________________ 15
Figure 5 - Framework for Choosing Teradata Aster Solution ____________________________________________ 15
Figure 6 - Potential Use Cases for Big Data _________________________________________________________ 21
Figure 7 - Big Data Analytics Business Model ________________________________________________________ 22
Figure 8 - Survey Results : Use of Open Source to Manage Big Data _____________________________________ 25
Figure 9 - Big Data Value Potential Index ___________________________________________________________ 28




                                                                                                                3
Executive Summary :
       The Internet has made new sources of vast amount of data to business executives. Big
data is comprised of data sets too large to be handled by traditional systems. To remain
competitive, business executives need to adopt new technologies & techniques emerging due
to big data.
       Big data includes structured data, semistructured and unstructured data. Structured
data are those data formatted for use in a database management system. Semistructured and
unstructured data include all types of unformatted data including multimedia and social media
content. Big data are also provided by myriad hardware objects, including sensors & actuators
embedded in physical objects, which are termed the Internet of things.
       Data storage techniques used include multiple clustered network attached storage
(NAS) and object-based storage. Clustered NAS deploys storage devices attached to a network.
Groups of storage devices attached to different networks are then clustered together. Object-
based storage systems distribute sets of objects over a distributed storage system.
       Hadoop, used to process unstructured and semistructured big data, uses the map
paradigm to locate all relevant data then select only the data directly answering the query.
NoSQL, MongoDB, and TerraStore process structured big data. NoSQL data is characterized by
being basically available, soft state (changeable), and eventually consistent. MongoDB and
TerraStore are both NoSQL-related products used for document – oriented applications.
       The advent of the age of big data poses opportunities and challenges for businesses.
Previously unavailable forms of data can now be saved, retrieved, and processed. However,
changes to hardware, software, and data processing techniques are necessary to employ this
new paradigm.




                                                                                                4
Introduction:
         The internet has grown tremendously in the last decade, from 304 million users in Mar
2000 to 2280 million users in Mar 2012 according to Internet Worlds stats. Worldwide
information is more than doubling every two years, with 1.8 zettabytes or 1.8 trillion gigabytes
projected to be created and replicated in 2011 according to the study conducted by research
firm IDC.
      A buzzword, or catch-phrase, used to describe a massive volume of both structured and
unstructured data that is so large that it's difficult to process with traditional database and
software techniques is "Big Data". An example of Big Data might be petabytes (1,024 terabytes)
or exabytes (1,024 petabytes) and zettabytes of data consisting of billions to trillions of records
of millions of people -- all from different sources (e.g. blogs, social media, email, sensors, RFID
readers, photographs, videos, microphones, mobile data and so on). The data is typically
loosely structured data that is often incomplete and inaccessible.
     When dealing with larger datasets, organizations face difficulties in being able to create,
manipulate, and manage Big Data. Scientists regularly encounter this problem in meteorology,
genomics, connectomics, complex physics simulations, biological and environmental
research,Internet search, finance and business informatics. Big data is particularly a problem in
business analytics because standard tools and procedures are not designed to search and
analyze massive datasets. While the term may seem to reference the volume of data, that isn't
always the case. The term Big Data, especially when used by vendors, may refer to the
technology (the tools and processes) that an organization requires to handle the large amounts
of data and storage facilities.




1.0 Big Data Basics :
1.1 What is Big Data ?
       Below infographic depicts the expected market size of Big data and some statistics.




                                                                                                      5
Figure 1 - Big Data Statistics InfoGraphic




1.2 Big Data Steps, Vendors & Technology Landscape :

   Data Acquisition: Data is collected from the data sources and distributed across
    multiple nodes -- often a grid -- each of which processes a subset of data in parallel.
    Here we have technological providers like IBM, HP etc.. and data providers like Reuters,
    Salesforce etc.. and social network websites like Facebook, Google+, LinkedIn etc..

   Marshalling : In this domain, we have Very Large Data Warehousing and BI Appliances,
    actors like Actian, EMC² (Greenplum), HP (Vertica), IBM (Netezza) etc.

   Analytics : In this phase, we have the predictive technologies (such as data mining) and
    vendors which are Adobe, EMC², GoodData, Hadoop Map Reduce etc.


   Action : Includes all the Data Acquisition providers plus the ERP, CRM and BPM actors,
    including Adobe, Eloqua, EMC² etc..
            Both in Analytical and Action phases, BI tools vendors are GoodData, Google, HP
    (Autonomy), IBM (Cognos suite) etc..




                                                                                                           6
 Data Governance : An efficient Master data management solution. As defined, data
     governance applies to each of the six preceding stages of Big Data delivery. By
     establishing processes and guiding principles it sanctions behaviors around data. In
     short, data governance means that the application of Big Data is useful and relevant. It's
     an insurance policy that the right questions are being asked. So we won't be
     squandering the immense power of new Big Data technologies that make processing,
     storage and delivery speed more cost-effective and nimble than ever.




                                                       Figure 2 - Big Data Vendors & Technology Landscape

1.3 What Business Problems are being targeted ?
    World-class companies are targeting a new set of business problems that were hard to
solve before –

      Modeling true risk
      Customer churn analysis,
      Flexible supply chains,
      Loyalty pricing,
      Recommendation engines,
      Ad targeting,
      Precision targeting,



                                                                                                            7
   PoS transaction analysis,
      Threat analysis,
      Trade surveillance,
      Search quality fine tuning and
      Mashups such as location + ad targeting.

Data growth curve: Terabytes -> Petabytes -> Exabytes -> Zettabytes -> Yottabytes ->
Brontobytes -> Geopbytes. It is getting more interesting.

Analytical Infrastructure curve: Databases -> Datamarts -> Operational Data Stores (ODS) ->
Enterprise Data Warehouses -> Data Appliances -> In-Memory Appliances -> NoSQL Databases -
> Hadoop Clusters


1.4    Key Terms :
  1.4.1 Data Scientists :
      A data scientist represents an evolution from the business or data analyst role. Data
  scientists, also known as data analysts -- are professionals with core statistics or
  mathematics background coupled with good knowledge in analytics and data software tools.
  A McKinsey study on Big Data states, “India will need nearly 1,00,000 data scientists in the
  next few years.”

        A Data Scientist is a fairly new role defined by Hillary Mason of Bit.ly as someone who
  can obtain, scrub, explore, model and interpret data, blending hacking, statistics and
  machine learning who culls information from data. These data scientists take a blend of the
  hackers’ arts, statistics, and machine learning and apply their expertise in mathematics and
  understanding the domain of the data—where the data originated—to process the data into
  useful information. This requires the ability to make creative decisions about the data and
  the information created and maintaining a perspective that goes beyond ordinary scientific
  boundaries.

  1.4.2 Massive Parallel Processing (MPP) :
       MPP is the coordinated processing of a program by multiple processors that work on
  different parts of the program, with each processor using its own operating
  system and memory. An MPP system is considered better than a symmetrically parallel
  system ( SMP ) for applications that allow a number of databases to be searched in parallel.
  These include decision support system and data warehouse applications.

  1.4.3 In Memory analytics :
       The key difference between conventional BI tools and in-memory products is that the
  former query data on disk while the latter query data in random access memory(RAM). When
  a user runs a query against a typical data warehouse, the querynormally goes to a database
  that reads the information from multiple tables stored on a server’shard disk. With a server-
  based inmemory database, all information is initially loaded into memory. Users then query
  and interact with the data loaded into the machine’s memory.

Does an in-memory analytics platform replace or augment traditional in-database
approaches?




                                                                                                  8
The answer is that it is quite complementary. In-database approaches put a large focus on
the data preparation and scoring portions of the analytic process. The value of in-database
processing is the ability to handle terabytes or petabytes of data effectively. Much of the
processing may not be highly sophisticated, but it is critical.

   The new in-memory architectures use a massively parallel platform to enable the multiple
terabytes of system memory to be utilized (conceptually) as one big pool of memory. This
means that samples can be much larger, or even eliminated. The number of variables tested
can be expanded immensely.

   In-memory approaches fit best in situations where there is a need for:

      High Volume & Speed: It is necessary to run many, many models quickly
      High Width & Depth: It is desired to test hundreds or thousands of metrics across tens
       of millions customers (or other entities)
      High Complexity: It is critical to run processing-intensive algorithms on all this data and
       to allow for many iterations to occur.

   There are a number of in-memory analytics tools and technologies with different
architectures. Boris Evelson (Forrester Research) defines the following five types of business
intelligence in-memory analytics:

      In-memory OLAP: Classic MOLAP (Multidimensional Online Analytical Processing) cube
       loaded entirely in memory.
      In-memory ROLAP: Relational OLAP metadata loaded entirely in memory.
      In-memory inverted index: Index, with data, loaded into memory.
      In-memory associative index: An array/index with every entity/attribute correlated to
       every other entity/attribute.
      In-memory spreadsheet: Spreadsheet like array loaded entirely into memory.


  1.4.4   Structured, Semi-Structured & UnStructured Data :

   Structured Data is the type that would fit neatly into a standard Relational Data Base
Management System, RDBMS, and lend itself to that type of processing.
   Semi-structured Data is that which has some level of commonality but does not fit the
structured data type.
   Unstructured Data is the type that varies in its content and can change from entry to entry.


Structured Data                 Semi Structure Data              UnStructured Data
Customer Records                Web Logs                         Pictures
Point of Sale data              Social Media                     Video Editing Data
Inventory                       E-Commerce                       Productivity (Office docs)
Financial Records                                                Geological Data


Above table depicts the examples of each of them.



                                                                                                     9
2. 0 Big Data Infrastructure
2.1 Storage
2.1.1 Why RAID Fails at Scale :
        RAID schemes are based on parity, and at its root, if more than two drives fail
simultaneously, data is not recoverable. The statistical likelihood of multiple drive failures has
not been an issue in the past. However, as drive capacities continue to grow beyond the
terabyte range and storage systems continue to grow to hundreds of terabytes and petabytes,
the likelihood of multiple drive failures is now a reality.

        Further, drives aren’t perfect, and typical SATA drives have a published bit rate error
(BRE) of 1014 , meaning that once every 100,000,000,000,000 bits, there will be a bit that is
unrecoverable. Doesn’t seem significant? In today’s big data storage systems, it is. The
likelihood of having one drive fail, and encountering a bit rate error when rebuilding from the
remaining RAID set is highly probable in real world scenarios. To put this into perspective, when
reading 10 terabytes, the probability of an unreadable bit is likely (56%), and when reading 100
terabytes, it is nearly certain (99.97%).
2.1.2 Scale up vs Scale out NAS :
         Traditional Scale up system would provide a small number of access points, or data
servers, that would sit in front of a set of disks protected with RAID. As these systems needed
to provide more data to more users the storage administrator would add more disks to the
back end but this only caused to create the data servers as a choke point. Larger and faster
data servers could be created using faster processor and more memory but this architecture
still had significant scalability issues.

        Scale out uses the approach of more of everything—instead of adding drives behind a
pair of servers, it adds servers each with processor, memory, network interfaces and storage
capacity. As I need to add capacity to a grid—the scale out version of an array—I insert a new
node with all the available resources. This architecture required a number of things to make it
work from both a technology and financial aspect. Some of these factors include:

    Clustered architecture – for this model to work the entire grid needed to work as a
     single entity and each node in the grid would need to be able to pick up a portion of the
     function of any other node that may fail.
    Distributed/parallel file system – the file system must allow for a file to be accessed
     from any one or any number of nodes to be sent to the requesting system. This
     required different mechanisms underlying the file system: distribution of data across
     multiple nodes for redundancy, a distributed metadata or locking mechanism, and data
     scrubbing/validation routines.
    Commodity hardware – for these systems to be affordable they must rely on
     commodity hardware that is inexpensive and easily accessible instead of purpose built
     systems.
   Benefits of Scale Out :
              There are a number of significant benefits to these new scale out systems that
       meet the needs of big data challenges.




                                                                                                     10
 Manageability – when data can grow in a single file system namespace the
     manageability of the system increases significantly and a single data administrator can
     now manage a petabyte or more of storage versus 50 or 100 terabytes on a scale up
     system.
    Elimination of stovepipes – since these systems scale linearly and do not have the
     bottlenecks that scale up systems create, all data is kept in a single file system in a
     single grid eliminating the stovepipes introduced by the multiple arrays and files
     systems required.
    Just in time scalability – as my storage needs grow I can add an appropriate number of
     nodes to meet my needs at the time I need them. With scale up arrays I would have to
     guess at the final size my data may grow while using that array which often led to the
     purchase of large data servers with only a few disks behind them initially so I would not
     hit bottleneck in the data server as I added disks.
    Increased utilization rates – since the data servers in these scale out systems can
     address the entire pool of storage there is no stranded capacity.

    There are five core tenets of scale-out NAS: a NAS should be simple to scale, offer
predictable performance, be efficient to operate, always available and be proven to work in a
large enterprise.
EMC Isilon :
EMC Isilon is the scale-out platform that delivers ideal storage for Big Data. Powered by the
OneFS operating system, Isilon nodes are clustered to create a high-performing, single pool of
storage.

        EMC Corporation announced in May 2011, the world’s largest single file system with
the introduction of EMC Isilon’s new IQ 108NL scale-out NAS hardware product. Leveraging
three terabyte (TB) enterprise-class Hitachi Ultrastar drives in a 4U node, the 108NL scales to
more than 15 petabytes (PB) in a single file system and single volume, providing the storage
foundation for maximizing the big data opportunity. EMC also announced Isilon’s new
SmartLock data retention software application, delivering immutable protection for big data to
ensure the integrity and continuity of big data assets from initial creation to archival.
2.1.3 Object Based Storage
       Object storage is based on a single, flat address space that enables the automatic
routing of data to the right storage systems, and the right tier and protection levels within
those systems according to its value and stage in the data life cycle.

Better Data Availability than RAID : In a properly configured object storage system, content is
replicated so that a minimum of two replicas assure continuous data availability. If a disk dies,
all other disks in the cluster join in to replace the lost replicas while the system still runs at
nearly full speed. Recovery takes only minutes, with no interruption of data availability and no
noticeable performance degradation.

Provides Unlimited Capacity and Scalability
       In object storage systems, there is no directory hierarchy (or "tree") and the object's
location does not have to be specified in the same way a directory's path has to be known in
order to retrieve it. This enables object storage systems to scale to petabytes and beyond




                                                                                                     11
without limits on the number of files (objects), file size or file system capacity, such as the 2-
terabyte restriction that is common for Windows and Linux file systems.

Backups Are Eliminated
       With a well-designed object storage system, backups are not required. Multiple replicas
ensure that content is always available and an offsite disaster recovery replica can be
automatically created if desired

Automatic Load Balancing
       A well-designed object storage cluster is totally symmetrical, which means that each
node is independent, provides an entry point into the cluster and runs the same code.

      Companies that provide this are CleverSafe,Compuverde, Amplidata, Caringo, EMC
(Atmos), Hitachi Data Systems (Hitachi Content Platform), NetApp (StorageGRID) and Scality.



2.2 Apache Hadoop
     Apache Hadoop has been the driving force behind the growth of the big data industry. It is
a framework for running applications on large cluster built of commodity hardware. The
Hadoop framework transparently provides applications both reliability and data motion.

MapReduce : is the core of Hadoop. Created at Google in response to the problem of creating
web search indexes, the MapReduce framework is the powerhouse behind most of today’s big
data processing. In addition to Hadoop, you’ll find MapReduce inside MPP and NoSQL
databases, such as Vertica or MongoDB. The important innovation of MapReduce is the ability
to take a query over a dataset, divide it, and run it in parallel over multiple nodes. Distributing
the computation solves the issue of data too large to fit onto a single machine. Combine this
technique with commodity Linux servers and you have a cost-effective alternative to massive
computing arrays.
HDFS : we discussed the ability of MapReduce to distribute computation over multiple servers.
For that computation to take place, each server must have access to the data. This is the role of
HDFS, the Hadoop Distributed File System.

        HDFS and MapReduce are robust. Servers in a Hadoop cluster can fail and not abort the
computation process. HDFS ensures data is replicated with redundancy across the cluster. On
completion of a calculation, a node will write its results back into HDFS. There are no
restrictions on the data that HDFS stores. Data may be unstructured and schemaless. By
contrast, relational databases require that data be structured and schemas be defined before
storing the data. With HDFS, making sense of the data is the responsibility of the developer’s
code.

Why a company will be interested in Hadoop?
        The number one reason is that the company is interested in taking advantage of un-
structured or semi-structured data. This data will not fit well into a relational database, but
Hadoop offers a scalable and relatively easy-to-program way to work with it. This category
includes emails, web server logs, instrumentation of online stores, images, video and external
data sets (such as list of small businesses organized by geographical area). All this data can
contain information that is critical to the business and should reside in your data warehouse,



                                                                                                      12
but it needs a lot of pre-processing, and this pre-processing will not happen in Oracle RDBMS
(for example).
        The other reason to look into Hadoop is for information that exists in the database, but
can’t be efficiently processed within the database. This is a wide use-case, and it is usually
labelled “ETL” because the data is going out of an OLTP system and into a data warehouse. You
use Hadoop when 99% of the work is in the “T” of ETL – Processing the data into useful
information.

2.3 Data Appliances :
        Purpose built solutions like Teradata, IBM/Netezza, EMC/Greenplum, SAP HANA (High-
Performance Analytic Appliance), HP Vertica and Oracle Exadata are forming a new category. Data
appliances are one of the fastest growing categories in Big Data. Data appliances integrate database,
processing, and storage in a integrated system optimized for analytics.

    Processing close to the data source
    Appliance simplicity (ease of procurement; limited consulting)
    Massively parallel architecture
    Platform for advanced analytics
    Flexible configurations and extreme scalability


2.3.1 HP Vertica :




                                                                      Figure 3 - HP Vertica Analytics Appliance

        The Vertica Analytics Platform is purpose built from the ground up to enable companies
to extract value from their data at the speed and scale they need to thrive in today’s economy.


                                                                                                                  13
Vertica was designed and built since its inception for today’s most demanding analytic
workloads, each Vertica component is able to take full-advantage of the others by design.

Key Features of the Vertica Analytics Platform :
   o   Real-Time Query & Loading » Capture the time value of data by continuously loading
       information, while simultaneously allowing immediate access for rich analytics.
   o   Advanced In-Database Analytics » Ever growing library of features and functions to
       explore and process more data closer to the CPU cores without the need to extract.
   o   Database Designer & Administration Tools » Powerful setup, tuning and control with
       minimal administration effort. Can make continual improvements while the system
       remains online.
   o   Columnar Storage & Execution » Perform queries 50x-1000x faster by eliminating costly
       disk I/O without the hassle and overhead of indexes and materialized views.
   o   Aggressive Data Compression » Accomplish more with less CAPX, while delivering
       superior performance with our engine that operates on compressed data.
   o   Scale-Out MPP Architecture » Vertica automatically scales linearly and limitlessly by just
       adding industry-standard x86 servers to the grid.
   o   Automatic High Availability » Runs non-stop with automatic redundancy, failover and
       recovery optimized to deliver superior query performance as well.
   o   Optimizer, Execution Engine & Workload Management » Get maximum performance
       without worrying about the details of how it gets done. Users just think about
       questions, we deliver answers, fast.
   o   Native BI, ETL, & Hadoop/MapReduce Integration » Seamless integration with a robust and
       ever growing ecosystem of analytics solutions.


2.3.2 Terradata Aster :
   To Gain Business Insight Using MapReduce and Apache Hadoop with SQL-Based Analytics,
below is a summary using a unified big data architecture that blends the best of Hadoop and
SQL, allowing users to:

      Capture and refine data from a wide variety of sources
      Perform necessary multi-structured data preprocessing
      Develop rapid analytics
      Process embedded analytics, analyzing both relational and non-relational data
      Produce semi-structured data as output, often withmetadata and heuristic analysis
      Solve new analytical workloads with reduced time to insight
      Usemassively parallel storage in Hadoop to efficiently store and retain data




                                                                                                    14
Figure 4 - Terradata Unified Big Data Architecture for the Enterprise

When to choose which solution : (Teradata, Aster & Hadoop) ?
        Below figure offers a framework to help enterprise architects most effectively use each
part of a unified big data architecture. This framework allows a best-of-breed approach that
you can apply to each schema type, helping you achieve maximum performance, rapid
enterprise adoption, and the lowest TCO.




                                                    Figure 5 - Framework for Choosing Teradata Aster Solution




                                                                                                                   15
3.0 Domain Wise Challenges in Big Data Era
3.1 Log Management
        Log data does not fall into the convenient schemas required by relational databases.
Log data is, at its core, unstructured, or, in fact, semi-structured, which leads to a deafening
cacophony of formats; the sheer variety in which logs are being generated is presenting a major
problem in how they are analyzed. The emergence of Big Data has not only been driven by the
increasing amount of unstructured data to be processed in near real-time, but also by the
availability of new toolsets to deal with these challenges.

       There are 2 things that don’t receive enough attention in the log management space.
The 1st is real scalability, which means thinking beyond what data centers can do. That
inevitably leads to ambient cloud models for log management. Splunk has done an amazing job
of pioneering an ambient cloud model with the way they created an eventual consistency
model which allows you to make a query to get a “good enough” answer quickly, or a perfect
answer in more time.

        The 2nd thing is security. Log data is next to useless if it is not nonrepudiatable.
Basically, all the log data in the world is not useful as evidence unless you can prove that
nobody changed it.

       Sumo Data, Loggly, Splunk,are the primary companies that currently have products
around Log management.



3.2 Data Integrity & Reliability in the Big Data Era
        Consider standard business practices and how nearly all physical forms of
documentation and transactions have evolved to become digitized versions, and with them
come the inherent challenges of validating not just the authenticity of their contents but also
the impact of acting upon an invalid data set – something which is highly possible in today's
high-velocity, big data business environment. With this view, we can then begin to identify the
scale of the challenge. With cybercrime and insider threats clearly emerging as a much more
profitable (and safe) business for the criminal element, the need to validate and verify is
going to become critical to all business documentation and related transactions,
even within the existing supply chains.
        Keyless signature technology is a relatively new concept in the market and will require
a different set of perspectives when put under consideration. A keyless signature provides an
alternative method to key-based technologies by providing proof and non-repudiation of
electronic data using only hash functions for verification. The implementation of keyless
signature is done via a globally distributed machine, taking hash values of data as inputs and
returning keyless signatures that prove the time, integrity, and origin (machine, organization,
individual) of the input data.
        A primary goal of the keyless signature technology is to provide mass-scale, non-
expiring data validation while eliminating the need for secrets or other forms of trust, thereby
reducing or even eliminating the need for more complex certificate based solutions as these
are ripe with certificate management issues, including expiration and revocation.
        As more organizations become affected by Big Data phenomenon, the clear implication
is that many businesses will potentially be making business decision based on massive amounts



                                                                                                   16
of internal and third-party data. Consequently, the demand for novel and trusted approaches
to validating data will grow. Extend this concept to the ability to validate a virtual machine,
switch logs or indeed the security logs, and then multiply by the clear advantages that cloud
computing (public or private) has over the traditional datacenter design – we will begin to
understand why keyless data integrity technology that can ensure self-validating
data is a technology that is likely to experience swift adoption.
        The ability to move away from reliance on a third-party certification authority will
be welcomed by many, although this move from the traditionally accepted approach to verify
data integrity needs to be more fully broadcasted and understood for more mass market
adoption and acceptance.
        Another solution for monitoring the stability, performance and security of your big data
environment is from a company called Gazzang. Enterprises and SaaS solution providers have
new needs that are driven by the new infrastructures and opportunities of cloud computing.
For example, business intelligence analysis uses big data stores such as MongoDB, Hadoop and
Cassandra. The data is spread across hundreds or thousands of servers in order to optimize
processing time and return business insight to the user. Leveraging its extensive experience
with cloud architectures and big data platforms, Gazzang is delivering a SaaS solution for the
capture, management and analysis of massive volumes of IT data. Gazzang zOps is purpose-
built for monitoring big data platforms and multiple cloud environments. The powerful engine
collects and correlates vast amounts of data from numerous sources in a variety of forms.


3.3 Backup Management in Big Data Era :

       For protection against user or application error, Ashar Baig, a senior analyst and
consultant with the Taneja Group, said snapshots can help with big data backups.

       Baig also recommends a local disk-based system for quick and simple first-level data-
recovery problems. “Look for a solution that provides you an option for local copies of data so
that you can do local restores, which are much faster,” he said. “Having a local copy, and having
an image-based technology to do fast, image-based snaps and replications, does speed it up
and takes care of the performance concern.”

Faster scanning needed :

        One of the issues big data backup systems face is scanning each time the backup and
archiving solutions start their jobs. Legacy data protection systems scan the file system each
time a backup job is run, and each time an archiving job is run. For file systems in big data
environments, this can be time-consuming.

        Commvault’s solution for the scanning issue in its Simpana data protection software is
its OnePass feature. According to Commvault, OnePass is an object-level converged process for
collecting backup, archiving and reporting data. The data is collected and moved off the
primary system to a ContentStore virtual repository for completing the data protection
operations.

       Once a complete scan has been accomplished, the Commvault software places an agent
on the file system to report on incremental backups, making the process even more efficient.


                                                                                                    17
Casino doesn’t want to gamble on backups

       Pechanga Resort & Casino in Temecula, Calif., went live with a cluster of 50 EMC Isilon
X200 nodes in February to back up data from its surveillance cameras. The casino has 1.4 PB of
usable Isilon storage to keep the data, which is critical to operations because the casino must
shut down all gaming operations if its surveillance system is interrupted.

       “In gaming, we’re mandated to have surveillance coverage,” said Michael Grimsley,
director of systems for Pechanga Technology Solutions Group. “If surveillance is down, all
gaming has to stop.”

       If a security incident occurs, the IT team pulls footage from the X200 nodes and moves
it to WORM-compliant storage and backs it up with NetWorker software to EMC Data Domain
DD860 deduplication target appliances. The casino doesn’t need tape for WORM capability
because WORM is part of Isilon’s SmartLock software.

         “It’s mandatory that part of our storage includes a WORM-compliant section,” Grimsley
said. “Any time an incident happens, we put that footage in the vault. We have policies in place
so it’s not deleted.” The casino keeps 21 days’ worth of video on Isilon before recording over
the video.

       Grimsley said he is looking to expand the backup for the surveillance camera data. He’s
considering adding a bigger Data Domain device to do day-to-day backup of the data. “We have
no requirements for day-to-day backup, but it’s something we would like to do,” he said.

       Another possibility is adding replication to a DR site so the casino can recover quickly if
the surveillance system goes down.

Scale-out systems :

       Another option to solving the performance and capacity issues is using a scale-out
backup system, one similar to scale-out NAS, but built for data protection. You add nodes with
additional performance and capacity resources as the amount of protected data grows.

        “Any backup architecture, especially for the big data world, has to balance the
performance and the capacity properly,” said Jeff Tofano, Sepaton Inc.’s chief technology
officer. “Otherwise, at the end of the day, it’s not a good solution for the customer and is a
more expensive solution than it should be.”

       Sepaton’s S2100-ES2 modular virtual tape library (VTL) was built for data-intensive large
enterprises. According to the company, its 64-bit processor nodes backup data at up to 43.2 TB
per hour, regardless of the data type, and can store up to 1.6 PB. You can add up to eight
performance nodes per cluster as your needs require, and add disk shelves to add capacity.




                                                                                                     18
3.4 Database Management in Big Data Era :

       There are currently three trends in the industry:

    the NoSQL databases, designed to meet the scalability requirements of distributed
     architectures, and/or schemaless data management requirements,
    the NewSQL databases designed to meet the requirements of distributed architectures
     or to improve performance such that horizontal scalability is no longer needed
    the Data grid/cache products designed to store data in memory to increase application
     and database performance

      Below comparison assesses the drivers behind the development and adoption of NoSQL
and NewSQL databases, as well as data grid/caching technologies.

                     NoSQL                                        NewSQL
   o Newbreed of non-relational                     o New breed of relational
   database products.                                 database products
   o Rejection of fixed table schema                 o Retain SQL and ACID
     and join operations.                           o Designed to meet scalability
   o Designed to meet scalability                     requirements of distributed
   requirements of distributed architectures.         architectures
   o And/or schemaless data management              o Or improve performance so
      requirements .                                  horizontal scalability is no
   o Big tables – data mapped by row                  longer a necessity
     key, column key and time stamp                 o MySQL storage engines -> scale-
   o Keyvalue stores store keys and                   up and scale-out
      associated values.                            o Transparent sharding - reduce to
   o Document stores all data as a single             manual effort required to scale
      document.                                     o Appliances - take advantage of
   o Graph databases to use nodes properties            improved hardware
      and edges to store data and the                 performance, solid state drives
      relationships between entries.                o New databases - designed
                                                       specifically for scale out.


           .. And Beyond
               o In-memory data grid/cache products
               o Potential primary platform for distributed data management
           Data grid/cache
           o spectrum of data management capabilities, from nonpersistent data caching
           to persistent caching, replication, and distributed data and compute grid.

     ComputerWorld’s Tam Harbert explored the skills and needs organizations are searching
for in the quest to manage the Big Data challenge, and also identified five job titles emerging in
the Big Data world. Along with Harbert’s findings, here are 7 new types of jobs being created
by Big Data:



                                                                                                     19
1. Data scientists: This emerging role is taking the lead in processing raw data and
    determining what types of analysis would deliver the best results.
 2. Data architects: Organizations managing Big Data need professionals who will be able to
    build a data model, and plan out a roadmap of how and when various data sources and
    analytical tools will come online, and how they will all fit together.
 3. Data visualizers: These days, a lot of decision-makers rely on information that is presented
    to them in a highly visual format — either on dashboards with colorful alerts and dials, or
    in quick-to-understand charts and graphs. Organizations need professionals who can
    harness the data and put it in context, in layman’s language, exploring what the data
    means and how it will impact the company, .
 4. Data change agents: Every forward-thinking organization needs change agents — usually
    an informal role — who can evangelize and marshal the necessary resources for new
    innovation and ways of doing business. Harbert predicts that data change agents may be
    more of a formal job title in the years to come, driving changes in internal operations and
    processes based on data analytics. They need to be good communicators, and a Six Sigma
    background — meaning they know how to apply statistics to improve quality on a
    continuous basis — also helps.
 5. Data engineer/operators: These are the people that make the Big Data infrastructure hum
    on a day-to-day basis. They develop the architecture that helps analyze and supply data in
    the way the business needs, and make sure systems are performing smoothly, says
    Harbert.
 6. Data stewards: Not mentioned in Harbert’s list, but essential to any analytics-driven
    organization, is the emerging role of data steward. Every bit and byte of data across the
    enterprise should be owned by someone — ideally, a line of business. Data stewards
    ensure that data sources are properly accounted for, and may also maintain a centralized
    repository as part of a Master Data Management approach, in which there is one gold
    copy of enterprise data to be referenced.
 7. Data virtualization/cloud specialists: Databases themselves are no longer as unique as
    they use to be. What matters now is the ability to build and maintain a virtualized data
    service layer that can draw data from any source and make it available across
    organizations in a consistent, easy-to-access manner. Sometimes, this is called Database-
    as-a-Service. No matter what it’s called, organizations need professionals that can also
    build and support these virtualized layers or clouds.
      Above insights will help visualize what the future global world class organizations would need to
manage their data.




                                                                                                          20
4.0 Big Data Use Cases :
4.1 Potential Use Cases
        The key to exploiting Big Data Analytics is focusing on a compelling business opportunity
as defined by a use case — WHAT (What exactly are we trying to do?); WHAT value is there in
proving a hypothesis?
        Use cases are emerging in a variety of industries that illustrate different core
competencies around analytics. Figure below illustrates some Use Cases along two dimensions:
data velocity and variety. A Use Case provides a context for a value chain: how to move from

        Raw Data -> Aggregated Data -> Intelligence -> Insights -> Decisions -> Operational Impact ->
Financial Outcomes -> Value creation.


Source : SAS & IDC




                                               Figure 6 - Potential Use Cases for Big Data



         Insurance — Individualize auto-insurance policies based on newly captured vehicle
         telemetry data. Insurer gains insight into customer’s driving habits delivering: (1) More
         accurate assessments of risks; (2) Individualized pricing based on actual individual
         customer driving habits; (3) Influence and motivate individual customers to improve
         their driving habits


                                                                                                        21
Travel — Optimize buying experience through web log and social media data analysis
      (1) Travel site gains insight into customer preferences and desires; (2) Up-selling
      products by correlating current sales with subsequent browsing behavior Increase
      browse-to-buy conversions via customized offers and packages; (3) Deliver personalized
      travel recommendations based on social media data
      Gaming – Collect gaming data to optimize spend within and across games: (1) Games
      company gains insight into likes, dislikes and relationships of its users; (2) Enhance
      games to drive customer spend within games; (3) Recommend other content based on
      analysis of player connections and similar likes Create special offers or packages based
      on browsing and (non-)buying behaviour




                                                             Figure 7 - Big Data Analytics Business Model




E-tailing – E-Commerce – Online                  Retail/Consumer Products
Retailing                                         Merchandizing and market basket
   Recommendation engines — increase            analysis.
    average order size by recommending               Campaign management and
    complementary products based on                   customer loyalty programs -


                                                                                                            22
predictive analysis for cross-selling.             Marketing departments across
   Cross-channel analytics — sales                    industries have long used technology
    attribution, average order value, lifetime         to monitor and determine the
    value (e.g., how many in-store purchases           effectiveness of marketing
    resulted from a particular                         campaigns. Big Data allows marketing
    recommendation, advertisement or                   teams to incorporate higher volumes
    promotion).                                        of increasingly granular data, like
                                                       click-stream data and call detail
   Event analytics — what series of steps
                                                       records, to increase the accuracy of
    (golden path) led to a desired outcome
                                                       analysis.
    (e.g., purchase, registration).
                                                      Supply-chain management and
                                                       analytics.
                                                      Event- and behavior-based
                                                       targeting.
                                                      Market and consumer
                                                       segmentations.
Financial Services                                 Web & Digital Media Services
   Compliance and regulatory reporting.              Large-scale clickstream analytics.
   Risk Modelling and management -                   Ad targeting, analysis, forecasting
    Financial firms, banks and others use              and optimization.
    Hadoop and Next Generation Data                   Abuse and click-fraud prevention.
    Warehouses to analyze large volumes of
                                                      Social graph analysis and profile
    transactional data to determine risk and
                                                       segmentation - In conjunction with
    exposure of fincnaical assets, to prepare
                                                       Hadoop and often Next Generation
    for potential what-if scenarios based on
                                                       Data Warehousing, social
    simulated market behavior, and to score
                                                       networking data is mined to
    potential clients for risk.
                                                       determine which customers pose
   Fraud detection and security analytics -           the most influence over others
    Credit card companies, for example, use            inside social networks. This helps
    Big Data technologies to identify                  enterprises determine which are
    transactional behavior that indicates a high       their most important customers,
    likelihood of a stolen card.                       who are not always those that buy
   CRM and customer loyalty programs.                 the most products or spend the
   Credit risk, scoring and analysis.                 most but those that tend to
                                                       influence the buying behavior of
   High speed Arbitrage trading
                                                       others the most.
   Trade surveillance.
                                                      Campaign management and loyalty
   Abnormal trading pattern analysis                  programs.
Government                                         New Applications
   Fraud detection and cybersecurity                 Sentiment Analytics - used in
   Compliance and regulatory analysis.                conjunction with Hadoop, advanced
                                                       text analytics tools analyze the
   Energy consumption and carbon footprint
                                                       unstructured text of social media
    management.



                                                                                              23
and social networking posts,
                                                       including Tweets and Facebook
                                                       posts, to determine the user
                                                       sentiment related to particular
                                                       companies, brands or products
                                                      Mashups – Mobile User Location +
                                                       Precision Targeting
                                                      Machine-generated data, the
                                                       exhaust fumes of the Web
Health & Life Sciences                            Telecommunications
   Health Insurance fraud detection                  Revenue assurance and price
   Campaign and sales program optimization.           optimization.
   Brand management.                                 Customer churn analysis -
                                                       Enterprises use Hadoop and Big Data
   Patient care quality and program analysis.
                                                       technologies to analyse customer
   Supply-chain management.                           behavior data to identify patterns
   Drug discovery and development analysis.           that indicate which customers are
                                                       most likely to leave for a competing
                                                       vendor or service..
                                                      Campaign management and
                                                       customer loyalty.
                                                      Call Detail Record (CDR) analysis.
                                                      Network performance and
                                                       optimization
                                                      Mobile User Location analysis


        Smart meters in the utilities industry. The rollout of smart meters as part of the Smart
Grid adoption by utilities everywhere has resulted in a deluge of data flowing at unprecedented
levels. Most utilities are ill-prepared to analyze the data once the meters are turned on.
4.2 Big Data Actual Use Cases :

        Below graphic mentions the survey result undertaken by Information Week which
indicates the % of respondents who would be opting for a open source solutions for Big Data .




                                                                                                   24
Figure 8 - Survey Results : Use of Open Source to Manage Big Data

    Interesting Use Case – Amazon Will Pay Shoppers $5 to Walk Out of Stores Empty-
     Handed
        Interesting use of consumer data entry to power next generation retail price
competition…. Amazon is offering consumers up to $5 off on purchases if they compare prices
using their mobile phone application in a store. The promotion will serve as a way for Amazon
to increase usage of its bar-code-scanning application, while also collecting intelligence on
prices in the stores.

           Amazon’s Price Check app, which is available for iPhone and Android, allows shoppers
to scan a bar code, take a picture of an item or conduct a text search to find the lowest prices.
Amazon is also asking consumers to submit the prices of items with the app, so Amazon knows
if it is still offering the best prices. A great way to feed data into its learning engine from brick-
and-mortar retailers.

       This is an interesting trend that should terrify brick-and-mortar retailers. While the real-
time Everyday Low Price information empowers consumers, it terrifies retailers, who
increasingly are feeling like showrooms — shoppers come to to check out the merchandise but
ultimately decide to walk out and buy online instead.
    Smart Meters :
            a. Because of smart meters, electricity providers can read the meter once every 15
minutes rather than once a month. This not only eliminates the need to send some one for
meter reading, but as the meter is read once every fifteen minutes, electricity can be priced
differently for peak and off-peak hours. Pricing can be used to shape the demand curve during
peak hours, eliminating the need for creating additional generating capacity just to meet peak
demand, saving electricity providers millions of dollars worth of investment in generating
capacity and plant maintenance costs.


                                                                                                         25
b. Well, there is a smart electric meter in a residence in Texas and one of the
electricity providers in the area (TXU Energy) is using the smart meter technology to shape the
demand curve by offering Free Night time Energy Charges — All Night. Every Night. All Year
Long.

       In fact, they promote their service as Do your laundry or run the dishwasher at night,
and pay nothing for your Energy Charges . What TXU Energy is trying to do here is to re-shape
energy demand using pricing so as to manage peak-time demand resulting in savings for both,
TXU and customers. This wouldn’t have been possible without Smart Electric meters.

     T-Mobile USA has integrated Big Data across multiple IT systems to combine customer
transaction and interactions data in order to better predict customer defections. By leveraging
social media data (Big Data) along with transaction data from CRM and Billing systems, T-
Mobile USA has been able to cut customer defections in half in a single quarter .



     US Xpress, provider of a wide variety of transportation solutions collects about a
thousand data elements ranging from fuel usage to tire condition to truck engine operations to
GPS information, and uses this data for optimal fleet management and to drive productivity
saving millions of dollars in operating costs.
     McLaren’s Formula One racing team uses real-time car sensor data during car races,
identifies issues with its racing cars using predictive analytics and takes corrective actions pro-
actively before it’s too late! (for more on T-Mobile USA, US Xpress and McLaren’s F1 case
studies refer to this article on FT.com)
    How Morgan Stanley uses Hadoop :
        Gary Bhattacharjee, executive director of enterprise information management at the
firm, had worked with Hadoop as early as 2008 and thought that it might provide a solution. So
the IT department hooked up some old servers.
        At the Fountainhead conference on Hadoop in Finance in New York, Bhattacharjee said
the investment bank has started by stringing together 15 end of life boxes. It allowed us to
bring really cheap infrastructure into a framework and install Hadoop and let it run.
        One area that Bhattacharjee would talk about was in IT and log analysis. A typical
approach would be to look at web logs and database logs to see problems, but one log
wouldn’t show if a web delay was caused by a database problem. We dumped every log we
could get, including web and all the different database logs, put them into Hadoop and ran
time-based correlations.. Now they can see market events and how they correlate with web
issues and database read-write problems.


    Big Data at Ford
   With analytics now embedded into the culture of Ford, the rise of Big Data analytics has
created a whole host of new possibilities for the automaker.

    We recognize that the volumes of data we generate internally -- from our business
operations and also from our vehicle research activities as well as the universe of data that our



                                                                                                      26
customers live in and that exists on the Internet -- all of those things are huge opportunities for
us that will likely require some new specialized techniques or platforms to manage, said
Ginder. Our research organization is experimenting with Hadoop and we're trying to combine
all of these various data sources that we have access to. We think the sky is the limit. We
recognize that we're just kind of scraping the tip of the iceberg here.

   The other major asset that Ford has going for it when it comes to Big Data is that the
company is tracking enormous amounts of useful data in both the product development
process and the products themselves.

    Ginder noted, “Our manufacturing sites are all very well instrumented. Our vehicles are
very well instrumented. They're closed loop control systems. There are many many sensors in
each vehicle… Until now, most of that information was [just] in the vehicle, but we think there's
an opportunity to grab that data and understand better how the car operates and how
consumers use the vehicles and feed that information back into our design process and help
optimize the user's experience in the future as well”.

    Of course, Big Data is about a lot more than just harnessing all of the runaway data sources
that most companies are trying to grapple with. It’s about structured data plus unstructured
data. Structured data is all the traditional stuff most companies have in their databases (as well
as the stuff like Ford is talking about with sensors in its vehicles and assembly lines).
Unstructured data is the stuff that’s now freely available across the Internet, from public data
now being exposed by governments on sites such as data.gov in the U.S. to treasure troves of
consumer intelligence such as Twitter. Mixing the two and coming up with new analysis is what
Big Data is all about.

    The fundamental assumption of Big Data is the amount of that data is only going to grow
and there's an opportunity for us to combine that external data with our own internal data in
new ways, said Ginder. For better forecasting or for better insights into product design, there
are many, many opportunities.

    Ford is also digging into the consumer intelligence aspect of unstructured data. Ginder said,
We recognize that the data on the Internet is potentially insightful for understanding what our
customers or our potential customers are looking for [and] what their attitudes are, so we do
some sentiment analysis around blog posts, comments, and other types of content on the
Internet.

     That kind of thing is pretty common and a lot of Fortune 500 companies are doing similar
kinds of things. However, there’s another way that Ford is using unstructured data from the
Web that is a little more unique and it has impacted the way the company predicts future sales
of its vehicles.

     We use Google Trends, which measures the popularity of search terms, to help inform our
own internal sales forecasts, Ginder explained. Along with other internal data we have, we use
that to build a better forecast. It's one of the inputs for our sales forecast. In the past, it would
just be what we sold last week. Now it's what we sold last week plus the popularity of the
search terms... Again, I think we're just scratching the surface. There's a lot more I think we'll
be doing in the future.




                                                                                                        27
Figure 9 - Big Data Value Potential Index




        Computer and electronic products and information sectors (Cluster A), traded globally,
stand out as sectors that have already been experiencing very strong productivity growth and
that are poised to gain substantially from the use of big data.

       Two services sectors (Cluster B)—finance and insurance and government—are
positioned to benefit very strongly from big data as long as barriers to its use can be overcome.

        Several sectors (Cluster C) have experienced negative productivity growth, probably
indicating that these sectors face strong systemic barriers to increasing productivity. Among
the remaining sectors, we see that globally traded sectors (mostly Cluster D) tend to have
experienced higher historical productivity growth, while local services (mainly Cluster E) have
experienced lower growth.

        While all sectors will have to overcome barriers to capture value from the use of big
data, barriers are structurally higher for some than for others (Exhibit 3). For example, the
public sector, including education, faces higher hurdles because of a lack of data-driven mind-
set and available data. Capturing value in health care faces challenges given the relatively low
IT investment performed so far. Sectors such as retail, manufacturing, and professional services
may have relatively lower degrees of barriers to overcome for precisely the opposite reasons.




                                                                                                    28
Bibliography :

John Webster – “Understanding Big Data Analytics”, Aug 2011,
Searchstorage.techtarget.com

Bill Franks – “What’s Up With In-Memory Analytics?”, May 7, 2012 iianalytics.com.

Pankaj Maru – “Data scientist: The new kid on the IT block!”, Sep 3, 2012, CIOL.com.

Yellowfin WhitePaper – “In-Memory Analytics”, www.yellowfin.bi.

“Morgan Stanley Takes On Big Data With Hadoop”, March 30, 2012, Forbes.com
Ravi Kalakota - “New Tools for New Times – Primer on Big Data, Hadoop and “In-
memory” Data Clouds”, May 15, 2011, practicalanalytics.wordpress.com

 McKinsey Global Institute – “ Big data: The next frontier for innovation, competition,
and productivity”, June 2011

Harish Kotadia – “4 Excellent Big Data Case Studies”, July 2012, hkotadia.com


Jeff Kelly, Big Data: Hadoop, Business Analytics and Beyond, Aug 27, 2012
Wikibon.org
Joe McKendrick - “7 new types of jobs created by Big Data”, Sep 20, 2012,
Smartplanet.com.
Jean-Jacques Dubray - NoSQL, NewSQL and Beyond, Apr 19, 2011, Infoq.com




                                                                                          29

Más contenido relacionado

La actualidad más candente

Research issues in the big data and its Challenges
Research issues in the big data and its ChallengesResearch issues in the big data and its Challenges
Research issues in the big data and its ChallengesKathirvel Ayyaswamy
 
Big Data and Computer Science Education
Big Data and Computer Science EducationBig Data and Computer Science Education
Big Data and Computer Science EducationJames Hendler
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolutionitnewsafrica
 
Big data issues and challenges
Big data issues and challengesBig data issues and challenges
Big data issues and challengesDilpreet kaur Virk
 
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesKathirvel Ayyaswamy
 
Big Data : Risks and Opportunities
Big Data : Risks and OpportunitiesBig Data : Risks and Opportunities
Big Data : Risks and OpportunitiesKenny Huang Ph.D.
 
Big Data Information Architecture PowerPoint Presentation Slide
Big Data Information Architecture PowerPoint Presentation SlideBig Data Information Architecture PowerPoint Presentation Slide
Big Data Information Architecture PowerPoint Presentation SlideSlideTeam
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance Qubole
 
Big Data & Future - Big Data, Analytics, Cloud, SDN, Internet of things
Big Data & Future - Big Data, Analytics, Cloud, SDN, Internet of thingsBig Data & Future - Big Data, Analytics, Cloud, SDN, Internet of things
Big Data & Future - Big Data, Analytics, Cloud, SDN, Internet of thingsRamakant Gawande
 
Big Data Characteristics And Process PowerPoint Presentation Slides
Big Data Characteristics And Process PowerPoint Presentation SlidesBig Data Characteristics And Process PowerPoint Presentation Slides
Big Data Characteristics And Process PowerPoint Presentation SlidesSlideTeam
 
BIG Data and Methodology-A review
BIG Data and Methodology-A reviewBIG Data and Methodology-A review
BIG Data and Methodology-A reviewShilpa Soi
 
Big data
Big dataBig data
Big datahsn99
 
Applications of Big Data Analytics in Businesses
Applications of Big Data Analytics in BusinessesApplications of Big Data Analytics in Businesses
Applications of Big Data Analytics in BusinessesT.S. Lim
 
On Big Data Analytics - opportunities and challenges
On Big Data Analytics - opportunities and challengesOn Big Data Analytics - opportunities and challenges
On Big Data Analytics - opportunities and challengesPetteri Alahuhta
 
Big data ppt
Big data pptBig data ppt
Big data pptYash Raj
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementationSandip Tipayle Patil
 

La actualidad más candente (20)

Research issues in the big data and its Challenges
Research issues in the big data and its ChallengesResearch issues in the big data and its Challenges
Research issues in the big data and its Challenges
 
Big Data and Computer Science Education
Big Data and Computer Science EducationBig Data and Computer Science Education
Big Data and Computer Science Education
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
Big data
Big dataBig data
Big data
 
Big data issues and challenges
Big data issues and challengesBig data issues and challenges
Big data issues and challenges
 
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research Opportunities
 
Big Data : Risks and Opportunities
Big Data : Risks and OpportunitiesBig Data : Risks and Opportunities
Big Data : Risks and Opportunities
 
Big Data Information Architecture PowerPoint Presentation Slide
Big Data Information Architecture PowerPoint Presentation SlideBig Data Information Architecture PowerPoint Presentation Slide
Big Data Information Architecture PowerPoint Presentation Slide
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance
 
Big Data & Future - Big Data, Analytics, Cloud, SDN, Internet of things
Big Data & Future - Big Data, Analytics, Cloud, SDN, Internet of thingsBig Data & Future - Big Data, Analytics, Cloud, SDN, Internet of things
Big Data & Future - Big Data, Analytics, Cloud, SDN, Internet of things
 
Big Data Characteristics And Process PowerPoint Presentation Slides
Big Data Characteristics And Process PowerPoint Presentation SlidesBig Data Characteristics And Process PowerPoint Presentation Slides
Big Data Characteristics And Process PowerPoint Presentation Slides
 
BIG Data and Methodology-A review
BIG Data and Methodology-A reviewBIG Data and Methodology-A review
BIG Data and Methodology-A review
 
Big data
Big dataBig data
Big data
 
Applications of Big Data Analytics in Businesses
Applications of Big Data Analytics in BusinessesApplications of Big Data Analytics in Businesses
Applications of Big Data Analytics in Businesses
 
On Big Data Analytics - opportunities and challenges
On Big Data Analytics - opportunities and challengesOn Big Data Analytics - opportunities and challenges
On Big Data Analytics - opportunities and challenges
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementation
 
Big data mining
Big data miningBig data mining
Big data mining
 

Destacado

Big data trends challenges opportunities
Big data trends challenges opportunitiesBig data trends challenges opportunities
Big data trends challenges opportunitiesMohammed Guller
 
Cards of The Life - Vera Ema Tataro
Cards of The Life - Vera Ema TataroCards of The Life - Vera Ema Tataro
Cards of The Life - Vera Ema TataroTataro
 
Xp1-stair-test-2-10-15-model
 Xp1-stair-test-2-10-15-model Xp1-stair-test-2-10-15-model
Xp1-stair-test-2-10-15-modelRima Kapel
 
Gamification - Reputation System
Gamification - Reputation SystemGamification - Reputation System
Gamification - Reputation SystemJane Vita
 
GSEEM 2012 (int.week_malardalen_may2012)
GSEEM 2012 (int.week_malardalen_may2012)GSEEM 2012 (int.week_malardalen_may2012)
GSEEM 2012 (int.week_malardalen_may2012)Henry Muccini
 
Resilience: a brief view on the state of the art
Resilience: a brief view on the state of the artResilience: a brief view on the state of the art
Resilience: a brief view on the state of the artHenry Muccini
 
Blockchain en smart contracts #pbdag 8 2016 06-27
Blockchain en smart contracts #pbdag 8 2016 06-27Blockchain en smart contracts #pbdag 8 2016 06-27
Blockchain en smart contracts #pbdag 8 2016 06-27Lykle de Vries
 
US Mid-Market Enterprises:Confident in overseas investments in 2016
US Mid-Market Enterprises:Confident in overseas investments in 2016US Mid-Market Enterprises:Confident in overseas investments in 2016
US Mid-Market Enterprises:Confident in overseas investments in 2016The Economist Media Businesses
 
Planificacion De La Gestion Escolar
Planificacion De La Gestion EscolarPlanificacion De La Gestion Escolar
Planificacion De La Gestion Escolarguest21418b
 
Captains of Industry
Captains of IndustryCaptains of Industry
Captains of IndustryIpsos UK
 
9 enterprise tech trends for 2016 and beyond
9 enterprise tech trends for 2016 and beyond9 enterprise tech trends for 2016 and beyond
9 enterprise tech trends for 2016 and beyondJon Cohn
 
Collection Cards of The Life - Vera Ema Tataro
Collection Cards of The Life - Vera Ema Tataro Collection Cards of The Life - Vera Ema Tataro
Collection Cards of The Life - Vera Ema Tataro Tataro
 
Planificación de ciencias naturales de 4° Año
Planificación de ciencias naturales de 4° Año Planificación de ciencias naturales de 4° Año
Planificación de ciencias naturales de 4° Año Micky Arias
 
Duik van de onderzeeboot Trieste (opdracht PenO)
Duik van de onderzeeboot Trieste (opdracht PenO)Duik van de onderzeeboot Trieste (opdracht PenO)
Duik van de onderzeeboot Trieste (opdracht PenO)Joran Michiels
 
Opening up Data - the benefits and value from a community and funding perspec...
Opening up Data - the benefits and value from a community and funding perspec...Opening up Data - the benefits and value from a community and funding perspec...
Opening up Data - the benefits and value from a community and funding perspec...Simon Tanner
 
Roofing in Wixom Michigan USA - Twelve Oaks Roofing
Roofing in Wixom Michigan USA - Twelve Oaks RoofingRoofing in Wixom Michigan USA - Twelve Oaks Roofing
Roofing in Wixom Michigan USA - Twelve Oaks RoofingChristos Pittis
 
The Return of The Sun - Vera Ema Tataro
The Return of The Sun - Vera Ema TataroThe Return of The Sun - Vera Ema Tataro
The Return of The Sun - Vera Ema TataroTataro
 

Destacado (20)

Big data trends challenges opportunities
Big data trends challenges opportunitiesBig data trends challenges opportunities
Big data trends challenges opportunities
 
The evolution of Business Intelligence
The evolution of Business IntelligenceThe evolution of Business Intelligence
The evolution of Business Intelligence
 
Nuevas profesiones
Nuevas profesionesNuevas profesiones
Nuevas profesiones
 
Cards of The Life - Vera Ema Tataro
Cards of The Life - Vera Ema TataroCards of The Life - Vera Ema Tataro
Cards of The Life - Vera Ema Tataro
 
Xp1-stair-test-2-10-15-model
 Xp1-stair-test-2-10-15-model Xp1-stair-test-2-10-15-model
Xp1-stair-test-2-10-15-model
 
Gamification - Reputation System
Gamification - Reputation SystemGamification - Reputation System
Gamification - Reputation System
 
GSEEM 2012 (int.week_malardalen_may2012)
GSEEM 2012 (int.week_malardalen_may2012)GSEEM 2012 (int.week_malardalen_may2012)
GSEEM 2012 (int.week_malardalen_may2012)
 
Resilience: a brief view on the state of the art
Resilience: a brief view on the state of the artResilience: a brief view on the state of the art
Resilience: a brief view on the state of the art
 
Blockchain en smart contracts #pbdag 8 2016 06-27
Blockchain en smart contracts #pbdag 8 2016 06-27Blockchain en smart contracts #pbdag 8 2016 06-27
Blockchain en smart contracts #pbdag 8 2016 06-27
 
US Mid-Market Enterprises:Confident in overseas investments in 2016
US Mid-Market Enterprises:Confident in overseas investments in 2016US Mid-Market Enterprises:Confident in overseas investments in 2016
US Mid-Market Enterprises:Confident in overseas investments in 2016
 
Planificacion De La Gestion Escolar
Planificacion De La Gestion EscolarPlanificacion De La Gestion Escolar
Planificacion De La Gestion Escolar
 
Captains of Industry
Captains of IndustryCaptains of Industry
Captains of Industry
 
Estrés
EstrésEstrés
Estrés
 
9 enterprise tech trends for 2016 and beyond
9 enterprise tech trends for 2016 and beyond9 enterprise tech trends for 2016 and beyond
9 enterprise tech trends for 2016 and beyond
 
Collection Cards of The Life - Vera Ema Tataro
Collection Cards of The Life - Vera Ema Tataro Collection Cards of The Life - Vera Ema Tataro
Collection Cards of The Life - Vera Ema Tataro
 
Planificación de ciencias naturales de 4° Año
Planificación de ciencias naturales de 4° Año Planificación de ciencias naturales de 4° Año
Planificación de ciencias naturales de 4° Año
 
Duik van de onderzeeboot Trieste (opdracht PenO)
Duik van de onderzeeboot Trieste (opdracht PenO)Duik van de onderzeeboot Trieste (opdracht PenO)
Duik van de onderzeeboot Trieste (opdracht PenO)
 
Opening up Data - the benefits and value from a community and funding perspec...
Opening up Data - the benefits and value from a community and funding perspec...Opening up Data - the benefits and value from a community and funding perspec...
Opening up Data - the benefits and value from a community and funding perspec...
 
Roofing in Wixom Michigan USA - Twelve Oaks Roofing
Roofing in Wixom Michigan USA - Twelve Oaks RoofingRoofing in Wixom Michigan USA - Twelve Oaks Roofing
Roofing in Wixom Michigan USA - Twelve Oaks Roofing
 
The Return of The Sun - Vera Ema Tataro
The Return of The Sun - Vera Ema TataroThe Return of The Sun - Vera Ema Tataro
The Return of The Sun - Vera Ema Tataro
 

Similar a Big Data - Insights & Challenges

Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...Taniya Fansupkar
 
Big data-comes-of-age ema-9sight
Big data-comes-of-age ema-9sightBig data-comes-of-age ema-9sight
Big data-comes-of-age ema-9sightJyrki Määttä
 
Big data seminor
Big data seminorBig data seminor
Big data seminorberasrujana
 
CASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSIS
CASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSISCASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSIS
CASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSISIRJET Journal
 
big data Big Things
big data Big Thingsbig data Big Things
big data Big Thingspateelhs
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and howbobosenthil
 
20211011112936_PPT01-Introduction to Big Data.pptx
20211011112936_PPT01-Introduction to Big Data.pptx20211011112936_PPT01-Introduction to Big Data.pptx
20211011112936_PPT01-Introduction to Big Data.pptxSyauqiAsyhabira1
 
Oea big-data-guide-1522052
Oea big-data-guide-1522052Oea big-data-guide-1522052
Oea big-data-guide-1522052kavi172
 
Oea big-data-guide-1522052
Oea big-data-guide-1522052Oea big-data-guide-1522052
Oea big-data-guide-1522052Gilbert Rozario
 
UNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdfUNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdfvvpadhu
 
Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.saranya270513
 
Analysis of Big Data
Analysis of Big DataAnalysis of Big Data
Analysis of Big DataIRJET Journal
 
Whitepaper: Big Data 101 - Creating Real Value from the Data Lifecycle - Happ...
Whitepaper: Big Data 101 - Creating Real Value from the Data Lifecycle - Happ...Whitepaper: Big Data 101 - Creating Real Value from the Data Lifecycle - Happ...
Whitepaper: Big Data 101 - Creating Real Value from the Data Lifecycle - Happ...Happiest Minds Technologies
 
Big Data 101 - Creating Real Value from the Data Lifecycle - Happiest Minds
 Big Data 101 - Creating Real Value from the Data Lifecycle - Happiest Minds Big Data 101 - Creating Real Value from the Data Lifecycle - Happiest Minds
Big Data 101 - Creating Real Value from the Data Lifecycle - Happiest Mindshappiestmindstech
 
Ab cs of big data
Ab cs of big dataAb cs of big data
Ab cs of big dataDigimark
 
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptxUnit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptxYashiBatra1
 

Similar a Big Data - Insights & Challenges (20)

Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...
 
Big data-comes-of-age ema-9sight
Big data-comes-of-age ema-9sightBig data-comes-of-age ema-9sight
Big data-comes-of-age ema-9sight
 
1
11
1
 
Complete-SRS.doc
Complete-SRS.docComplete-SRS.doc
Complete-SRS.doc
 
Big data seminor
Big data seminorBig data seminor
Big data seminor
 
CASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSIS
CASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSISCASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSIS
CASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSIS
 
big data Big Things
big data Big Thingsbig data Big Things
big data Big Things
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and how
 
20211011112936_PPT01-Introduction to Big Data.pptx
20211011112936_PPT01-Introduction to Big Data.pptx20211011112936_PPT01-Introduction to Big Data.pptx
20211011112936_PPT01-Introduction to Big Data.pptx
 
Oea big-data-guide-1522052
Oea big-data-guide-1522052Oea big-data-guide-1522052
Oea big-data-guide-1522052
 
Oea big-data-guide-1522052
Oea big-data-guide-1522052Oea big-data-guide-1522052
Oea big-data-guide-1522052
 
UNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdfUNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdf
 
Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.
 
Kartikey tripathi
Kartikey tripathiKartikey tripathi
Kartikey tripathi
 
Analysis of Big Data
Analysis of Big DataAnalysis of Big Data
Analysis of Big Data
 
Whitepaper: Big Data 101 - Creating Real Value from the Data Lifecycle - Happ...
Whitepaper: Big Data 101 - Creating Real Value from the Data Lifecycle - Happ...Whitepaper: Big Data 101 - Creating Real Value from the Data Lifecycle - Happ...
Whitepaper: Big Data 101 - Creating Real Value from the Data Lifecycle - Happ...
 
Big Data 101 - Creating Real Value from the Data Lifecycle - Happiest Minds
 Big Data 101 - Creating Real Value from the Data Lifecycle - Happiest Minds Big Data 101 - Creating Real Value from the Data Lifecycle - Happiest Minds
Big Data 101 - Creating Real Value from the Data Lifecycle - Happiest Minds
 
The ABCs of Big Data
The ABCs of Big DataThe ABCs of Big Data
The ABCs of Big Data
 
Ab cs of big data
Ab cs of big dataAb cs of big data
Ab cs of big data
 
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptxUnit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
 

Último

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 

Último (20)

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Big Data - Insights & Challenges

  • 1. 2012 Big Data – Insights & Challenges Rupen Momaya WEschool Part Time Masters Program 8/28/2012 1
  • 2. Table of Contents Executive Summary ......................................................................................................................... 4 Introduction .................................................................................................................................... 5 1.0 Big Data Basics ...................................................................................................................... 5 1.1 What is Big Data ?.................................................................................................................. 5 1.2 Big Data Steps, Vendors & Technology Landscape .............................................................. 6 1.3 What Business Problems are being targeted ? ..................................................................... 7 1.4 Key Terms ............................................................................................................................ 8 1.4.1 Data Scientists ................................................................................................................ 8 1.4.2 Massive Parallel Processing (MPP) ................................................................................ 8 1.4.3 In Memory analytics ...................................................................................................... 8 1.4.4 Structured, Semi-Structured & UnStructured Data........................................................ 9 2.0. Big Data Infrastructure ........................................................................................................... 10 2.1 Storage................................................................................................................................. 10 2.1.1 Why RAID Fails at Scale ................................................................................................ 10 2.1.2 Scale up vs Scale out NAS ............................................................................................ 10 2.1.3 Object Based Storage.................................................................................................... 11 2.2 Apache Hadoop ................................................................................................................... 12 2.3 Data Appliances .................................................................................................................. 13 2.3.1 HP Vertica .................................................................................................................... 13 2.3.2 Terradata Aster ............................................................................................................ 14 3.0 Domain Wise Challenges in Big Data Era ................................................................................ 16 3.1 Log Management................................................................................................................. 16 3.2 Data Integrity & Reliability in the Big Data Era ................................................................... 16 3.3 Backup Management in Big Data Era ................................................................................. 17 3.4 Database Management in Big Data Era .............................................................................. 19 4.0 Big Data Use Cases ................................................................................................................. 21 4.1 Potential Use Cases ............................................................................................................. 21 4.2 Big Data Actual Use Cases .................................................................................................. 24 Bibliography ................................................................................................................................. 29 2
  • 3. Table of Figures Figure 1 - Big Data Statistics InfoGraphic ____________________________________________________________ 6 Figure 2 - Big Data Vendors & Technology Landscape __________________________________________________ 7 Figure 3 - HP Vertica Analytics Appliance ___________________________________________________________ 13 Figure 4 - Terradata Unified Big Data Architecture for the Enterprise ____________________________________ 15 Figure 5 - Framework for Choosing Teradata Aster Solution ____________________________________________ 15 Figure 6 - Potential Use Cases for Big Data _________________________________________________________ 21 Figure 7 - Big Data Analytics Business Model ________________________________________________________ 22 Figure 8 - Survey Results : Use of Open Source to Manage Big Data _____________________________________ 25 Figure 9 - Big Data Value Potential Index ___________________________________________________________ 28 3
  • 4. Executive Summary : The Internet has made new sources of vast amount of data to business executives. Big data is comprised of data sets too large to be handled by traditional systems. To remain competitive, business executives need to adopt new technologies & techniques emerging due to big data. Big data includes structured data, semistructured and unstructured data. Structured data are those data formatted for use in a database management system. Semistructured and unstructured data include all types of unformatted data including multimedia and social media content. Big data are also provided by myriad hardware objects, including sensors & actuators embedded in physical objects, which are termed the Internet of things. Data storage techniques used include multiple clustered network attached storage (NAS) and object-based storage. Clustered NAS deploys storage devices attached to a network. Groups of storage devices attached to different networks are then clustered together. Object- based storage systems distribute sets of objects over a distributed storage system. Hadoop, used to process unstructured and semistructured big data, uses the map paradigm to locate all relevant data then select only the data directly answering the query. NoSQL, MongoDB, and TerraStore process structured big data. NoSQL data is characterized by being basically available, soft state (changeable), and eventually consistent. MongoDB and TerraStore are both NoSQL-related products used for document – oriented applications. The advent of the age of big data poses opportunities and challenges for businesses. Previously unavailable forms of data can now be saved, retrieved, and processed. However, changes to hardware, software, and data processing techniques are necessary to employ this new paradigm. 4
  • 5. Introduction: The internet has grown tremendously in the last decade, from 304 million users in Mar 2000 to 2280 million users in Mar 2012 according to Internet Worlds stats. Worldwide information is more than doubling every two years, with 1.8 zettabytes or 1.8 trillion gigabytes projected to be created and replicated in 2011 according to the study conducted by research firm IDC. A buzzword, or catch-phrase, used to describe a massive volume of both structured and unstructured data that is so large that it's difficult to process with traditional database and software techniques is "Big Data". An example of Big Data might be petabytes (1,024 terabytes) or exabytes (1,024 petabytes) and zettabytes of data consisting of billions to trillions of records of millions of people -- all from different sources (e.g. blogs, social media, email, sensors, RFID readers, photographs, videos, microphones, mobile data and so on). The data is typically loosely structured data that is often incomplete and inaccessible. When dealing with larger datasets, organizations face difficulties in being able to create, manipulate, and manage Big Data. Scientists regularly encounter this problem in meteorology, genomics, connectomics, complex physics simulations, biological and environmental research,Internet search, finance and business informatics. Big data is particularly a problem in business analytics because standard tools and procedures are not designed to search and analyze massive datasets. While the term may seem to reference the volume of data, that isn't always the case. The term Big Data, especially when used by vendors, may refer to the technology (the tools and processes) that an organization requires to handle the large amounts of data and storage facilities. 1.0 Big Data Basics : 1.1 What is Big Data ? Below infographic depicts the expected market size of Big data and some statistics. 5
  • 6. Figure 1 - Big Data Statistics InfoGraphic 1.2 Big Data Steps, Vendors & Technology Landscape :  Data Acquisition: Data is collected from the data sources and distributed across multiple nodes -- often a grid -- each of which processes a subset of data in parallel. Here we have technological providers like IBM, HP etc.. and data providers like Reuters, Salesforce etc.. and social network websites like Facebook, Google+, LinkedIn etc..  Marshalling : In this domain, we have Very Large Data Warehousing and BI Appliances, actors like Actian, EMC² (Greenplum), HP (Vertica), IBM (Netezza) etc.  Analytics : In this phase, we have the predictive technologies (such as data mining) and vendors which are Adobe, EMC², GoodData, Hadoop Map Reduce etc.  Action : Includes all the Data Acquisition providers plus the ERP, CRM and BPM actors, including Adobe, Eloqua, EMC² etc.. Both in Analytical and Action phases, BI tools vendors are GoodData, Google, HP (Autonomy), IBM (Cognos suite) etc.. 6
  • 7.  Data Governance : An efficient Master data management solution. As defined, data governance applies to each of the six preceding stages of Big Data delivery. By establishing processes and guiding principles it sanctions behaviors around data. In short, data governance means that the application of Big Data is useful and relevant. It's an insurance policy that the right questions are being asked. So we won't be squandering the immense power of new Big Data technologies that make processing, storage and delivery speed more cost-effective and nimble than ever. Figure 2 - Big Data Vendors & Technology Landscape 1.3 What Business Problems are being targeted ? World-class companies are targeting a new set of business problems that were hard to solve before –  Modeling true risk  Customer churn analysis,  Flexible supply chains,  Loyalty pricing,  Recommendation engines,  Ad targeting,  Precision targeting, 7
  • 8. PoS transaction analysis,  Threat analysis,  Trade surveillance,  Search quality fine tuning and  Mashups such as location + ad targeting. Data growth curve: Terabytes -> Petabytes -> Exabytes -> Zettabytes -> Yottabytes -> Brontobytes -> Geopbytes. It is getting more interesting. Analytical Infrastructure curve: Databases -> Datamarts -> Operational Data Stores (ODS) -> Enterprise Data Warehouses -> Data Appliances -> In-Memory Appliances -> NoSQL Databases - > Hadoop Clusters 1.4 Key Terms : 1.4.1 Data Scientists : A data scientist represents an evolution from the business or data analyst role. Data scientists, also known as data analysts -- are professionals with core statistics or mathematics background coupled with good knowledge in analytics and data software tools. A McKinsey study on Big Data states, “India will need nearly 1,00,000 data scientists in the next few years.” A Data Scientist is a fairly new role defined by Hillary Mason of Bit.ly as someone who can obtain, scrub, explore, model and interpret data, blending hacking, statistics and machine learning who culls information from data. These data scientists take a blend of the hackers’ arts, statistics, and machine learning and apply their expertise in mathematics and understanding the domain of the data—where the data originated—to process the data into useful information. This requires the ability to make creative decisions about the data and the information created and maintaining a perspective that goes beyond ordinary scientific boundaries. 1.4.2 Massive Parallel Processing (MPP) : MPP is the coordinated processing of a program by multiple processors that work on different parts of the program, with each processor using its own operating system and memory. An MPP system is considered better than a symmetrically parallel system ( SMP ) for applications that allow a number of databases to be searched in parallel. These include decision support system and data warehouse applications. 1.4.3 In Memory analytics : The key difference between conventional BI tools and in-memory products is that the former query data on disk while the latter query data in random access memory(RAM). When a user runs a query against a typical data warehouse, the querynormally goes to a database that reads the information from multiple tables stored on a server’shard disk. With a server- based inmemory database, all information is initially loaded into memory. Users then query and interact with the data loaded into the machine’s memory. Does an in-memory analytics platform replace or augment traditional in-database approaches? 8
  • 9. The answer is that it is quite complementary. In-database approaches put a large focus on the data preparation and scoring portions of the analytic process. The value of in-database processing is the ability to handle terabytes or petabytes of data effectively. Much of the processing may not be highly sophisticated, but it is critical. The new in-memory architectures use a massively parallel platform to enable the multiple terabytes of system memory to be utilized (conceptually) as one big pool of memory. This means that samples can be much larger, or even eliminated. The number of variables tested can be expanded immensely. In-memory approaches fit best in situations where there is a need for:  High Volume & Speed: It is necessary to run many, many models quickly  High Width & Depth: It is desired to test hundreds or thousands of metrics across tens of millions customers (or other entities)  High Complexity: It is critical to run processing-intensive algorithms on all this data and to allow for many iterations to occur. There are a number of in-memory analytics tools and technologies with different architectures. Boris Evelson (Forrester Research) defines the following five types of business intelligence in-memory analytics:  In-memory OLAP: Classic MOLAP (Multidimensional Online Analytical Processing) cube loaded entirely in memory.  In-memory ROLAP: Relational OLAP metadata loaded entirely in memory.  In-memory inverted index: Index, with data, loaded into memory.  In-memory associative index: An array/index with every entity/attribute correlated to every other entity/attribute.  In-memory spreadsheet: Spreadsheet like array loaded entirely into memory. 1.4.4 Structured, Semi-Structured & UnStructured Data : Structured Data is the type that would fit neatly into a standard Relational Data Base Management System, RDBMS, and lend itself to that type of processing. Semi-structured Data is that which has some level of commonality but does not fit the structured data type. Unstructured Data is the type that varies in its content and can change from entry to entry. Structured Data Semi Structure Data UnStructured Data Customer Records Web Logs Pictures Point of Sale data Social Media Video Editing Data Inventory E-Commerce Productivity (Office docs) Financial Records Geological Data Above table depicts the examples of each of them. 9
  • 10. 2. 0 Big Data Infrastructure 2.1 Storage 2.1.1 Why RAID Fails at Scale : RAID schemes are based on parity, and at its root, if more than two drives fail simultaneously, data is not recoverable. The statistical likelihood of multiple drive failures has not been an issue in the past. However, as drive capacities continue to grow beyond the terabyte range and storage systems continue to grow to hundreds of terabytes and petabytes, the likelihood of multiple drive failures is now a reality. Further, drives aren’t perfect, and typical SATA drives have a published bit rate error (BRE) of 1014 , meaning that once every 100,000,000,000,000 bits, there will be a bit that is unrecoverable. Doesn’t seem significant? In today’s big data storage systems, it is. The likelihood of having one drive fail, and encountering a bit rate error when rebuilding from the remaining RAID set is highly probable in real world scenarios. To put this into perspective, when reading 10 terabytes, the probability of an unreadable bit is likely (56%), and when reading 100 terabytes, it is nearly certain (99.97%). 2.1.2 Scale up vs Scale out NAS : Traditional Scale up system would provide a small number of access points, or data servers, that would sit in front of a set of disks protected with RAID. As these systems needed to provide more data to more users the storage administrator would add more disks to the back end but this only caused to create the data servers as a choke point. Larger and faster data servers could be created using faster processor and more memory but this architecture still had significant scalability issues. Scale out uses the approach of more of everything—instead of adding drives behind a pair of servers, it adds servers each with processor, memory, network interfaces and storage capacity. As I need to add capacity to a grid—the scale out version of an array—I insert a new node with all the available resources. This architecture required a number of things to make it work from both a technology and financial aspect. Some of these factors include:  Clustered architecture – for this model to work the entire grid needed to work as a single entity and each node in the grid would need to be able to pick up a portion of the function of any other node that may fail.  Distributed/parallel file system – the file system must allow for a file to be accessed from any one or any number of nodes to be sent to the requesting system. This required different mechanisms underlying the file system: distribution of data across multiple nodes for redundancy, a distributed metadata or locking mechanism, and data scrubbing/validation routines.  Commodity hardware – for these systems to be affordable they must rely on commodity hardware that is inexpensive and easily accessible instead of purpose built systems. Benefits of Scale Out : There are a number of significant benefits to these new scale out systems that meet the needs of big data challenges. 10
  • 11.  Manageability – when data can grow in a single file system namespace the manageability of the system increases significantly and a single data administrator can now manage a petabyte or more of storage versus 50 or 100 terabytes on a scale up system.  Elimination of stovepipes – since these systems scale linearly and do not have the bottlenecks that scale up systems create, all data is kept in a single file system in a single grid eliminating the stovepipes introduced by the multiple arrays and files systems required.  Just in time scalability – as my storage needs grow I can add an appropriate number of nodes to meet my needs at the time I need them. With scale up arrays I would have to guess at the final size my data may grow while using that array which often led to the purchase of large data servers with only a few disks behind them initially so I would not hit bottleneck in the data server as I added disks.  Increased utilization rates – since the data servers in these scale out systems can address the entire pool of storage there is no stranded capacity. There are five core tenets of scale-out NAS: a NAS should be simple to scale, offer predictable performance, be efficient to operate, always available and be proven to work in a large enterprise. EMC Isilon : EMC Isilon is the scale-out platform that delivers ideal storage for Big Data. Powered by the OneFS operating system, Isilon nodes are clustered to create a high-performing, single pool of storage. EMC Corporation announced in May 2011, the world’s largest single file system with the introduction of EMC Isilon’s new IQ 108NL scale-out NAS hardware product. Leveraging three terabyte (TB) enterprise-class Hitachi Ultrastar drives in a 4U node, the 108NL scales to more than 15 petabytes (PB) in a single file system and single volume, providing the storage foundation for maximizing the big data opportunity. EMC also announced Isilon’s new SmartLock data retention software application, delivering immutable protection for big data to ensure the integrity and continuity of big data assets from initial creation to archival. 2.1.3 Object Based Storage Object storage is based on a single, flat address space that enables the automatic routing of data to the right storage systems, and the right tier and protection levels within those systems according to its value and stage in the data life cycle. Better Data Availability than RAID : In a properly configured object storage system, content is replicated so that a minimum of two replicas assure continuous data availability. If a disk dies, all other disks in the cluster join in to replace the lost replicas while the system still runs at nearly full speed. Recovery takes only minutes, with no interruption of data availability and no noticeable performance degradation. Provides Unlimited Capacity and Scalability In object storage systems, there is no directory hierarchy (or "tree") and the object's location does not have to be specified in the same way a directory's path has to be known in order to retrieve it. This enables object storage systems to scale to petabytes and beyond 11
  • 12. without limits on the number of files (objects), file size or file system capacity, such as the 2- terabyte restriction that is common for Windows and Linux file systems. Backups Are Eliminated With a well-designed object storage system, backups are not required. Multiple replicas ensure that content is always available and an offsite disaster recovery replica can be automatically created if desired Automatic Load Balancing A well-designed object storage cluster is totally symmetrical, which means that each node is independent, provides an entry point into the cluster and runs the same code. Companies that provide this are CleverSafe,Compuverde, Amplidata, Caringo, EMC (Atmos), Hitachi Data Systems (Hitachi Content Platform), NetApp (StorageGRID) and Scality. 2.2 Apache Hadoop Apache Hadoop has been the driving force behind the growth of the big data industry. It is a framework for running applications on large cluster built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. MapReduce : is the core of Hadoop. Created at Google in response to the problem of creating web search indexes, the MapReduce framework is the powerhouse behind most of today’s big data processing. In addition to Hadoop, you’ll find MapReduce inside MPP and NoSQL databases, such as Vertica or MongoDB. The important innovation of MapReduce is the ability to take a query over a dataset, divide it, and run it in parallel over multiple nodes. Distributing the computation solves the issue of data too large to fit onto a single machine. Combine this technique with commodity Linux servers and you have a cost-effective alternative to massive computing arrays. HDFS : we discussed the ability of MapReduce to distribute computation over multiple servers. For that computation to take place, each server must have access to the data. This is the role of HDFS, the Hadoop Distributed File System. HDFS and MapReduce are robust. Servers in a Hadoop cluster can fail and not abort the computation process. HDFS ensures data is replicated with redundancy across the cluster. On completion of a calculation, a node will write its results back into HDFS. There are no restrictions on the data that HDFS stores. Data may be unstructured and schemaless. By contrast, relational databases require that data be structured and schemas be defined before storing the data. With HDFS, making sense of the data is the responsibility of the developer’s code. Why a company will be interested in Hadoop? The number one reason is that the company is interested in taking advantage of un- structured or semi-structured data. This data will not fit well into a relational database, but Hadoop offers a scalable and relatively easy-to-program way to work with it. This category includes emails, web server logs, instrumentation of online stores, images, video and external data sets (such as list of small businesses organized by geographical area). All this data can contain information that is critical to the business and should reside in your data warehouse, 12
  • 13. but it needs a lot of pre-processing, and this pre-processing will not happen in Oracle RDBMS (for example). The other reason to look into Hadoop is for information that exists in the database, but can’t be efficiently processed within the database. This is a wide use-case, and it is usually labelled “ETL” because the data is going out of an OLTP system and into a data warehouse. You use Hadoop when 99% of the work is in the “T” of ETL – Processing the data into useful information. 2.3 Data Appliances : Purpose built solutions like Teradata, IBM/Netezza, EMC/Greenplum, SAP HANA (High- Performance Analytic Appliance), HP Vertica and Oracle Exadata are forming a new category. Data appliances are one of the fastest growing categories in Big Data. Data appliances integrate database, processing, and storage in a integrated system optimized for analytics.  Processing close to the data source  Appliance simplicity (ease of procurement; limited consulting)  Massively parallel architecture  Platform for advanced analytics  Flexible configurations and extreme scalability 2.3.1 HP Vertica : Figure 3 - HP Vertica Analytics Appliance The Vertica Analytics Platform is purpose built from the ground up to enable companies to extract value from their data at the speed and scale they need to thrive in today’s economy. 13
  • 14. Vertica was designed and built since its inception for today’s most demanding analytic workloads, each Vertica component is able to take full-advantage of the others by design. Key Features of the Vertica Analytics Platform : o Real-Time Query & Loading » Capture the time value of data by continuously loading information, while simultaneously allowing immediate access for rich analytics. o Advanced In-Database Analytics » Ever growing library of features and functions to explore and process more data closer to the CPU cores without the need to extract. o Database Designer & Administration Tools » Powerful setup, tuning and control with minimal administration effort. Can make continual improvements while the system remains online. o Columnar Storage & Execution » Perform queries 50x-1000x faster by eliminating costly disk I/O without the hassle and overhead of indexes and materialized views. o Aggressive Data Compression » Accomplish more with less CAPX, while delivering superior performance with our engine that operates on compressed data. o Scale-Out MPP Architecture » Vertica automatically scales linearly and limitlessly by just adding industry-standard x86 servers to the grid. o Automatic High Availability » Runs non-stop with automatic redundancy, failover and recovery optimized to deliver superior query performance as well. o Optimizer, Execution Engine & Workload Management » Get maximum performance without worrying about the details of how it gets done. Users just think about questions, we deliver answers, fast. o Native BI, ETL, & Hadoop/MapReduce Integration » Seamless integration with a robust and ever growing ecosystem of analytics solutions. 2.3.2 Terradata Aster : To Gain Business Insight Using MapReduce and Apache Hadoop with SQL-Based Analytics, below is a summary using a unified big data architecture that blends the best of Hadoop and SQL, allowing users to:  Capture and refine data from a wide variety of sources  Perform necessary multi-structured data preprocessing  Develop rapid analytics  Process embedded analytics, analyzing both relational and non-relational data  Produce semi-structured data as output, often withmetadata and heuristic analysis  Solve new analytical workloads with reduced time to insight  Usemassively parallel storage in Hadoop to efficiently store and retain data 14
  • 15. Figure 4 - Terradata Unified Big Data Architecture for the Enterprise When to choose which solution : (Teradata, Aster & Hadoop) ? Below figure offers a framework to help enterprise architects most effectively use each part of a unified big data architecture. This framework allows a best-of-breed approach that you can apply to each schema type, helping you achieve maximum performance, rapid enterprise adoption, and the lowest TCO. Figure 5 - Framework for Choosing Teradata Aster Solution 15
  • 16. 3.0 Domain Wise Challenges in Big Data Era 3.1 Log Management Log data does not fall into the convenient schemas required by relational databases. Log data is, at its core, unstructured, or, in fact, semi-structured, which leads to a deafening cacophony of formats; the sheer variety in which logs are being generated is presenting a major problem in how they are analyzed. The emergence of Big Data has not only been driven by the increasing amount of unstructured data to be processed in near real-time, but also by the availability of new toolsets to deal with these challenges. There are 2 things that don’t receive enough attention in the log management space. The 1st is real scalability, which means thinking beyond what data centers can do. That inevitably leads to ambient cloud models for log management. Splunk has done an amazing job of pioneering an ambient cloud model with the way they created an eventual consistency model which allows you to make a query to get a “good enough” answer quickly, or a perfect answer in more time. The 2nd thing is security. Log data is next to useless if it is not nonrepudiatable. Basically, all the log data in the world is not useful as evidence unless you can prove that nobody changed it. Sumo Data, Loggly, Splunk,are the primary companies that currently have products around Log management. 3.2 Data Integrity & Reliability in the Big Data Era Consider standard business practices and how nearly all physical forms of documentation and transactions have evolved to become digitized versions, and with them come the inherent challenges of validating not just the authenticity of their contents but also the impact of acting upon an invalid data set – something which is highly possible in today's high-velocity, big data business environment. With this view, we can then begin to identify the scale of the challenge. With cybercrime and insider threats clearly emerging as a much more profitable (and safe) business for the criminal element, the need to validate and verify is going to become critical to all business documentation and related transactions, even within the existing supply chains. Keyless signature technology is a relatively new concept in the market and will require a different set of perspectives when put under consideration. A keyless signature provides an alternative method to key-based technologies by providing proof and non-repudiation of electronic data using only hash functions for verification. The implementation of keyless signature is done via a globally distributed machine, taking hash values of data as inputs and returning keyless signatures that prove the time, integrity, and origin (machine, organization, individual) of the input data. A primary goal of the keyless signature technology is to provide mass-scale, non- expiring data validation while eliminating the need for secrets or other forms of trust, thereby reducing or even eliminating the need for more complex certificate based solutions as these are ripe with certificate management issues, including expiration and revocation. As more organizations become affected by Big Data phenomenon, the clear implication is that many businesses will potentially be making business decision based on massive amounts 16
  • 17. of internal and third-party data. Consequently, the demand for novel and trusted approaches to validating data will grow. Extend this concept to the ability to validate a virtual machine, switch logs or indeed the security logs, and then multiply by the clear advantages that cloud computing (public or private) has over the traditional datacenter design – we will begin to understand why keyless data integrity technology that can ensure self-validating data is a technology that is likely to experience swift adoption. The ability to move away from reliance on a third-party certification authority will be welcomed by many, although this move from the traditionally accepted approach to verify data integrity needs to be more fully broadcasted and understood for more mass market adoption and acceptance. Another solution for monitoring the stability, performance and security of your big data environment is from a company called Gazzang. Enterprises and SaaS solution providers have new needs that are driven by the new infrastructures and opportunities of cloud computing. For example, business intelligence analysis uses big data stores such as MongoDB, Hadoop and Cassandra. The data is spread across hundreds or thousands of servers in order to optimize processing time and return business insight to the user. Leveraging its extensive experience with cloud architectures and big data platforms, Gazzang is delivering a SaaS solution for the capture, management and analysis of massive volumes of IT data. Gazzang zOps is purpose- built for monitoring big data platforms and multiple cloud environments. The powerful engine collects and correlates vast amounts of data from numerous sources in a variety of forms. 3.3 Backup Management in Big Data Era : For protection against user or application error, Ashar Baig, a senior analyst and consultant with the Taneja Group, said snapshots can help with big data backups. Baig also recommends a local disk-based system for quick and simple first-level data- recovery problems. “Look for a solution that provides you an option for local copies of data so that you can do local restores, which are much faster,” he said. “Having a local copy, and having an image-based technology to do fast, image-based snaps and replications, does speed it up and takes care of the performance concern.” Faster scanning needed : One of the issues big data backup systems face is scanning each time the backup and archiving solutions start their jobs. Legacy data protection systems scan the file system each time a backup job is run, and each time an archiving job is run. For file systems in big data environments, this can be time-consuming. Commvault’s solution for the scanning issue in its Simpana data protection software is its OnePass feature. According to Commvault, OnePass is an object-level converged process for collecting backup, archiving and reporting data. The data is collected and moved off the primary system to a ContentStore virtual repository for completing the data protection operations. Once a complete scan has been accomplished, the Commvault software places an agent on the file system to report on incremental backups, making the process even more efficient. 17
  • 18. Casino doesn’t want to gamble on backups Pechanga Resort & Casino in Temecula, Calif., went live with a cluster of 50 EMC Isilon X200 nodes in February to back up data from its surveillance cameras. The casino has 1.4 PB of usable Isilon storage to keep the data, which is critical to operations because the casino must shut down all gaming operations if its surveillance system is interrupted. “In gaming, we’re mandated to have surveillance coverage,” said Michael Grimsley, director of systems for Pechanga Technology Solutions Group. “If surveillance is down, all gaming has to stop.” If a security incident occurs, the IT team pulls footage from the X200 nodes and moves it to WORM-compliant storage and backs it up with NetWorker software to EMC Data Domain DD860 deduplication target appliances. The casino doesn’t need tape for WORM capability because WORM is part of Isilon’s SmartLock software. “It’s mandatory that part of our storage includes a WORM-compliant section,” Grimsley said. “Any time an incident happens, we put that footage in the vault. We have policies in place so it’s not deleted.” The casino keeps 21 days’ worth of video on Isilon before recording over the video. Grimsley said he is looking to expand the backup for the surveillance camera data. He’s considering adding a bigger Data Domain device to do day-to-day backup of the data. “We have no requirements for day-to-day backup, but it’s something we would like to do,” he said. Another possibility is adding replication to a DR site so the casino can recover quickly if the surveillance system goes down. Scale-out systems : Another option to solving the performance and capacity issues is using a scale-out backup system, one similar to scale-out NAS, but built for data protection. You add nodes with additional performance and capacity resources as the amount of protected data grows. “Any backup architecture, especially for the big data world, has to balance the performance and the capacity properly,” said Jeff Tofano, Sepaton Inc.’s chief technology officer. “Otherwise, at the end of the day, it’s not a good solution for the customer and is a more expensive solution than it should be.” Sepaton’s S2100-ES2 modular virtual tape library (VTL) was built for data-intensive large enterprises. According to the company, its 64-bit processor nodes backup data at up to 43.2 TB per hour, regardless of the data type, and can store up to 1.6 PB. You can add up to eight performance nodes per cluster as your needs require, and add disk shelves to add capacity. 18
  • 19. 3.4 Database Management in Big Data Era : There are currently three trends in the industry:  the NoSQL databases, designed to meet the scalability requirements of distributed architectures, and/or schemaless data management requirements,  the NewSQL databases designed to meet the requirements of distributed architectures or to improve performance such that horizontal scalability is no longer needed  the Data grid/cache products designed to store data in memory to increase application and database performance Below comparison assesses the drivers behind the development and adoption of NoSQL and NewSQL databases, as well as data grid/caching technologies. NoSQL NewSQL o Newbreed of non-relational o New breed of relational database products. database products o Rejection of fixed table schema o Retain SQL and ACID and join operations. o Designed to meet scalability o Designed to meet scalability requirements of distributed requirements of distributed architectures. architectures o And/or schemaless data management o Or improve performance so requirements . horizontal scalability is no o Big tables – data mapped by row longer a necessity key, column key and time stamp o MySQL storage engines -> scale- o Keyvalue stores store keys and up and scale-out associated values. o Transparent sharding - reduce to o Document stores all data as a single manual effort required to scale document. o Appliances - take advantage of o Graph databases to use nodes properties improved hardware and edges to store data and the performance, solid state drives relationships between entries. o New databases - designed specifically for scale out. .. And Beyond o In-memory data grid/cache products o Potential primary platform for distributed data management Data grid/cache o spectrum of data management capabilities, from nonpersistent data caching to persistent caching, replication, and distributed data and compute grid. ComputerWorld’s Tam Harbert explored the skills and needs organizations are searching for in the quest to manage the Big Data challenge, and also identified five job titles emerging in the Big Data world. Along with Harbert’s findings, here are 7 new types of jobs being created by Big Data: 19
  • 20. 1. Data scientists: This emerging role is taking the lead in processing raw data and determining what types of analysis would deliver the best results. 2. Data architects: Organizations managing Big Data need professionals who will be able to build a data model, and plan out a roadmap of how and when various data sources and analytical tools will come online, and how they will all fit together. 3. Data visualizers: These days, a lot of decision-makers rely on information that is presented to them in a highly visual format — either on dashboards with colorful alerts and dials, or in quick-to-understand charts and graphs. Organizations need professionals who can harness the data and put it in context, in layman’s language, exploring what the data means and how it will impact the company, . 4. Data change agents: Every forward-thinking organization needs change agents — usually an informal role — who can evangelize and marshal the necessary resources for new innovation and ways of doing business. Harbert predicts that data change agents may be more of a formal job title in the years to come, driving changes in internal operations and processes based on data analytics. They need to be good communicators, and a Six Sigma background — meaning they know how to apply statistics to improve quality on a continuous basis — also helps. 5. Data engineer/operators: These are the people that make the Big Data infrastructure hum on a day-to-day basis. They develop the architecture that helps analyze and supply data in the way the business needs, and make sure systems are performing smoothly, says Harbert. 6. Data stewards: Not mentioned in Harbert’s list, but essential to any analytics-driven organization, is the emerging role of data steward. Every bit and byte of data across the enterprise should be owned by someone — ideally, a line of business. Data stewards ensure that data sources are properly accounted for, and may also maintain a centralized repository as part of a Master Data Management approach, in which there is one gold copy of enterprise data to be referenced. 7. Data virtualization/cloud specialists: Databases themselves are no longer as unique as they use to be. What matters now is the ability to build and maintain a virtualized data service layer that can draw data from any source and make it available across organizations in a consistent, easy-to-access manner. Sometimes, this is called Database- as-a-Service. No matter what it’s called, organizations need professionals that can also build and support these virtualized layers or clouds. Above insights will help visualize what the future global world class organizations would need to manage their data. 20
  • 21. 4.0 Big Data Use Cases : 4.1 Potential Use Cases The key to exploiting Big Data Analytics is focusing on a compelling business opportunity as defined by a use case — WHAT (What exactly are we trying to do?); WHAT value is there in proving a hypothesis? Use cases are emerging in a variety of industries that illustrate different core competencies around analytics. Figure below illustrates some Use Cases along two dimensions: data velocity and variety. A Use Case provides a context for a value chain: how to move from Raw Data -> Aggregated Data -> Intelligence -> Insights -> Decisions -> Operational Impact -> Financial Outcomes -> Value creation. Source : SAS & IDC Figure 6 - Potential Use Cases for Big Data Insurance — Individualize auto-insurance policies based on newly captured vehicle telemetry data. Insurer gains insight into customer’s driving habits delivering: (1) More accurate assessments of risks; (2) Individualized pricing based on actual individual customer driving habits; (3) Influence and motivate individual customers to improve their driving habits 21
  • 22. Travel — Optimize buying experience through web log and social media data analysis (1) Travel site gains insight into customer preferences and desires; (2) Up-selling products by correlating current sales with subsequent browsing behavior Increase browse-to-buy conversions via customized offers and packages; (3) Deliver personalized travel recommendations based on social media data Gaming – Collect gaming data to optimize spend within and across games: (1) Games company gains insight into likes, dislikes and relationships of its users; (2) Enhance games to drive customer spend within games; (3) Recommend other content based on analysis of player connections and similar likes Create special offers or packages based on browsing and (non-)buying behaviour Figure 7 - Big Data Analytics Business Model E-tailing – E-Commerce – Online Retail/Consumer Products Retailing  Merchandizing and market basket  Recommendation engines — increase analysis. average order size by recommending  Campaign management and complementary products based on customer loyalty programs - 22
  • 23. predictive analysis for cross-selling. Marketing departments across  Cross-channel analytics — sales industries have long used technology attribution, average order value, lifetime to monitor and determine the value (e.g., how many in-store purchases effectiveness of marketing resulted from a particular campaigns. Big Data allows marketing recommendation, advertisement or teams to incorporate higher volumes promotion). of increasingly granular data, like click-stream data and call detail  Event analytics — what series of steps records, to increase the accuracy of (golden path) led to a desired outcome analysis. (e.g., purchase, registration).  Supply-chain management and analytics.  Event- and behavior-based targeting.  Market and consumer segmentations. Financial Services Web & Digital Media Services  Compliance and regulatory reporting.  Large-scale clickstream analytics.  Risk Modelling and management -  Ad targeting, analysis, forecasting Financial firms, banks and others use and optimization. Hadoop and Next Generation Data  Abuse and click-fraud prevention. Warehouses to analyze large volumes of  Social graph analysis and profile transactional data to determine risk and segmentation - In conjunction with exposure of fincnaical assets, to prepare Hadoop and often Next Generation for potential what-if scenarios based on Data Warehousing, social simulated market behavior, and to score networking data is mined to potential clients for risk. determine which customers pose  Fraud detection and security analytics - the most influence over others Credit card companies, for example, use inside social networks. This helps Big Data technologies to identify enterprises determine which are transactional behavior that indicates a high their most important customers, likelihood of a stolen card. who are not always those that buy  CRM and customer loyalty programs. the most products or spend the  Credit risk, scoring and analysis. most but those that tend to influence the buying behavior of  High speed Arbitrage trading others the most.  Trade surveillance.  Campaign management and loyalty  Abnormal trading pattern analysis programs. Government New Applications  Fraud detection and cybersecurity  Sentiment Analytics - used in  Compliance and regulatory analysis. conjunction with Hadoop, advanced text analytics tools analyze the  Energy consumption and carbon footprint unstructured text of social media management. 23
  • 24. and social networking posts, including Tweets and Facebook posts, to determine the user sentiment related to particular companies, brands or products  Mashups – Mobile User Location + Precision Targeting  Machine-generated data, the exhaust fumes of the Web Health & Life Sciences Telecommunications  Health Insurance fraud detection  Revenue assurance and price  Campaign and sales program optimization. optimization.  Brand management.  Customer churn analysis - Enterprises use Hadoop and Big Data  Patient care quality and program analysis. technologies to analyse customer  Supply-chain management. behavior data to identify patterns  Drug discovery and development analysis. that indicate which customers are most likely to leave for a competing vendor or service..  Campaign management and customer loyalty.  Call Detail Record (CDR) analysis.  Network performance and optimization  Mobile User Location analysis Smart meters in the utilities industry. The rollout of smart meters as part of the Smart Grid adoption by utilities everywhere has resulted in a deluge of data flowing at unprecedented levels. Most utilities are ill-prepared to analyze the data once the meters are turned on. 4.2 Big Data Actual Use Cases : Below graphic mentions the survey result undertaken by Information Week which indicates the % of respondents who would be opting for a open source solutions for Big Data . 24
  • 25. Figure 8 - Survey Results : Use of Open Source to Manage Big Data  Interesting Use Case – Amazon Will Pay Shoppers $5 to Walk Out of Stores Empty- Handed Interesting use of consumer data entry to power next generation retail price competition…. Amazon is offering consumers up to $5 off on purchases if they compare prices using their mobile phone application in a store. The promotion will serve as a way for Amazon to increase usage of its bar-code-scanning application, while also collecting intelligence on prices in the stores. Amazon’s Price Check app, which is available for iPhone and Android, allows shoppers to scan a bar code, take a picture of an item or conduct a text search to find the lowest prices. Amazon is also asking consumers to submit the prices of items with the app, so Amazon knows if it is still offering the best prices. A great way to feed data into its learning engine from brick- and-mortar retailers. This is an interesting trend that should terrify brick-and-mortar retailers. While the real- time Everyday Low Price information empowers consumers, it terrifies retailers, who increasingly are feeling like showrooms — shoppers come to to check out the merchandise but ultimately decide to walk out and buy online instead.  Smart Meters : a. Because of smart meters, electricity providers can read the meter once every 15 minutes rather than once a month. This not only eliminates the need to send some one for meter reading, but as the meter is read once every fifteen minutes, electricity can be priced differently for peak and off-peak hours. Pricing can be used to shape the demand curve during peak hours, eliminating the need for creating additional generating capacity just to meet peak demand, saving electricity providers millions of dollars worth of investment in generating capacity and plant maintenance costs. 25
  • 26. b. Well, there is a smart electric meter in a residence in Texas and one of the electricity providers in the area (TXU Energy) is using the smart meter technology to shape the demand curve by offering Free Night time Energy Charges — All Night. Every Night. All Year Long. In fact, they promote their service as Do your laundry or run the dishwasher at night, and pay nothing for your Energy Charges . What TXU Energy is trying to do here is to re-shape energy demand using pricing so as to manage peak-time demand resulting in savings for both, TXU and customers. This wouldn’t have been possible without Smart Electric meters.  T-Mobile USA has integrated Big Data across multiple IT systems to combine customer transaction and interactions data in order to better predict customer defections. By leveraging social media data (Big Data) along with transaction data from CRM and Billing systems, T- Mobile USA has been able to cut customer defections in half in a single quarter .  US Xpress, provider of a wide variety of transportation solutions collects about a thousand data elements ranging from fuel usage to tire condition to truck engine operations to GPS information, and uses this data for optimal fleet management and to drive productivity saving millions of dollars in operating costs.  McLaren’s Formula One racing team uses real-time car sensor data during car races, identifies issues with its racing cars using predictive analytics and takes corrective actions pro- actively before it’s too late! (for more on T-Mobile USA, US Xpress and McLaren’s F1 case studies refer to this article on FT.com)  How Morgan Stanley uses Hadoop : Gary Bhattacharjee, executive director of enterprise information management at the firm, had worked with Hadoop as early as 2008 and thought that it might provide a solution. So the IT department hooked up some old servers. At the Fountainhead conference on Hadoop in Finance in New York, Bhattacharjee said the investment bank has started by stringing together 15 end of life boxes. It allowed us to bring really cheap infrastructure into a framework and install Hadoop and let it run. One area that Bhattacharjee would talk about was in IT and log analysis. A typical approach would be to look at web logs and database logs to see problems, but one log wouldn’t show if a web delay was caused by a database problem. We dumped every log we could get, including web and all the different database logs, put them into Hadoop and ran time-based correlations.. Now they can see market events and how they correlate with web issues and database read-write problems.  Big Data at Ford With analytics now embedded into the culture of Ford, the rise of Big Data analytics has created a whole host of new possibilities for the automaker. We recognize that the volumes of data we generate internally -- from our business operations and also from our vehicle research activities as well as the universe of data that our 26
  • 27. customers live in and that exists on the Internet -- all of those things are huge opportunities for us that will likely require some new specialized techniques or platforms to manage, said Ginder. Our research organization is experimenting with Hadoop and we're trying to combine all of these various data sources that we have access to. We think the sky is the limit. We recognize that we're just kind of scraping the tip of the iceberg here. The other major asset that Ford has going for it when it comes to Big Data is that the company is tracking enormous amounts of useful data in both the product development process and the products themselves. Ginder noted, “Our manufacturing sites are all very well instrumented. Our vehicles are very well instrumented. They're closed loop control systems. There are many many sensors in each vehicle… Until now, most of that information was [just] in the vehicle, but we think there's an opportunity to grab that data and understand better how the car operates and how consumers use the vehicles and feed that information back into our design process and help optimize the user's experience in the future as well”. Of course, Big Data is about a lot more than just harnessing all of the runaway data sources that most companies are trying to grapple with. It’s about structured data plus unstructured data. Structured data is all the traditional stuff most companies have in their databases (as well as the stuff like Ford is talking about with sensors in its vehicles and assembly lines). Unstructured data is the stuff that’s now freely available across the Internet, from public data now being exposed by governments on sites such as data.gov in the U.S. to treasure troves of consumer intelligence such as Twitter. Mixing the two and coming up with new analysis is what Big Data is all about. The fundamental assumption of Big Data is the amount of that data is only going to grow and there's an opportunity for us to combine that external data with our own internal data in new ways, said Ginder. For better forecasting or for better insights into product design, there are many, many opportunities. Ford is also digging into the consumer intelligence aspect of unstructured data. Ginder said, We recognize that the data on the Internet is potentially insightful for understanding what our customers or our potential customers are looking for [and] what their attitudes are, so we do some sentiment analysis around blog posts, comments, and other types of content on the Internet. That kind of thing is pretty common and a lot of Fortune 500 companies are doing similar kinds of things. However, there’s another way that Ford is using unstructured data from the Web that is a little more unique and it has impacted the way the company predicts future sales of its vehicles. We use Google Trends, which measures the popularity of search terms, to help inform our own internal sales forecasts, Ginder explained. Along with other internal data we have, we use that to build a better forecast. It's one of the inputs for our sales forecast. In the past, it would just be what we sold last week. Now it's what we sold last week plus the popularity of the search terms... Again, I think we're just scratching the surface. There's a lot more I think we'll be doing in the future. 27
  • 28. Figure 9 - Big Data Value Potential Index Computer and electronic products and information sectors (Cluster A), traded globally, stand out as sectors that have already been experiencing very strong productivity growth and that are poised to gain substantially from the use of big data. Two services sectors (Cluster B)—finance and insurance and government—are positioned to benefit very strongly from big data as long as barriers to its use can be overcome. Several sectors (Cluster C) have experienced negative productivity growth, probably indicating that these sectors face strong systemic barriers to increasing productivity. Among the remaining sectors, we see that globally traded sectors (mostly Cluster D) tend to have experienced higher historical productivity growth, while local services (mainly Cluster E) have experienced lower growth. While all sectors will have to overcome barriers to capture value from the use of big data, barriers are structurally higher for some than for others (Exhibit 3). For example, the public sector, including education, faces higher hurdles because of a lack of data-driven mind- set and available data. Capturing value in health care faces challenges given the relatively low IT investment performed so far. Sectors such as retail, manufacturing, and professional services may have relatively lower degrees of barriers to overcome for precisely the opposite reasons. 28
  • 29. Bibliography : John Webster – “Understanding Big Data Analytics”, Aug 2011, Searchstorage.techtarget.com Bill Franks – “What’s Up With In-Memory Analytics?”, May 7, 2012 iianalytics.com. Pankaj Maru – “Data scientist: The new kid on the IT block!”, Sep 3, 2012, CIOL.com. Yellowfin WhitePaper – “In-Memory Analytics”, www.yellowfin.bi. “Morgan Stanley Takes On Big Data With Hadoop”, March 30, 2012, Forbes.com Ravi Kalakota - “New Tools for New Times – Primer on Big Data, Hadoop and “In- memory” Data Clouds”, May 15, 2011, practicalanalytics.wordpress.com McKinsey Global Institute – “ Big data: The next frontier for innovation, competition, and productivity”, June 2011 Harish Kotadia – “4 Excellent Big Data Case Studies”, July 2012, hkotadia.com Jeff Kelly, Big Data: Hadoop, Business Analytics and Beyond, Aug 27, 2012 Wikibon.org Joe McKendrick - “7 new types of jobs created by Big Data”, Sep 20, 2012, Smartplanet.com. Jean-Jacques Dubray - NoSQL, NewSQL and Beyond, Apr 19, 2011, Infoq.com 29