SlideShare una empresa de Scribd logo
1 de 9
Descargar para leer sin conexión
TERADATA
ASTER
DISCOVERY
PLATFORM
WHITEPAPER
01.13
TERADATA ASTER
DISCOVERY PLATFORM
Extending Analytics to
Non-Relational Data
Copyright © 2012-2013 by Teradata Corporation   2
TERADATA
ASTER
DISCOVERY
PLATFORM
WHITEPAPER
01.13
CONTENTS
3 	 TERADATA ASTER DISCOVERY PLATFORM
3 	Tabular and non-tabular data
3 	What are the defining characteristics of this non-relational data?
	 3 Size
4 	Structure and Defining “Multi-Structured” Data
4	 Isn’t it really tabular?
4 	So why is the world interested in storing and manipulating multi-structured data?
4 	Integration with relational data
5	 The historical solution	
5 	ROLAP
5	 MOLAP
5	 “History is bunk”	
5	 Design philosophy	
6	 How does it work?	
6 	The engine and the processing layer
6 	So what is MapReduce?
7 	The analytical function library
8	 Using Aster for real
9	 Summary
9	 LEARN MORE
Copyright © 2012-2013 by Teradata Corporation   3
TERADATA
ASTER
DISCOVERY
PLATFORM
WHITEPAPER
01.13
Teradata Aster Discovery Platform
Teradata specializes in storing and analyzing structured, relational data. It has recently purchased Aster Data
Systems, Inc. in order to extend its platform to include the capability of handling what is often called ‘big’,
‘semi-structured’ or multi-structured (see below) data. This paper explains how the Teradata Aster solution
works, specifically drilling into how its design philosophy enables it to cope not only with the myriad different
types of big data that exist today, but how it is engineered to cope with those that will arise in the future.
TABULAR AND NON-TABULAR DATA
‘Structured’ data is a term that refers to data that fits neatly into tables. For instance an employee table has
columns like date of birth, marital status and so on, and each row contains all the data about a single employee.
Table 1
Each table has a name and a number of rows and columns. Each column has a unique name and each row has a unique
identifier. So, using the name of the table, the column name and the row ID we can reach any piece of data within the
database. The data found there should be very simple – the term often used is ‘atomic’ which implies that the data is
so simple it cannot be meaningfully sub-divided.
A great deal of business data is tabular and we frequently store it in relational databases so we often use the term
‘relational’ data to describe this kind of highly structured data.
However the world has become increasingly interested in storing and manipulating data that does not easily fit into
relational tables – data such as images, text files, .PDFs, sensor data, Word documents, click-stream data, and so on.
WHAT ARE THE DEFINING CHARACTERISTICS OF THIS NON-RELATIONAL DATA?
Size
This kind of data is often also referred to as ‘big’ data. The term is appropriate for two reasons.
1.	 Whilst each piece of tabular data is usually small and indivisible (atomic), each piece of non-tabular is often very
large. Image files from modern cameras can easily be 7-8 Mbytes each. Part of my research work involves mass
spectrometers which produce between 4 and 6 GBytes of data in a single run. Compared with, say, a name or a
date of birth, these are large chunks of data.
2.	 And, not only is each piece of semi-structured data big, we often collect a great number of individual pieces.
How many new or modified emails, Word documents and Excel worksheets are produced by your company every
day? And every user of your website is creating a click-stream trail, every temperature sensor in your building is
streaming data out second by second by second…
Copyright © 2012-2013 by Teradata Corporation   4
TERADATA
ASTER
DISCOVERY
PLATFORM
WHITEPAPER
01.13
STRUCTURE AND DEFINING “MULTI-STRUCTURED” DATA
People often refer to this kind of data as ‘semi-structured’ despite the fact that the term is really a misnomer. As
described above, relational data is very precisely structured but then so is a .PDF file. In fact calling it semi-structured
almost suggests that it is second class in some way and that it is only partially structured. And, indeed, text files have
structure, as do .JPGs, they’re just different from relational data structures. The term ‘semi-structured’ also tends to
imply that all non-relational data is the same and one of the defining characteristics of this semi-structured data is its
diversity.
So a much better general term for all members of this new class of data is ‘multi-structured’. This name implies that
which is true: there are many different classes of data, all of which are highly structured and their structure simply
differs depending on the file type.
ISN’T IT REALLY TABULAR?
So why can’t we treat this data in the same way as tabular data? Well, in a sense, we can. Any and all digital
data is stored as bits and bytes. If we are dogged enough we can break any data into a long string of bits/bytes
and store these as one column in table with a huge number of rows. In that sense, all data can be tabular.
We can also store data such as images in tables by creating specialized data types such as BLOBS
(Binary Large Objects) – some relational database engines have been able to do this for years.
The problem is that while both of these solutions allow us to store the data, they both miss the point that
our main interest in this data is to dig inside it and extract the useful information that it contains.
SO WHY IS THE WORLD INTERESTED IN STORING AND
MANIPULATING MULTI-STRUCTURED DATA?
This kind of data can have huge commercial value locked up within it. Think about a company like eBay. In many ways,
when it started, eBay was simply a huge tabular database. You and I may buy and sell items on eBay, but the company
itself never sees or handles the items or the cash; as far as it is concerned, we are simply carrying out transactions
against a set of tabular data. But after a while eBay also became interested in the behavior of its customers. The
tabular data was storing our purchases but our behavior (which buttons we clicked, in which order and when) was
in the click-stream data - which is classic multi-structured data.
Then there is Google’s spell checker. Microsoft reportedly spent several million dollars over 20 years developing its
spell checker. Google realized that if it tracked what users typed in:
“Ferari”
and what they ended up viewing:
www.ferrari.com
then it could map the strings of characters that people actually typed to the strings they wanted. Not only did Google
immediately gain a multi-lingual spell checker, it gained a very, very effective one. A spell checker that learns over
time and is created effectively for free from the data that other people would throw away – so-called data exhaust.
Now think about sensors in a factory – they might record noise and light levels, temperature, pressure and so on.
Every now and then the production process produces a bad batch. Locked in the data from the sensors may be
the information about the conditions that lead to failure.
INTEGRATION WITH RELATIONAL DATA
So, multi-structured data is here to stay, and we need a solution that can not only store it and manipulate it but also
allows it to be analyzed seamlessly with the relational data. At first sight, and particularly from a technical point of
view, this seems like an odd assertion. Multi-structured data is fundamentally different from relational, so surely it
makes sense to query them independently. The problem with this line of argument is that it makes
Copyright © 2012-2013 by Teradata Corporation   5
TERADATA
ASTER
DISCOVERY
PLATFORM
WHITEPAPER
01.13
no sense at all from the business perspective. Business users may well be entirely unaware of the technical
differences in data structures, all they know is that there is a new source of data and they want to be able to
understand it in relation (if you will pardon the pun) to their existing data. Whatever solution we adopt must allow
analysis across not only all the different types of multi-structured data, but it must also include the relational data.
THE HISTORICAL SOLUTION
Historically multi-structured data has been handled in one of two ways, neither of which is entirely satisfactory:
1.	 You can force it into a relational structure, either as a BLOB or by ‘shredding’ it into atomic data. These
solutions have the advantage that you can store it in your existing relational engine and, if it is shredded,
you can run SQL against it. The disadvantage is that this tends to be very inefficient, slow and unwieldy.
2.	 You can create a new database engine specifically for that class of multi-structured data and even develop
a new language for querying and manipulating it. This gives very efficient storage and manipulation.
The problem is that there are already many types of multi-structured data out there and, as we move
forward, more will arise. We can’t go on and on creating new engines for each new type.
A good example of a type of multi-structured data that is handled in both of these ways is dimensional data.
Dimensional data is primarily used for On-Line Analytical Processing (OLAP) and consists of a set of measures
which can be sliced by a number of dimensions. It is traditionally handled either in a relational (ROLAP) or a
dimensional (MOLAP) engine.
ROLAP
The dimensional data is essentially rendered down into two dimensional tables. The measures go into a fact table,
the dimension data into dimension tables and thus you have a ROLAP solution (Relational On-Line Analytical
Processing). The good news is that this utilized existing technology and skills, the bad is that it is inefficient.
MOLAP
The alternative is to create an entirely new class of database engine, in this case a multi-dimensional database engine
in which to store the data. The advantage is that you can use an analytical language like MDX (Multi-Dimensional
eXpressions) and run it natively against that engine. The downside is that you’ve had to create an entirely new
engine and an entirely new language in order to handle just one of your many multi-structured flavors of data.
“HISTORY IS BUNK”
To paraphrase Henry Ford, the historical solutions to this problem are bunk; neither is realistically sustainable
for multi-structured data. The former is always inefficient, the latter produces an ever-increasing set of
database engines, which makes integrating the different types of multi-structured data a nightmare.
DESIGN PHILOSOPHY
Part of the philosophy of the Teradata Aster solution is based on a simple observation. When people analyze
data (multi-structured or relational) the typical output they want to see is a graph, a grid (as in spreadsheet) or a
report. Now, in this case, graph is a very broad term, it might be a bar chart, a pie chart, a map of the US with states
color coded, but the bottom line is that these three are the fundamental ways in which people like to visualize the
information that is locked up in raw data. And it further turns out that the data required to produce any graph, grid
or report can always be produced as a table of data.
This is such a fundamental principle of analysis that it is enshrined in the relational model itself as a principle known
as ‘closure’ - all queries must produce as their output a table of data. It ensures, amongst other factors, that queries
can be chained, the output from one query can always serve as the input to another.
So a core part of Teradata Aster’s approach was to ensure that all output from querying the data was tabular,
irrespective of whether the initial data was relational or multi-structured.
Copyright © 2012-2013 by Teradata Corporation   6
TERADATA
ASTER
DISCOVERY
PLATFORM
WHITEPAPER
01.13
HOW DOES IT WORK?
Aster is essentially comprised of three parts: the storage engine, a processing layer and an analytical function library.
Figure 1: Aster Discovery Platform
THE ENGINE AND THE PROCESSING LAYER
The storage engine holds the data as either relational tables (can be either relational row or relational column storage)
or as de-serialized objects (you can think of these latter as BLOBs – Binary Large OBjects). In the processing layer
there is an extended SQL engine, extended to include MapReduce functionality, known as SQL-MapReduce®.
If your data is stored as relational tables, it can be queried using the SQL functions in the engine;
if it’s stored as BLOBs, it can be queried using the engine’s MapReduce functions.
SO WHAT IS MAPREDUCE?
Before we start on the functions, what is MapReduce itself? The name reflects the fact that it is built on two
programming functions, Map and Reduce. Map applies a given function to every member of a list, Reduce
can combine the results of Map output. So, if data to be analyzed can be rendered into a large number of list
on different nodes, Map can process these in parallel and Reduce can pull the answers together. To put that
another way, MapReduce is a programming model for writing applications that handle vast volumes of data and
process it in parallel. It can run happily on a single server but because one of its major strengths is its ability to
scale elegantly, it is usually implemented on large clusters of hardware which parallel process any MapReduce
job. Many terabytes can be processed in a single job running on hundreds, if not thousands, of nodes.
What’s extraordinary, given that MapReduce is used with enormous data sets, is that it looks at
everything (or almost everything) every time it is run. It hardly sounds like an optimal approach and
indeed it isn’t for repetitive similar searches. Its strength is in letting us inspect huge data sets and see
results in a realistic time, answering questions that were previously too time-consuming to even ask and
enabling ‘train of thought’ analysis that can produce valuable information from acres of data.
Copyright © 2012-2013 by Teradata Corporation   7
TERADATA
ASTER
DISCOVERY
PLATFORM
WHITEPAPER
01.13
THE ANALYTICAL FUNCTION LIBRARY
Above the processing layer is an analytical layer where you find a function library and it is in this layer
that we find the adaptability that allows Aster to handle a myriad different data structures. Here we can
write whatever functions we like, and as many as we like, to deal with any new structural data type. Any
new data type will almost certainly be stored as a BLOB, queried using MapReduce and the capability
to manipulate and analyze it will be manifest as one or more functions in the function library.
Now this may sound like technobabble but an example should make it clearer.
For example, suppose we want to store very simple .TXT files and we want to be able to query them and find
particular strings within the text.
The .TXT files are clearly not relational so they will be stored as a BLOB and they will be queried using MapReduce.
What we have to do in the top layer is to write a function that searches for specific strings within longer strings.
If we want other functions, perhaps to count the occurrences of particular words, we write them as well.
At that point, Teradata Aster is fulfilling one of its promises – it is storing multi-structured data and allowing us
to query it. So far, so good.
Now further suppose we want to work with .PDFs. They will be stored as BLOBs and manipulated with the MapReduce
engine (just like the .TXT files) and we write functions to do whatever we require, maybe one will extract the text from
the .PDF, another will count the number of pages and so on.
So, Teradata Aster’s architecture has already addressed the broad question of how to store multiple structural
types but there is another hugely important implication of this approach which makes the Teradata Aster solution
incredibly versatile.
Most traditional relational engines are basically built to perform queries: a query is sent to the engine, it runs and
produces an answer. As we’ve said above, a fundamental principle of the relational model, called closure, says the
output of a query is an answer table and that table must look, feel and smell just like any other table in the database.
Closure provides the capability to chain queries together. In Aster the principal of closure is very important and an
absolutely fundamental part of the whole philosophy is that the output from every single function is a table. No matter
how the data is originally stored (BLOB or table) the output from every function has to be a table. And, just as with
closure in the relational world, the output from one function can act as the input to another. In other words, all Aster
functions have to be able to accept a table as input.
Figure 2: Aster Analytics Portfolio
Copyright © 2012-2013 by Teradata Corporation   8
TERADATA
ASTER
DISCOVERY
PLATFORM
WHITEPAPER
01.13
The implications of these simple concepts are highly significant. For a start it means that functions written for one type
of multi-structured data can be used for another. Take our TXT and PDF examples. Remember that text string finding
function? Suppose we want to use it against a .PDF. We already have a function that extracts the text from a .PDF. The
output from that function has to be a table, maybe with one column called EntireTextOfPDF which has a row for every
.PDF file. We can pass this output table to the string-finding function we wrote for text files: that function will accept a
table as its input and is therefore entirely happy.
This means we can query across all the different data structures by chaining functions because the Teradata Aster
solution elegantly uses the table structure as the lingua franca at the top end. Whatever you do, you get a table and
you can continue to do table stuff with it.
USING ASTER FOR REAL
This new way of analyzing data has the potential to be incredibly powerful, and Teradata Aster is already unlocking
that power to analyze click-stream data. Click-stream data is increasingly seen as a source of valuable information
about the behavior of web site visitors – which pages hold their attention, which do they skip through, is there a page
where they stall and then fail to purchase? Teradata Aster is addressing this need with its Apache web log parser and
some clever built-in functions.
Raw click-stream log data can be imported (very rapidly given Teradata’s parallel processing architecture) and re-
structured for analytical purposes by the parser. It is then ready for analysis using several specific SQL-MapReduce
functions, one of which is Aster nPath. Using nPath it is possible to frame questions like “How many users start at the
home page, click on a hotel, read the reviews and book a stay”. The query is answered in a single pass and the results
are returned blisteringly fast.
This function is ideal for complex sequential analysis on time-series data and for behavioral pattern analysis:
click-stream data is one such source; financial transaction and market basket data are others.
Figure 3: Sequential analysis on time-series data with Aster nPath analytic function
999 Skyway Blvd. Suite 100, San Carlos CA | teradataaster.com
SQL-H and The Best Decision Possible are trademarks, and Aster, SQL-MapReduce, Teradata and the Teradata logo are registered trademarks of Teradata Corporation and/or its affiliates in the
U.S. or worldwide. Intel, the Intel logo, and Xeon are registered trademarks of Intel Corporation. SUSE is a registered trademark of Novell, Inc. Teradata continually improves products as new
technologies and components become available. Teradata, therefore, reserves the right to change specifications without prior notice. All features, functions, and operations described
herein may not be marketed in all parts of the world. Consult your Teradata representative or Teradata.com for more information.
Copyright © 2012-2013 by Teradata Corporation    All Rights Reserved.    Produced in U.S.A.
TERADATA
ASTER
DISCOVERY
PLATFORM
WHITEPAPER
01.13
SUMMARY
Traditional solutions are always caught on the horns
of the dilemma – do you want inefficiency or huge
complexity? The Teradata Aster solution slips elegantly
between the horns; solves the problem in a totally novel
way and provides very high efficiency very simply and,
as a bonus, is precisely engineered so that integration
of the different types of multi-structured data with
relational data is a natural outcome of the solution.
Learn More
For more information about how the Teradata Aster
Big Analytics Appliance can bring value to your
organization, contact your Teradata or Teradata
Aster representative or visit us on the web at:
http://www.asterdata.com/product/index.php
ABOUT TERADATA ASTER
Teradata Aster, a division of Teradata, is a market
leader in big data analytics, enabling advanced
analytics on big data with richer, deeper data
processing at ultra-fast speeds, massive but cost-
effective scaling, and the ability to seamlessly
manage diverse workloads. From applications like
fraud detection, customer intelligence, trending
& forecasting to scenario modeling, customer
personalization and targeting, and click stream
analysis – it is evident that enabling big analytics
and discovery has a material impact on the business.
The Teradata Aster MapReduce Platform utilizes
Aster’s patented SQL-MapReduce® to parallelize the
processing of data and applications and deliver rich
analytic insights at scale.
www.teradataaster.com

Más contenido relacionado

La actualidad más candente

One Slide Overview: ORCL Big Data Integration and Governance
One Slide Overview: ORCL Big Data Integration and GovernanceOne Slide Overview: ORCL Big Data Integration and Governance
One Slide Overview: ORCL Big Data Integration and GovernanceJeffrey T. Pollock
 
Unlocking Big Data Silos in the Enterprise or the Cloud (Con7877)
Unlocking Big Data Silos in the Enterprise or the Cloud (Con7877)Unlocking Big Data Silos in the Enterprise or the Cloud (Con7877)
Unlocking Big Data Silos in the Enterprise or the Cloud (Con7877)Jeffrey T. Pollock
 
Big Data Discovery
Big Data DiscoveryBig Data Discovery
Big Data DiscoveryHarald Erb
 
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...Revolution Analytics
 
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with AmbariAmbari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with AmbariHortonworks
 
Designing the Next Generation Data Lake
Designing the Next Generation Data LakeDesigning the Next Generation Data Lake
Designing the Next Generation Data LakeRobert Chong
 
Expand a Data warehouse with Hadoop and Big Data
Expand a Data warehouse with Hadoop and Big DataExpand a Data warehouse with Hadoop and Big Data
Expand a Data warehouse with Hadoop and Big Datajdijcks
 
Splunk Business Analytics
Splunk Business AnalyticsSplunk Business Analytics
Splunk Business AnalyticsCleverDATA
 
Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy snehal parikh
 
Information Virtualization: Query Federation on Data Lakes
Information Virtualization: Query Federation on Data LakesInformation Virtualization: Query Federation on Data Lakes
Information Virtualization: Query Federation on Data LakesDataWorks Summit
 
Data Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureData Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureLorenzo Nicora
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
 
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESBData Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESBDenodo
 
Enterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digitalsambiswal
 
The Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the SameThe Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the SameCloudera, Inc.
 
Filling the Data Lake - Strata + HadoopWorld San Jose 2016 Preview Presentation
Filling the Data Lake - Strata + HadoopWorld San Jose 2016 Preview PresentationFilling the Data Lake - Strata + HadoopWorld San Jose 2016 Preview Presentation
Filling the Data Lake - Strata + HadoopWorld San Jose 2016 Preview PresentationPentaho
 
Are You Killing the Benefits of Your Data Lake?
Are You Killing the Benefits of Your Data Lake?Are You Killing the Benefits of Your Data Lake?
Are You Killing the Benefits of Your Data Lake?Denodo
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Lviv Startup Club
 

La actualidad más candente (20)

One Slide Overview: ORCL Big Data Integration and Governance
One Slide Overview: ORCL Big Data Integration and GovernanceOne Slide Overview: ORCL Big Data Integration and Governance
One Slide Overview: ORCL Big Data Integration and Governance
 
Unlocking Big Data Silos in the Enterprise or the Cloud (Con7877)
Unlocking Big Data Silos in the Enterprise or the Cloud (Con7877)Unlocking Big Data Silos in the Enterprise or the Cloud (Con7877)
Unlocking Big Data Silos in the Enterprise or the Cloud (Con7877)
 
How to build a successful Data Lake
How to build a successful Data LakeHow to build a successful Data Lake
How to build a successful Data Lake
 
Big Data Discovery
Big Data DiscoveryBig Data Discovery
Big Data Discovery
 
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
 
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with AmbariAmbari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
 
Designing the Next Generation Data Lake
Designing the Next Generation Data LakeDesigning the Next Generation Data Lake
Designing the Next Generation Data Lake
 
Expand a Data warehouse with Hadoop and Big Data
Expand a Data warehouse with Hadoop and Big DataExpand a Data warehouse with Hadoop and Big Data
Expand a Data warehouse with Hadoop and Big Data
 
Stream based Data Integration
Stream based Data IntegrationStream based Data Integration
Stream based Data Integration
 
Splunk Business Analytics
Splunk Business AnalyticsSplunk Business Analytics
Splunk Business Analytics
 
Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy
 
Information Virtualization: Query Federation on Data Lakes
Information Virtualization: Query Federation on Data LakesInformation Virtualization: Query Federation on Data Lakes
Information Virtualization: Query Federation on Data Lakes
 
Data Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureData Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and Future
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESBData Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
 
Enterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digital
 
The Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the SameThe Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the Same
 
Filling the Data Lake - Strata + HadoopWorld San Jose 2016 Preview Presentation
Filling the Data Lake - Strata + HadoopWorld San Jose 2016 Preview PresentationFilling the Data Lake - Strata + HadoopWorld San Jose 2016 Preview Presentation
Filling the Data Lake - Strata + HadoopWorld San Jose 2016 Preview Presentation
 
Are You Killing the Benefits of Your Data Lake?
Are You Killing the Benefits of Your Data Lake?Are You Killing the Benefits of Your Data Lake?
Are You Killing the Benefits of Your Data Lake?
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
 

Similar a Teradata Aster Discovery Platform

Gerenral insurance Accounts IT and Investment
Gerenral insurance Accounts IT and InvestmentGerenral insurance Accounts IT and Investment
Gerenral insurance Accounts IT and Investmentvijayk23x
 
Big Data
Big DataBig Data
Big DataNGDATA
 
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLijscai
 
A Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLIJSCAI Journal
 
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLijscai
 
A Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLIJSCAI Journal
 
INTRODUCTION TO BIG DATA AND HADOOP
INTRODUCTION TO BIG DATA AND HADOOPINTRODUCTION TO BIG DATA AND HADOOP
INTRODUCTION TO BIG DATA AND HADOOPDr Geetha Mohan
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and howbobosenthil
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBhavya Gulati
 
Optimising Data Lakes for Financial Services
Optimising Data Lakes for Financial ServicesOptimising Data Lakes for Financial Services
Optimising Data Lakes for Financial ServicesAndrew Carr
 
What is big data
What is big dataWhat is big data
What is big dataShubShubi
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questionsKalyan Hadoop
 
Security issues in big data
Security issues in big data Security issues in big data
Security issues in big data Shallote Dsouza
 

Similar a Teradata Aster Discovery Platform (20)

Gerenral insurance Accounts IT and Investment
Gerenral insurance Accounts IT and InvestmentGerenral insurance Accounts IT and Investment
Gerenral insurance Accounts IT and Investment
 
Data lake ppt
Data lake pptData lake ppt
Data lake ppt
 
Big Data
Big DataBig Data
Big Data
 
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
 
A Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQL
 
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
 
A Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQL
 
INTRODUCTION TO BIG DATA AND HADOOP
INTRODUCTION TO BIG DATA AND HADOOPINTRODUCTION TO BIG DATA AND HADOOP
INTRODUCTION TO BIG DATA AND HADOOP
 
Big data rmoug
Big data rmougBig data rmoug
Big data rmoug
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and how
 
ANALYTICS OF DATA USING HADOOP-A REVIEW
ANALYTICS OF DATA USING HADOOP-A REVIEWANALYTICS OF DATA USING HADOOP-A REVIEW
ANALYTICS OF DATA USING HADOOP-A REVIEW
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
 
Optimising Data Lakes for Financial Services
Optimising Data Lakes for Financial ServicesOptimising Data Lakes for Financial Services
Optimising Data Lakes for Financial Services
 
Hadoop(Term Paper)
Hadoop(Term Paper)Hadoop(Term Paper)
Hadoop(Term Paper)
 
What is big data
What is big dataWhat is big data
What is big data
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questions
 
Database Essay
Database EssayDatabase Essay
Database Essay
 
Data science unit2
Data science unit2Data science unit2
Data science unit2
 
Security issues in big data
Security issues in big data Security issues in big data
Security issues in big data
 

Último

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 

Último (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 

Teradata Aster Discovery Platform

  • 2. Copyright © 2012-2013 by Teradata Corporation   2 TERADATA ASTER DISCOVERY PLATFORM WHITEPAPER 01.13 CONTENTS 3 TERADATA ASTER DISCOVERY PLATFORM 3 Tabular and non-tabular data 3 What are the defining characteristics of this non-relational data? 3 Size 4 Structure and Defining “Multi-Structured” Data 4 Isn’t it really tabular? 4 So why is the world interested in storing and manipulating multi-structured data? 4 Integration with relational data 5 The historical solution 5 ROLAP 5 MOLAP 5 “History is bunk” 5 Design philosophy 6 How does it work? 6 The engine and the processing layer 6 So what is MapReduce? 7 The analytical function library 8 Using Aster for real 9 Summary 9 LEARN MORE
  • 3. Copyright © 2012-2013 by Teradata Corporation   3 TERADATA ASTER DISCOVERY PLATFORM WHITEPAPER 01.13 Teradata Aster Discovery Platform Teradata specializes in storing and analyzing structured, relational data. It has recently purchased Aster Data Systems, Inc. in order to extend its platform to include the capability of handling what is often called ‘big’, ‘semi-structured’ or multi-structured (see below) data. This paper explains how the Teradata Aster solution works, specifically drilling into how its design philosophy enables it to cope not only with the myriad different types of big data that exist today, but how it is engineered to cope with those that will arise in the future. TABULAR AND NON-TABULAR DATA ‘Structured’ data is a term that refers to data that fits neatly into tables. For instance an employee table has columns like date of birth, marital status and so on, and each row contains all the data about a single employee. Table 1 Each table has a name and a number of rows and columns. Each column has a unique name and each row has a unique identifier. So, using the name of the table, the column name and the row ID we can reach any piece of data within the database. The data found there should be very simple – the term often used is ‘atomic’ which implies that the data is so simple it cannot be meaningfully sub-divided. A great deal of business data is tabular and we frequently store it in relational databases so we often use the term ‘relational’ data to describe this kind of highly structured data. However the world has become increasingly interested in storing and manipulating data that does not easily fit into relational tables – data such as images, text files, .PDFs, sensor data, Word documents, click-stream data, and so on. WHAT ARE THE DEFINING CHARACTERISTICS OF THIS NON-RELATIONAL DATA? Size This kind of data is often also referred to as ‘big’ data. The term is appropriate for two reasons. 1. Whilst each piece of tabular data is usually small and indivisible (atomic), each piece of non-tabular is often very large. Image files from modern cameras can easily be 7-8 Mbytes each. Part of my research work involves mass spectrometers which produce between 4 and 6 GBytes of data in a single run. Compared with, say, a name or a date of birth, these are large chunks of data. 2. And, not only is each piece of semi-structured data big, we often collect a great number of individual pieces. How many new or modified emails, Word documents and Excel worksheets are produced by your company every day? And every user of your website is creating a click-stream trail, every temperature sensor in your building is streaming data out second by second by second…
  • 4. Copyright © 2012-2013 by Teradata Corporation   4 TERADATA ASTER DISCOVERY PLATFORM WHITEPAPER 01.13 STRUCTURE AND DEFINING “MULTI-STRUCTURED” DATA People often refer to this kind of data as ‘semi-structured’ despite the fact that the term is really a misnomer. As described above, relational data is very precisely structured but then so is a .PDF file. In fact calling it semi-structured almost suggests that it is second class in some way and that it is only partially structured. And, indeed, text files have structure, as do .JPGs, they’re just different from relational data structures. The term ‘semi-structured’ also tends to imply that all non-relational data is the same and one of the defining characteristics of this semi-structured data is its diversity. So a much better general term for all members of this new class of data is ‘multi-structured’. This name implies that which is true: there are many different classes of data, all of which are highly structured and their structure simply differs depending on the file type. ISN’T IT REALLY TABULAR? So why can’t we treat this data in the same way as tabular data? Well, in a sense, we can. Any and all digital data is stored as bits and bytes. If we are dogged enough we can break any data into a long string of bits/bytes and store these as one column in table with a huge number of rows. In that sense, all data can be tabular. We can also store data such as images in tables by creating specialized data types such as BLOBS (Binary Large Objects) – some relational database engines have been able to do this for years. The problem is that while both of these solutions allow us to store the data, they both miss the point that our main interest in this data is to dig inside it and extract the useful information that it contains. SO WHY IS THE WORLD INTERESTED IN STORING AND MANIPULATING MULTI-STRUCTURED DATA? This kind of data can have huge commercial value locked up within it. Think about a company like eBay. In many ways, when it started, eBay was simply a huge tabular database. You and I may buy and sell items on eBay, but the company itself never sees or handles the items or the cash; as far as it is concerned, we are simply carrying out transactions against a set of tabular data. But after a while eBay also became interested in the behavior of its customers. The tabular data was storing our purchases but our behavior (which buttons we clicked, in which order and when) was in the click-stream data - which is classic multi-structured data. Then there is Google’s spell checker. Microsoft reportedly spent several million dollars over 20 years developing its spell checker. Google realized that if it tracked what users typed in: “Ferari” and what they ended up viewing: www.ferrari.com then it could map the strings of characters that people actually typed to the strings they wanted. Not only did Google immediately gain a multi-lingual spell checker, it gained a very, very effective one. A spell checker that learns over time and is created effectively for free from the data that other people would throw away – so-called data exhaust. Now think about sensors in a factory – they might record noise and light levels, temperature, pressure and so on. Every now and then the production process produces a bad batch. Locked in the data from the sensors may be the information about the conditions that lead to failure. INTEGRATION WITH RELATIONAL DATA So, multi-structured data is here to stay, and we need a solution that can not only store it and manipulate it but also allows it to be analyzed seamlessly with the relational data. At first sight, and particularly from a technical point of view, this seems like an odd assertion. Multi-structured data is fundamentally different from relational, so surely it makes sense to query them independently. The problem with this line of argument is that it makes
  • 5. Copyright © 2012-2013 by Teradata Corporation   5 TERADATA ASTER DISCOVERY PLATFORM WHITEPAPER 01.13 no sense at all from the business perspective. Business users may well be entirely unaware of the technical differences in data structures, all they know is that there is a new source of data and they want to be able to understand it in relation (if you will pardon the pun) to their existing data. Whatever solution we adopt must allow analysis across not only all the different types of multi-structured data, but it must also include the relational data. THE HISTORICAL SOLUTION Historically multi-structured data has been handled in one of two ways, neither of which is entirely satisfactory: 1. You can force it into a relational structure, either as a BLOB or by ‘shredding’ it into atomic data. These solutions have the advantage that you can store it in your existing relational engine and, if it is shredded, you can run SQL against it. The disadvantage is that this tends to be very inefficient, slow and unwieldy. 2. You can create a new database engine specifically for that class of multi-structured data and even develop a new language for querying and manipulating it. This gives very efficient storage and manipulation. The problem is that there are already many types of multi-structured data out there and, as we move forward, more will arise. We can’t go on and on creating new engines for each new type. A good example of a type of multi-structured data that is handled in both of these ways is dimensional data. Dimensional data is primarily used for On-Line Analytical Processing (OLAP) and consists of a set of measures which can be sliced by a number of dimensions. It is traditionally handled either in a relational (ROLAP) or a dimensional (MOLAP) engine. ROLAP The dimensional data is essentially rendered down into two dimensional tables. The measures go into a fact table, the dimension data into dimension tables and thus you have a ROLAP solution (Relational On-Line Analytical Processing). The good news is that this utilized existing technology and skills, the bad is that it is inefficient. MOLAP The alternative is to create an entirely new class of database engine, in this case a multi-dimensional database engine in which to store the data. The advantage is that you can use an analytical language like MDX (Multi-Dimensional eXpressions) and run it natively against that engine. The downside is that you’ve had to create an entirely new engine and an entirely new language in order to handle just one of your many multi-structured flavors of data. “HISTORY IS BUNK” To paraphrase Henry Ford, the historical solutions to this problem are bunk; neither is realistically sustainable for multi-structured data. The former is always inefficient, the latter produces an ever-increasing set of database engines, which makes integrating the different types of multi-structured data a nightmare. DESIGN PHILOSOPHY Part of the philosophy of the Teradata Aster solution is based on a simple observation. When people analyze data (multi-structured or relational) the typical output they want to see is a graph, a grid (as in spreadsheet) or a report. Now, in this case, graph is a very broad term, it might be a bar chart, a pie chart, a map of the US with states color coded, but the bottom line is that these three are the fundamental ways in which people like to visualize the information that is locked up in raw data. And it further turns out that the data required to produce any graph, grid or report can always be produced as a table of data. This is such a fundamental principle of analysis that it is enshrined in the relational model itself as a principle known as ‘closure’ - all queries must produce as their output a table of data. It ensures, amongst other factors, that queries can be chained, the output from one query can always serve as the input to another. So a core part of Teradata Aster’s approach was to ensure that all output from querying the data was tabular, irrespective of whether the initial data was relational or multi-structured.
  • 6. Copyright © 2012-2013 by Teradata Corporation   6 TERADATA ASTER DISCOVERY PLATFORM WHITEPAPER 01.13 HOW DOES IT WORK? Aster is essentially comprised of three parts: the storage engine, a processing layer and an analytical function library. Figure 1: Aster Discovery Platform THE ENGINE AND THE PROCESSING LAYER The storage engine holds the data as either relational tables (can be either relational row or relational column storage) or as de-serialized objects (you can think of these latter as BLOBs – Binary Large OBjects). In the processing layer there is an extended SQL engine, extended to include MapReduce functionality, known as SQL-MapReduce®. If your data is stored as relational tables, it can be queried using the SQL functions in the engine; if it’s stored as BLOBs, it can be queried using the engine’s MapReduce functions. SO WHAT IS MAPREDUCE? Before we start on the functions, what is MapReduce itself? The name reflects the fact that it is built on two programming functions, Map and Reduce. Map applies a given function to every member of a list, Reduce can combine the results of Map output. So, if data to be analyzed can be rendered into a large number of list on different nodes, Map can process these in parallel and Reduce can pull the answers together. To put that another way, MapReduce is a programming model for writing applications that handle vast volumes of data and process it in parallel. It can run happily on a single server but because one of its major strengths is its ability to scale elegantly, it is usually implemented on large clusters of hardware which parallel process any MapReduce job. Many terabytes can be processed in a single job running on hundreds, if not thousands, of nodes. What’s extraordinary, given that MapReduce is used with enormous data sets, is that it looks at everything (or almost everything) every time it is run. It hardly sounds like an optimal approach and indeed it isn’t for repetitive similar searches. Its strength is in letting us inspect huge data sets and see results in a realistic time, answering questions that were previously too time-consuming to even ask and enabling ‘train of thought’ analysis that can produce valuable information from acres of data.
  • 7. Copyright © 2012-2013 by Teradata Corporation   7 TERADATA ASTER DISCOVERY PLATFORM WHITEPAPER 01.13 THE ANALYTICAL FUNCTION LIBRARY Above the processing layer is an analytical layer where you find a function library and it is in this layer that we find the adaptability that allows Aster to handle a myriad different data structures. Here we can write whatever functions we like, and as many as we like, to deal with any new structural data type. Any new data type will almost certainly be stored as a BLOB, queried using MapReduce and the capability to manipulate and analyze it will be manifest as one or more functions in the function library. Now this may sound like technobabble but an example should make it clearer. For example, suppose we want to store very simple .TXT files and we want to be able to query them and find particular strings within the text. The .TXT files are clearly not relational so they will be stored as a BLOB and they will be queried using MapReduce. What we have to do in the top layer is to write a function that searches for specific strings within longer strings. If we want other functions, perhaps to count the occurrences of particular words, we write them as well. At that point, Teradata Aster is fulfilling one of its promises – it is storing multi-structured data and allowing us to query it. So far, so good. Now further suppose we want to work with .PDFs. They will be stored as BLOBs and manipulated with the MapReduce engine (just like the .TXT files) and we write functions to do whatever we require, maybe one will extract the text from the .PDF, another will count the number of pages and so on. So, Teradata Aster’s architecture has already addressed the broad question of how to store multiple structural types but there is another hugely important implication of this approach which makes the Teradata Aster solution incredibly versatile. Most traditional relational engines are basically built to perform queries: a query is sent to the engine, it runs and produces an answer. As we’ve said above, a fundamental principle of the relational model, called closure, says the output of a query is an answer table and that table must look, feel and smell just like any other table in the database. Closure provides the capability to chain queries together. In Aster the principal of closure is very important and an absolutely fundamental part of the whole philosophy is that the output from every single function is a table. No matter how the data is originally stored (BLOB or table) the output from every function has to be a table. And, just as with closure in the relational world, the output from one function can act as the input to another. In other words, all Aster functions have to be able to accept a table as input. Figure 2: Aster Analytics Portfolio
  • 8. Copyright © 2012-2013 by Teradata Corporation   8 TERADATA ASTER DISCOVERY PLATFORM WHITEPAPER 01.13 The implications of these simple concepts are highly significant. For a start it means that functions written for one type of multi-structured data can be used for another. Take our TXT and PDF examples. Remember that text string finding function? Suppose we want to use it against a .PDF. We already have a function that extracts the text from a .PDF. The output from that function has to be a table, maybe with one column called EntireTextOfPDF which has a row for every .PDF file. We can pass this output table to the string-finding function we wrote for text files: that function will accept a table as its input and is therefore entirely happy. This means we can query across all the different data structures by chaining functions because the Teradata Aster solution elegantly uses the table structure as the lingua franca at the top end. Whatever you do, you get a table and you can continue to do table stuff with it. USING ASTER FOR REAL This new way of analyzing data has the potential to be incredibly powerful, and Teradata Aster is already unlocking that power to analyze click-stream data. Click-stream data is increasingly seen as a source of valuable information about the behavior of web site visitors – which pages hold their attention, which do they skip through, is there a page where they stall and then fail to purchase? Teradata Aster is addressing this need with its Apache web log parser and some clever built-in functions. Raw click-stream log data can be imported (very rapidly given Teradata’s parallel processing architecture) and re- structured for analytical purposes by the parser. It is then ready for analysis using several specific SQL-MapReduce functions, one of which is Aster nPath. Using nPath it is possible to frame questions like “How many users start at the home page, click on a hotel, read the reviews and book a stay”. The query is answered in a single pass and the results are returned blisteringly fast. This function is ideal for complex sequential analysis on time-series data and for behavioral pattern analysis: click-stream data is one such source; financial transaction and market basket data are others. Figure 3: Sequential analysis on time-series data with Aster nPath analytic function
  • 9. 999 Skyway Blvd. Suite 100, San Carlos CA | teradataaster.com SQL-H and The Best Decision Possible are trademarks, and Aster, SQL-MapReduce, Teradata and the Teradata logo are registered trademarks of Teradata Corporation and/or its affiliates in the U.S. or worldwide. Intel, the Intel logo, and Xeon are registered trademarks of Intel Corporation. SUSE is a registered trademark of Novell, Inc. Teradata continually improves products as new technologies and components become available. Teradata, therefore, reserves the right to change specifications without prior notice. All features, functions, and operations described herein may not be marketed in all parts of the world. Consult your Teradata representative or Teradata.com for more information. Copyright © 2012-2013 by Teradata Corporation    All Rights Reserved.    Produced in U.S.A. TERADATA ASTER DISCOVERY PLATFORM WHITEPAPER 01.13 SUMMARY Traditional solutions are always caught on the horns of the dilemma – do you want inefficiency or huge complexity? The Teradata Aster solution slips elegantly between the horns; solves the problem in a totally novel way and provides very high efficiency very simply and, as a bonus, is precisely engineered so that integration of the different types of multi-structured data with relational data is a natural outcome of the solution. Learn More For more information about how the Teradata Aster Big Analytics Appliance can bring value to your organization, contact your Teradata or Teradata Aster representative or visit us on the web at: http://www.asterdata.com/product/index.php ABOUT TERADATA ASTER Teradata Aster, a division of Teradata, is a market leader in big data analytics, enabling advanced analytics on big data with richer, deeper data processing at ultra-fast speeds, massive but cost- effective scaling, and the ability to seamlessly manage diverse workloads. From applications like fraud detection, customer intelligence, trending & forecasting to scenario modeling, customer personalization and targeting, and click stream analysis – it is evident that enabling big analytics and discovery has a material impact on the business. The Teradata Aster MapReduce Platform utilizes Aster’s patented SQL-MapReduce® to parallelize the processing of data and applications and deliver rich analytic insights at scale. www.teradataaster.com