As integrated web analytics evolves to both a service oriented and event based model, there will be higher emphasis on moving toward event based analytics. Business analytics is moving from purely counts of analytics to time-series, relationship and usage analytics. Examples of web analytics that can take advantage of this architecture are conversions analytics or cross channel marketing.
The advantage of storing raw event data is that you have maximum flexibility for analysis. For example, you can trace the sequence of pages that one person visited over the course of their session. You can’t do that if you’ve squashed all the events into e.g. counters. That sort of analysis is really important for some offline processing tasks, such as training a recommender system (“people who bought X also bought Y”, that sort of thing). For such use cases, it’s best to simply keep all the raw events, so that you can later feed them all into your shiny new machine learning system.
In this session we are going to elaborate on using Kafka, an Event Processing framework (e.g. Storm or Spark Streaming) and either Hadoop or EDW for building an Event Driven Architecture.
15. 15
Teradata Listener
Common Data Integration Platform
for Streaming Data
Simplifies data integration across the enterprise
Provides a platform for (near) real-time applications
Scalable and reliable to support the entire enterprise
Open and API based to encourage use
Teradata Confid
16. 16
Listener Data Flow
Data flow sequence from sources to target systems
1
6
Multiple sources
Write to firehose
Read mini-batches
INGEST FIREHOSE
Write to streams
Read mini-batches
Write tuples
SOURCES ROUTER STREAMS WRITERS SYSTEMS
Teradata Confid
18. 18
FIREHOSE
Regulating Pressure
Distributed write-ahead logging allows bursts of data without impacting systems
1
8
Apache Kafka
Resilient and durable;
Horizontally scalable;
Built for maximum
throughput.
Asynchronous Reads & W
Producers append to the log at
their own pace;
Consumers read at their own
pace;
Latest data is always in
memory.
Teradata Confid
19. 19
{…},
{…}
{…},
{…}
{…},
{…}
CONSUM
E
FIREHOSE
Routing Data
Demuxing data into streams based on rules
ROUTER
STREAM 1
STREAM …
STREAM n
Built on Apache Spark
Resilient and durable;
Horizontally scalable.
Rule Based
Initially based on API key
(per registered data
source);
Eventually enable
combined streams.
Teradata Confid
20. 20
{…},
{…}
{…},
{…}
Writing Data
Write to target systems through Spark jobs
{…},
{…}
TERADATA
WRITER
STREAM 1
STREAM …
STREAM n
ASTER
WRITER
HDFS
WRITER
Built on Apache Spark
Resilient and durable;
Horizontally scalable.
Writer Options
Initial support for
Teradata, Aster and
HDFS;
Initially leverage JDBC
batch writes for
Teradata;
Exploring rate limiting
writers and other
systems.
Teradata Confid
Over the years we’ve gotten good at building platforms to answer the What, Who, When & Where question. Most of the work is about making it more cost efficient, a bit more flexible and improve data quality. For me that is important but not game changing for the business.
Event analytics is about focusing on the WHY and HOW questions.
How are things (events) related or how are customers related by their experiences?
In my view of the world the business user should have just as much ease answering the “why” questions as they have today with getting answers to the “what” questions.
Behavioral Analytics is a subset of business analytics that focuses on how and why users of eCommerce platforms, online games, & web applications behave. While business analytics has a more broad focus on the who, what, where and when of business intelligence, behavioral analytics narrows that scope, allowing one to take seemingly unrelated data points in order to extrapolate, predict and determine errors and future trends. It takes a more holistic and human view of data, connecting individual data points to tell us not only what is happening, but also how and why it is happening.
Behavioral analytics utilizes user data captured while the web application, game, or website is in use by analytic platforms like Google Analytics. Platform traffic data like navigation paths, clicks, social media interactions, purchasing decisions and marketing responsiveness is all recorded. Also, other more specific advertising metrics like click-to-conversion time, and comparisons between other metrics like the monetary value of an order and the amount of time spent on the site.[1] These data points are then compiled and analyzed, whether by looking and the timeline progression from when a user first entered the platform until a sale was made, or what other products a user bought or looked at before this purchase. Behavioral analysis allows future actions and trends to be predicted based on all the data collected.
How the business looks to the customer
The customer experiences the company across the entirety of applications that company has developed and deployed. Applications more so represent the Brand of the company
Most applications are not designed to solicit and extract the customer experience data well. There are 2 major ways data is obtained from applications
Web-site tagging
Very detailed logging data for engineers for application development and application operational performance
One is too aggregate and difficult to administer; the other is too engineering oriented
Furthermore applications are designed within themselves and mostly are not designed to thinking about the experiences across other applications and channels. Stitching the customer experience across multiple applications is difficult.
How the business looks to the customer
The customer experiences the company across the entirety of applications that company has developed and deployed. Applications more so represent the Brand of the company
Most applications are not designed to solicit and extract the customer experience data well. There are 2 major ways data is obtained from applications
Web-site tagging
Very detailed logging data for engineers for application development and application operational performance
One is too aggregate and difficult to administer; the other is too engineering oriented
Furthermore applications are designed within themselves and mostly are not designed to thinking about the experiences across other applications and channels. Stitching the customer experience across multiple applications is difficult.
The problem is big
7 sources by client
Ability to customize for the consumer
Ingestion: depending on the type of source TD has IP; basically there are 2 types of sources: streaming & batch. For streaming TD Listener will be the advocated solution; for batch TB has 2 pieces of IP for ingestion (Light-weight ingestion (LWI) & Buffer Server).
Light-weight ingestion (LWI) is for large 3rd party files like Omniture. Instead of having to FTP OMNI to a landing server; LWI connects directly to FTP and pulls the file and lands into HDFS in time-partitions.
Buffer Server is a set of IP that is designed to ingest large numbers of small files, concatenate them together to large files that are more Hadoop friendly and lands them into HDFS time-partitions.
Event Processing & Repository
TB has designed (but not yet implemented) 2 pieces of IP in this area
Event Processing: built using M/R it converts the incoming data sources into event objects (3 processing steps include: pre-pend an event header, pre-pend an event type header and resolve incoming ID (cookie, GUID, customer, email address, etc.) to a specific customer. Populates event records into Hbase. The Event Processing Engine processese both streaming and batch sources
Event Repository is an HBase schema that is to central storage for all events
Dashboard Engine
TB has built IP that allows quickly building KPI’s from the Event Repository. Using a UI, a developer can quickly aggregate metrics into an Hbase schema onto top of which tools like Tableau can optimall run
Guided, Metadata-driven Discovery Event Analytics
Teradata has a solution
Allows users to define the pattern they are wanting to match
The UI guides the analyst to do this
Steps:
Give me visibility to all the event available in this repository
This allows you to define the flow
Allows to desired patterns to be run against that flow
All the data and Metadata is behind this, allowing us to gain insights greater view of the customer
As we continue to advance our UDA you will see us focusing on Restful APIs for both loading data and for accessing data across the UDA.
For loading you will also see us advance with the Listening Framework
Listening Framework (Restful API)
The Listening Framework will provide the ability to open database sessions, submit SQL requests and access responses, and access metadata. Requiring zero client install, the Listening Framework is ideal for web, mobile, scripting languages. This does not replace traditional ETL, it provides a lightweight framework to easily subscribe to new data sources in a easy frictionless model
For Access you will see us extend AppCenter into a App Framework that goes beyond just Aster across the entire UDA
App Framework (Restful API)
AppCenter makes it easy, fast & efficient for organizations to benefit from big data analytics & discovery by simplifying the app building & consumption experience. AppCenter is a framework to build, deploy, configure and consume big data apps. Using AppCenter, customers or Teradata PS can build new big data apps or configure (and customize) an existing app to deliver faster value from big data analytics. Apps include analytical apps like customer churn analytics, product affinity analytics and fraud analytics and can be horizontal or industry focused depending on the use case.
AppCenter provides the following capabilities:
For IT, It provides a visual, GUI & standards-based app building and configuration experience where personas like data scientists, analysts and developers can easily, quickly and seamlessly build, configure, deploy and share big data apps. AppCenter services include a web based portal to deploy and consume apps, common services like authentication, data access, user interface, visualization and app building API’s and SDK and RESTful API’s.
For Business, it speeds up the process of pervasively deploying & benefiting from big data analytics. For business users & analysts, it provides an interactive, visual, web based experience to analyze, view & share the results of big data apps helping organizations innovate with powerful big data insights.
While initially specific to Aster, this approach of Restful API based integration of end-user access tools and applications will drive future capabilities for interacting with all components of the UDA.
Lightweight distributed RESTful endpoint for a message or messages;
Validate API key for all messages sent;
All messages sent get a UUID;
All messages are wrapped with a lightweight metadata envelope and sent to the Firehose.
Future state can be rate limited.
Handling traffic spikes gracefully is a critical ability of a modern data ingestion platform. This is especially true of a self-service data ingestion platform. We are leveraging Apache Kafka as a distributed write ahead log in order to help satisfy the need to handle bursts of data flowing through the system as well as smooth operation through updates, failovers and other anomalies. Because Kafka supports partitioned topics we can support horizontal scaling for our Firehose and individual Streams. Adding capacity for any topic is as simple as ensuring an adequate number of running Kafka nodes and making a configuration change to add a new partition.
Using a write ahead log as our data bus allows us to have a great deal of asynchronicity between writers and readers. Kafka holds history of a topic for a configurable amount of time and/or space. If the applications reading data from the topic fail or are unavailable for any reason there is no impact to upstream parts of the system. This gives us a way to build up "pressure" in a very recoverable way. As a system we measure the volume of data flowing through the various parts of our system and with that information we can adjust resources to where it is most needed such as adding another node for the Router application if we notice the Firehose is being written to consistently faster than it is being read from.
The application responsible for reading data from the Firehose and sorting it into its relevant Streams is the Router. The Router is built on Apache Spark and is thus horizontally scalable. Determining what Streams an event will land in depends on a set of rules which are managed through the Application Services. Initially these rules are static and the events are grouped into Streams based on the API key used to send them to the Ingest Services. In the future we plan to support more advanced options for grouping streams.