The document discusses using Teradata's Unified Data Architecture and SQL-MapReduce functions to analyze customer churn for a telecommunications company. It provides examples of creating views that join customer data from Teradata, Hadoop, and Aster sources. Graphing and visualization tools are used to identify patterns in customer reboot events and equipment issues that may lead to cancellations. The document demonstrates how to gain insights into customer behavior across multiple data platforms.
2. UDA IN PRACTICE
• Teradata and Big Data
• Customer Churn Example
> Examples of Code
> How the UDA works in Practice
• IPTV Example
> Data Science Workflow
> Real-life Example
4. Modern information management: year zero
In 1970, computer scientist
and former war-time Royal
Air Force pilot Ted Codd
published a seminal
academic paper that
would change Information
Management forever…
5. Lots of transactions, or lots of data to analyse?
…Codd had envisaged
“large, shared data banks”,
queried any-which-way;
but the first RDBMS
implementations had
focused on providing
support for on-line
transaction processing…
6. Modern information management: year nine
…so in 1979, four
academics and software
engineers quit their days
jobs, maxed-out their
credit cards – and built the
world’s first MPP Relational
Database Computer in a
garage in California.
7. Teradata’s “shared nothing” hardware appliance
model has since been widely emulated*…
1st Teradata implementa on Netezza DATAllegro Oracle Exadata
goes live at Wells Fargo
Greenplum
IBM DB2 Parallel Edi on
1980 1985 1990 1995 2000 2005 2010
Kogni o (WhiteCross) Aster Data Ver ca NeoView
* But some are more Massively Parallel Processor than others!
8. “Teradata was Big Data before there was Big Data”
Total data ~40 Exabytes
volume under
management:
Largest single ~40 Petabytes
implementation:
# customers in 25
the Teradata PB
club:
Largest hybrid 1,500 SSDs;
system: 12,000 HDDs
9. Key takeaway: “Big Data” are typically non-relational
or “multi-structured”
I didn’t say Bill was ugly.
I didn’t say Bill was ugly.
I didn’t say Bill was ugly.
I didn’t say Bill was ugly.
I didn’t say Bill was ugly.
I didn’t say Bill was ugly.
10. The Unified Data Architecture
Engineers Data Scientists Quants Business Analysts
Java, C/C++, Pig, Python, R, SAS, SQL, Excel, BI, Visualization, etc.
Discovery Platform Integrated Data
Warehouse
Capture, Store, Refine
Audio/ Web & Machine
Images Text CRM SCM ERP
Video Social Logs
26. Churners – and data quality
26 4/8/2013 Teradata Confidential
27. What events lead up to a reboot?
Note number of
paths with a
reboot, following
another reboot!
CREATE dimension table wrk.npath_reboot_5events
AS SELECT path, COUNT(*) AS path_count
FROM nPath
(ON wrk.w_event_f
PARTITION BY srv_id SELECT *
ORDER BY evt_ts desc FROM GraphGen (ON
MODE (NONOVERLAPPING ) (SELECT * from wrk.npath_reboot_5events
PATTERN ('X{0,5}.reboot') ORDER BY path_count
SYMBOLS LIMIT 30 )
(true as X, PARTITION BY 1
evt_name = 'REBOOT' AS reboot) ORDER BY path_count desc
RESULT item_format('npath')
(FIRST( srv_id OF X) AS srv_id, item1_col('path')
ACCUMULATE (evt_name OF ANY (X,reboot)) score_col('path_count')
AS path) output_format('sankey')
) GROUP BY 1 ; justify('right'));
27 4/8/2013 Teradata Confidential
28. View events data in Tableau
Looks like an issue with the
data on the 30th September
and beyond, the Reboot data
for October seems to have
been aggregated and added
to September the 30th
28 4/8/2013 Teradata Confidential
29. Address data quality
• Remove paths will all reboots and exclude data from 30th
September
Would appear
that events
with suffix 1
and 2 can be
added together
29 4/8/2013 Teradata Confidential
30. Visualise as a Graph using Aster GraphGen
Size of Node =
number of customers
Width of Edge =
number of errors
SELECT *
FROM graphgen
(ON
(SELECT DISTINCT dmt_act_dslam,
nra_id,
nbr_of_srvid,
errorspersrv,
nbr_of_dslam
FROM wrk.srvid_dslam_err)
PARTITION BY 1
ORDER BY errorspersrv
item_format('cfilter')
item1_col('dmt_act_dslam')
item2_col('nra_id')
score_col('errorspersrv')
cnt1_col('nbr_of_srvid')
cnt2_col('nbr_of_dslam')
output_format('sigma')
directed('false')
width_max(10)
width_min(1)
nodesize_max (3)
nodesize_min (1));
30 4/8/2013 Teradata Confidential
32. Error and Complaint rates by equipment type
32 4/8/2013 Teradata Confidential
33. Thank You, Any questions?
33 4/8/2013 Teradata Confidential
Notas del editor
Slides from a real PoC using data from an IPTV network looking at Quality of Service and Churn
So eBay measured…Note latency – Hadoop is batch-oriented;Note parallel efficiency – if the unit cost of acquisition is relatively low, but I have to buy very many more units, total cost of acquisition is still higherAnd total cost of acquisition is not TCO – also have to factor in development, integration, sys admin and maintenance, etc., etc., etc. costs.Note also that Hadoop is an implementation of the MR programming model, not a DBMS;Impact of, for example, lack of indexes, lack of cost-based optimization, etc., etc. likely to be even more significant for more complex queries.
Slides from a real PoC using data from an IPTV network looking at Quality of Service and Churn
This scenario involves a Telco company that is experiencing an increased number of cancellations. They want to know what behaviors are leading up to the cancellation and have been unable to discover those reasons until now. The challenge has been twofold. First, their data is on multiple platforms. Secondly, Analysis has been so time consuming that they have been unable to estimate and budget the effort. They have data on Hadoop, processed Web Logs on Aster, and Store data housed on their Teradata EDW. All of this data needs to be combined together then analyzed in a timely fashion. This is a common situation today across many industries. You may see a solution here to your own challenges.During this presentation we will see the real code behind the solution. We will see a 3 way join of data across the three platforms. This has never been done before until done for this demonstration. We will see the analytic results output by nPath, a SQL-MR function that comes with the Teradata Aster platform, and the visualization of those analytic results using Tableau.
This is what the environment looks like. On the left is a hadoop cluster storing data on HDFS. This is a large volume of call-center data originally stored as VRU files. After processing on Hadoop it is made available through SQL-H, a new product released with Aster database 5 that allows SQL queries against hadoop data.On the right is the Teradata EDW which contains structured store transactions. It accessed through our Teradata connector using SQL also.In the middle is Online Web log data stored and pre-processed on Aster using our SQL-MR functions.All of these sources are pulled together in a single SQL query on Aster and processed through nPath to discover the customers behavior before cancellation.
Let’s walk through the code required to perform this analysis.First, we create an Hcatalog entry for the table.This code shows what is done on the hadoop machine in order to create a table called hive-callcenter in Hcatalog. It is what you might expect for any table definition. Drop the table if it exists then create the structure. You can see that the data is actually stored as a text file in hadoop with the location being a directory hierarchy.
Next, we create a view on Aster pointing to hive-callcenter on hadoop. In this case we create a view. However, since a view is actually just SQL code, we could actually put the select statement anywhere in our code. We are creating a permanent view since we will be using this table often.Notice that we called the view hcat_telco_callcenter. We’ll see this again later.
This is how we created the view into Teradata for the store data and called it td_telco_store.Again, just plain old SQL.
Here is where the 3 way join takes place. We create the view td_telco_multi using the views into Hadoop and Teradata along with data stored on Aster.Remember td_telco_store from the last page and hcat_telco_callcenter from the one before that. telco_online is the data stored on Aster. This is an ansi-standard view created on Aster. This has, quite literally, never been done before it was done for this presentation.
Here is another view of the views from Aqua Data Studio. Look closely and you can see the views td_telco_store and hcat_telco_callcenter. Telco_online is a regular table on Aster and is not seen here.Each of these tables/views has around a million rows. When we run a count on the view td_telco_multi we see a little over 3 million rows returned for the time period. As Aqua Data Studio and Tableau demonstrate, these data sources are available to most any BI, system tool, or application that understands ODBC/JDBC. So now that we have all of the views and in place to bring this data together in real-time, how do we supply it to nPath?
It’s actually very simple. There is the 3-way join supplied into nPath. Notice that this is just another SQL query. There is some very sophisticated MapReduce code running under the covers of nPath, but to the business user, it is exposed as an external table function with replaceable parameters. This is what makes the very powerfull SQL-Mapreduce functions of Teradata Aster available to the business user without programming experience beyond SQL. Its just replaceable parameters on the function. Being this straightforward is also what makes fast analytic iterations possible. The most important parts of this nPath function are seen in the patterns searched for and the actions taken. They are very simple in this case. Look for all events that end the session in a cancellation of service. If it just an event label it as such. If it is a cancellation of service label it as Cancel Service. Getting the parameters right is the most challenging thing about using the SQL-MR functions. However, since no programming or projects are required, a business user can afford to try lots of different parameters and experiment and explore the data.
This is the visualization of the output data we looked at on the previous slide. This represents all customers who cancelled their service and the pathways they took for 4 steps preceding cancellation. Starting at the left we see all of the channels that a customer could have entered with. There are 14 of them. They represent the callcenter data on Hadoop, the online web-logs on Aster, and the store transactional data on the Teradata EDW. This is what the first pass of analysis often looks like in the real world. Its very busy, it’s the first attempt at exploration. There is little or no filtering of data. As you may recall, the nPath statement we looked at was relatively simple. We can see from the thickness of the colored lines on the right side that there is a lot of activity around the call center and the store, but there is too much noise to determine what common behaviors exist that might be actionable. Following this there are numerous iterations of altering the nPath parameters to get to the final, quiet, determination of common behavior.
This is the final nPath function that will show us a real Golden Pathway for customer cancellations. It is very similar to the first pass nPath. It’s only a few lines of SQL and some additional parameters. It uses the same 3-way join of data and will execute the next steps identically to the first nPath. Notice in the PATTERN parameters that there is more specificity, and that the actions are more granular. This is how noise was removed from the data. Again, this is the real code that creates the visualizations. Let’s take a look at what this data looks like.
Here it is. The Golden Path toward cancellation. It’s a lot cleaner and actually shows us what customers were doing before they cancelled in a way that we can do something about. Starting from the left, we see that customers came in through the onllinechanell and reviewed their contract, followed by at least one, and usually two calls in the call center either disputing their bill or registering a service complaint . The thickness of the lines show us that there were more disputes than service complaints. These calls were followed by visit to the store with a dispute or complain. That is where the cancellations occurred. This is actionable. We can implement this model in our production systems by counting the online visits and calls to the call center. For the entire population of customers, if the number of online reviews > 0, and the number of calls into the call center is >1, then we have a customer who has a higher probablity of cancelling their service and can be flagged for intervention on their next contact. This entire analysis took place over a few days.Let’s think about this for a moment. Imagine trying to come to this conclusion using traditional SQL, without the SQL-MR function of nPath, and without the ability to join this data. The first challenge is pulling the data together with the biggest challenge coming from the data on Hadoop. This currently requireds a skilled engineer writing MapReduce code in a lower level language just to pull the data out. The manipulation of the data once gathered together requires around 350-400 lines of complex, recursive SQL code. Neither the pulling of the Hadoop data or the SQL development is trivial. Both require skilled programmers and, most likely, several months of work. In most shops, this level of resource allocation and time requires a that a project be scoped with detailed requirements, resourced, approved and budgeted. As challenging, expensive, and time-consuming as that project might be, the real problem is that this analysis requires many iterations. In fact, an unknown number of iterations. Each of those iterations may require a separate project. You know, Phase 1, 2, and 3, etc. This actually took more on the order of nine iterations through nPath over several days. So what really happens when confronted by analysis needs like this without nPath. I can tell you that it is usually nothing. You can’t pre-determine the number of iterations, so you can’t scope it. If you can’t do that, you aren’t going to get approval to budget and resource a project that has no end-date in sight. The reality is that most organizations never get to an answer like this. However, using nPath, a business analyst, and a few days work, without ever having to approve a project, not only can one get to the answer, but can also formulate an action plan. That is the real value proposition here. Difficult analysis done quickly by business analysts without the need to budget expansive and in-demand resources.
Slides from a real PoC using data from an IPTV network looking at Quality of Service and Churn
First looked at analysing the complaints data which was text files stored in Hadoop, got nowhere with this. The text analytics showed that the comments fields held standard phrases such as “No fault found” “customer issue” or was just blank.Good example of fail fast. If it isn’t going to work, realise this and stop doing it as quick as possible.
Looked at patterns in data usage prior to a customer closing their account. Here each line represents a customer, it appears that just prior to account closure, there was a huge surge in usage. This turned out to be an error in the data (again!)
Decided to look at number of home router reboots as a measure of quality of service.Here the pattern of 5 events preceding a reboot can be seen and the code used to generate the sankey chart ( now native aster format which is viewed in a web browser)
As previous data issues found, went back and used SQL and tableau to check the data. Found an issue on september 30th but as the data on;y needs to be “good enough” to run analysis, we can safely ignore this day and just use 1st to 29th for our investigation
Final pattern with some of the noise cleaned up… the high transmitted blocks doesn’t help much because it just shows that if you use the serice a lot then you are more likely to reboot…But the other 3 events show a thing called Synchronisation speed errors which are something that can be detected on the network and leads to issues with the iptv signal at the customer end.
Using Aster’s built in Graph viz. we can now see the way the synchroerros affect users across the entire network in a single picture. Note the thick red line in the highlighted area and another one down and to the right of it.
Talking to the network engineers, we found out that there are two different types of hub used.The older ones are on the left and the newer ones on the right, you can see from the colours that the newer ones are reporting far more errors than the older ones.
Final chart.Blue = new hubsOrange = old hubs4th chart shows that the customers connected to the new hubs are complaining more3rd chart shows that complaints by customers connected to the new hubs take longer to resolve-- these show proof of the quality of service issues2nd chart shows bandwidth, higher is better, so new hubs are actually getting a better bandwidth1st chart shows synchro speed, higher is better, so new hubs having worse synchro speed.It looks like the top two are mirror images, so as the bandwidth increases, the synchro speed decreases causing the QoS issue. This turned out to be a firmware issue and not faulty hubs at all.