1. Using Power View and Hive
to Gain Business Insights
Finding Hidden Answers in Data
Joey D’Antoni, Comcast Cable
Stacia Misner, Data Inspirations
April 10-12 | Chicago, IL
3. About Us
Joey D’Antoni Stacia Misner
• Principal Architect for SQL Server at Comcast • Principal Consultant at Data Inspirations
Cable • @StaciaMisner on Twitter
• @jdanton on Twitter • blog.datainspirations.com
• Joedantoni.wordpress.com
3
4. Agenda
• Introducing Big Data
• Overview and Summary of Data Set
• Insights into the Data
• Conclusions
4
12. Extract, Transform, Load (ETL) Process
Some
Process
Your
Some Database Business
Doesn’t Care
About
Credit—Buck Woody, Microsoft
12
13. Our ETL Process
Collection
HDFS
Server
Hive is a Data Warehouse System that connects to Hadoop and
allows SQL queries to be written against data sets in Hadoop
13
14. The Data Set
Set Top Box Engagement Times
• Max Set Top Boxes Viewing Channels
• Aggregate Viewing Seconds
• Potential Total Seconds Watched
• Recorded in 5, 15 and 60 minute aggregates
This data is from the week of 11-17, July 2012
14
15. Preparation for Data Analysis
• Define question to answer
• Define ideal data set
• Find data
15
17. Diving into Data Analysis
• Cleanse
• Reformat as needed
• Decide what is usable
• Explore
• Create summaries
• Perform statistical analysis
• Use visualizations
17
So why is this “new” territory? Don’t we have a handle on our data?Sure we build DW and BI solutions but honestly…. The data we pull into this environment is just a fragment of the full range of data that we could explore. Therein lies the problem.One type of data comes from cell phones – there are 4 billion of them in the world, each generating its own mountain of data.Or maybe we’re involved in scientific research, working with instruments. Did you know that the Large Hadron Collider generates 40TB of data… per second?Even if your industry doesn’t deal with mobile or scientific data, surely you have email. Consider how much data exists there. Or maybe there’s audio files that get stored – such as in a call center operationOr video files.http://crowinfodesign.com/2009/10/19/iphone-cost-analysis/http://themuse.ca/articles/52015http://www.mylearning.org/digital-storytelling--recording-equipment-and-editing-in-audacity/images/1-2155/http://wherewhywhen.com/panasonic-hc-v10eb-k-hd-camcorder-review/
Another thing about classic data analysis Is that it inherently imposes structure. Although we may not necessarily use data from relational sources, we generally use data that we can easily break down into records that in turn break down into fields and can store all of that data relationally which we can in turn repackage in OLAP form and use as sources for our reports and dashboards and so on. In other words, we rely on structure.
Thinking about how we approach a Big Data solution, there are some key differences from traditional data warehouseingFirst – we can scale as needed with commodity hardware.Second – we don’t have to know in advance how to structure the data. This seems rather counterintuitive for those of us who have spent a lot of time learning how to model data to support BI.Third, we have something called BASE which stands for Basically avaialble soft-state eventually consistent. This is diametrically opposed to ACID – which says atomicity – alll operations in transaction must complete – consistency at the beginning and end of the transaction – isolation – transaction is independent f everything else and durability – nothing is going to eliminate that transaction. BASE says – things are fluid – something can fail in one partition without failing everything everywhere. (exchange of assets – leaves one party but has yet to arrive at other party – could be too small of time period for either to notice but technically creates out of sync situation)
Hadoop – provides distribute storage with the HDFS (Hadoop Distributed File System) - high throughputand distributed processing through MapReduceStore, Index, process in place(DW = move data before you can use it – heavy lifting)Imagine moving 1 PB through 1GB pipeInstead move code to the data and send results back to userNo longer have to sample data, can actually use all data (imagine magnifying glass on subset)More data means better predictabilityHbase – column store NoSQL Database: A scalable, distributed database that supports structured data storage for large tables.Pig-A high-level data-flow language and execution framework for parallel computation. Language layer is called Pig Latin. Can combine commands into batches. Can use it to read and write data on parallel systems. Example – can use it to find frequency of phrases used for search stored in a log. Hive - data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called Mahout - A Scalable machine learning and data mining library.Sqoop – transfer data in bulk to RDBMS
Almost all data solutions, live in some sort of database—in order to take that data and transform it, into some practical that our business users can do analysis from, we have go through what’s known as an ETL process—I’m sure many of you are familiar with SQL Server Integration Services, a very common ETL tool in our space. From an IT perspective that process can be pretty painful—we have to version control packages, and there are limitations to what we can do in a package—right Stacia.Stacia—fill in here with the Chicago story. So that brings us to our decisions on how to handle our data for this project. We had a pretty large amount of files, but we weren’t exactly sure how we wanted to handle the data. We wanted to be able to do a wide variety of analysis, and not really be confined. So that leads us into our ETL process…
So for our project, we are collecting data from set top boxes—it’s aggregated for an entire region for privacy purposes, and then loaded onto a collection server in the form of comma delimited files. Part of our strategy at Comcast is to work towards using more open source solutions, so this seemed like the perfect time to leverage Hadoop. I’m not going to cover Hadoop 101, but if you don’t know what it is, it’s a basically a distributed file system. There are a lot more components than that. Our ETL process is as simple as loading files in Hadoop—an O/S function that happens really quickly, and then scripting a Hive table, which we’ve also automated. Then I hand of to Excel, where Stacia can work her magic using PowerShell. One interesting point is that we can create multiple data structures on the same set of data.Hive design principlesScalable, extensible (via UDF, UDAF), fault tolerant, and loose coupling with file formats.What Hive is notLow latency response times on queries. Data warehousing framework on HadoopImposes metadata / familiar looking HiveQLSimple translation layer for MapReduceExtensible via custom mappers/reducersLoose coupling with input formatsEnables analytics from high level BI tools via ODBC
Do demo here to show the aggregate statistics about the data.