SlideShare una empresa de Scribd logo
1 de 20
Using Power View and Hive
to Gain Business Insights
Finding Hidden Answers in Data
Joey D’Antoni, Comcast Cable
Stacia Misner, Data Inspirations




                                   April 10-12 | Chicago, IL
Please silence
cell phones

                 April 10-12 | Chicago, IL
About Us
                 Joey D’Antoni                                      Stacia Misner

• Principal Architect for SQL Server at Comcast   • Principal Consultant at Data Inspirations
  Cable                                           • @StaciaMisner on Twitter
• @jdanton on Twitter                             • blog.datainspirations.com
• Joedantoni.wordpress.com




                                                                                                3
Agenda

•   Introducing Big Data
•   Overview and Summary of Data Set
•   Insights into the Data
•   Conclusions


                                       4
Classic Data Analysis




  Loading     Analyzing   Visualization




                                          5
Classic Data Analysis …Uses Just a Subset

                  Data Warehouse &
                     BI Solutions



            ETL
Classic Data Analysis …Requires Structure

                   Data Warehouse &
                      BI Solutions



             ETL
Why Leave the RDBMS




                      8
Key Differences


                                          Basically
                                          Available
                                          Soft-state
                                          Eventually
   Scale Out As Needed    Impose Schema      consistent
With Commodity Hardware      On Read
Hadoop Ecosystem
                              Note: This is only a
                              subset of ecosystem!




                  MapReduce

           HDFS
Hadoop and Hive Demo




                       11
Extract, Transform, Load (ETL) Process
                Some

                       Process
                        Your


Some Database                     Business
                                 Doesn’t Care
                                   About




                                                Credit—Buck Woody, Microsoft



                                                                        12
Our ETL Process

                       Collection
                                             HDFS
                        Server




   Hive is a Data Warehouse System that connects to Hadoop and
   allows SQL queries to be written against data sets in Hadoop



                                                                  13
The Data Set

Set Top Box Engagement Times
•   Max Set Top Boxes Viewing Channels
•   Aggregate Viewing Seconds
•   Potential Total Seconds Watched
•   Recorded in 5, 15 and 60 minute aggregates




This data is from the week of 11-17, July 2012

                                                 14
Preparation for Data Analysis

   • Define question to answer


   • Define ideal data set


   • Find data




                                 15
Remember Legal and Privacy Issues




                                    16
Diving into Data Analysis

    • Cleanse
      • Reformat as needed
      • Decide what is usable



    • Explore
      • Create summaries
      • Perform statistical analysis
      • Use visualizations




                                       17
Aggregate Statistics on Data




                               18
Resources

Connecting Excel to Hive (Hive ODBC Driver, Excel Hive Add-in)
•   http://social.technet.microsoft.com/wiki/contents/articles/6226.how-to-
    connect-excel-to-hadoop-on-azure-via-hiveodbc.aspx
Connecting PowerPivot to Hadoop on Azure
•   http://dennyglee.com/2012/01/21/connecting-powerpivot-to-hadoop-on-
    azure-self-service-bi-to-big-data-in-the-cloud/
Connecting Power View to Hadoop on Azure
•   http://dennyglee.com/2012/02/10/connecting-power-view-to-hadoop-on-
    azurean-awesomesauce-way-to-view-big-data-in-the-cloud/



                                                                              19
Thank you!
Diamond Sponsor




                               April 10-12 | Chicago, IL

Más contenido relacionado

La actualidad más candente

Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
Denodo
 
Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with HadoopChicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Cloudera, Inc.
 
The Rise of Logical Data Architecture - Breaking the Data Gravity Notion (Mid...
The Rise of Logical Data Architecture - Breaking the Data Gravity Notion (Mid...The Rise of Logical Data Architecture - Breaking the Data Gravity Notion (Mid...
The Rise of Logical Data Architecture - Breaking the Data Gravity Notion (Mid...
Denodo
 

La actualidad más candente (20)

Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for Dinner
 
Microsoft Power BI: AI Powered Analytics
Microsoft Power BI: AI Powered AnalyticsMicrosoft Power BI: AI Powered Analytics
Microsoft Power BI: AI Powered Analytics
 
Data Warehousing 2016
Data Warehousing 2016Data Warehousing 2016
Data Warehousing 2016
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)
 
SnapLogic corporate presentation
SnapLogic corporate presentationSnapLogic corporate presentation
SnapLogic corporate presentation
 
Elastic Data Warehousing
Elastic Data WarehousingElastic Data Warehousing
Elastic Data Warehousing
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
 
Exploiting Data Lakes: Architecture, Capabilities & Future
Exploiting Data Lakes: Architecture, Capabilities & FutureExploiting Data Lakes: Architecture, Capabilities & Future
Exploiting Data Lakes: Architecture, Capabilities & Future
 
Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with HadoopChicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
 
Big data insights with Red Hat JBoss Data Virtualization
Big data insights with Red Hat JBoss Data VirtualizationBig data insights with Red Hat JBoss Data Virtualization
Big data insights with Red Hat JBoss Data Virtualization
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
 
Intro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeIntro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on Snowflake
 
From Hadoop to Enterprise Data Warehouse
From Hadoop to Enterprise Data WarehouseFrom Hadoop to Enterprise Data Warehouse
From Hadoop to Enterprise Data Warehouse
 
The Rise of Logical Data Architecture - Breaking the Data Gravity Notion (Mid...
The Rise of Logical Data Architecture - Breaking the Data Gravity Notion (Mid...The Rise of Logical Data Architecture - Breaking the Data Gravity Notion (Mid...
The Rise of Logical Data Architecture - Breaking the Data Gravity Notion (Mid...
 
Performance Acceleration: Summaries, Recommendation, MPP and more
Performance Acceleration: Summaries, Recommendation, MPP and morePerformance Acceleration: Summaries, Recommendation, MPP and more
Performance Acceleration: Summaries, Recommendation, MPP and more
 
Introduction to Data Vault Modeling
Introduction to Data Vault ModelingIntroduction to Data Vault Modeling
Introduction to Data Vault Modeling
 
HOW TO SAVE PILEs of $$$ BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
HOW TO SAVE  PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...HOW TO SAVE  PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
HOW TO SAVE PILEs of $$$ BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
 
Data Vault Vs Data Lake
Data Vault Vs Data LakeData Vault Vs Data Lake
Data Vault Vs Data Lake
 

Destacado (6)

Sql server 2012 ha and dr sql saturday tampa
Sql server 2012 ha and dr sql saturday tampaSql server 2012 ha and dr sql saturday tampa
Sql server 2012 ha and dr sql saturday tampa
 
Sql server 2012 ha and dr sql saturday dc
Sql server 2012 ha and dr sql saturday dcSql server 2012 ha and dr sql saturday dc
Sql server 2012 ha and dr sql saturday dc
 
Sql saturday powerpoint dc_san
Sql saturday powerpoint dc_sanSql saturday powerpoint dc_san
Sql saturday powerpoint dc_san
 
Sql server 2012 ha and dr sql saturday boston
Sql server 2012 ha and dr sql saturday bostonSql server 2012 ha and dr sql saturday boston
Sql server 2012 ha and dr sql saturday boston
 
Always on availability groups way too deep
Always on availability groups way too deepAlways on availability groups way too deep
Always on availability groups way too deep
 
Virtualization for DBA
Virtualization for DBAVirtualization for DBA
Virtualization for DBA
 

Similar a Pass bac jd_sm

The modern analytics architecture
The modern analytics architectureThe modern analytics architecture
The modern analytics architecture
Joseph D'Antoni
 

Similar a Pass bac jd_sm (20)

Unlocking the Power of the Data Lake
Unlocking the Power of the Data LakeUnlocking the Power of the Data Lake
Unlocking the Power of the Data Lake
 
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
Myth Busters II: BI Tools and Data Virtualization are Interchangeable
Myth Busters II: BI Tools and Data Virtualization are InterchangeableMyth Busters II: BI Tools and Data Virtualization are Interchangeable
Myth Busters II: BI Tools and Data Virtualization are Interchangeable
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
 
Presentation big dataappliance-overview_oow_v3
Presentation   big dataappliance-overview_oow_v3Presentation   big dataappliance-overview_oow_v3
Presentation big dataappliance-overview_oow_v3
 
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKES
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKESBig Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKES
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKES
 
A Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
A Tale of 2 BI Standards: One for Data Warehouses and One for Data LakesA Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
A Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
 
Big Data 視覺化分析解決方案
Big Data 視覺化分析解決方案Big Data 視覺化分析解決方案
Big Data 視覺化分析解決方案
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve Loughran
 
Understanding Big Data And Hadoop
Understanding Big Data And HadoopUnderstanding Big Data And Hadoop
Understanding Big Data And Hadoop
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
 
Ibm db2update2019 icp4 data
Ibm db2update2019   icp4 dataIbm db2update2019   icp4 data
Ibm db2update2019 icp4 data
 
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
 
The modern analytics architecture
The modern analytics architectureThe modern analytics architecture
The modern analytics architecture
 
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
 

Más de Joseph D'Antoni

Building perfect sql servers, every time -oops
Building perfect sql servers, every time -oopsBuilding perfect sql servers, every time -oops
Building perfect sql servers, every time -oops
Joseph D'Antoni
 
Accelerating Database Performance Using Compression
Accelerating Database Performance Using CompressionAccelerating Database Performance Using Compression
Accelerating Database Performance Using Compression
Joseph D'Antoni
 
Windows server 2012 failover clustering new features
Windows server 2012 failover clustering new featuresWindows server 2012 failover clustering new features
Windows server 2012 failover clustering new features
Joseph D'Antoni
 
Deploying data tier applications sql saturday dc
Deploying data tier applications sql saturday dcDeploying data tier applications sql saturday dc
Deploying data tier applications sql saturday dc
Joseph D'Antoni
 

Más de Joseph D'Antoni (20)

DBA Fundamentals VC
DBA Fundamentals VCDBA Fundamentals VC
DBA Fundamentals VC
 
Building perfect sql servers, every time -oops
Building perfect sql servers, every time -oopsBuilding perfect sql servers, every time -oops
Building perfect sql servers, every time -oops
 
Pass 2013 dantoni azure a gs
Pass 2013 dantoni azure a gsPass 2013 dantoni azure a gs
Pass 2013 dantoni azure a gs
 
Accelerating Database Performance Using Compression
Accelerating Database Performance Using CompressionAccelerating Database Performance Using Compression
Accelerating Database Performance Using Compression
 
Accelerating Database Performance with Compression
Accelerating Database Performance with CompressionAccelerating Database Performance with Compression
Accelerating Database Performance with Compression
 
Sql Server 2012 HA and DR -- SQL Saturday Richmond
Sql Server 2012 HA and DR -- SQL Saturday RichmondSql Server 2012 HA and DR -- SQL Saturday Richmond
Sql Server 2012 HA and DR -- SQL Saturday Richmond
 
Windows server 2012 failover clustering new features
Windows server 2012 failover clustering new featuresWindows server 2012 failover clustering new features
Windows server 2012 failover clustering new features
 
San presentation nov 2012 central pa
San presentation nov 2012 central paSan presentation nov 2012 central pa
San presentation nov 2012 central pa
 
South jersey sql virtualization
South jersey sql virtualizationSouth jersey sql virtualization
South jersey sql virtualization
 
Sql server 2012 ha dr 24_hop_final
Sql server 2012 ha dr 24_hop_finalSql server 2012 ha dr 24_hop_final
Sql server 2012 ha dr 24_hop_final
 
Sql server 2012 ha dr 24_hop_final
Sql server 2012 ha dr 24_hop_finalSql server 2012 ha dr 24_hop_final
Sql server 2012 ha dr 24_hop_final
 
Sql server 2012 ha dr nova
Sql server 2012 ha dr novaSql server 2012 ha dr nova
Sql server 2012 ha dr nova
 
Sql server 2012 ha dr
Sql server 2012 ha drSql server 2012 ha dr
Sql server 2012 ha dr
 
Sql saturday dc vm ware
Sql saturday dc vm wareSql saturday dc vm ware
Sql saturday dc vm ware
 
Deploying your Application to SQLRally
Deploying your Application to SQLRallyDeploying your Application to SQLRally
Deploying your Application to SQLRally
 
Deploying data tier applications sql saturday dc
Deploying data tier applications sql saturday dcDeploying data tier applications sql saturday dc
Deploying data tier applications sql saturday dc
 
Building your first sql server cluster
Building your first sql server clusterBuilding your first sql server cluster
Building your first sql server cluster
 
Deploying data tier applications sql saturday dc
Deploying data tier applications sql saturday dcDeploying data tier applications sql saturday dc
Deploying data tier applications sql saturday dc
 
Server virtualization and cloud computing
Server virtualization and cloud computingServer virtualization and cloud computing
Server virtualization and cloud computing
 
Management data warehouse
Management data warehouseManagement data warehouse
Management data warehouse
 

Pass bac jd_sm

  • 1. Using Power View and Hive to Gain Business Insights Finding Hidden Answers in Data Joey D’Antoni, Comcast Cable Stacia Misner, Data Inspirations April 10-12 | Chicago, IL
  • 2. Please silence cell phones April 10-12 | Chicago, IL
  • 3. About Us Joey D’Antoni Stacia Misner • Principal Architect for SQL Server at Comcast • Principal Consultant at Data Inspirations Cable • @StaciaMisner on Twitter • @jdanton on Twitter • blog.datainspirations.com • Joedantoni.wordpress.com 3
  • 4. Agenda • Introducing Big Data • Overview and Summary of Data Set • Insights into the Data • Conclusions 4
  • 5. Classic Data Analysis Loading Analyzing Visualization 5
  • 6. Classic Data Analysis …Uses Just a Subset Data Warehouse & BI Solutions ETL
  • 7. Classic Data Analysis …Requires Structure Data Warehouse & BI Solutions ETL
  • 8. Why Leave the RDBMS 8
  • 9. Key Differences Basically Available Soft-state Eventually Scale Out As Needed Impose Schema consistent With Commodity Hardware On Read
  • 10. Hadoop Ecosystem Note: This is only a subset of ecosystem! MapReduce HDFS
  • 11. Hadoop and Hive Demo 11
  • 12. Extract, Transform, Load (ETL) Process Some Process Your Some Database Business Doesn’t Care About Credit—Buck Woody, Microsoft 12
  • 13. Our ETL Process Collection HDFS Server Hive is a Data Warehouse System that connects to Hadoop and allows SQL queries to be written against data sets in Hadoop 13
  • 14. The Data Set Set Top Box Engagement Times • Max Set Top Boxes Viewing Channels • Aggregate Viewing Seconds • Potential Total Seconds Watched • Recorded in 5, 15 and 60 minute aggregates This data is from the week of 11-17, July 2012 14
  • 15. Preparation for Data Analysis • Define question to answer • Define ideal data set • Find data 15
  • 16. Remember Legal and Privacy Issues 16
  • 17. Diving into Data Analysis • Cleanse • Reformat as needed • Decide what is usable • Explore • Create summaries • Perform statistical analysis • Use visualizations 17
  • 19. Resources Connecting Excel to Hive (Hive ODBC Driver, Excel Hive Add-in) • http://social.technet.microsoft.com/wiki/contents/articles/6226.how-to- connect-excel-to-hadoop-on-azure-via-hiveodbc.aspx Connecting PowerPivot to Hadoop on Azure • http://dennyglee.com/2012/01/21/connecting-powerpivot-to-hadoop-on- azure-self-service-bi-to-big-data-in-the-cloud/ Connecting Power View to Hadoop on Azure • http://dennyglee.com/2012/02/10/connecting-power-view-to-hadoop-on- azurean-awesomesauce-way-to-view-big-data-in-the-cloud/ 19
  • 20. Thank you! Diamond Sponsor April 10-12 | Chicago, IL

Notas del editor

  1. So why is this “new” territory? Don’t we have a handle on our data?Sure we build DW and BI solutions but honestly…. The data we pull into this environment is just a fragment of the full range of data that we could explore. Therein lies the problem.One type of data comes from cell phones – there are 4 billion of them in the world, each generating its own mountain of data.Or maybe we’re involved in scientific research, working with instruments. Did you know that the Large Hadron Collider generates 40TB of data… per second?Even if your industry doesn’t deal with mobile or scientific data, surely you have email. Consider how much data exists there. Or maybe there’s audio files that get stored – such as in a call center operationOr video files.http://crowinfodesign.com/2009/10/19/iphone-cost-analysis/http://themuse.ca/articles/52015http://www.mylearning.org/digital-storytelling--recording-equipment-and-editing-in-audacity/images/1-2155/http://wherewhywhen.com/panasonic-hc-v10eb-k-hd-camcorder-review/
  2. Another thing about classic data analysis Is that it inherently imposes structure. Although we may not necessarily use data from relational sources, we generally use data that we can easily break down into records that in turn break down into fields and can store all of that data relationally which we can in turn repackage in OLAP form and use as sources for our reports and dashboards and so on. In other words, we rely on structure.
  3. Thinking about how we approach a Big Data solution, there are some key differences from traditional data warehouseingFirst – we can scale as needed with commodity hardware.Second – we don’t have to know in advance how to structure the data. This seems rather counterintuitive for those of us who have spent a lot of time learning how to model data to support BI.Third, we have something called BASE which stands for Basically avaialble soft-state eventually consistent. This is diametrically opposed to ACID – which says atomicity – alll operations in transaction must complete – consistency at the beginning and end of the transaction – isolation – transaction is independent f everything else and durability – nothing is going to eliminate that transaction. BASE says – things are fluid – something can fail in one partition without failing everything everywhere. (exchange of assets – leaves one party but has yet to arrive at other party – could be too small of time period for either to notice but technically creates out of sync situation)
  4. Hadoop – provides distribute storage with the HDFS (Hadoop Distributed File System) - high throughputand distributed processing through MapReduceStore, Index, process in place(DW = move data before you can use it – heavy lifting)Imagine moving 1 PB through 1GB pipeInstead move code to the data and send results back to userNo longer have to sample data, can actually use all data (imagine magnifying glass on subset)More data means better predictabilityHbase – column store NoSQL Database: A scalable, distributed database that supports structured data storage for large tables.Pig-A high-level data-flow language and execution framework for parallel computation. Language layer is called Pig Latin. Can combine commands into batches. Can use it to read and write data on parallel systems. Example – can use it to find frequency of phrases used for search stored in a log. Hive - data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called Mahout - A Scalable machine learning and data mining library.Sqoop – transfer data in bulk to RDBMS
  5. Almost all data solutions, live in some sort of database—in order to take that data and transform it, into some practical that our business users can do analysis from, we have go through what’s known as an ETL process—I’m sure many of you are familiar with SQL Server Integration Services, a very common ETL tool in our space. From an IT perspective that process can be pretty painful—we have to version control packages, and there are limitations to what we can do in a package—right Stacia.Stacia—fill in here with the Chicago story. So that brings us to our decisions on how to handle our data for this project. We had a pretty large amount of files, but we weren’t exactly sure how we wanted to handle the data. We wanted to be able to do a wide variety of analysis, and not really be confined. So that leads us into our ETL process…
  6. So for our project, we are collecting data from set top boxes—it’s aggregated for an entire region for privacy purposes, and then loaded onto a collection server in the form of comma delimited files. Part of our strategy at Comcast is to work towards using more open source solutions, so this seemed like the perfect time to leverage Hadoop. I’m not going to cover Hadoop 101, but if you don’t know what it is, it’s a basically a distributed file system. There are a lot more components than that. Our ETL process is as simple as loading files in Hadoop—an O/S function that happens really quickly, and then scripting a Hive table, which we’ve also automated. Then I hand of to Excel, where Stacia can work her magic using PowerShell. One interesting point is that we can create multiple data structures on the same set of data.Hive design principlesScalable, extensible (via UDF, UDAF), fault tolerant, and loose coupling with file formats.What Hive is notLow latency response times on queries. Data warehousing framework on HadoopImposes metadata / familiar looking HiveQLSimple translation layer for MapReduceExtensible via custom mappers/reducersLoose coupling with input formatsEnables analytics from high level BI tools via ODBC
  7. Do demo here to show the aggregate statistics about the data.