2. About
▪ Koray Kocabaş
▪ Data Platform (SQL Server) MVP
▪ Yemeksepeti Business Intelligence
▪ Bahcesehir University Instructor
▪ @koraykocabas
▪ https://tr.linkedin.com/in/koraykocabas
▪ Blog: http://www.misjournal.com
▪ E-Mail: koraykocabas@outlook.com
3. Evolution of Data
Internet ofThings
Web 2.0
ERP/CRM
• Clickstream
• Sensors / RFID / Devices
• Log Files
• Spatial & GPS Coordinates
• Social Media
• Mobile
• Advertising
• eCommerce
• Digital Marketing
• Search Marketing
• Recommendations
• Payables
• Payroll
• Inventory
• Contacts
• DealTracking
• Sales Pipeline
Gigabytes Terabytes Petabytes Exabytes
4. Big Data Utility Gap
70 % of data
generated by
customers
80 % of data
being stored
3 % being
prepared for
analysis
0.5 % begin
analyzed
< 0.5 % begin
operationalized
5. How Does this Work in Practice
• Obsessively collect data
• Keep it forever
• Put the data in one place
Store
Everything
• Cleanse, organize and manage your data
• Make the right tools available
• Use the resources wisely to compute, analyze and understand data
Analyze
Anything
• Use insights to iteratively improve your product
Build the Right
Thing
6. Big Data isn’t meaningful
Big Data is not just data
65 + Million Members
50 Countries
1000 + Devices Supported
~25 PB Datawarehouse on Cloud (Read %10)
~550 Billion events daily
7. • 20 Million songs
• 24 Million Active Users
• 8 Million Daily Active Users
• 1TB of Compressed Data Generated From Users Per Day
• 700 node Hadoop Cluster
8. Big Data is not just data
Cannes Lion 2014 - Grand Prix - Titanium : Honda 'Sound of Honda Ayrton Senna 1989'
2000 + sensors, 200 GB data per a race
9. Big Data is not just data Boeing generates 20 TB data per hour
10. ~10 Billions row processed (Daily) ~750 Millions row result set (Daily)
11. New E-Commerce Big Data Flow
Purchase
User
Product
Data Warehouse
Store it All
12.
13. Overview (ETL to ELT)
Demand Architecture Data
Loading
Data
Preparation
Analytics Validation
16. Google Analytics & Adobe Omniture
Problem 1: How can we collect data
Problem 2: How can we store data
Problem 3: How can we visualize data
Problem 4: How can we predict data
20. MOOC
Big Data Analytics, Implementing Big Data Analysis, Big Data Analytics with HDInsight, Big
Data and BusinessAnalytics Immersion,Getting Started with MicrosoftAzure Machine
Learning
RealWorld Big Data in Azure, Big Data on AmazonWeb Services, Reporting with MongoDB,
Cloud Business Intelligence, HDInsight Deep Dive: Storm HBase and Hive, Data Science &
Hadoop Workflows at ScaleWith Scalding, SQL on Hadoop - Analyzing Big Data with Hive
Introduction to Big Data Analytics, Machine Learning with Big Data, Big Data Analytics for
Healthcare, Data Science at Scale,The Data Scientist'sToolbox, R Programming
Master Big Data and Hadoop Step by Step, Hadoop Essentials, Hadoop Starter Kit, Data
Analytics using Hadoop eco system, Big Data: How Data Analytics IsTransforming the World,
Applied Data Science with R, Hadoop Enterprise Integration
Data Science and Analytics in Context, Introduction to Big Data with Spark, Data Science and
Machine Learning Essentials, Machine Learning for Data Science and Analytics, Statistical
Thinking for Data Science and Analytics
23. One more cup of coffee
https://azure.microsoft.com/en-us/pricing/details/hdinsight/
https://azure.microsoft.com/en-us/pricing/calculator/#
24.
25. Developed by Facebook. Later it was adopted in Apache as an open source project.
A data warehouse infrastructure built on top of Hadoop for providing data
summarization, query and analysis
Integration between Hadoop and BI and visualization
Provides an SQL Like language called Hive QL to query data
Create Index, includes Partitioning
Not supported Update (isn’t correct)
Hive provides Users, Groups, Roles. But it’s not designed for high security.
Console (hive>), script, ODBC/JDBC, SQuirreL, HUE,Web Interface, etc.
Most popular Business IntelligenceTools support Hive
26. DataTypes
Primitive DataTypes: int, bigint, float, double, boolean, decimal, string, timestamp, date etc
Complex DataTypes: arrays, maps, structs
ARRAY<string>: workplace: istanbul, ankara
STRUCT<sex:string,age:int> : Female,25
MAP<string,int>: SOLR:92
Hive RDBMS
SQL Interface SQL Interface
Focus on analytics ay focus on online or analytics
No transactions Transactions usually supported
Partition adds, no random Inserts. Random Insert and Update supported
Distributed processing via map/reduce Distributed processing varies by vendor (if available)
Scales to hundreds of nodes Seldom scale beyond 20 nodes
Built for commodity hardware Often built on proprietary hardware (especially when scaling out)
Low cost per petabyte What's petabyte? :) (note: Are you sure?)
27. Hive Architecture
SQL on Hadoop Frameworks
• Apache Hive
• Impala
• Presto (Facebook)
• EMC/Pivotal HAWQ
• BigSQL by IBM
30. Originally developed atYahoo! (Huge contributions from Hortonworks,Twitter)
A Platform for analyzing large data sets that consists of high-level language for
expressing data analysis programs
Processing large semi-structured data sets using Hadoop Map Reduce
Write complex MapReduce jobs using a simple script language (Pig Latin)
Pig provides a bunch of aggregation function (AVG, COUNT, SUM, MAX, MIN etc.)
Developers can develop UDF
Console (grunt), script, java, HUE (Hadoop User Experience by Cloudera)
Easy to use and efficient
32. DataTypes
Simple DataTypes: int, float, double, chararray (UTF-8), bytearray
Complex DataTypes: map (Key,Value),Tuple, Bag (list of tuples)
Commands
Loading: LOAD, STORE, DUMP
Filtering: FILTER, FOREACH, DISTINCT
Grouping: JOIN, GROUP, COGROUP, CROSS
Ordering: ORDER, LIMIT
Merging & Split: UNION, SPLIT
SQL SCRIPT PIG SCRIPT
SELECT * FROM TABLE A=LOAD 'DATA' USING PigStorage('t') AS (col1:int, col2:int, col3:int);
SELECT col1+col2, col3 FROM TABLE B=FOREACH A GENERATE col1+col2, col3;
SELECT col1+col2, col3 FROM TABLE WHERE col3>10 C=FILTER B by col3>10;
SELECT col1, col2, sum(col3) FROM X GROUP BY col1, col2 D=GROUP A BY (col1,col2);
E=FOREACH D GENERATE FLATTEN(group), SUM(A.col3);
... HAVING sum(col3) > 5 F=FILTER E BY $2>5;
... ORDER BY col1 G=ORDER F BY $0
SELECT DISTINCT col1 FROM TABLE I=FOREACH A GENERATE col1;
J=DISTINCT I;
SELECT col1,COUNT(DISTINCT col2) FROM TABLE GROUP BY
col1 K=GROUP A BY col1;
L=FOREACH K {M=DISTINCT A.col2; GENERATE FLATTEN(group),
count(M);}
39. Case Study Klout
• Collect and normalize more than
12 billion signals a day
• Hive data warehouse of more
than 1 trillion rows
• Klout acquired for $200 million
by LithiumTechnologies