57. Intel AA Team
Advanced Analytics
•
•
Vision: Make analytics a competitive advantage for Intel
Mission:
• Solve strategic high value business line problems
• Leverage analytics to grow Intel revenue
•
About the team:
• ~100 employees - corporate ownership of advanced analytics
• Big data and Machine Learning are key focus areas
• Skills: Software Engineering / Decision Science / Business Acumen
• Value driven – ROI>$10M and/or key corporate problem as defined by VPs
• Part of the Israel Academy Computational research center
Intel Confidential
58. AA Overview
•
Big Data Analytics Platform
Highly scalable, hybrid platform to support a range of
business use cases
Prediction Module
MPP
High Speed Data Loader
Heterogeneous data, batch oriented
on advanced analytics
Rich advanced analytics and realtime, in-database data mining
capabilities
Intel Confidential
59. Enterprise On the Cloud
Why Cloud ?
• Known reasons
– Reduce cost
– Universal access
– Scale fast
• Additional reasons
– Flexible & Agile platform – no need to certify each tool by
engineering team
– Development accelerator – R&D team can start develop while
engineering teams implement the platform on premise
Intel Confidential
60. Use Case
•
•
•
Use Case
Characteristics:
– CPU behavior data
– Size: 30TB of data per month
– Type: Structured data
– Processing:
• Create aggregation facts and grant ad hoc analysis
• Create ML solutions
Current Status:
– Data is sampled and processed on SMP RDBMS
– Takes almost 24 hours to process the entire data
Problem Statement
– Limited ability analyze all data
Intel Confidential
61. Enterprise On the Cloud
•
•
Platforms
On premise
– Hbase – Hadoop platform exists
• No Hbase
– MPP DB – Exists with Machine Learning capabilities
• Lower cost platform evaluate and purchase
Cloud
– HBase - EMR
– MPP DB - AWS Redshift
Go for POC on the Cloud
Intel Confidential
62. Enterprise On the Cloud
Evaluation Criteria
• Capabilities
– Create statistics calculations
• Cost of HW per TB
– Replication
– Compression
• Performance
– Load, transformation, querying
• Scalability
• Ability to execute
Intel Confidential
63. Use Case
•
•
Preliminary Results
Dataset example
– 34GB compressed data divided to files
– ~1,500,000,000 records
– 24B compressed, 240B per record ( ~15 columns )
Performance & Scalability - 8 x 1XL nodes
– Load time – for 32 files – 2 hours ( 4 files – 5 hours )
– Table size – 202GB (compression rate ~1.5:1)
– SQL aggregation statements
• 38K records – 6 minutes
• 14M records – 7 minutes
• 66M records – 11 minutes ( on 4 x 1XL – 22 minutes )
• 939M records – 34 minutes ( on 4 x 1XL – 77 minutes )
Intel Confidential
64. Use Case
Capabilities and Cost
• No current ability to write code (Java/C++/Python/R)
– Implement statistics and algorithm in SQL
• Compression is not strait forward
• Cost sensitive for actual compression
– 2.6 : 1 is break even
• 8XL vs. High Storage instance (16 cores 48TB)
• 3 years with 100% utilization
Intel Confidential
75. Join via Facebook
Raw Data
Amazon S3
Web Servers
Add a Skill Page
User Action Trace Events
Invite Friends
Get Data
Aggregated Data
Amazon Redshift
Amazon S3
Raw Events
EMR
•
Tableau
Excel
•
•
Data Analyst
Internal Web
Hive Scripts Process Content
Process log files with regular
expressions to parse out the
info we need.
Processes cookies into useful
searchable data such as Session,
UserId, API Security token.
Filters surplus info like internal
varnish logging.