2. What is Big Data*?
The collection & analysis of data sets so large, complex & rapidly changing
that it is difficult to process & understand using traditional data processing tools &
applications
• Coined by either Professor Francis Diebold, University Pennsylvania or John Mashely, Chief Scientist of Silicon Graphics around 1999
5/5/2015 Big Data Workshop 3
Page 3
3. Traditional Analytics vs. Big Data Analytics
Traditional Analytics Big Data Analytics
Report Past Events
Processing Times 1-2 days
Batch file oriented
Responds NOW
Processing Times <1-5 seconds
Near Time oriented
Traditional big DB Relational
Data
Self generated
Defined meta-data structures
Batch file oriented
Real-time data + warehouse
Everyone creates data – e.g,
Industry, Cross-Ind. Gov. etc.
All forms, images, videos, texts
Near real time
Linear Growth
Mostly Sampled
Gigabytes (109), Terabytes (1012)
Exponential Growth
All the Data
Petabytes (1015), Exabytes (1018),
Zettabytes (1021) , Yotabytes, etc.
Sustained relevance of
data series
Short term relevance of
data snippetsVelocity
Volume
Variety
4
Big Data Workshop
5/5/2015
Page 4
4. More Big Data “facts” – Appendix pages 24 - 26
Big Data Workshop
55/5/2015
Page 5
5. Drivers of Big Data
(not inclusive)
Historical
• Cost of Data Storage
• Cost of Computing
• Mobile Phones and tablets
• Increase access to Internet
• Social Media and eCommerce
• Web Search
• Etc.
Future
• IoT (Internet of Things)
• Internet Cloud
• Analog to Digital Conversions
• Data Driven Decision-making
• Enterprise Applications
• Fraud, Security, CRM, ERP, etc.
0
10
20
30
40
2010 2015 2020
Zettabytes
Data Consumption Over the Years
Total Data
Enterprise
Managed
Enterprise
Created
- In 2007 was estimated that all human knowledge was 295 Exabytes
- In 2015, 1 Exabyte created d each day on internet = 250 million DVDs
worth of information
- By 2020, there will be 5.2 Terabytes per person on earth
- 70% of all data generated by individuals but 80% is stored & managed
by enterprises *The Rapid Growth of Big Data - CSC
*The Rapid Growth of
Big Data - CSC
More Big Data “Facts Pages
5/5/2015 Big Data Workshop 6
Page 6
6. People of Big Data
OPSBig DataSystem AdministrationSys Design/Engineering Apps, Process, Business
Data Scientist
• Recent due to Big Data ~ 2010
• Uses internal, external & 3rd
Party Data sets to answer
questions
• Knowledge of BD systems like
Hadoop, Spark, Python/R,
Statistics, etc.
• Looks for hidden insights to
solve business problems
• Usually PhD or MS degree
Data Engineer
• Traditional Engineer who knows DB systems,
Excel, Access, etc.
• Compiles, installs DB systems and writes
queries, etc.
• Knows DB software such as SQL/NoSQL
• Usually CS or Info Sys degree
Data Analyst
• Compiles & analyzes information – generally not Big Data
• Draws analytical insights from available data and makes business reports to aid
decision making –e.g., sales analyst, operations analyst, etc.
• Usually has CS or Business degree
Data Architect
• Manages lots of data
• Translates data into usable
info
• Designs DBs, and manages
data sets
• Com Sci, Com Eng, Info Sys,
etc.
7
Big Data Workshop
Data
Architect
5/5/2015
Page 7
7. Generalized Big Data Education Curriculums*
Usually Interdisciplinary
PhD Programs
Statistics
Computer
Science
Elected
Discipline**
Masters Programs
* Based on experience with Carnegie Mellon, Stanford and UC Berkley
** e.g., Engineering Programs, Biology, Economics, Psychology, etc.
Elected
Discipline**
Statistics
Computer
Science
8
Big Data Workshop
5/5/2015
Page 8
8. Big Data Challenges
1. General Business Topics
a. Value vs. Cost vs. Business Case
b. Understanding how to apply Big Data
• Focus on use cases, not technology
c. Corporate Commitment & Leadership
• Hesitation usually exist
d. Common Taxonomy/Definitions
e. HR/Personnel Issues
• Finding skilled personnel
– everyone thinks they know Big Data ;>
• Educating the masses
• Vendor vs. Employee-lead
f. Privacy, Security & Policy/Regulations
• New/increased costs
g. Etc.
9
Reporting
Monitoring
Data Mining
Evaluation
Prediction
Why did it happen?
(hypothesis based)
What will happen?
Why did it happen?
(correlations only)
What is happening now?
What is happening?
Complexity
Value & Cost
Big Data Capability Blocks
Privacy vs. Value Breakdown
$Value Privacy
Usage
Preferences
Status
Location
Personal
Identifiers
Intent
Relationships
Demographics
Interactions
3 Steps to Sustained Big Data Analytics Evolution – Appendix page 27
Big Data Workshop
5/5/2015
Page 9
9. Big Data Challenges (cont’d)
2) Data Management Topics
a. Data Governance – e.g., efficiencies of
standardization & data sharing
b. Collecting the “right” data – not everything
c. Lack of common data dictionary across
enterprise
d. How much data to collect – e.g., storage costs
e. Amount of data integrity – e.g., redundant
data system vs. how they handle data
f. Data retention – e.g., how long to store
g. Etc.
3) Data / BI Architectures Topics
a. Redundant & customized data systems e.g.,
silo’d data
b. Greenfield vs. Metadata & Federation Usage
c. Where analytics accomplished – e.g.
transport costs
d. Etc.
10
RealityGoal
Big Data Workshop
5/5/2015
Page 10
10. What are the essential elements needed to create
a Big Data professional education program?
Charge to Participants
This could include:
• What are the challenge areas in Big Data environment?
• What are the skills and knowledge need to meet those challenges?
• How do we address these educational needs or gaps through training?
5/5/2015 Big Data Workshop 11
Page 11
11. Example Use Case by Industries
• Educators
• Health Care
• HR Administrators
• Language/Linguistics
• Mobile Carriers
• Railroad
• Sales - Retail & Wholesale
• Water Utilities
• Web Search
Page
14
15
16
17
18
19
20
21
22
Big Data Workshop 125/5/2015
Page 12
12. Educational Sector Use Cases
1) Student Performance Management & Intervention
– What sequence of topics/subjects are most effective for a specific student?
– Predict student academic and behavior issues using social media, web semantics, interpret grades,
etc.
– Develop student- and class-specific recommendations, such as individual or small group tutoring,
supplemental learning materials in “problem” subject areas, or even changes in classes or majors
1) Teaching Effectiveness Analytics
– More data allows for more robust comparison of teachers across years and schools/districts
– Allows for a common base comparisons for trying new curricula
2) Academic Research
– Wide open at all scholastic levels and subjects
13
Big Data Workshop
5/5/2015
Page 13
13. Health Care Sector Use Cases
1) Predict sickness/disease outbreaks using reliable information on geographical
movement
– Reliable information from patients, doctors submitting reports, mobility records
2) Use pattern matching for predictive health: Correlate patient visits, diagnostics,
and hospital/provider interactions across years of multiple visits
– Find repeatable patterns in patient data & long term illness diagnosis (hypertension, diabetes, cancer,
etc.)
– Predict retreatment risk & proactively address, to avoid readmission within Medicare’s 30-day window
3) Identify best care approach via clinical analysis
– Longitudinal analysis of care across patients and diagnoses
– Cluster analysis around influencers on treatment, physicians, therapist, patient social relationships,
mobility, income, etc.
4) Perform fraud analysis and identification via pattern analysis
– Understand relationships among parties (physicians, consumers, organizations), locations, time of
filing, frequency and circumstances
– Detect potential for computer generated claims, graph analysis of cohort networks
14
Big Data Workshop
5/5/2015
Page 14
14. Human Resources Sector Use Cases
1) Hiring New Employees
– Profile candidates on various data points (e.g., situational/problem solving, social media interactions
etc.) to determine with probability how candidates will perform in specific positions, reduce employee
turnover, impact employee happiness, etc.
– Minimize risk of negative results and missed budgets by guiding managers to make better, more
informed decisions on which employees to select
2) Labor Force Cost Controls
– Use holistic industry and cross-industry analytics to control labor costs by recommending the right
level of labor (Mgr. vs. VP) and overall scope of position responsibilities
– Use holistic industry and cross industry analytics to reveal the optimal organizational size and shape
3) Productivity Improvements
– Improve workforce productivity with quick, timely adjustments to labor levels and fluctuating
workload volume
15
Big Data Workshop
5/5/2015
Page 15
15. Language/Linguistics Sector Use Cases
16
1) Identify Author of Anonymous Text
– Discover who wrote anonymous text, where it comes from, who claims what, who refutes whom, and
how many people claim this and how many people claim that position, etc.
2) Language Translation
– Better & faster real-time translations of words and name as well as their cultural/societal meanings
between various languages and locations, even slang and dialects. (almost 7000 primary languages
and dialects w/another 39k sublanguages & dialects)
– Significant improvements for verbal & textual web search capabilities across the world
3) Computational Linguistics
– Allows for a baseline understanding of the various cultural, social, political, religious etc., biases, of
individuals based upon the information read/obtained. Once an individual baseline is established,
individual biases and specific information via web searches can be summarized and delivered with the
individual biases for specific individuals, e.g., read all info with a “Republican” bias, etc.
Big Data Workshop
5/5/2015
Page 16
16. Mobile Carrier Sector Use Cases
17
1) Improved Network Operations
– Better network coverage by understanding customer movement patterns based on weather, social
events, etc.
– Identifying and resolving network bottlenecks in minutes
– Proactively managing customer experience and churn
– Managing and planning for capacity requirements to maintain and improve the quality of service
2) Proactive Call Centers
– Identifying and resolving service issues in minutes
– Proactively managing customer experience and churn
– Maximizing revenue and margins from existing subscriber base
– Decreasing average call handling times and network operating costs
3) Movement Analyses
– What is the movement between geographical markets/stores?
– What % of people visit more than 1 location within a specific timeframe?
– What locations share the same traffic patterns?
– Associate movement w/activities – e.g., purchasing trends, diseases, energy consumption
Big Data Workshop
5/5/2015
Page 17
17. Railroad Sector Use Cases
1) Improve Fuel (Annual consumption = 1.7 trillion ton miles of freight w/3.6 billion tons of fuels)
– Holistically determine shortest route, congestion, carriage determinations, etc.
– Predict & track conditions based on usage, weather, by train type, etc.
2) Predict train maintenance schedules based on type of train, mileage, parts usage
and shelf-life expectations, etc.
– Overlay augmented reality handsets to calculate in real time specific engine parts and even where
located and how best to get them to the nearest local maintenance facility
3) Improve Safety
– Use various sensors, thermometers, and trends to detect problems with railway tracks in order to
predict and prevent potential derailments
18
Big Data Workshop
5/5/2015
Page 18
18. Retail Sales Sector Use Cases
1) Store Traverse Pattern Analysis
– Use sensors to understand unique shopper’s movements and purchases throughout the store. How
long are they standing in one place vs. another? How many items are reviewed before one is selected
(if any at all)
– Understand shopper demographics, interests and possible desired brands as they walk in the store
and then “help” them make selections via electronic means
2) Voice of Customer
– Use social sentiment analytics (web semantics) to assess overall store preferences, brand status and
the launch of new products, services, or offers
– By combining social and mobile analytics with loyalty information, retailers can create personalized,
more relevant engagements with shoppers
3) Encourage Store Visits
– Using presence and location-based mobility analytics, retailers pinpoint the location of opt-in
shoppers when they are close to a store location
19
Big Data Workshop
5/5/2015
Page 19
19. Water Utilities Sector Use Cases
1) Better Understanding of Customer improves Service & Reduces Costs
– Understanding of customer profiles, weather patterns, populations growth & dynamics, mobility of
customers help to understand and predict usage patterns
– Size & condition of distribution network and better understanding of individual residential or business
consumption patterns allow for much easier customized pricing packages to meet needs.
2) Improving Operations
– Pinpointing or forecasting the location of outages for workforce deployment, scanning for potential
fraud or theft or security breaches, as well as identifying trends or patterns for unbilled accounts
assists the net income
3) Predictive Asset Management
– Forecast potential performance or equipment failures in the distribution network as well as rate
capacity of the network.
– Leaks can also help predict potential distribution plant failures.
20
Big Data Workshop
5/5/2015
Page 20
20. Web Search Sector Use Cases
21
1. Customer Analysis
– Advertising effectiveness - How many, who are they, and how long do they look at an advertisement?
– Where are they from?
– Apply demographic, lifestyle, interest
– Socioeconomic profiling
– Understand advertising campaign effectiveness
– Identify repeat viewer patterns
2) Economics of Social Networks
• Who are the Influencers in specific social network groups?
• How do Influencers impact purchases by others?
• Understand impacts on churn
• Understand impacts of acquisitions
3) Customer Behavior
–Do various channel solicitations influence viewing, purchasing or no action?
–How often do they shop/stay on your web store?
–What is the “Next Best Offer”?
Big Data Workshop
5/5/2015
Page 21
22. With better big data integration, various industries could have significant savings:
Healthcare – could save $300b per year (this is almost $1k per every man, woman and child)
Mobile Carriers – could save 50% of annual budgets per year by better understanding and
forecasting usage patterns
Decoding the human genome originally took 10 years to complete: now it can be achieved
in one week
Wal-Mart has more than 1 million customer transactions every hour
Retail sales revenue would increase by 60% if optimally used big data analytics
Over 90% of data was created in the last 2 years
The NSA is thought to analyze 1.6% of all global internet traffic (~ 30 petabytes)
An average mobile phone user “looks” their phone 45 times per hour. Each transaction
generates on average of over 1200 data points or about 800k data points per user/per day
There is more data generated on mobile phones than on desktop and laptop computers
combined
One Boeing Jet creates 50 terabytes of data per hour (can’t download data until plane
lands)
More Big Data “Facts”?
(many attributed to Bernard Marr)
23
Big Data Workshop
5/5/2015
Page 23