Executives are still waiting on our “Big Data Deep Insights”. Many of us are down the path of collecting, extracting, and analyzing our ever-growing data in Hadoop environments. We are building our data science expertise and expanding data governance. Yet still we are not getting what we are waiting for.This talk is about:
1. Getting to the right questions
2. Setting expectations with the executive team
3. The unintentional consequence of suddenly having lots of data
4. Framing the boundaries of our data science
5. Pragmatic data governance
6. Looking outside your data to 3rd party data
8. Questions - Approaches
• Understand what manual process you want to automate:
what is currently manually predicted that could be
automated and determine if there’s any way to get training
data comprising of <input,output> pairs.
• Consider methods to augment existing data with a “pivot”
column that can be used to join. For example, geo-location
of an IP address could lead to joining with Census Data
based on zip+4.
9. Questions - Approaches
• Determine if your problem is one of prediction or one of
grouping (clustering). The latter is more of a task that can
lead to better understanding rather than solving a direct
business problem.
10. Questions - Approaches
• Determine if you are more interested in finding “interesting”
relationships among data columns rather than knowing the
columns. This is a task I’d call more of “discovery” than
prediction but the idea is to determine one column as the
output column in terms of the other columns as input.
• Doing this for all output columns can lead to “discovery”
of those correlations that are the strongest (e.g., every
time a customer buys beer at 5PM, he is likely to buy
diapers). This is more of a fishing expedition, but can
lead to unusual insights.
15. About Impetus
» Accelerated consulting and services leader for Big Data;
Headquartered in San Jose since 1996; 1400+; Presences
in Silicon Valley, Atlanta, NYC; offices in India; Expertise
through Architects
» Pioneers in distributed software engineering with vertical
and functional expertise; Dedicated innovation labs; 200+
Big Data practitioners; 80+ dedicated to R&D
16. Drill
* Incoming
Question
* Problem
Landscape
* Underlying
Constraints
* Specific Goals
Assess
* Goal Driven
Hypotheses
* Data
Requirement
* Resource
Requirements
* Analysis Plan
Target
* Data Collection
* Quality
Assessment
* Cross
Validation
* Restructuring
Analyze
* Test Previous
Hypotheses
* Explore New
Hypotheses
* Test
* Quantify
Results
Recommend
* Summary of
Results
* Key Novel
Insights
* Impact Analysis
* Action Items
Data Science Approach
17. » Recommender Systems
» Sentiment Analysis
» Topic Identification
» Predictive Analytics
» Data Stream Analytics
Data Science Focus
Areas
Contact us at bigdata@impetus.com
Sometimes clustering could be enough to solve a business problem
We must understand the columns well before understanding the relationships
Data Science results lead to better database marketing – churn analytics, upselling, cross selling, RFM/LTVThese are some of the areas where we’ve used data science and machine learning to come up w/ some interesting models.