Disruptive Data Science Series: The Eight-Fold Path of Data Science

After years of practitioners, academics, and industries applying
mathematical models to solve practical problems, big data
technologies have created an opportunity for scientists and
engineers from many different backgrounds to come together
and consider how we create meaning from vast amounts of data.
The Bay Area is a hotbed for data science, but its relevance
and impact is global. If the Industrial Revolution strengthened
the muscular and skeletal systems of the global economy, the
Internet of Things is ready to do the same to the economy’s
brain and nervous system. Many smart devices already exist —
smart energy meters, sensors on car and plane engines. The
challenge comes in connecting these devices and the data they
produce to accelerate insights and action. We see examples of
this in the applications data scientists are already building, such
as efforts to make oil drilling platforms smarter so they can catch
issues early on and activate appropriate shut-off procedures,
preventing explosions such as the one that caused the Gulf of
Mexico oil spill.
The basic methodology of analytics, such as the Cross Industry
Standard for Data Mining (CRISP), ) remains unchanged. What
data scientists have done is build upon that foundation to take
into account the increasing complexity of our problems and
capabilities of our tools. Annika Jimenez, who leads the Data
Science team here at Pivotal, has talked about eight steps of value
creation from data in her Disruptive Data Science white paper.
In this paper, I am going to zero in on the data science practices
that are part of this process of value creation. The key to a
successful data science project following an eightfold path,
consisting of four phases and four differentiating factors.
Phase 1: Problem Formulation – Are you solving the
right problem?
We could simply improve existing analytics processes with
these new technologies, but the big opportunities will be found
by harnessing the new data and capabilities at our disposal
to formulate new problems. An example from the data side
would be improving a consumer churn propensity model in
the telecom sector by adding social graph information from
Call Data Records. Such models have found that users who are
close friends with someone who has switched carriers are more
likely to also switch.
An example of the new capabilities can be found in the
exciting world of the Internet Of Things, where we can help
manufacturers understand how well their products are
performing and, more importantly, when and why they are not.
In both cases, it is very important to include all stakeholders
while formulating the problem — not just the IT department
who will maintain the technology platform, but also the
business users who will actually use the results of the process.
Domain knowledge is crucial when formulating a problem with
an impactful solution.
White paper
The Eight-Fold Path of Data Science
Disruptive Data Science Series
goPivotal.com
Introduction
Irecently spent an evening in the San Francisco Bay Area with the emerging community of data scientists.
Our group included students and academics, practitioners with varying levels of experience, and even a
smattering of venture capitalists and angel investors, meeting to discuss practical applications of data science
with D J Patil, one of the gurus of this new community.
by Kaushik Kunal Das

White paper The Eight-Fold Path of Data Science
2
Phase 2: Data Step – Do you have the right feature-set?
This is the step that usually takes the most time. It
encompasses a number of key questions: What are the
data sources that are available to us inside and outside the
organization? What are the variables that we will use for our
analysis? The more data we have, the better off we are. Equally
important is that we incorporate domain knowledge in the
features that we build. In this area, data science has made a
significant contribution by exploiting big data technologies
ability to refine to increasing levels of detail and combine
different types of structured, unstructured and semi-structured
datasets.
This phase produces a set of features that are used in the
subsequent analysis. For instance, incorporating Call Data
Records of a phone company in a regression model for
predicting churn propensity involves creating thousands of
variable candidates, called features. This features are based
on factors such as whether customers use text messages or
phone calls, the number of outgoing or incoming calls, and the
time periods of call activity. In this step, we decide whether to
combine fields and determine the level of aggregation that each
feature needs to be at. High levels of aggregation can mask
signals, but data that is overly-low level may not be statistically
significant. This process of defining features determines the
boundaries of our solution.
Phase 3: Modeling Step – Deploy the right algorithms to
uncover causal links.
The modeling step is where we identify patterns among our
features. We might need to explore and transform our feature
space using Principal Component Selection (PCA), Fourier and
wavelet transforms, and other such techniques. If we have
enough data to find strong correlations that could indicate
causal relationships, we can use various powerful regression
techniques like that old warhorse logistic regression,, or newer
ones like Support Vector Machines. In cases where we cannot
prove strong correlations, we cluster large datasets to find
Phase 2: Data Step
Build the right feature set making
full use of the volume, variety and
velocity of all available data
Phase 3: Modeling Step
This is where you move from
answering what, where and when
to answering why and what if?
Phase 1: Problem Formulation
Make sure you formulate a
problem that is relevant to the
goals and pain points of the
stakeholders
Phase 4: Application
Create a framework for integrating
the model with decision making
processes and taking action using
the Internet of Things
Technology Selection
Select the right platform and the
right set of tools for solving the
problem at hand
Building a Narrative
Create a fact-based narrative
that clearly communicates
insights to stakeholders
Iterative Approach
Perform each phase in an
agile manner and iterate as
required
Creativity
Take the opportunity to
innovate at every phase

3
hidden patterns. In many cases we use optimizations or solve
complicated inverse problems, like creating images of the earth
using data from earthquakes or artificial seismic surveys.
Fortunately, there is large and ever-growing set of techniques
and associated algorithms available today. If we go back to the
example of the churn model, we find that logistic regression
is used very widely today, as it is very easy to explain. Logistic
regression does not necessarily provide the best estimate. In
this case, we improved the explanatory power of the model
by using a generalized version of logistic regression called
Generalized Additive Models that combined the variable
transformation and regression steps and fit them together.
Phase 4: Application – This is where we finally solve
the problem.
The potential applications of these insights are numerous: they
might inform a decision support tool, or a control system that
acts based on the patterns we have uncovered. In many cases,
the insights serve multiple applications. For example, when
leveraging the Internet Of Things to make an oil drilling platform
smarter so that it can detect signs of catastrophic failure and
take corrective action, we need to build a dashboard for human
operators along with connections to the drill controls.
If you step back and look at these four phases, they mimic the
process of human thought, starting with the formulation of a
question, determining the ontology, creating a cognitive model,
and applying the conclusions. Professor George Lakoff of UC
Berkeley has a very interesting theory about this phenomenon.
Of course, a mathematical model is a special type of cognitive
model, in that it is defined rigorously using the language
of math. It’s important to be careful when applying any
mathematical model, and verify that the results of the models
correspond to reality.
The four differentiating factors are principles that we need to
keep in mind as we go through the aforementioned four phase
process.. They are:
1) Technology selection
There are numerous very powerful technologies that exist for
tackling various types of problems. Any single data science project
might require us to use several such technologies. For example,
you might want to use the Python library NLTK or GPText for
text analysis, the fft function in the R Stats package for doing a
frequency analysis of the word count, and LDA function in MADlib
for topic modeling. It is crucial that we select an open and flexible
platform that allows us to leverage all the technologies we need,
without having to move the data. The Pivotal platform is built
with that in mind. We keep the data and compute in parallel, while
minimizing the movement of data.
2) Creativity
There is ample room for creativity in all of these steps. Indeed,
this may be the most exciting part of the job. While designing a
project, determine opportunities to be creative, and do something
that hasn’t been done before. For example, we use many signal
processing techniques on time series data that is produced by
sensors connected to the Internet Of Things. This makes it much
easier to deal with problems which are often considered difficult.
3) Iterative approach
We also keep our projects iterative, with a meeting at the
completion of every step described above. Here we show
our results and solicit feedback from the stakeholders that we
incorporate. The German philosopher Edmund Husserl pointed
out that we communicate using concepts based on a shared
reality he called “lebenswelt,” or lifeworld. A common problem
for data science projects is that we are creating a new application,
one which is not a shared reality since it does not exist yet, which
is therefore difficult to convey in words. It makes sense to create
a prototype as soon as possible and demo it to the stakeholders.
This elicits very productive feedback, and prevents us from
spending time on things that would be difficult to adopt.
The four-step process described here is rarely linear. We often
need more that one iteration to reach the most impactful
solution. For example, we created a marketing mix model for an
engagement, in which we started modeling at the level of store
groups. After the first iteration, we realized that the television
effects were not strong enough at that level, and that those
variables needed to be aggregated at the national level. (This
would be described as “pooling” in the context of a Bayesian
hierarchical model). Making that iterative change made the
models much more accurate.
4) Building a narrative
Finally, the human element remains important in the process
of building a story, a narrative that makes sense of all these
steps. Whether you are applying data science internally to your
organization, or you are implementing a data science project for
a customer, it is very important that you are able to explain what
you have done and how it is helpful.
For instance, take the smart drilling platform project. In this case,
we have a compelling goal: preventing explosions such as the

GoPivotal, Pivotal, and the Pivotal logo are registered trademarks or trademarks of GoPivotal, Inc. in the United States and other countries. All other trademarks used herein are the property of their respective owners.
© Copyright 2013 Go Pivotal, Inc. All rights reserved. Published in the USA. PVTL-WP-210-09/13
Pivotal, committed to open source and open standards, is a leading provider of application and data infrastructure software, agile development services, and data science consulting.
Pivotal’s revolutionary Enterprise PaaS product, powered by Cloud Foundry, will be available in Q4 2013.
Pivotal CF is a complete, next generation Enterprise Platform-as-a-Service that makes it possible, for the first time, for the employees of the enterprise to rapidly create modern
applications. To create powerful experiences that serve their customers in the context of who they are, where they are, and what they are doing in the moment. To store, manage and
deliver value from fast, massive data sets. To build, deploy and scale at an unprecedented pace.
Uniting selected technology, people and programs from EMC and VMware, the following products and services are now part of Pivotal: Greenplum®
, Cloud Foundry, Spring, Cetas,
Pivotal Labs®
, GemFire®
and other products from the VMware vFabricTM
Suite.
Pivotal 1900 S Norfolk Street San Mateo CA 94403 goPivotal.com
one that caused the Gulf of Mexico oil spill. We must consider:
what are the steps that lead us to that goal? We start with sensor
data from the drill bit, data from analysis of the drill mud that is
coming out, data from the drill control system, and data from
other sensors and instruments on the platform. From there, we
get a set of features from the data and apply advanced signal
processing techniques to get rid of the noise. Perhaps we ignore
the high frequency part of the signal, treat that as noise, and use
a filtering function in frequency space to get rid of the noise.
Running a FFT in parallel is very fast on large datasets if you use
the right technology. Now we can build regression models on
our dataset, using data from several wells in one oilfield to predict
the rate of penetration of the drill. This model can be used for
prediction in real-time, and deviations from this prediction can
be treated as anomalies. From this model, we can attach the
right action (e.g. raising an alarm, stopping the drill, etc.) to each
anomaly using labeled data.
By articulating this sort of story to stakeholders, you can
communicate the various goals, steps, and the value of developing
these models.This is a crucial step of communicating what you are
doing, why it is useful, and getting stakeholders’ feedback. This
process also helps explain results, facilitates discussions about
improvements, and prompts further next steps.
This is a very exciting time to be involved in data science. What I
have outlined here is an emerging process for what is an emerging
field. As data scientists work on an increasing variety of problems
and innovate in small garages and large companies, this framework
will evolve, and become all the more significant, sophisticated, and
meaningful over time.
About Kaushik Kunal Das
Kaushik is Senior Principal Data Scientist with Pivotal. His job is
to formulate data science problems and solve them using the
Pivotal Big Data Platform. He leads a team of highly accomplished
data scientists working in energy, telecommunications, retail and
digital media. Kaushik has an engineering background focused
on solving mathematical problems requiring large datasets. He
studied engineering at the Institute of Technology of the Banaras
Hindu University and the University of California at Berkeley. He
is interested in questions like how much can a company know
their customer and customize their actions in a context-sensitive
fashion? How can our living and working environments get
smarter and how can we get there?
Learn More
To learn more about our products, services and solutions, visit us
at goPivotal.com.

Disruptive Data Science Series: The Eight-Fold Path of Data Science

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (14)

Más de EMC

Más de EMC (20)

Último

Último (20)

Disruptive Data Science Series: The Eight-Fold Path of Data Science