SlideShare una empresa de Scribd logo
1 de 4
Descargar para leer sin conexión
After years of practitioners, academics, and industries applying
mathematical models to solve practical problems, big data
technologies have created an opportunity for scientists and
engineers from many different backgrounds to come together
and consider how we create meaning from vast amounts of data.
The Bay Area is a hotbed for data science, but its relevance
and impact is global. If the Industrial Revolution strengthened
the muscular and skeletal systems of the global economy, the
Internet of Things is ready to do the same to the economy’s
brain and nervous system. Many smart devices already exist —
smart energy meters, sensors on car and plane engines. The
challenge comes in connecting these devices and the data they
produce to accelerate insights and action. We see examples of
this in the applications data scientists are already building, such
as efforts to make oil drilling platforms smarter so they can catch
issues early on and activate appropriate shut-off procedures,
preventing explosions such as the one that caused the Gulf of
Mexico oil spill.
The basic methodology of analytics, such as the Cross Industry
Standard for Data Mining (CRISP), ) remains unchanged. What
data scientists have done is build upon that foundation to take
into account the increasing complexity of our problems and
capabilities of our tools. Annika Jimenez, who leads the Data
Science team here at Pivotal, has talked about eight steps of value
creation from data in her Disruptive Data Science white paper.
In this paper, I am going to zero in on the data science practices
that are part of this process of value creation. The key to a
successful data science project following an eightfold path,
consisting of four phases and four differentiating factors.
Phase 1: Problem Formulation – Are you solving the
right problem?
We could simply improve existing analytics processes with
these new technologies, but the big opportunities will be found
by harnessing the new data and capabilities at our disposal
to formulate new problems. An example from the data side
would be improving a consumer churn propensity model in
the telecom sector by adding social graph information from
Call Data Records. Such models have found that users who are
close friends with someone who has switched carriers are more
likely to also switch.
An example of the new capabilities can be found in the
exciting world of the Internet Of Things, where we can help
manufacturers understand how well their products are
performing and, more importantly, when and why they are not.
In both cases, it is very important to include all stakeholders
while formulating the problem — not just the IT department
who will maintain the technology platform, but also the
business users who will actually use the results of the process.
Domain knowledge is crucial when formulating a problem with
an impactful solution.
White paper
The Eight-Fold Path of Data Science
Disruptive Data Science Series
goPivotal.com
Introduction
Irecently spent an evening in the San Francisco Bay Area with the emerging community of data scientists.
Our group included students and academics, practitioners with varying levels of experience, and even a
smattering of venture capitalists and angel investors, meeting to discuss practical applications of data science
with D J Patil, one of the gurus of this new community.
by Kaushik Kunal Das
White paper The Eight-Fold Path of Data Science
2
Phase 2: Data Step – Do you have the right feature-set?
This is the step that usually takes the most time. It
encompasses a number of key questions: What are the
data sources that are available to us inside and outside the
organization? What are the variables that we will use for our
analysis? The more data we have, the better off we are. Equally
important is that we incorporate domain knowledge in the
features that we build. In this area, data science has made a
significant contribution by exploiting big data technologies
ability to refine to increasing levels of detail and combine
different types of structured, unstructured and semi-structured
datasets.
This phase produces a set of features that are used in the
subsequent analysis. For instance, incorporating Call Data
Records of a phone company in a regression model for
predicting churn propensity involves creating thousands of
variable candidates, called features. This features are based
on factors such as whether customers use text messages or
phone calls, the number of outgoing or incoming calls, and the
time periods of call activity. In this step, we decide whether to
combine fields and determine the level of aggregation that each
feature needs to be at. High levels of aggregation can mask
signals, but data that is overly-low level may not be statistically
significant. This process of defining features determines the
boundaries of our solution.
Phase 3: Modeling Step – Deploy the right algorithms to
uncover causal links.
The modeling step is where we identify patterns among our
features. We might need to explore and transform our feature
space using Principal Component Selection (PCA), Fourier and
wavelet transforms, and other such techniques. If we have
enough data to find strong correlations that could indicate
causal relationships, we can use various powerful regression
techniques like that old warhorse logistic regression,, or newer
ones like Support Vector Machines. In cases where we cannot
prove strong correlations, we cluster large datasets to find
Phase 2: Data Step
Build the right feature set making
full use of the volume, variety and
velocity of all available data
Phase 3: Modeling Step
This is where you move from
answering what, where and when
to answering why and what if?
Phase 1: Problem Formulation
Make sure you formulate a
problem that is relevant to the
goals and pain points of the
stakeholders
Phase 4: Application
Create a framework for integrating
the model with decision making
processes and taking action using
the Internet of Things
Technology Selection
Select the right platform and the
right set of tools for solving the
problem at hand
Building a Narrative
Create a fact-based narrative
that clearly communicates
insights to stakeholders
Iterative Approach
Perform each phase in an
agile manner and iterate as
required
Creativity
Take the opportunity to
innovate at every phase
White paper The Eight-Fold Path of Data Science
3
hidden patterns. In many cases we use optimizations or solve
complicated inverse problems, like creating images of the earth
using data from earthquakes or artificial seismic surveys.
Fortunately, there is large and ever-growing set of techniques
and associated algorithms available today. If we go back to the
example of the churn model, we find that logistic regression
is used very widely today, as it is very easy to explain. Logistic
regression does not necessarily provide the best estimate. In
this case, we improved the explanatory power of the model
by using a generalized version of logistic regression called
Generalized Additive Models that combined the variable
transformation and regression steps and fit them together.
Phase 4: Application – This is where we finally solve
the problem.
The potential applications of these insights are numerous: they
might inform a decision support tool, or a control system that
acts based on the patterns we have uncovered. In many cases,
the insights serve multiple applications. For example, when
leveraging the Internet Of Things to make an oil drilling platform
smarter so that it can detect signs of catastrophic failure and
take corrective action, we need to build a dashboard for human
operators along with connections to the drill controls.
If you step back and look at these four phases, they mimic the
process of human thought, starting with the formulation of a
question, determining the ontology, creating a cognitive model,
and applying the conclusions. Professor George Lakoff of UC
Berkeley has a very interesting theory about this phenomenon.
Of course, a mathematical model is a special type of cognitive
model, in that it is defined rigorously using the language
of math. It’s important to be careful when applying any
mathematical model, and verify that the results of the models
correspond to reality.
The four differentiating factors are principles that we need to
keep in mind as we go through the aforementioned four phase
process.. They are:
1) Technology selection
There are numerous very powerful technologies that exist for
tackling various types of problems. Any single data science project
might require us to use several such technologies. For example,
you might want to use the Python library NLTK or GPText for
text analysis, the fft function in the R Stats package for doing a
frequency analysis of the word count, and LDA function in MADlib
for topic modeling. It is crucial that we select an open and flexible
platform that allows us to leverage all the technologies we need,
without having to move the data. The Pivotal platform is built
with that in mind. We keep the data and compute in parallel, while
minimizing the movement of data.
2) Creativity
There is ample room for creativity in all of these steps. Indeed,
this may be the most exciting part of the job. While designing a
project, determine opportunities to be creative, and do something
that hasn’t been done before. For example, we use many signal
processing techniques on time series data that is produced by
sensors connected to the Internet Of Things. This makes it much
easier to deal with problems which are often considered difficult.
3) Iterative approach
We also keep our projects iterative, with a meeting at the
completion of every step described above. Here we show
our results and solicit feedback from the stakeholders that we
incorporate. The German philosopher Edmund Husserl pointed
out that we communicate using concepts based on a shared
reality he called “lebenswelt,” or lifeworld. A common problem
for data science projects is that we are creating a new application,
one which is not a shared reality since it does not exist yet, which
is therefore difficult to convey in words. It makes sense to create
a prototype as soon as possible and demo it to the stakeholders.
This elicits very productive feedback, and prevents us from
spending time on things that would be difficult to adopt.
The four-step process described here is rarely linear. We often
need more that one iteration to reach the most impactful
solution. For example, we created a marketing mix model for an
engagement, in which we started modeling at the level of store
groups. After the first iteration, we realized that the television
effects were not strong enough at that level, and that those
variables needed to be aggregated at the national level. (This
would be described as “pooling” in the context of a Bayesian
hierarchical model). Making that iterative change made the
models much more accurate.
4) Building a narrative
Finally, the human element remains important in the process
of building a story, a narrative that makes sense of all these
steps. Whether you are applying data science internally to your
organization, or you are implementing a data science project for
a customer, it is very important that you are able to explain what
you have done and how it is helpful.
For instance, take the smart drilling platform project. In this case,
we have a compelling goal: preventing explosions such as the
GoPivotal, Pivotal, and the Pivotal logo are registered trademarks or trademarks of GoPivotal, Inc. in the United States and other countries. All other trademarks used herein are the property of their respective owners.
© Copyright 2013 Go Pivotal, Inc. All rights reserved. Published in the USA. PVTL-WP-210-09/13
Pivotal, committed to open source and open standards, is a leading provider of application and data infrastructure software, agile development services, and data science consulting.
Pivotal’s revolutionary Enterprise PaaS product, powered by Cloud Foundry, will be available in Q4 2013.
Pivotal CF is a complete, next generation Enterprise Platform-as-a-Service that makes it possible, for the first time, for the employees of the enterprise to rapidly create modern
applications. To create powerful experiences that serve their customers in the context of who they are, where they are, and what they are doing in the moment. To store, manage and
deliver value from fast, massive data sets. To build, deploy and scale at an unprecedented pace.
Uniting selected technology, people and programs from EMC and VMware, the following products and services are now part of Pivotal: Greenplum®
, Cloud Foundry, Spring, Cetas,
Pivotal Labs®
, GemFire®
and other products from the VMware vFabricTM
Suite.
White paper The Eight-Fold Path of Data Science
Pivotal 1900 S Norfolk Street San Mateo CA 94403 goPivotal.com
one that caused the Gulf of Mexico oil spill. We must consider:
what are the steps that lead us to that goal? We start with sensor
data from the drill bit, data from analysis of the drill mud that is
coming out, data from the drill control system, and data from
other sensors and instruments on the platform. From there, we
get a set of features from the data and apply advanced signal
processing techniques to get rid of the noise. Perhaps we ignore
the high frequency part of the signal, treat that as noise, and use
a filtering function in frequency space to get rid of the noise.
Running a FFT in parallel is very fast on large datasets if you use
the right technology. Now we can build regression models on
our dataset, using data from several wells in one oilfield to predict
the rate of penetration of the drill. This model can be used for
prediction in real-time, and deviations from this prediction can
be treated as anomalies. From this model, we can attach the
right action (e.g. raising an alarm, stopping the drill, etc.) to each
anomaly using labeled data.
By articulating this sort of story to stakeholders, you can
communicate the various goals, steps, and the value of developing
these models.This is a crucial step of communicating what you are
doing, why it is useful, and getting stakeholders’ feedback. This
process also helps explain results, facilitates discussions about
improvements, and prompts further next steps.
This is a very exciting time to be involved in data science. What I
have outlined here is an emerging process for what is an emerging
field. As data scientists work on an increasing variety of problems
and innovate in small garages and large companies, this framework
will evolve, and become all the more significant, sophisticated, and
meaningful over time.
About Kaushik Kunal Das
Kaushik is Senior Principal Data Scientist with Pivotal. His job is
to formulate data science problems and solve them using the
Pivotal Big Data Platform. He leads a team of highly accomplished
data scientists working in energy, telecommunications, retail and
digital media. Kaushik has an engineering background focused
on solving mathematical problems requiring large datasets. He
studied engineering at the Institute of Technology of the Banaras
Hindu University and the University of California at Berkeley. He
is interested in questions like how much can a company know
their customer and customize their actions in a context-sensitive
fashion? How can our living and working environments get
smarter and how can we get there?
Learn More
To learn more about our products, services and solutions, visit us
at goPivotal.com.

Más contenido relacionado

Destacado (14)

Mon contrast comm with today
Mon contrast comm with todayMon contrast comm with today
Mon contrast comm with today
 
Windows Server 2012 Disk Dedupe
Windows Server 2012 Disk DedupeWindows Server 2012 Disk Dedupe
Windows Server 2012 Disk Dedupe
 
Clientes
ClientesClientes
Clientes
 
мультимедийные технологии
мультимедийные технологиимультимедийные технологии
мультимедийные технологии
 
San valentino
San valentinoSan valentino
San valentino
 
La televisió blai
La televisió blaiLa televisió blai
La televisió blai
 
Insaat kursu-sancaktepe
Insaat kursu-sancaktepeInsaat kursu-sancaktepe
Insaat kursu-sancaktepe
 
Animal power pont
Animal power pontAnimal power pont
Animal power pont
 
Thur change to s or d
Thur change to s or dThur change to s or d
Thur change to s or d
 
Wed quiz and communism
Wed quiz and communismWed quiz and communism
Wed quiz and communism
 
семинар использование икт
семинар использование иктсеминар использование икт
семинар использование икт
 
Tues islam hajj
Tues islam hajjTues islam hajj
Tues islam hajj
 
Sean melton
Sean meltonSean melton
Sean melton
 
Gestão de stocks lingua inglesa 1
Gestão de stocks lingua inglesa 1Gestão de stocks lingua inglesa 1
Gestão de stocks lingua inglesa 1
 

Más de EMC

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDEMC
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote EMC
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOEMC
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremioEMC
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lakeEMC
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereEMC
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History EMC
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewEMC
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeEMC
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic EMC
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityEMC
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeEMC
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015EMC
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesEMC
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsEMC
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookEMC
 

Más de EMC (20)

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremio
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis Openstack
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lake
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop Elsewhere
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical Review
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or Foe
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for Security
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure Age
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education Services
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere Environments
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBook
 

Último

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Último (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Disruptive Data Science Series: The Eight-Fold Path of Data Science

  • 1. After years of practitioners, academics, and industries applying mathematical models to solve practical problems, big data technologies have created an opportunity for scientists and engineers from many different backgrounds to come together and consider how we create meaning from vast amounts of data. The Bay Area is a hotbed for data science, but its relevance and impact is global. If the Industrial Revolution strengthened the muscular and skeletal systems of the global economy, the Internet of Things is ready to do the same to the economy’s brain and nervous system. Many smart devices already exist — smart energy meters, sensors on car and plane engines. The challenge comes in connecting these devices and the data they produce to accelerate insights and action. We see examples of this in the applications data scientists are already building, such as efforts to make oil drilling platforms smarter so they can catch issues early on and activate appropriate shut-off procedures, preventing explosions such as the one that caused the Gulf of Mexico oil spill. The basic methodology of analytics, such as the Cross Industry Standard for Data Mining (CRISP), ) remains unchanged. What data scientists have done is build upon that foundation to take into account the increasing complexity of our problems and capabilities of our tools. Annika Jimenez, who leads the Data Science team here at Pivotal, has talked about eight steps of value creation from data in her Disruptive Data Science white paper. In this paper, I am going to zero in on the data science practices that are part of this process of value creation. The key to a successful data science project following an eightfold path, consisting of four phases and four differentiating factors. Phase 1: Problem Formulation – Are you solving the right problem? We could simply improve existing analytics processes with these new technologies, but the big opportunities will be found by harnessing the new data and capabilities at our disposal to formulate new problems. An example from the data side would be improving a consumer churn propensity model in the telecom sector by adding social graph information from Call Data Records. Such models have found that users who are close friends with someone who has switched carriers are more likely to also switch. An example of the new capabilities can be found in the exciting world of the Internet Of Things, where we can help manufacturers understand how well their products are performing and, more importantly, when and why they are not. In both cases, it is very important to include all stakeholders while formulating the problem — not just the IT department who will maintain the technology platform, but also the business users who will actually use the results of the process. Domain knowledge is crucial when formulating a problem with an impactful solution. White paper The Eight-Fold Path of Data Science Disruptive Data Science Series goPivotal.com Introduction Irecently spent an evening in the San Francisco Bay Area with the emerging community of data scientists. Our group included students and academics, practitioners with varying levels of experience, and even a smattering of venture capitalists and angel investors, meeting to discuss practical applications of data science with D J Patil, one of the gurus of this new community. by Kaushik Kunal Das
  • 2. White paper The Eight-Fold Path of Data Science 2 Phase 2: Data Step – Do you have the right feature-set? This is the step that usually takes the most time. It encompasses a number of key questions: What are the data sources that are available to us inside and outside the organization? What are the variables that we will use for our analysis? The more data we have, the better off we are. Equally important is that we incorporate domain knowledge in the features that we build. In this area, data science has made a significant contribution by exploiting big data technologies ability to refine to increasing levels of detail and combine different types of structured, unstructured and semi-structured datasets. This phase produces a set of features that are used in the subsequent analysis. For instance, incorporating Call Data Records of a phone company in a regression model for predicting churn propensity involves creating thousands of variable candidates, called features. This features are based on factors such as whether customers use text messages or phone calls, the number of outgoing or incoming calls, and the time periods of call activity. In this step, we decide whether to combine fields and determine the level of aggregation that each feature needs to be at. High levels of aggregation can mask signals, but data that is overly-low level may not be statistically significant. This process of defining features determines the boundaries of our solution. Phase 3: Modeling Step – Deploy the right algorithms to uncover causal links. The modeling step is where we identify patterns among our features. We might need to explore and transform our feature space using Principal Component Selection (PCA), Fourier and wavelet transforms, and other such techniques. If we have enough data to find strong correlations that could indicate causal relationships, we can use various powerful regression techniques like that old warhorse logistic regression,, or newer ones like Support Vector Machines. In cases where we cannot prove strong correlations, we cluster large datasets to find Phase 2: Data Step Build the right feature set making full use of the volume, variety and velocity of all available data Phase 3: Modeling Step This is where you move from answering what, where and when to answering why and what if? Phase 1: Problem Formulation Make sure you formulate a problem that is relevant to the goals and pain points of the stakeholders Phase 4: Application Create a framework for integrating the model with decision making processes and taking action using the Internet of Things Technology Selection Select the right platform and the right set of tools for solving the problem at hand Building a Narrative Create a fact-based narrative that clearly communicates insights to stakeholders Iterative Approach Perform each phase in an agile manner and iterate as required Creativity Take the opportunity to innovate at every phase
  • 3. White paper The Eight-Fold Path of Data Science 3 hidden patterns. In many cases we use optimizations or solve complicated inverse problems, like creating images of the earth using data from earthquakes or artificial seismic surveys. Fortunately, there is large and ever-growing set of techniques and associated algorithms available today. If we go back to the example of the churn model, we find that logistic regression is used very widely today, as it is very easy to explain. Logistic regression does not necessarily provide the best estimate. In this case, we improved the explanatory power of the model by using a generalized version of logistic regression called Generalized Additive Models that combined the variable transformation and regression steps and fit them together. Phase 4: Application – This is where we finally solve the problem. The potential applications of these insights are numerous: they might inform a decision support tool, or a control system that acts based on the patterns we have uncovered. In many cases, the insights serve multiple applications. For example, when leveraging the Internet Of Things to make an oil drilling platform smarter so that it can detect signs of catastrophic failure and take corrective action, we need to build a dashboard for human operators along with connections to the drill controls. If you step back and look at these four phases, they mimic the process of human thought, starting with the formulation of a question, determining the ontology, creating a cognitive model, and applying the conclusions. Professor George Lakoff of UC Berkeley has a very interesting theory about this phenomenon. Of course, a mathematical model is a special type of cognitive model, in that it is defined rigorously using the language of math. It’s important to be careful when applying any mathematical model, and verify that the results of the models correspond to reality. The four differentiating factors are principles that we need to keep in mind as we go through the aforementioned four phase process.. They are: 1) Technology selection There are numerous very powerful technologies that exist for tackling various types of problems. Any single data science project might require us to use several such technologies. For example, you might want to use the Python library NLTK or GPText for text analysis, the fft function in the R Stats package for doing a frequency analysis of the word count, and LDA function in MADlib for topic modeling. It is crucial that we select an open and flexible platform that allows us to leverage all the technologies we need, without having to move the data. The Pivotal platform is built with that in mind. We keep the data and compute in parallel, while minimizing the movement of data. 2) Creativity There is ample room for creativity in all of these steps. Indeed, this may be the most exciting part of the job. While designing a project, determine opportunities to be creative, and do something that hasn’t been done before. For example, we use many signal processing techniques on time series data that is produced by sensors connected to the Internet Of Things. This makes it much easier to deal with problems which are often considered difficult. 3) Iterative approach We also keep our projects iterative, with a meeting at the completion of every step described above. Here we show our results and solicit feedback from the stakeholders that we incorporate. The German philosopher Edmund Husserl pointed out that we communicate using concepts based on a shared reality he called “lebenswelt,” or lifeworld. A common problem for data science projects is that we are creating a new application, one which is not a shared reality since it does not exist yet, which is therefore difficult to convey in words. It makes sense to create a prototype as soon as possible and demo it to the stakeholders. This elicits very productive feedback, and prevents us from spending time on things that would be difficult to adopt. The four-step process described here is rarely linear. We often need more that one iteration to reach the most impactful solution. For example, we created a marketing mix model for an engagement, in which we started modeling at the level of store groups. After the first iteration, we realized that the television effects were not strong enough at that level, and that those variables needed to be aggregated at the national level. (This would be described as “pooling” in the context of a Bayesian hierarchical model). Making that iterative change made the models much more accurate. 4) Building a narrative Finally, the human element remains important in the process of building a story, a narrative that makes sense of all these steps. Whether you are applying data science internally to your organization, or you are implementing a data science project for a customer, it is very important that you are able to explain what you have done and how it is helpful. For instance, take the smart drilling platform project. In this case, we have a compelling goal: preventing explosions such as the
  • 4. GoPivotal, Pivotal, and the Pivotal logo are registered trademarks or trademarks of GoPivotal, Inc. in the United States and other countries. All other trademarks used herein are the property of their respective owners. © Copyright 2013 Go Pivotal, Inc. All rights reserved. Published in the USA. PVTL-WP-210-09/13 Pivotal, committed to open source and open standards, is a leading provider of application and data infrastructure software, agile development services, and data science consulting. Pivotal’s revolutionary Enterprise PaaS product, powered by Cloud Foundry, will be available in Q4 2013. Pivotal CF is a complete, next generation Enterprise Platform-as-a-Service that makes it possible, for the first time, for the employees of the enterprise to rapidly create modern applications. To create powerful experiences that serve their customers in the context of who they are, where they are, and what they are doing in the moment. To store, manage and deliver value from fast, massive data sets. To build, deploy and scale at an unprecedented pace. Uniting selected technology, people and programs from EMC and VMware, the following products and services are now part of Pivotal: Greenplum® , Cloud Foundry, Spring, Cetas, Pivotal Labs® , GemFire® and other products from the VMware vFabricTM Suite. White paper The Eight-Fold Path of Data Science Pivotal 1900 S Norfolk Street San Mateo CA 94403 goPivotal.com one that caused the Gulf of Mexico oil spill. We must consider: what are the steps that lead us to that goal? We start with sensor data from the drill bit, data from analysis of the drill mud that is coming out, data from the drill control system, and data from other sensors and instruments on the platform. From there, we get a set of features from the data and apply advanced signal processing techniques to get rid of the noise. Perhaps we ignore the high frequency part of the signal, treat that as noise, and use a filtering function in frequency space to get rid of the noise. Running a FFT in parallel is very fast on large datasets if you use the right technology. Now we can build regression models on our dataset, using data from several wells in one oilfield to predict the rate of penetration of the drill. This model can be used for prediction in real-time, and deviations from this prediction can be treated as anomalies. From this model, we can attach the right action (e.g. raising an alarm, stopping the drill, etc.) to each anomaly using labeled data. By articulating this sort of story to stakeholders, you can communicate the various goals, steps, and the value of developing these models.This is a crucial step of communicating what you are doing, why it is useful, and getting stakeholders’ feedback. This process also helps explain results, facilitates discussions about improvements, and prompts further next steps. This is a very exciting time to be involved in data science. What I have outlined here is an emerging process for what is an emerging field. As data scientists work on an increasing variety of problems and innovate in small garages and large companies, this framework will evolve, and become all the more significant, sophisticated, and meaningful over time. About Kaushik Kunal Das Kaushik is Senior Principal Data Scientist with Pivotal. His job is to formulate data science problems and solve them using the Pivotal Big Data Platform. He leads a team of highly accomplished data scientists working in energy, telecommunications, retail and digital media. Kaushik has an engineering background focused on solving mathematical problems requiring large datasets. He studied engineering at the Institute of Technology of the Banaras Hindu University and the University of California at Berkeley. He is interested in questions like how much can a company know their customer and customize their actions in a context-sensitive fashion? How can our living and working environments get smarter and how can we get there? Learn More To learn more about our products, services and solutions, visit us at goPivotal.com.