SlideShare a Scribd company logo
1 of 6
Download to read offline
Data Science Revealed: A Data-Driven
Glimpse into the Burgeoning New Field
Introduction
As the cost of computing power, data storage, and high-bandwidth Internet access and have plunged
exponentially over the past two decades, companies around the globe recognized the power of
harnessing data as a source of competitive advantage. But it was only recently, as social web
applications and massive, parallel processing have become more widely available that the nescient
field of data science revealed what many are becoming to understand: that data is the new oil,i the
source for corporate energy and differentiation in the 21st century. Companies like Facebook,
LinkedIn, Yahoo, and Google are generating data not only as their primary product, but are analyzing
it to continuously improve their products. Pharmaceutical and biomedical companies are using big
data to find new cures and analyze genetic information, while marketers leverage the same
technology to generate new customer insights. In order to tap this newfound wealth, organizations
of all sizes are turning to practitioners in the new field of data science who are capable of translating
massive data into predictive insights that lead to results.

Data science is an emerging field, with rapid changes, great uncertainty, and exciting opportunities.
Our study attempts the first ever benchmark of the data science community, looking at how they
interact with their data, the tools they use, their education, and how their organizations approach
data-driven problem solving. We also looked at a smaller group of business intelligence
professionals to identify areas of contrast between the emerging role of data scientists and the more
mature field of BI. Our findings, summarized here, show an emerging talent gap between
organizational needs and current industry capabilities exemplified by the unique contributions data
scientists can make to an organization and the broad expectations of data science professionals
generally.



The Emerging Talent Gap
For the past two decades, business and political leaders have sounded a warning call about the
shortage of trained computer scientists, programmers, and engineers necessary to continue to
advance a high tech economy. Less noticed has been the coming gap in analytics professionals
needed to use all of the data created by these high tech systems. In 2009, Google Chief Economist
Hal Varian predicted that the statisticians will be the next sexy jobii, and in our own survey, 83% of
Data Scientists felt that new
technology would increase the                                                                         
demand for data scientists, and 64%
believe that it will outpace the supply
of available talent.                        Decrease  the  
                                           number  of  data                          Increase  the  
                                              scientists                            number  of  data  
This makes intuitive sense for a young       required  by                              scientists  
                                             automating                               required  by  
field. Technological trends tend to         much  of  the                           opening  up  new  
                                           analytical  work                           possibilities  
follow a cycle, where the initial               17%                                      83%  

opportunity leads to ever increasing
demand for a certain set of skills, while later demand wanes as many of those initial skills are
automated by even newer tools. Consider, for instance, the way many data processing and network
management jobs that used to require legions of computer operators are now handled by automated
monitoring tools. Data science is still in its very early phase, with the amount of data exploding and
                                                           the right tools to process them just becoming
   The  best  source  of  new  Data  Science  talent       available.
                            is:  
  Today's  BI  
                        Other  
                         3%  
                                                       Although data science is generating new
 professionals  
     12%  
                                                       opportunities, our capacity to train new data
                                        Students       scientists is not keeping up, and nearly two-
                                        studying  
                                        computer       thirds of respondents foresee a looming
    Professionals                        science  
    in  disciplines                       34%          shortfall in the number of data scientists over
    other  than  IT  
     or  computer              Students                the next five years. This aligns with other
        science                studying  
          27%                fields  other             research, including a recent McKinsey Global
                                 than  
                              computer                 Institute study that predicts a shortage of
                                science  
                                 24%                   190,000 data scientists by the year 2019iii.
                                                       And when our respondents were asked where
                                                       the best source for talent was, few looked to

                                                       Instead, nearly two-
university students.




Who is the Data Scientist?
Although the term data science has been around for decades indeed, most scientists use data of
some form the term data scientist in its current context is relatively new, frequently credited to DJ
Patil, who started the data science team at LinkedIn.iv But as a new term, the field is still very much
                                                                                out what it may mean.
In our survey, we allowed users to self-identify as

conflicts over terminology in job titles. In this       Jim Asplund, Chief Scientist at Gallup Consulting, is a data
                                                     scientist focused on evaluating the role that human perception
by comparing them with the previous big player in     has on everything from disease conditions and GDP to worker
the analytics space, business intelligence            productivity and consumer behavior. He works with massive
professionals.                                         data sets linking perception with actual behavior, and micro
                                                          and macroeconomic outcomes. His work has isolated
Twenty years ago, business intelligence was itself     emotional factors that are most highly related to outcomes
a new term, just emerging to take over the                               organizations care about.
various database management and decisions
support functions within an organization. As the
field grew rapidly in the 90s, it also coalesced around a smaller number of tools, more consistent
expectations for talent, better training, and more rigorous organizational standards. As our data
demonstrates, data scientists are currently going through that transition,
It may be helpful to think of data science and business intelligence as being on two ends of the same
spectrum, with business intelligence focused on managing and reporting existing business data in
order to monitor or manage various concerns within the enterprise. In contrast, data science applies
advanced analytical tools and algorithms to generate predictive insights and new product
innovations that are a direct result of the data.

The need for rigorous scientific training was born out in our research on data scientists, and paints a
clear distinction between data scientists and BI professionals. The most popular undergraduate
degree for BI professionals was in business at 37% - more than the next three categories combined.
In contrast, the most popular degree for data science professionals was computer science (24%),
followed closely by engineering (17%) and the hard sciences (11%). We also found that data science

likely to have a doctoral degree as business intelligence professionals.

The data science toolkit is more varied and more technically sophisticated than the BI toolkit. While
most BI professionals do their analysis and data processing in Excel, data science professionals are
using SQL, advanced statistical packages, and NoSQL databases. Further, although big-data tools
like Hadoop, and advanced visualization tools like Tableau are just starting to emerge in the data
science world, they are almost unseen in the business intelligence world. Open Source tools, like the
R statistics package, Python, and Perl, are each used by one in five data science professionals, but
around one in twenty BI professionals.

       Data Storage                     Data                           Data Analysis                         Data Visualization
                                     Management

 Microsoft SQL              51%                                                                              Microsoft Business                44%
                                                              68%                                  43%       Intelligence Tools
    Server                 46%      Excel                                                                                                       51%
                                                              77%        SAS
                                                                                                 37%           Oracle Business             35%
                          42%                                                                                  Intellience Tools
          IBM                                                                                                                            25%
                         35%
                                                        52%
                                     SQL                                                                  IBM/Cognos Business             32%
                                                  22%                                            37%
                          39%                                          STATA                                Intelligence Tools           24%
        Oracle
                    21%                                                                    27%
                                                                                                          SAP/Business Objects           22%
                                                  22%                                                     Business Intelligence         17%
    Other SQL      15%             Python
    database       13%                      5%
                                                                                             29%         MicroStrategy Business       21%
                                                                        SPSS                                Intelligence Tools      11%
  Other NoSQL     7%                                                                 13%
                                                  20%
   database        13%               Perl                                                                                               18%
                                            7%                                                                         Tableau
                                                                                                                                   4%
                   10%                                                                 19%
      Netezza                                                       Greenplum                                                        15%
                 0%                              16%                                                                  QlikView
                                    BASH                                        7%                                                 4%
                                            2%
                  11%
   Greenplum                                                                                                                         15%
                 4%                                                                                                  Datameer
                                                                                                                                   5%
                                                                                       20%
                                                 14%                       R
                   12%               AWK                                                                                            9%
      Hadoop                                3%                                  5%                                Karmasphere
                 2%                                                                                                                5%




An important facet of data science is the ability to run experiments on data, as evidenced by DJ


           It would have been easy to turn this into a high-ceremony development project that would take thousands of
           hours of developer time, plus thousands of hours of computing time to do massive correlations across LinkedIn's
Data Scientist Profile
         membership. But the process worked quite differently: it started out with a
         relatively small, simple program that looked at members' profiles and made
         recommendations accordingly. Asking things like, did you go to Cornell? Then
         you might like to join the Cornell Alumni group. It then branched out
         incrementally. In addition to looking at profiles, LinkedIn's data scientists
         started looking at events that members attended. Then at books members
         had in their libraries. The result was a valuable data product that analyzed a
         huge database -- but it was never conceived as such. It started small, and
                                                                                            Usama Fayyad
         added value iteratively. It was an agile, flexible process that built toward its
         goal incrementally, rather than tackling a huge mountain of data all at once.      Former Chief Data Officer at Yahoo, and
                                                                                            currently CEO or Chairman at 3 mid-stage
                                  Without free access to the data and the                   start-up companies
 All employees have access        ability to run tests without bureaucratic
 to run experiments on data
      (% strongly agree)          hurdles, LinkedIn would never have been
                                  able to add this critical feature. In the Data            As chief data officer and executive VP,
                                                                                            Research & Strategic Data Solutions
Data Scientists            22%    Science Survey, we found that data
                                                                                            at Yahoo! Inc. Fayyad was responsible
                                  scientists were nearly twice as likely as
                                                                                            for Yahoo!'s overall data strategy,
                                  business intelligence professionals to have               policies, systems, and managing the
    Business                      this freedom, but that most organizations
   Intelligence
                     12%                                                                    Company's data processing/analytics
                                  are falling short on this critical                        infrastructure. He also oversaw Yahoo!
                                  characteristic.                                           Research in building the premier
                                                                                            scientific research organization to
                                                                                            develop the new sciences of the
Broadening the Data Science Community                                                       Internet, on-line marketing, and
                                                                                            innovative interactive applications.

While Data Science is most often associated with Big Data, it is                            Prior to this, Fayyad co-founded and
important to consider the host of other professions and roles that deem                     led the DMX Group, a data mining and
their work to be data science. This includes people from fields as                          data strategy company that was
                                                                                            acquired by Yahoo! in 2004.
diverse as Market Research, Financial Analysis, Information Technology,
Management Consulting, Marketing and Media, Academia, Social
                                                                                            Fayyad earned his Ph.D. in
Research, Demographic and Census Research and the Intelligence
                                                                                            engineering from the University of
Community it is no wonder this segment is difficult to define                               Michigan, Ann Arbor, and also holds
                                                                                            BSE's in both electrical and computer
As part of our study, we asked respondents for their job titles, and in the
                                                                                            engineering.
ranks of many business analysts, consultants, and analytics managers,
we also found graphic designers, physicians, and research scientists,                       Fayyad serves as an ideal example of
including one fisheries biologist. Successful data science often requires                                            taking his
working with teams, including specialists who are able to divide labor                      statistical smarts and turning them
and collaborate on the final outcome.                                                       into viable business entities and an
                                                                                            executive role in a Fortune 500
The absence of clear boundaries defining data science, and the many                         organization. In a recent keynote at
people co-opting the term for their own, is a good thing for the                            KAUST (King Abdulla University of
burgeoning function. It creates more interest in data science, both to                      Science and Technology), Usama
support organizational decision-making, as well as to attract more talent                   spoke to the power of data to help
into the field of data science. Raising the profile of both the function                    define the social web and how
                                                                                            businesses can use this data to their
                                                                                            own advantage and that without the
                                                                                            expertise of data scientists,
                                                                                            businesses will not be able to use the
                                                                                            data at their disposal (particularly big
                                                                                            data).
and the role is critical in organizations making the most out of the explosion of data they have
access to.

                                                             In addition to data science representing a broad sample of the
      How frequently do you partner
     with each role (%very frequently)
                                                             population, we also found that data scientists are not lone actors,
                                                             but work on teams that frequently partner with diverse roles in an
     Graphic Designer                  18%
                                         23%                 organization. On average, a data science professional has one
                                       18%
                   HR                   20%                  additional frequent partner than the average business
                Sales                          30%
                                              27%            intelligence professional, and will partner more frequently with
          Statistician              14%
                                                 32%
                                                 32%
                                                             statisticians, programmers, IT administrators, and most
            Marketing                          29%
    Strategic Planning                       24%             importantly other data scientists to gather, organize, and make
                                                 32%
         Programmer                  15%
                                                  36%
                                                             use of their data.
     IT Administration                 16%
                                                 33%
Business Management                               38%
                                                  35%
        Data Scientist         9%
                                                  38%

                                      Perhaps the greatest emerging
     Business Intelligence        Data Scientist
                                      opportunity in data science is
              the ability to analyze massive data sets generated by
web logs, sensor systems, and transaction data in order to
identify insights and derive new data products. In addition to
asking whether respondents self-identified as data scientists, our
survey also constructed a psychometric screener to identify data
scientists who worked on big data, based on background research

                                                               Based                v

on these five items, we created a subsample that includes only
respondents who answer
side of the spectrum. Four-fifths of big data scientists classifed
themselves as data scientists, and the remaining 20% cosnidered themselves business intelligence
professionals.

Our findings showed that the emerging big data scientist is distinctly different from other data
                                                    professionals. For instance, nearly half of big data
                                                    scientists use R, despite the fact that it is only used
      About  how  much  time  do  you  spend  on  
          the  following  activities  (%  A  lot)   by only 13% of other practitioners. They are also
                                                    twice as likely to use a big data storage tool like
      Acquiring  new  data  sets
                                      27%  
                                            48%  
                                                    Hadoop, Greenplum, or Netezza. Big data scientists
                                             50%    are also remarkably educated 40% have a
               Parsing  data  sets
                                                21%  

                                                                58%  
  Filtering  and  organizing  data
                                                       34%                 doctorate. Over 90% have at least a college
       Mining  data  for  patterns                           52%           education.
                                                 23%  

 Applying  advanced  algorithms
  to  solve  analytical  problems                22%  
                                                          48%              Big data is also even more of a team sport. Half of
                                                               54%  
                                                                           big data scientists partner very frequently with a
    Representing  data  visually
                                                     31%  
                                                                           Data Scientist, Statistician, or Programmer nearly
                                                             50%  
       Telling  a  story  with  data
                                                  27%                      twice the rate of the normal data group. They are
          Interacting  with  data                               58%  
               dynamically                         30%  

     Making  business  decisions                                60%  
          based  on  data                              35%  


                         Big  Data        Normal  Data
also more likely to partner with frequently with business management, but are interestingly no more
likely to partner with IT administration.

Finally, big data scientists touch data in more ways. They are twice as likely as those working with
normal data to work across the data life cycle, everywhere from acquiring new data to business
decision making, and around half spend a lot of time on each of these activities.


Organizational Implications
In order to remain competitive in the world of data science, companies need to create organizational
cultures that are conducive to data-driven decision making. First, they need to expand their view on
the possibilities when hiring data scientists, and look outside business degrees, and even computer
science, to find practitioners with the intellectual curiosity and technical depth to solve big data
problems, with academic concentrations in the hard sciences, statistics, and mathematics. Data
scientists use a variety of tools, but also recognize skill gaps as a barrier to adoption. Rather than
hiring for experience with a certain toolkit, companies should invest in on-the-job training with their
chosen set of emerging technologies.

Once companies have brought in the right talent, they need to create an environment conducive to
effective data science. That means building high-performing, cross-functional teams that include a
variety of roles, including programmers, statisticians, and graphic designers, and aligning them to
directly support interested business decision makers. They should also loosen restrictions on data in
the enterprise, allowing employees to more freely run data-driven experiments. Finally, data
scientists should be given free access to run experiments on data, without bureaucratic obstacles,
so that they can rapidly translate their own intellectual curiosity into business results.


Our Methodology
The EMC Data Science Community Survey interviewed 497 data scientists and business intelligence
professionals from around the world, including deliberate samples in the United States, India, China,
the United Kingdom, Germany, and France. 465 respondents were collected through a partnership
                                                                                                -
screened for information technology decision making authority, and further screened as either data
science professionals or business intelligence professionals. An additional 25 responses came from
participants in the 2011 Data Science Summit, and six through publication by Kaggle, an online
contest community for data scientists. All groups were asked the same questions with the same
screeners.


i http://gerdleonhard.typepad.com/files/wef_ittc_personaldatanewasset_report_2011.pdf
iihttp://www.nytimes.com/2009/08/06/technology/06stats.html
http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_f
or_innovation
iv http://radar.oreilly.com/2011/09/building-data-science-teams.html
v http://radar.oreilly.com/2010/06/what-is-data-science.html

More Related Content

Viewers also liked (16)

Batatgal
BatatgalBatatgal
Batatgal
 
Mon printing press
Mon printing pressMon printing press
Mon printing press
 
Informe consulta devoluciones
Informe consulta devolucionesInforme consulta devoluciones
Informe consulta devoluciones
 
La televisió blai
La televisió blaiLa televisió blai
La televisió blai
 
White Paper: The Essential Guide to Exchange and the Private Cloud
White Paper: The Essential Guide to Exchange and the Private CloudWhite Paper: The Essential Guide to Exchange and the Private Cloud
White Paper: The Essential Guide to Exchange and the Private Cloud
 
Day 7 reconstuction
Day 7 reconstuctionDay 7 reconstuction
Day 7 reconstuction
 
Grasphatch - Social Venture Captital
Grasphatch - Social Venture CaptitalGrasphatch - Social Venture Captital
Grasphatch - Social Venture Captital
 
Canals de tv via satel·lit asma
Canals de tv via satel·lit asmaCanals de tv via satel·lit asma
Canals de tv via satel·lit asma
 
Tatev hashvetvutyun
Tatev hashvetvutyunTatev hashvetvutyun
Tatev hashvetvutyun
 
Conduct monetary policy
Conduct monetary policyConduct monetary policy
Conduct monetary policy
 
Linux kursu-pendik
Linux kursu-pendikLinux kursu-pendik
Linux kursu-pendik
 
Gdp per capita macro
Gdp per capita macroGdp per capita macro
Gdp per capita macro
 
Advertising wed
Advertising wedAdvertising wed
Advertising wed
 
Jose esteves 1
Jose esteves 1Jose esteves 1
Jose esteves 1
 
Acercamiento a las_ciencias_naturales_lepri
Acercamiento a las_ciencias_naturales_lepriAcercamiento a las_ciencias_naturales_lepri
Acercamiento a las_ciencias_naturales_lepri
 
Jose esteves 1
Jose esteves 1Jose esteves 1
Jose esteves 1
 

More from EMC

Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lake
EMC
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic
EMC
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education Services
EMC
 

More from EMC (20)

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremio
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis Openstack
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lake
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop Elsewhere
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical Review
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or Foe
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for Security
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure Age
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education Services
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere Environments
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBook
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 

Whitepaper : Data Science Revealed in global study

  • 1. Data Science Revealed: A Data-Driven Glimpse into the Burgeoning New Field Introduction As the cost of computing power, data storage, and high-bandwidth Internet access and have plunged exponentially over the past two decades, companies around the globe recognized the power of harnessing data as a source of competitive advantage. But it was only recently, as social web applications and massive, parallel processing have become more widely available that the nescient field of data science revealed what many are becoming to understand: that data is the new oil,i the source for corporate energy and differentiation in the 21st century. Companies like Facebook, LinkedIn, Yahoo, and Google are generating data not only as their primary product, but are analyzing it to continuously improve their products. Pharmaceutical and biomedical companies are using big data to find new cures and analyze genetic information, while marketers leverage the same technology to generate new customer insights. In order to tap this newfound wealth, organizations of all sizes are turning to practitioners in the new field of data science who are capable of translating massive data into predictive insights that lead to results. Data science is an emerging field, with rapid changes, great uncertainty, and exciting opportunities. Our study attempts the first ever benchmark of the data science community, looking at how they interact with their data, the tools they use, their education, and how their organizations approach data-driven problem solving. We also looked at a smaller group of business intelligence professionals to identify areas of contrast between the emerging role of data scientists and the more mature field of BI. Our findings, summarized here, show an emerging talent gap between organizational needs and current industry capabilities exemplified by the unique contributions data scientists can make to an organization and the broad expectations of data science professionals generally. The Emerging Talent Gap For the past two decades, business and political leaders have sounded a warning call about the shortage of trained computer scientists, programmers, and engineers necessary to continue to advance a high tech economy. Less noticed has been the coming gap in analytics professionals needed to use all of the data created by these high tech systems. In 2009, Google Chief Economist Hal Varian predicted that the statisticians will be the next sexy jobii, and in our own survey, 83% of Data Scientists felt that new technology would increase the   demand for data scientists, and 64% believe that it will outpace the supply of available talent. Decrease  the   number  of  data   Increase  the   scientists   number  of  data   This makes intuitive sense for a young required  by   scientists   automating   required  by   field. Technological trends tend to much  of  the   opening  up  new   analytical  work   possibilities   follow a cycle, where the initial 17%   83%   opportunity leads to ever increasing
  • 2. demand for a certain set of skills, while later demand wanes as many of those initial skills are automated by even newer tools. Consider, for instance, the way many data processing and network management jobs that used to require legions of computer operators are now handled by automated monitoring tools. Data science is still in its very early phase, with the amount of data exploding and the right tools to process them just becoming The  best  source  of  new  Data  Science  talent   available. is:   Today's  BI   Other   3%   Although data science is generating new professionals   12%   opportunities, our capacity to train new data Students   scientists is not keeping up, and nearly two- studying   computer   thirds of respondents foresee a looming Professionals   science   in  disciplines   34%   shortfall in the number of data scientists over other  than  IT   or  computer   Students   the next five years. This aligns with other science   studying   27%   fields  other   research, including a recent McKinsey Global than   computer   Institute study that predicts a shortage of science   24%   190,000 data scientists by the year 2019iii. And when our respondents were asked where the best source for talent was, few looked to Instead, nearly two- university students. Who is the Data Scientist? Although the term data science has been around for decades indeed, most scientists use data of some form the term data scientist in its current context is relatively new, frequently credited to DJ Patil, who started the data science team at LinkedIn.iv But as a new term, the field is still very much out what it may mean. In our survey, we allowed users to self-identify as conflicts over terminology in job titles. In this Jim Asplund, Chief Scientist at Gallup Consulting, is a data scientist focused on evaluating the role that human perception by comparing them with the previous big player in has on everything from disease conditions and GDP to worker the analytics space, business intelligence productivity and consumer behavior. He works with massive professionals. data sets linking perception with actual behavior, and micro and macroeconomic outcomes. His work has isolated Twenty years ago, business intelligence was itself emotional factors that are most highly related to outcomes a new term, just emerging to take over the organizations care about. various database management and decisions support functions within an organization. As the field grew rapidly in the 90s, it also coalesced around a smaller number of tools, more consistent expectations for talent, better training, and more rigorous organizational standards. As our data demonstrates, data scientists are currently going through that transition,
  • 3. It may be helpful to think of data science and business intelligence as being on two ends of the same spectrum, with business intelligence focused on managing and reporting existing business data in order to monitor or manage various concerns within the enterprise. In contrast, data science applies advanced analytical tools and algorithms to generate predictive insights and new product innovations that are a direct result of the data. The need for rigorous scientific training was born out in our research on data scientists, and paints a clear distinction between data scientists and BI professionals. The most popular undergraduate degree for BI professionals was in business at 37% - more than the next three categories combined. In contrast, the most popular degree for data science professionals was computer science (24%), followed closely by engineering (17%) and the hard sciences (11%). We also found that data science likely to have a doctoral degree as business intelligence professionals. The data science toolkit is more varied and more technically sophisticated than the BI toolkit. While most BI professionals do their analysis and data processing in Excel, data science professionals are using SQL, advanced statistical packages, and NoSQL databases. Further, although big-data tools like Hadoop, and advanced visualization tools like Tableau are just starting to emerge in the data science world, they are almost unseen in the business intelligence world. Open Source tools, like the R statistics package, Python, and Perl, are each used by one in five data science professionals, but around one in twenty BI professionals. Data Storage Data Data Analysis Data Visualization Management Microsoft SQL 51% Microsoft Business 44% 68% 43% Intelligence Tools Server 46% Excel 51% 77% SAS 37% Oracle Business 35% 42% Intellience Tools IBM 25% 35% 52% SQL IBM/Cognos Business 32% 22% 37% 39% STATA Intelligence Tools 24% Oracle 21% 27% SAP/Business Objects 22% 22% Business Intelligence 17% Other SQL 15% Python database 13% 5% 29% MicroStrategy Business 21% SPSS Intelligence Tools 11% Other NoSQL 7% 13% 20% database 13% Perl 18% 7% Tableau 4% 10% 19% Netezza Greenplum 15% 0% 16% QlikView BASH 7% 4% 2% 11% Greenplum 15% 4% Datameer 5% 20% 14% R 12% AWK 9% Hadoop 3% 5% Karmasphere 2% 5% An important facet of data science is the ability to run experiments on data, as evidenced by DJ It would have been easy to turn this into a high-ceremony development project that would take thousands of hours of developer time, plus thousands of hours of computing time to do massive correlations across LinkedIn's
  • 4. Data Scientist Profile membership. But the process worked quite differently: it started out with a relatively small, simple program that looked at members' profiles and made recommendations accordingly. Asking things like, did you go to Cornell? Then you might like to join the Cornell Alumni group. It then branched out incrementally. In addition to looking at profiles, LinkedIn's data scientists started looking at events that members attended. Then at books members had in their libraries. The result was a valuable data product that analyzed a huge database -- but it was never conceived as such. It started small, and Usama Fayyad added value iteratively. It was an agile, flexible process that built toward its goal incrementally, rather than tackling a huge mountain of data all at once. Former Chief Data Officer at Yahoo, and currently CEO or Chairman at 3 mid-stage Without free access to the data and the start-up companies All employees have access ability to run tests without bureaucratic to run experiments on data (% strongly agree) hurdles, LinkedIn would never have been able to add this critical feature. In the Data As chief data officer and executive VP, Research & Strategic Data Solutions Data Scientists 22% Science Survey, we found that data at Yahoo! Inc. Fayyad was responsible scientists were nearly twice as likely as for Yahoo!'s overall data strategy, business intelligence professionals to have policies, systems, and managing the Business this freedom, but that most organizations Intelligence 12% Company's data processing/analytics are falling short on this critical infrastructure. He also oversaw Yahoo! characteristic. Research in building the premier scientific research organization to develop the new sciences of the Broadening the Data Science Community Internet, on-line marketing, and innovative interactive applications. While Data Science is most often associated with Big Data, it is Prior to this, Fayyad co-founded and important to consider the host of other professions and roles that deem led the DMX Group, a data mining and their work to be data science. This includes people from fields as data strategy company that was acquired by Yahoo! in 2004. diverse as Market Research, Financial Analysis, Information Technology, Management Consulting, Marketing and Media, Academia, Social Fayyad earned his Ph.D. in Research, Demographic and Census Research and the Intelligence engineering from the University of Community it is no wonder this segment is difficult to define Michigan, Ann Arbor, and also holds BSE's in both electrical and computer As part of our study, we asked respondents for their job titles, and in the engineering. ranks of many business analysts, consultants, and analytics managers, we also found graphic designers, physicians, and research scientists, Fayyad serves as an ideal example of including one fisheries biologist. Successful data science often requires taking his working with teams, including specialists who are able to divide labor statistical smarts and turning them and collaborate on the final outcome. into viable business entities and an executive role in a Fortune 500 The absence of clear boundaries defining data science, and the many organization. In a recent keynote at people co-opting the term for their own, is a good thing for the KAUST (King Abdulla University of burgeoning function. It creates more interest in data science, both to Science and Technology), Usama support organizational decision-making, as well as to attract more talent spoke to the power of data to help into the field of data science. Raising the profile of both the function define the social web and how businesses can use this data to their own advantage and that without the expertise of data scientists, businesses will not be able to use the data at their disposal (particularly big data).
  • 5. and the role is critical in organizations making the most out of the explosion of data they have access to. In addition to data science representing a broad sample of the How frequently do you partner with each role (%very frequently) population, we also found that data scientists are not lone actors, but work on teams that frequently partner with diverse roles in an Graphic Designer 18% 23% organization. On average, a data science professional has one 18% HR 20% additional frequent partner than the average business Sales 30% 27% intelligence professional, and will partner more frequently with Statistician 14% 32% 32% statisticians, programmers, IT administrators, and most Marketing 29% Strategic Planning 24% importantly other data scientists to gather, organize, and make 32% Programmer 15% 36% use of their data. IT Administration 16% 33% Business Management 38% 35% Data Scientist 9% 38% Perhaps the greatest emerging Business Intelligence Data Scientist opportunity in data science is the ability to analyze massive data sets generated by web logs, sensor systems, and transaction data in order to identify insights and derive new data products. In addition to asking whether respondents self-identified as data scientists, our survey also constructed a psychometric screener to identify data scientists who worked on big data, based on background research Based v on these five items, we created a subsample that includes only respondents who answer side of the spectrum. Four-fifths of big data scientists classifed themselves as data scientists, and the remaining 20% cosnidered themselves business intelligence professionals. Our findings showed that the emerging big data scientist is distinctly different from other data professionals. For instance, nearly half of big data scientists use R, despite the fact that it is only used About  how  much  time  do  you  spend  on   the  following  activities  (%  A  lot)   by only 13% of other practitioners. They are also twice as likely to use a big data storage tool like Acquiring  new  data  sets 27%   48%   Hadoop, Greenplum, or Netezza. Big data scientists 50%   are also remarkably educated 40% have a Parsing  data  sets 21%   58%   Filtering  and  organizing  data 34%   doctorate. Over 90% have at least a college Mining  data  for  patterns 52%   education. 23%   Applying  advanced  algorithms to  solve  analytical  problems 22%   48%   Big data is also even more of a team sport. Half of 54%   big data scientists partner very frequently with a Representing  data  visually 31%   Data Scientist, Statistician, or Programmer nearly 50%   Telling  a  story  with  data 27%   twice the rate of the normal data group. They are Interacting  with  data 58%   dynamically 30%   Making  business  decisions 60%   based  on  data 35%   Big  Data Normal  Data
  • 6. also more likely to partner with frequently with business management, but are interestingly no more likely to partner with IT administration. Finally, big data scientists touch data in more ways. They are twice as likely as those working with normal data to work across the data life cycle, everywhere from acquiring new data to business decision making, and around half spend a lot of time on each of these activities. Organizational Implications In order to remain competitive in the world of data science, companies need to create organizational cultures that are conducive to data-driven decision making. First, they need to expand their view on the possibilities when hiring data scientists, and look outside business degrees, and even computer science, to find practitioners with the intellectual curiosity and technical depth to solve big data problems, with academic concentrations in the hard sciences, statistics, and mathematics. Data scientists use a variety of tools, but also recognize skill gaps as a barrier to adoption. Rather than hiring for experience with a certain toolkit, companies should invest in on-the-job training with their chosen set of emerging technologies. Once companies have brought in the right talent, they need to create an environment conducive to effective data science. That means building high-performing, cross-functional teams that include a variety of roles, including programmers, statisticians, and graphic designers, and aligning them to directly support interested business decision makers. They should also loosen restrictions on data in the enterprise, allowing employees to more freely run data-driven experiments. Finally, data scientists should be given free access to run experiments on data, without bureaucratic obstacles, so that they can rapidly translate their own intellectual curiosity into business results. Our Methodology The EMC Data Science Community Survey interviewed 497 data scientists and business intelligence professionals from around the world, including deliberate samples in the United States, India, China, the United Kingdom, Germany, and France. 465 respondents were collected through a partnership - screened for information technology decision making authority, and further screened as either data science professionals or business intelligence professionals. An additional 25 responses came from participants in the 2011 Data Science Summit, and six through publication by Kaggle, an online contest community for data scientists. All groups were asked the same questions with the same screeners. i http://gerdleonhard.typepad.com/files/wef_ittc_personaldatanewasset_report_2011.pdf iihttp://www.nytimes.com/2009/08/06/technology/06stats.html http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_f or_innovation iv http://radar.oreilly.com/2011/09/building-data-science-teams.html v http://radar.oreilly.com/2010/06/what-is-data-science.html