3. Business Requirements
u Data scientists need to work with business people and
those with expertise in understanding the data,
understanding the business
u Specify the business requirements
u For instance, the healthcare data
4. e.g. ‘DISCWT’:
‘This the discharge-level weight
on the HCUP nationwide data to
produce national estimates’
Understand the data:
Understand the Business:
Goal:
Predict Readmission Rate
Database:
Healthcare:
Readmissions Database
Modeling
5. Data Collection
u Data from product line
u Purchase third party data
u Social media (Facebook, LinkedIn)
u Web crawling
u Open source data (Opendata, U.S. Census Data)
Challenge
Data Storage
Data Management
6. Legacy data
OLTP Web Log
Web Crawler
Open Source
Third Party
Data
Social Media
Data
XML
CSV
LOG
SQL
…
Product Line
Business
Intelligence
Data Science
App
7. Data Preparation (Data Wrangling)
u Cleaning data (semantic errors, missing entries, or inconsistent
formatting)
u Challenge: data integration
u 80% time in project workflow
Data
Source A
Data
Source B
Data
Source B
ETL
Data
Warehouse