The document discusses combining data from Jawbone with public datasets using Spark and Redshift. It describes how to model the problem, find and understand the different data sources, validate and fuse the data while accounting for confounding variables. It then provides examples of using Redshift and Spark to analyze the combined dataset and identify relationships between factors like temperature and daily steps. Overall, the key message is that data fusion can provide powerful insights but requires addressing challenges like noisy data and understanding the problem domain.
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Better Together - Using Spark and Redshift to Combine Your Data with Public Datasets
1. BETTER TOGETHER
USING SPARK AND REDSHIFT TO COMBINE
YOUR DATA WITH PUBLIC DATASETS
EUGENE MANDEL (@EUGMANDEL)
JAWBONE
QCON SF 2014
2. InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/hadoop-redshift-spark
3. Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
8. DATA FUSION IS THE PROCESS OF
INTEGRATION OF MULTIPLE DATA AND
KNOWLEDGE REPRESENTING THE
SAME REAL-WORLD OBJECT INTO A
CONSISTENT, ACCURATE, AND USEFUL
REPRESENTATION.
(WIKIPEDIA)
9. DATA FUSION -
HOW TO FIND THE
ELEPHANT
IMAGE SOURCE: HTTP://COMMONS.WIKIMEDIA.ORG/WIKI/FILE%3ABLIND_MEN_AND_ELEPHANT.PNG
18. DATA GENERATION PROCESS
NETWORK OF WEATHER STATIONS
FREQUENCY OF MEASUREMENTS - HOURLY TO DAILY
!
COLLABORATION WITH INTERNATIONAL AGENCIES
!
AGGREGATION AND QA BY NCDC
!
25. DOMAIN SPECIFIC
HOW?
WEATHER STATION B
LAT: 39.35
LON: -74.44
TIME: 2014-07-09 13:00:00
AIR TEMP: 60°F
WEATHER STATION A
LAT: 39.36
LON: -74.45
TIME: 2014-07-09 13:04:00
AIR TEMP: 74°F
ELEVATION: 30FT ELEVATION: 120FT
26. DO THE DATASETS INTERSECT ENOUGH?
COVERAGE
PLACES
!
TIMES
!
USERS
32. IN-MEMORY DATA PROCESSING FRAMEWORK
!
MODELS COMPUTATION AS A GRAPH OF RDDS (RESILIENT DISTRIBUTED
DATASETS)
!
FUNCTIONAL PROGRAMMING MODEL (SCALA, PYTHON)
!
SQL
!
CAN READ FROM SAME SOURCES AS HADOOP