This document provides an overview of data science innovations and the Hadoop ecosystem. It discusses data science workflows and discovery, as well as Hadoop and Spark. Specific innovations are highlighted, such as using sensor data from trucks to forecast GDP and analyzing social media and IoT data. Apache Spark is also introduced as a framework for big data analytics. The document aims to outline the current state of data science and provide a roadmap for further innovation using big data technologies.
1. Data Science Innovations:
Roadmap to Hadoop Ecosystem & Spark
Suresh.sood@uts.edu.au
linkedin.com/in/sureshsood
@soody
http://www.slideshare.net/ssood/datainnovation
February 4, 2015
2. Topic Areas for Discussion
1. Statistics/Data mining or Data Science?
2. What is big data and the challenge today ?
3. Data types
4. Data Science workflows & discovery
5. Hadoop
6. Data Science innovation
7. New Sources of Information (Big data) Data Driven Innovations
8. Internet of Things
9. Data Science Innovations
10. Apache Spark
3. Statistics, Data Mining or Data Science ?
• Statistics
– precise deterministic causal analysis over precisely collected data
• Data Mining
– deterministic causal analysis over re-purposed data carefully sampled
• Data Science
– trending/correlation analysis over existing data using bulk of
population i.e. big data
Adapted from:
NIST Big Data taxonomy draft report (see http://bigdatawg.nist.gov /show_InputDoc.php)
4. Unknown relationships
Unstructured data
95% of data not collected
Social-Psychological- local-Mobile-GPS-M2M
Beyond Transactions including interactions and observations
4
What is Big Data ?
5. Big Data Challenge Today : Moving from
Transactions Alone to Relationships and Empathy
Current State
= Transactions $$$
We do this stuff well e.g.
Collect payments …
Future State
= Human Empathy (relationships)
We don’t do this really e.g. User
generated content, ratings, reviews, 1:1
dialogue, Distress Signals, Geolocation
5
6. Data Types
• Astronomical
• Documents
• Earthquake
• Email
• Environmental sensors
• Fingerprints
• Health (personal) Images
• Graph data (social network)
• Location
• Marine
• Particle accelerator
• Satellite
• Scanned survey data
• Sound
• Text
• Transactions
• Video
9. HadoopConfigurations(SingleandMulti-Rack)
Adapted from: http://stackiq.com/
Cluster manager e.g. Apache Ambari, Apache Mesos, or Rocks
3 TB drives ,18 data
nodes configuration
represents 648 TB of raw
storage HDFS standard
replication factor of 3
216 TB of usable storage
Name/secondary/data nodes – 6 core 96 GB
Management node – 4 core 16 GB
10. Data Science Innovation
Data science innovation is something an
organization has not done before or even
something nobody anywhere has done before. A
data science innovation focuses on discovering
and using new or untraditional data sources to
solve new problems.
Adapted from:
Franks, B. (2012) Taming the Big Data Tidal Wave, p. 255, John Wiley & Son
12. Internet of Things (IOTs)
“trillion sensors”
Source: www.tsensorssummit.org
13. Data Science Innovations
ID Analytics Innovative Info source Innovation Software/Platform
1. Node-Link (NLA) Multiple Reduce suspect list from 18 m
to 230/32
New version
Spark GraphX
2. ANZ Truckometer NZ transport authority real time
traffic data
GDP forecast 6 months in
advance
N/A
3. Driving (Usage Based) Black box (telematics)
Unstructured data
Pay as you drive policy
Pay how you drive
Hadoop Map Reduce
4a. Deception (veracity) Found stories online blogs Flag fake stories text, images
and short video
MongoDB – Python
dictionary
4b. Psychological State Twitter and Instagram Junk words MongoDB – Python
dictionary
4c. Thematic Apperception Technique Mobile phone screen
customisation
Automated informant testing Sparkling Water
(H2O/Spark)
Deep Learning
5. Brand Brand stories “found” online Brand user profile R/Hadoop
6. Supermarket shopper behavior CCTV /beacon transmitters “My store” product placement
based on time of day
predictive shopping behaviour
MongoDB
Hadoop 2 Cluster
Spark GraphX
Spark MLib
7. Sandbag exercise Sandbag sensors Virtual trainer Spark GraphX
Spark MLib
8. Oil reserves shipment monitoring Skybox (Google) satellite images Improved oil forecast “Busboy” – C /Hadoop
9. J score for mobile energy usage Sparse incomplete data from
community of mobile users
Energy bug mgmt. Spark/Amazon Web
Suresh Sood 2015
14. 1. Node Link Analytics
• 1990’s Ivan Milat killed 7 backpackers making him Australia's most notorious Serial Killer
• Everyone in Australia was a suspect
• Large volumes of data from multiple sources
RTA Vehicle records
Gym Memberships
Gun Licensing records
Internal Police records
• Police applied node link analysis techniques (NetMap) to the data
• Harness power of the human mind
• Analyst can spot indirect links, patterns , structure, relationships and anomalies
• A bottom-up approach with process of discovery to uncover structure
• Reduced the suspect list from 18 million to 230
• Further analysis with the use of additional satellite information reduced this to 32
Data Information Knowledge
15. The ANZ Heavy Traffic Index comprises flows
of vehicles weighing more than 3.5 tonnes
(primarily trucks) on 11 selected roads around
NZ. It is contemporaneous with GDP growth.
The ANZ Light Traffic Index is made up of light
or total traffic flows (primarily cars and
vans) on 10 selected roads around the
country. It gives a six month lead on GDP
growth
http://www.anz.co.nz/about-us/economic-markets-research/truckometer/
2.
16. 3. Black Box Insurance
• Big data transforms actuarial insurance from using probability methods to estimate premiums into dynamic risk
management using real data generating individually tailored premiums
• Estimate 20 km work or home journey, data point acquired every min and journey captures 12 points per km. Assume
1000 km per month driving or generating 12,000 points per month resulting in 144,000 points per car/annum. Hence,
1,000 cars leads to 144 million points per annum.
• Telematics technology (black box) monitor helps assess the driving behavior and prices policy based on true driver
centric premiums by capturing:
– Number of journeys
– Distances travelled
– Types of roads
– Speed
– Time of travel
– Acceleration and braking
– Any accidents
– Location ?
• Benefits low mileage, smooth and safe drivers
• Privacy vs. Saving monies on insurance (Canada ; http://bit.ly/Black_box)
17. Psychological analytics helps put human context into Business
• Behavior data Links human emotions to business -> Analyse footprints left behind.
• What really does customer satisfaction mean ? Is the person actually happy?
• How do we take the emotional dimension into account for customer experience?
• How do we recognize someone is dissatisfied?
• How do we recognize a “distressed” person?
• Do we use text and voice? Will sleeping patterns and eating habits help?
• would you act differently if someone is happy?
• How do you coach employees to see how someone sounds in emotional terms?
• Understanding when distress exists and when a customer needs enhanced service
• Behavior data reveals attitude and intent. This is more predictive of future
opportunities and risk versus historical data
19. 1.Gayle
3. Paris
2. Paige
+
+
4.”The occasion
was my cousin
Paige’s 16th”
5. “I am a Canadian
and get by in
French.”
6. "All I can say is WOW! We rented a 2
bedroom, 1 ½ bath apartment (two
showers), "Merlot" from ParisPerfect
http://www.parisperfect.com/ and boy was
it ever perfect! "
7. “We had a full view of the Eiffel from
our charming little terrace. ....We were
within walking distance to two metro
stops (Pont d'Alma or Ecole Militaire) "
8. "We were walkable to many good
bistros, cafes and bakeries and only a
few blocks from the wonderful market
street Rue Cler."
9. "I bought a Paris Pratique pocket-sized book at a
Metro station. This handy guide has detailed maps
of each arrondisement, as well as the metro lines,
the bus lines, the RER and the SCNF (trains). I'll
never be without this again."
10."Six months before our trip, I gave
Paige a couple of good guide books on
Paris and suggested she let me know
what her interests were since after all,
this was to be her trip."
11.Sites
•The Marais
•Notre Dame
•L'Arc de Triomphe - 248 steps up and 248 steps
down...
•Champs Elysee
•Jacquemart Museum
•Louvre Lite
•Musee D'Orsay
•Les Invalides, Napoleon's Tomb and the
Napoleon Museum
•Sacre Coeur
•Monmartre
•Rodin Museum
•Pompidou Museum
•Train to Vernon, bike to Giverny with Fat Tire
Bike Tours
•http://www.fattirebiketoursparis.com/
•Eiffel Tower
Elaboration of Trip to Paris Blog Story (Means-End & Heider)
Woodside, Sood & Miller 2008 When Consumers and Brands Talk Psychology & Marketing
12. Unforgettable Memories
"This trip had so many memories, but here are a few choice
highlights........On our very first night, knowing that the Eiffel
Tower light show started at 10:00 p.m.... she [Paige] dropped
her camera…down 6 flights…we were stunned…Spanish
Family below standing below [with pieces of the camera]”
15." Michael Osman is an American artists
living in Paris."
"He supplements his income by being a
tour guide." I" found out about him on
Fodors"
"So I engaged Michael for two days."
16. "On our trip to Giverny, we met a young
woman from Brisbane, Australia who was
traveling on her own and we invited her to join
us. Three of us enjoyed delicious and
innovative soufflés, while Paige had the rack of
lamb. We shared two dessert soufflés, one
chocolate and the other cherry/almond. Yum"
17. "I wanted Paige to get a feel
for shopping experiences that
she would not have at home (aka
the ubiquitous mall). "
18."We went on Fat
Tire's day trip to
Monet's gardens and
house in Giverny, about
an hour outside Paris."
13."The father stretched out his cupped
hands which held all of the pieces they were
able to recover, including the memory stick
and he very solemnly said, "El muerto...".
14. "They had decide to come to Paris
to find the Harley Davidson store so
they could buy Harley Paris t-shirts."
+
+
+
+
19....."I know Paige will
treasure the memory of
this girl's trip for many
years to come."
19
21. The Newman Model of Deception (Pennebaker et al)
Key word categories for deception mapping:
1. Self words e.g. “I” and “me” – decrease when someone distances
themselves from content
1. Exclusive words e.g. “but” and “or” decrease with fabricated
content owing to complexity of maintaining deception
1. Negative emotion words e.g. “hate” increase in word usage owing
to shame or guilty feeling
1. Motion verbs e.g. “go” or “move” increase as exclusive words go
down to keep the story on track
23. 4b. Psychological State
• LIWC (analyzewords.com)
– Reveal personality from word usage
– Uses LIWC classification of words
• TweetPsych (tweetpsych.com/)
– Linguisitic analysis using:
– RID
– LIWC
Note: TweetPsych is not without critics:
http://psychcentral.com/blog/archives/2009/06/18/putting-cool-ahead-of-science-tweetpsych/
30. 7.Smart Sandbag System
smart-dove.com
The first 3 columns are x, y, z axis of gyroscope, then x, y,
z axis of accelerator. These are raw data of 40 repetitions
of shoulder press exercise. Standard Deviation and
moving average algorithm to build the chart and Hidden
Markov Model to extract features and build model of
exercise. All models are put into cloud for trainee
exercise scoring.
35. Square
Kilometer Array
(SKA)
• Data collected in a single day take nearly two million years to playback on an MP3 player
• Central computer has processing power of about one hundred million PCs.
• SKA will use enough optical fiber linking up all the radio telescopes to wrap twice around the Earth.
• Dishes of SKA when fully operational will produce 10 times the global internet traffic as of 2013.
• Aperture arrays in the SKA could produce more than 100 times the global internet traffic as of 2013.
• The SKA will generate enough raw data to fill 15 million 64 GB MP3 players every day.
• The SKA supercomputer will perform 1018 operations per second - equivalent to the number of stars in three
million Milky Way galaxies - in order to process all the data that the SKA will produce.
• So sensitive that it will be able to detect an airport radar on a planet 50 light years away.
• Thousands of antennas with collecting area of about one square kilometer (that's 1,000,000 square meters).
• Previous mapping of Centaurus A galaxy took a team 12,000 hours of observations or several years. SKA ETA 5
minutes !
• In first six hours of operation, SKA will generate more information than all previous radio telescopes
• in the world combined.
To the scientists involved, however, the SKA is no testbed, it’s a transformative instrument
which, according to Luijten, will lead to “fundamental discoveries of how life and planets and
matter all came into existence. As a scientist, this is a once in a lifetime opportunity.”
Sources: http://bit.ly/amazin-facts & http://bit.ly/astro-ska
Centaurus A