SlideShare una empresa de Scribd logo
1 de 77
A Primer on Big Data
MAURO MEANTI
HAWASSA UNIVERSITY, MARCH 11-13 2014
Based on the work of V.M. Schonberger and K. Cukier: «Big Data»
Agenda
 Today:
 What is Big Data about
 More Data
 Messy Data
 Correlation
 Thursday:
 Data: their essence,
their value
 Implications
 Risk
 Remedies
2
WHAT IS BIG DATA ABOUT
3
What is Big Data about:
The 2009 US Flu Epidemic
 H1N1: big scare, no vaccine. Need for a map to contain the spread
 Center for Diseas Control and Prevention method: good but 2 weeks late
 Google came with a predictive algorithm based on what people searched for.
50m terms and 450m models brought down to 45 “marker” terms. No
assumptions were taken
 When the flu stroke, those 45 terms painted the same map as CDC, but in real
time
 http://www.nature.com/nature/journal/v457/n7232/full/nature07634.html
4
What is Big Data about:
A Definition
THE ABILITY OF SOCIETY TO
HARNESS INFORMATION IN NOVEL WAYS
TO PRODUCE USEFUL INSIGHT
OR GOODS OR SERVICES OF SIGNIFICANT
VALUE
5
What is Big Data about:
Buying Airplane Tickets
 Airplane fares do not behave linearly with time. A Computer Scientist (Oren Etzioli)
got mad at it
 He collected all historical data on a number of routes. The data, not the rules
behind them. 200Billion flight-price records
 Can predict if a price will go up or down with a 75% hit rate, saving 50$ in
average
 Sold to Microsoft for 110M$
6
What is Big Data about:
What is New Here?
 Ability to process huge quantities of data, and not necessary tidy
 Hadoop vs relational DataBases
 A Mindset shift:
 DATA are no longer static – they have more value than their original use
 DATA can be reused
 DATA can reveal secrets
 Look for correlation versus causality
 Quantity shift leads to a Quality shift
7
What is Big Data about:
How Big?
 Non Linear growth of data– New telescopes collect today 50 times the info they collected 5
years ago
 Google process 24 petabytes per day = US Library of Congress time 1000
 Facebook uploads 10M photos per hour and 3Billion “like” per day
 YouTube adds one hour of video every second
….
 In 2000 – 25% of data were digital
 In 2007 – 300 exabytes of data were stored. As in 300 Billions compressed digital films. And it
represented 93% of data
 In 2012 - 1200 exabytes – representing 98% of all data. Like 5 piles of CD reaching the moon
 Every person on Earth now has 320 times the information that were (estimate) stored in the
Library of Alexandria 
 In Gutenberg time, it took 50 years to double the amount of info, now it takes 3 years
8
What is Big Data about:
3 main shifts
 MORE Data
 We can now process almost ALL data we want
 Using ALL data les us see details we could not see when we were limited
 MESSY Data
 Having ALL data available we can forgive some imperfections in them
 Removing the sampling error allows for some measurement error
 The loss in accuracy at the micro level is compensated by the insight at the macro level
 From causality to CORRELATION
 Big Data tells us the “WHAT”, not the “WHY”
 From validation of our hypotheses to observing connections we never thought about
9
What is Big Data about:
Datafication
 Taking informations on everything and making it analyzable opens the door to new
usage for the data
 Like a gold hunt, there is lot of value to be discovered
 Data is the OIL of the “Information Economy” and will soon move to the Balance Sheets
of companies
 Subject Matter Expert will become less relevant, Statisticians will become more (! )
 There will be value for Data, for people being able to manage them and for people with
ideas on HOW to use them
10
What is Big Data about:
Risks
 Moving from human-driven decision (based on small dataset) to machine-based
decision (based on huge dataset containing OUR data) have implications
 Who regulates the algorithms
 How we preserve individual volition “sanctity”?
 Examples:
 Data predict you will have a hearth attack soon. Insurance asks you to pay more
 Data predict you will default on a mortgage. Mortgage is denied
 Data predict you will commit a crime. Should you be arrested?
11
MORE DATA
12
More Data:
We were all biased by scarcity
 Statistic in the past: confirm the richest finding using the smallest amount of data
 The CENSUS history (by the way, CENSUS comes from “to estimate”)
 Caesar Augustus (1 BC)
 Domesday Book (1086). King William I did not live enough to see its end
 London during the plague (1390) – First attempt to make inference.
 US, 19th century. Constitution mandates one every 10 years, in 1890 the estimation was for 13
years
 So Herman Hollerith invented punch cards and tabulation machines – Data Processing (and
IBM!) is born
 Still too complex and expensive to be run more frequently than each decade
 Sampling gets invented
 First, it looked like building a “representative sample” was the best approach
 1934: Jerry Neyzman proves that random sampling provides a better result
13
More Data:
23andMe
 For 100$ they (used to) analyze your DNA
to reveal traits making you more likely to
get some heart and cancer problems
 But they only sequence a small portion of
your DNA – relative to the markers they
know
 So, if a new marker is discovered – they
would need to sequence you again
 So, working with a subset only answers the
questions you considered in advance
14
More Data:
Steve Jobs
 He got his entire DNA sequenced (3B
pairs)
 In choosing medications, doctors normally
hope for similarities between what they
know of their patient DNA and the one of
who participated to the drug’s trial
 In Job’s case, they could precisely select
drugs according to their efficacy given his
genetic make-up
 They kept changing treatment, as the
cancer mutated
 This did not save Steve’s life, but extended
it by many years
15
More Data:
Sampling make no more sense
 In many cases today, we can get close to N=ALL
 Google Flu Trends used billion of search queries
 Farecast used all US routes price data for an entire year
 In many cases, the interesting data points are the “outliers” – and you only see them
when you get N=ALL
 Detection of credit cards fraud - based on anomalies, need to be real time
 International money transfer: Xoom. Discovered a large scam when they observed a pattern
where there should not have been a pattern
16
More Data:
“Big” does not to be BIG
 The real power is not from the sheer size of the
data , is from N=ALL
 The SUMO example
 Steve Levitt from the Chicago University proved
(after many unsuccessful attempts) what everybody
knew: there was corruption in Sumo!
 Analyzed 11 years of matches, all of them (64K)
 Crossed the results with the ranking
 Corruption was not in the matches for the top
position but in the matches with mid-ranking players
(you need to win 8 of 15 matches to retain your
salary and ranking)
 When a 8-6 player met a 7-7 player at the end of the
season, he lost 25% more often than normal
 And in their first match of the next season, the former
8-6 won much more frequently than normal….as a
gift back
17
More Data:
Summing up
 Big Data (or N=ALL) allows us to reuse information and not to
resample
 It allows us to look a details and test new hypothesis at each level of
granularity
 The Albert-Laszlo Barabasi example
 The chart on the right comes from ALL calls done over one mobile
operator network in a 4 months period
 The study (Barabasi et al) is the first network analysis at a societal level
 It shows that people with many links are less important that people with
links outside their immediate community. It indicates a premium on
diversity within societies
 Using random sample in the era of big data is like using dial phones
in the era of cell phones. Go for ALL, whenever you can!
18
MESSY DATA
19
Messy Data:
We were all obsessed with precision
 Focused on sampling, we were trying to get exactitude – since errors got hugely
amplified
 And with few data, the quest for exactitude was reasonable and aligned with our inner
belief since the 19th century
 Quantum mechanics in 1920s should have changed that mindset, but did not
 But if we relax the precision standard, we can get many more data, and “more trumps
better”
 Messiness (likelihood of errors) grows linearly with more data
 Messines grows when combining different types of data: think an anagrafic where the
company IBM can be represented as IBM, I.B.M., International Business machine, T.J.W
Labs…….
 Messiness kicks in when we transform data as when we use twitter messages to predict the
success of a movie
20
Messy Data:
The vineyard example
 If we have one temperature sensor for a whole
vineyard, we must and can ensure it works perfectly,
having a high-cost sensor and high-cost maintenance
 If we have one per vine, we can use cheaper sensors
since the aggregate data will provide a better picture
even with few imprecise measurements
 If each sensor sends a reading every minute, we have
no sync issues, but if each sends every millisecond, we
can have data “out-of-sequence” but we still collect a
much better representation
 Maintaining exactitude in the word of Big Data can be
done (look at Wall Street 30,000 trades per second) but
is expensive
 Very often less precision is “good enough” and allow
to scale data
21
Messy Data:
More Trumps Less
 As Moore Law says processor speed keeps improving, also
performance of algorithms has kept increasing
 But most of the gains do not come from faster chips or better
algorithms but from more data.
 In Chess, the system has been fed with ALL data for a match with <= 6
pieces left an now the computer always win
 In Natural Language, given 4 existing algorithms for grammar-checking,
Microsoft discovered that feeding them more words changed the
performance dramatically, and also altered the ranking of the
algorithms
 So Microsoft invested in developing a corpus of words versus developing
new algorithms
22
Messy Data:
The case of machine-translation
 Started with very small data: 250 word pairs were used to translate 60 Russian phrases
into English in the cold war. It worked, but it was useless
 But it did not improve fast: the issues were fuzzy words: is “bonjour” good morning, or
good day, or hello, or hi ?
 IBM in 1990 launched Candide: ten years of Canadian parliament transcripts in French
and English. 3 millions sentence pairs, very well translated. It worked better, but not good
enough to become commercial. And could not improve further
 Enters Google:
 Takes every translation it can find on the web. A trillion of words, 95 billions English sentences,
unevenly translated.
 It works way better than anything else before
 Not because a better algorithm, not because better quality of the dataset. Just because its
size
 And it got the size because it accepted messiness
23
Messy Data:
The Billion Prices Project
 Calculating the Consumer Price Index (or inflation rate) is a complex project that costs
250M$ a year. And gives the output after few weeks, often too late to predict crises
 The MIT launched a project to get the prices of products over the Web. 500,000 prices a
day. Messy and not neatly comparable
 The project produces an accurate prediction of CPI in real time
 It spawn off a commercial venture, PriceStats, that sells analysis real-time to banks and
Governments over the world . At a much cheaper price
24
Messy Data:
Tags –imprecise but powerful
 Traditional hierarchical taxonomies were painful, but good-enough in small data
 But how to categorize the 6 Billions photo Flickr has, from 75M users?
 Use TAGS. Created by people in ad-hoc way, simply typed in
 They may be misspelled, so they introduce inaccuracy, but they give us natural access
to our universe of photos, thoughts, expressions….
25
Messy Data:
How to handle them?
 Traditional databases “Structured Query Language” requires structured
and precise data. If a field is defined as numeric, it must be a number. And
so on
 They are designed for a world when data are few, and hence are curated
carefully and precise
 Also indexes are predefined, so you need to know in advance what you
will be searching for
 Now we have large amounts of data with different types and different
qualities, and we need to mix them. This required a new database design,
“noSQL”.
 Hadoop is an example of this.
 It accepts data of different type and size, it accept messy data, and it allows to
search for everything
 But it requires more processing and storage, typically distributed across physical
locations.
 It has redundancy built in, and it perform processing in place.
 Its output is less precise than a SLQ output. So don’t use it for your bank account
 Segmenting a list of customer for a marketing campaign: Visa reduced
processing time from one month to 13 minutes
26
Messy Data:
Past the tradeoffs
 Only 5% of data are structured – we need to accept that and the
inevitable messiness it brings if we want to tap into the universe of web
pages, pictures, videos,….
 We were used to be limited to small sets and focused on exactitude
 We can now embrace the reality: data sets ARE large and they ARE
messy. We have the tools to handle those characteristics and better
understand the world
27
CORRELATION
28
Correlation:
The Amazon story
 In 1997, the top selling tool for Amazon were critics’ reviews.
They had 12 full time book critics
 Then they realized they had huge quantities of data: every
purchase, every book looked at but not bought, the time
spent on each book…
 First attempt to use those data: taking a sample to find
similarities across customers. Outcome: dumb.
 Second attempt: use all data and just look at correlation
between products (“item-to-item” collaborative filtering)
 It worked, and it was book-independent
 They marked-tested the 2 approach: books suggested by
the algorithm beat books suggested by the critics 100:1
 The 12 critics got fired, and Amazon sales soared
29
Correlation:
Machine-gen recommendations work
 Nobody knows WHY a customer who bough book A also want to but book B
 But one third of Amazon’s sales result from this system
 75% of orders for Netflix come from this system
 It is like the merchandise placed close to the cashiers – but it analyses your cart real
time and real time it puts the right merchandise in the basket
 Professional skills, subject-matter expertise, have no impact on those sales processes
 Knowing what, not why, is good enough
 Correlation cannot foretell the future, but through identifying a really good proxy for
a phenomenon, it can predict it with a certain likelihood
30
Correlation:
Don’t make hypothesis, be data-driven
 Walmart – the largest retailer in the world, crossed its historical sales
data with the weather reports. Discovered that before every hurricane,
people rushed to buy….
Pop-Tart, a sugary snack. Now they know and they stock it next to the
hurricane supplies
 Nobody could have made that hypothesis
 The traditional approach was to make hypothesis and validate them through test. Slow
and cumbersome and influenced by our bias
 Let sophisticated computational analysis identify the optimal proxy
 No need to know which are the search items correlated to flu
 No need to know the rules the airlines use to compute prices
 No need to know the taste of Walmart buyers
31
Correlation:
More examples of the use of correlation
 FICO Medication Adherence Score:
 To know if somebody will take his medicines, FICO analyzed apparently irrelevant variables
as how often they changed job, if they were married, if they had a car
 Historical data gave them correlations helping creating an index that helps health
providers to better target the money they spend reminding the patients to take their
medicines
 Experia estimates people’s income based on their credit history. It cost 1$ to get itm
while it would cost 10$ to get the tax return form
 Aviva uses credit reports and lifestyle data as proxies for the blood and urine tests.
The data driven prediction costs 5$ while the tests would cost 125$
 Target used its shopping history to predict if a woman was pregnant. Found 20
products that were good predictors and used them to target those women. Even
targeting the different phases of pregnancy
32
Correlation:
Predictive Analysis
 Place sensors on motors, equipment or infrastructure like bridges
to monitor the data patterns around temperature, vibration,
sound, etc
 Failures typically observe a pattern in those data so once the
pattern is spotted, predicting it becomes easy
 UPS use it for its 60,000 cars. Before it, it replaced each part every 2
years, to be on the safe side. Now it has saved millions of dollars
 University of Ontario used it to help making better diagnostic
decision while caring for premature babies
 Data showed that very constant vital signs are a precursor of a serious
infection – against any apparent logic
 This stability is likely the calm before the storm, but the causality is not
important, the correlation is
 Big data saves lives
33
Correlation:
Not only linear
 We already said that in small data every analysis started with an hypothesis
 Today with big data , the hypothesis is no longer important
 Also, in small data the analysis was limited to linear correlation. Today, no longer
 Are happiness and income directly correlated?
 They are linearly correlated for low income, than it plateau
 How measles immunity depends on healthcare spend?
 Again, it is linear at the beginning but then it drops (likely since more affluent people shy
away from vaccines)
34
Correlation:
A philosophical problem
 Those analysis help us understand the world by primarily asking WHAT and not
WHY
 As humans, we desire to make sense of the world through causal explanations
 Causality normally is a very superficial (quick, illusory) mechanism. When two
events happen one after the other, we are urged to see a causal relation.
 Got a flu. It happened since I did not wear a hat yesterday
 Got stomach sick. It happened since I ate at the restaurant yesterday
 Big data correlation will routinely disprove our causal intuitions
 Sometimes causality is a deep scientific experimental process
 In this case, correlation is a fast and cheap way to accelerate it, providing proxies
instead of hypothesis
 Be careful with correlation:
 In a “quality of used cars” study, it was proven that cars painted in orange were 50%
less prone to have defects.
 But painting you car orange will not make the trick!
35
Correlation:
The Manhattan Manhole
 In Manhattan there are 51,000 manholes, each weighs 150K
 They tend to explode in the air and crash on the ground 
 A typical Big Data problem for MIT : identify the ones at risk so to be
able to service them preventively
 94,000 miles of cables, some laid before 1930
 Records kept since 1880, formats immensely different. Same object ( a
“service box”) is identified with 38 different names
 After a huge work to format the data to make them machine readable, the
MIT team identified 106 predictors and mapped them against the historical
data up to 2008, then used the result to predict 2009
 It turned out that there were 2 important ones: age of cables and having
had previous problems. The top 10% of the manholes in the list prioritized by
those two factors contained 44% of the manholes that had incidents
 Using those predictors in the future allows to reduce the number of incidents
dramatically
36
Correlation:
Is it the end of theory?
 Chris Anderson in 2008 Wired asked his readers if correlation
and statistical analysis mark the end of theory
 Likely NO, Big Data is founded on theories itself and requires
them through its process
 But it marks a shift in the way we make sense of the world,
and this change will require time to get us used to
 And this change is, in the end, due to the fact that we have
far more data than ever
37
Data: their essence, their value
38
Data: their essence, their value
Navigating the Oceans
 In 1840, ocean navigation was still a mystery. Captains were afraid of the
uncertain, they always repeated their own preferred routes, with no rationale
 Enters M. Maury, head of the “Depot of Charts and Instruments” bureau of
the US Navy
 In his office, he discovers hundreds of thousands of “logs” of previous trips.
They contain info on winds, tides, streams, weather….
 He hires 10 “computers” to transform those logs in data to be able to tabulate
them, he divides the oceans in 5x5 degrees squares … and those data
indicates amazingly clearly the most efficient routes. On average, it saved
one third of the navigation time
 To improve further and get more data, he then created standards for logging
(to save $ on the “computers”), he gave his charts only to whom agreed to
return the data, he gave flags for the ships supporting the initiative to show
 In the end he tabulated 1.2M data points, and changed the world. His maps
are still in use
39
Data: their essence, their value
Datafication
 Commander Maury was one of the first to understand the special value of huge corpus
of data. He took data nobody cared about and transformed them into objects of value,
This is called: Datafication
 Similarly, Farecast had taken old price points for airplane tickets, and Google had taken
old search queries and they had transformed those in something of value
 Another example: a research in Japan Institute of Industrial Technology
 They took data nobody thought to use: the way people sit in the car. 360 sensors on the car
seat
 They obtain a digital map that can be used as antitheft signature, for insurance purpose, as a
safety tool
 This is another example of taking some data with apparently little use and transform
them in useful data. Datafication
40
Data: their essence, their value
Datafication ≠ Digitalization
 To datafy a phenomenon is to put it in a quantified format so it can be
tabulated and analyzed
 It requires us to know how to measure and how to record what we
measure
 This idea pre-date the “IT Revolution” age by far.
 Roman numeration was extremely hard to use for calculating large (or very
small) amounts. Counting board helped with calculating but were np use for
recording
 Arabic numerals were introduced in Europe in 1200 but they only took off at
in the 1500 thanks to Luca Pacioli and the double-entry bookkeeping: a
clear tool for datafication
 Double-entry bookkeeping standardized the recording of information,
allowed quick queries to the data set and provided and audit trail to allow
data to be retraced (a build-in “error-correction” mechanism)
 Computers made datafying much more efficient. And improved
immensely the ability to analyze data. But the act of digitalization, by
itself does not datafy
41
Data: their essence, their value
Google vs Amazon
 Both Google and Amazon has datafied a huge number of books
 Google as part of his huge project “Google Book Search”
 First digitized the text, then, using custom-build OCR, datafied it
 Now 20M titles are fully searchable. Look at
http://books.google.com/ngrams for a quick idea
 15% of all published books
 Google uses is for his machine translation service
 Amazon, with Kindle, has datafied books too, for millions of new
books
 but it has decided not to use that for any relevant project/analyses (with
the exception of the service of statistically relevant word)
 Possibly since books are its core business
42
Data: their essence, their value
Location is also Data
 Introduction of GPS in 1978 it allowed the simple datafication of location data
 Price going down from >100$ to <1$ makes it possible to get location data for many
different things
 Insurances now price (also) based on location logs
 UPS used geo-location to build, similarly to Maury’s navigation map of the oceans, an
optimized navigation map for its 60,000 vehicles, saving 30 million miles
 AirSage buys cellphone data to create real-time traffic reports
 Jana uses cellphone data to understand consumer behaviors…. A powerful tool
 The important point is that those data are used for different purposes versus what they
were created for
43
Data: their essence, their value
Interaction is also Data
 Facebook social graph (in 2012) covered >10% of the world
population, all datafied and available to a single company
 This could be used for credit scores: bad payers tend to stick with their
similes: Facebook could be the next credit scoring agency
 Twitter (who sells access to its data) is already used to read the
“sentiment” about politics, movies, songs….
 Now sentiment analysis starts being used also to drive investments in
the stock market. MarketPsich sells reports on that, covering 18,864
indices across 119 countries
 Social Media networks sit on a immense treasury of data, the
exploitation of which has just started
44
Data: their essence, their value
Everything is also Data
 The “Internet of things” is about sensors on everything, incessantly transmitting data in
a format suitable for datafication
 It is is starting with fitness, medical, manufacturing
 Zeo has created a database of sleep activity uncovering differences between men
and women
 Heapsylon has created a sock that tells you phone if you are running well or not
 Georgia Tech has created an app that allows a phone to monitor a person body
tremor to diagnose and control Parkinson disease. It is just less effective than the
expensive tools used in the hospitals
 GreenGoose sells tiny sensors that everyone can put on objects to measure how
much they are used. Allows anyone to create his own data environment
45
Data: their essence, their value
Datafication is a fundamental project
 It is an infrastructure project rivaling the ones in the past, the
Roman aqueducts or the Encyclopediè of the Enlightenment
age
 We may not notice, because we are in the middle of it
 In time, datafication will give us the means to map the world
in quantifiable, analyzable way
 Today, it is mostly used in business to create new forms of
value
46
Data: their essence, their value
The Value of reusing
 You all (annoyingly) digit the captcha Luis von Ahn
invented in 2000
 When von Ahn realized he was wasting 10 seconds of
your time 100M times a day, he thought harder
 He invented ReCaptcha. The second word is a
digitized word a computer cannot read
 5 consistent user inputs disambiguate that word
 Data has a primary use (to prove you are human) and
a secondary use (to decipher unclear words)
 And it saves 750M$/yr in digitalization manual work
Captcha = Compeletely Automated Public Turing test to tell Computers and Humans Apart
47
Data: their essence, their value
A new Value for Data
 Data has always been used and traded
 Prices, Contents, Financial informations, Personal data…
 But they used to be either ancillary to the business, or narrowly used like in
contents or personal informations
 Now, all data can become valuable
 Fuel levels from a delivery vehicle
 Readings from heat sensors
 Billions of old search queries
 Old price records for airline tickets
 ….
 And the cost of gathering and keeping them keeps falling. In 50 years
storage density has increased by a 50-million fold factor….
48
Data: their essence, their value
Data can be reused and multiused
 The primary used for data is typically evident to who collects them:
Stores for proper accounting
Factories for quality control
Websites for content optimization
Social sites for ads optimization
 But data do not get consumed by usage and can be reused for
multiple purposes.
 So data full value is greater than the one extracted from their first
use
 This is called the “option value” of data. They have a “potential
energy”
49
Data: their essence, their value
Reuse
 Search terms are a classic for reuse
 Hitwise use search terms to learn about consumer preferences. Will “pink” or
“black” be next season fashion color?
 Bank of England use search terms to get a sense on the housing market
 Logistic companies use their records to create business forecast they sell
(under a different company name)
 SWIFT offers GDP forecast based on the money transfers it handles
 Mobile operators start reselling their infos (enriched with geo-loc info) for
local advertisement and promotions
 They can also sell the signal strength information (with geo-loc) to
handset manufacturers to improve the reception quality
 Large companies start spinning off dedicated companies to take $
advantage of their data option value
50
Data: their essence, their value
Data combination
 At times the dormant value can only be unleashed thru
combining different datasets – often vey different
 Cancer and Cell Phones
A question that has always been hanging around
The Danes took a N=ALL approach, combining all consumer
mobile operator data from 1987 to 1995, all cancer patient
registers from 1990 to 2007 and all income and education
information for each inhabitant
The result was that there was NO correlation
 With Big data, the sum is more valuable than the parts
51
Data: their essence, their value
Data Extensibility
 To enable reuse – design extensibility from the ground up
 Google Street view was originally used to allow the “street
view” in Google maps.
But data had been collected with extensibility in mind, so
they will be reused to allow functioning of Google self-driving
car
 In-shops camera (and software) are designed to prevent
shop-lifting but they can be extended to provide
marketing-relevant data on customer behaviors and
preference
 The extra cost of collecting multiple data streams is low,
and can drive massive benefit when a dataset can be
used for multiple instances
52
Data: their essence, their value
Data Exhaust
 Bad, Incorrect or Defective Data can bring
a value
 Google spell-checker is built using the end-user input when correcting
misspelled queries
 Data exhaust, in general, means data the users leave behind them
 Also voice recognition, spam filters system improves in a similar way
 Social networks are obviously looking at this
 But other sectors are starting:
 E-Book readers – gather an amazing amount of information that could help
authors and publishers make better books
 Online education programs can predict student behavior
 This will constitute a huge barrier to entry for new-entrants
53
Data: their essence, their value
What is the value of data
 Data are an intangible asset, as brand, talent and
strategy
 But it can explain some strange things that
happened recently like WhatsApp evaluation (or
Facebook IPO itself)
 There are emerging marketplaces for Data, like
Import.io, or Factual
 But there is no clear answer yet, also since most of
the value of data is in their (re)use, not in the data
possession
54
Implications
55
Implications
Decide.com
 Decide.com had an ambition: to be a price-prediction engine for
almost every consumer product
 They scrapped the web to obtain 25Billion price observations. Lot of
data, and lot of text to be transformed in data
 Identified un-natural behaviors, like prices increasing for old model at
the introduction of a new one
 Spotted any un-natural price spike
 Provided 77% of accuracy, and saved on average 100$ per purchase
 If the prediction was wrong, they reimbursed the difference
 They got bought by eBay…
 What makes them special? Data were available on the Internet, they
did not use any special algorithm….
56
Implications
Ideas matter
 Decide.com had an IDEA. And that idea came from a big data
mindset: they saw the opportunity and realized it could have been
realized with existing data and tools
 Moving from the data itself to the companies who use data, how does
the value-chain work?
 There are three types of big-data companies, differentiated by the
value they offer:
 The Data
 The Skills
 The Ideas
(and of course some companies have a mix….)
57
Implications
Who has Data
 Some companies have lot of data, but data is not what
they are in business for
 Twitter – as an example – turned to two independent
companies to license its data to other users
 Telecom companies could do the same – and in some
cases they start doing it
 ITA provided data to Farecast – they did not do the job
themselves since they would have been in competition
with the airlines
 Master Card created a division (MC Advisors) to extract
value from its data and resell
58
Implications
Who has Skills
 Consultants, technology vendors analytics providers who
have competencies to do the work but do not have
access to data and do not have a “big-data” mindset
 Accenture is a good example
 Microsoft (Consulting) is another:
 Worked with an Hospital in Seattle to analyze years of
anonymized medical record to find a way to minimize
readmissions
 Found that the mental state of the patient is a key predictor
 Addressed that and reduced the overall healthcare spend
59
Implications
Who has a big-data mindset (1)
 They see opportunities before the others, and they see
what is possible without thinking too early to its feasibility
 FlightCaster.com – predicts if a flight will be delayed
 Analyze every flight over ten years, matches against
weather data, and apply the correlation to current
flights and current weather
 Data where all available openly (government owned)
but the government had no interest in using them
 Airlines had no interest (they want to hide the delays)
 It worked perfectly… even airlines’ pilots used them...
 They were a first mover – it was not difficult to copy them
60
Implications
Who has a big-data mindset (2)
 Very often it takes an outsider to get a brilliant idea
 The incumbent are often too “encumbered” by their present to think
well to the future
 Amazon was not funded by a bookstore but by an hedge fund…
 Ebay was not launched by an auction company but by a software
developer….
 Entrepreneurs with big-data mindset do not normally have the data
but they also miss the vested interest/fear preventing to use the data
61
Implications
Data Intermediaries
 Today, both skills and ideas seem to dominate the value-chain, but long term most
of the value will be in the data themselves
 Data intermediaries will emerge
 Inrix – a traffic-analysis firm
 They get geo-loc data from car manufacturers, taxis, delivery vans
 They aggregate, combine with historical data, weather data and
local events information and predict traffic
 They collect data from rival companies, who could do nothing with their data alone
and who have no competencies in predictive methods
 What Inrix does benefits their customers so they have a return themselves (even if not
a competitive advantage)
 This “collaboration” is not new (banks need to send their data to central bank etc)
but now it is about a secondary use of data. And maybe tertiary.. Inrix stated using
traffic data to provide informations about health of commercial centers and health
of the economy in general….
62
Implications
What are the experts for?
 In the movie Moneyball the old “scouts” confront the geek statistician
and offer their arguments against him
 “He’s got a baseball body… a good face”
 “He has an ugly girlfriend, it means no confidence”
 This shows the shortcoming of human judgment
 Data driven decisions are poised to augment and overrule the human judgment
 The subject matter expert loses appeal versus the data analyst
 The online training company Coursera uses machine-recorded data to advise
teachers on what to improve in their lessons
 Skills in the workplace are changing. Experience is a bit like exactitude. Very useful
in a small data word where you need to make many inferences, less useful in a big
data world where data talk
64
Implications
Who will be the winners
 Large companies will continue to soar. Their advantage will rest on data
scale and not on physical scale. And ownership of large set of data will be a
competitive barrier.
 But large companies need to get the big-data mindset. Rolls-Royce is a
good example – using sensors and big data they transformed from a
manufacturer to a services companies (charging on usage time and
support)
 Small companies will also do well since they can have “scale without mass”
and big-data does not require large initial investments, they can license
data vs owning them, they can rely on cheap cloud computing and
storage
 Mid-sized companies will be squeezed in between
 Individuals will likely be able to take advantage of this revolution. Personal
data ownership may empower individual consumers. But it will need new
technologies, albeit companies as Mydex are already working on it
66
Risks
67
Risks
3 categories of risk
 Internet already threatened PRIVACY, with big data the change of
scale created a change of state. Google knows what we search,
Amazon knows what we buy (or would like to buy), Twitter and
Facebook know how we feel and who we like
 PROPENSITY now can become something affecting our life. We can see
insurance and mortgages denied, even if we have never been sick or
never been a bad payer
 We can fall victim of a DATA DICTATORSHIP where we fetish our analysis
and end-up misusing them
68
Risks
Privacy
 Big Data is not all about personal informations (think to UPS or the
manhole examples) but much of the data being generated now
contain personal informations (or can be traced back to them
 “Smart meters” collect info on electric usage very 6 minutes. It can tell
whichever appliance you use, and of course when
 The traditional approach to privacy is “notice and consent” that limits
to the primary usage
 How to use it in a big data world where secondary usages have not
being imagined yet?
 Opt-out leaves a trace
 Anonymization does not work either since big data creates too many
references to ensure we can not be identified
69
Risks
Probability and free will
 Parole boards in the US use data analysis – based predictions to decide
whether release somebody from prison
 US Homeland has a project to identify terrorists by monitoring body
language and other physiological patterns
 In Los Angeles police use big data to select streets, groups, individuals
need to be subject to more surveillance
 It at looks like a great idea (preventing crime) but it is dangerous. We
may want to punish the probable criminal
 And while “small data” techniques were based on profiling based on a
model of the issue at hand (causal), “big data” only look at correlations
– that makes things even more dangerous
71
Risks
A potential bad outcome
 Going back to the Google Flu example
 What if the government decides to impose a quarantine on people in
the more risky areas
 The Google algorithm allows to identify them individually
 So they can be quarantined only since they made the queries…
 But remember: Correlation is NOT Causation….
72
Remedies
73
Remedies
Every revolution bring new rules
 Gutenberg invention brought censorship, licensing,
copyright, freedom of speech, defamation rules
 First the focus was on limiting the information flow , than it
edged in the opposite direction
 With the Big Data transformation, we will also need a new
set of rules. Simply adapting the existing ones will not be
sufficient. But we need to move fast
74
Remedies
Few suggestions
 Privacy should move from end-user consent to data-user accountability
 Big data users should provide use-assessements on the dangers of the
intended use
 They should also provide a time-frame for the usage (and retention) of data
to avoid a “permanent memory” scenario (as we have today)
 Decisions based on big data predictions must be documented and the
algorithm certified, and they need to be disprovable
 Decisions mast be framed in a language of risks and avoidance not in a
language of “personal responsibility”
 Judgment must stick to personal responsibility and actual behavior
75
Remedies
A new profession
 As the complexity of Finance paved the way for the creation of
auditing firms, we will need a new set of experts: the “Algorithmists”
 Companies will have internal algorithmists , as they have controllers
now, and external ones, as they have auditors
 Those people will be the expert ensuring that big data system do not
remain “black-boxes” offering no accountability, traceability or
confidence
76
Remedies
Data Antitrust
 As for any other raw material or key service, access to data must be
regulated
 Competition must be ensured and data transactions enabled
through licensing and interoperability
 Government (and others willing to do so) should publicly release its
own data (this is already happening under the name of “Open
Data”)
77
Closing
79
Big Data today
 The effects are large on a practical level, finding solutions to real problems
 Big Data is when the “Information Society” becomes true.
Data (information) takes the center stage, and it speaks
 Data will keep increasing
 Messines will be acceptable in return for capturing far more data
 Correlation is faster and cheaper than causality so it is often preferable
 Much of the value will come from secondary use of data
 We will need to establish new principles to govern the change
 Big Data is a resource and a tool. It informs, it does not explain. It points us
towards understanding, but is it not the truth
80
Big Data tomorrow
What’s past is prologue
(William Shakespeare)
81

Más contenido relacionado

Similar a A Primer on Big Data taken by the book: "Big Data" by Schoenberger and Cukier

Big data chicago v2 5 14 14
Big data chicago v2 5 14 14Big data chicago v2 5 14 14
Big data chicago v2 5 14 14
Tim Gilchrist
 
Sensory transformation
Sensory transformationSensory transformation
Sensory transformation
Karlos Svoboda
 
​Big data and the examined life
​Big data and the examined life​Big data and the examined life
​Big data and the examined life
Sherry Jones
 
Business stats assignment
Business stats assignmentBusiness stats assignment
Business stats assignment
Infosys
 
TED Wiley Visualizing .docx
TED  Wiley Visualizing .docxTED  Wiley Visualizing .docx
TED Wiley Visualizing .docx
ssuserf9c51d
 
Dawn Nafus's presentation at eComm 2008
Dawn Nafus's presentation at eComm 2008Dawn Nafus's presentation at eComm 2008
Dawn Nafus's presentation at eComm 2008
eComm2008
 

Similar a A Primer on Big Data taken by the book: "Big Data" by Schoenberger and Cukier (20)

Big Data--Variant Data Concept
Big Data--Variant Data ConceptBig Data--Variant Data Concept
Big Data--Variant Data Concept
 
Heavy, Messy, Misleading: why Big Data is a human problem, not a tech one
Heavy, Messy, Misleading: why Big Data is a human problem, not a tech oneHeavy, Messy, Misleading: why Big Data is a human problem, not a tech one
Heavy, Messy, Misleading: why Big Data is a human problem, not a tech one
 
Heavy, messy, misleading. Why Big Data is a human problem, not a technology one.
Heavy, messy, misleading. Why Big Data is a human problem, not a technology one.Heavy, messy, misleading. Why Big Data is a human problem, not a technology one.
Heavy, messy, misleading. Why Big Data is a human problem, not a technology one.
 
Heavy, Messy, Misleading: How Big Data is a human problem, not a tech one
Heavy, Messy, Misleading: How Big Data is a human problem, not a tech oneHeavy, Messy, Misleading: How Big Data is a human problem, not a tech one
Heavy, Messy, Misleading: How Big Data is a human problem, not a tech one
 
Big data chicago v2 5 14 14
Big data chicago v2 5 14 14Big data chicago v2 5 14 14
Big data chicago v2 5 14 14
 
Roger hoerl say award presentation 2013
Roger hoerl say award presentation 2013Roger hoerl say award presentation 2013
Roger hoerl say award presentation 2013
 
Sensory transformation
Sensory transformationSensory transformation
Sensory transformation
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data
 
How New York Genome Center Manages the Massive Data Generated from DNA Sequen...
How New York Genome Center Manages the Massive Data Generated from DNA Sequen...How New York Genome Center Manages the Massive Data Generated from DNA Sequen...
How New York Genome Center Manages the Massive Data Generated from DNA Sequen...
 
NPTEL BIG DATA FULL PPT BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...
NPTEL BIG DATA FULL PPT  BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...NPTEL BIG DATA FULL PPT  BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...
NPTEL BIG DATA FULL PPT BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...
 
​Big data and the examined life
​Big data and the examined life​Big data and the examined life
​Big data and the examined life
 
Business stats assignment
Business stats assignmentBusiness stats assignment
Business stats assignment
 
JanData-mining-to-knowledge-discovery.ppt
JanData-mining-to-knowledge-discovery.pptJanData-mining-to-knowledge-discovery.ppt
JanData-mining-to-knowledge-discovery.ppt
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
TED Wiley Visualizing .docx
TED  Wiley Visualizing .docxTED  Wiley Visualizing .docx
TED Wiley Visualizing .docx
 
Data mining and knowledge Discovery
Data mining and knowledge DiscoveryData mining and knowledge Discovery
Data mining and knowledge Discovery
 
Big Data-Job 2
Big Data-Job 2Big Data-Job 2
Big Data-Job 2
 
The Paradox of Big Data: Misery or Magic?
The Paradox of Big Data: Misery or Magic?The Paradox of Big Data: Misery or Magic?
The Paradox of Big Data: Misery or Magic?
 
Dawn Nafus's presentation at eComm 2008
Dawn Nafus's presentation at eComm 2008Dawn Nafus's presentation at eComm 2008
Dawn Nafus's presentation at eComm 2008
 
Big data new physics giga om structure conference ny - march 2011
Big data new physics   giga om structure conference ny - march 2011Big data new physics   giga om structure conference ny - march 2011
Big data new physics giga om structure conference ny - march 2011
 

Último

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
SanaAli374401
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 

Último (20)

APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 

A Primer on Big Data taken by the book: "Big Data" by Schoenberger and Cukier

  • 1. A Primer on Big Data MAURO MEANTI HAWASSA UNIVERSITY, MARCH 11-13 2014 Based on the work of V.M. Schonberger and K. Cukier: «Big Data»
  • 2. Agenda  Today:  What is Big Data about  More Data  Messy Data  Correlation  Thursday:  Data: their essence, their value  Implications  Risk  Remedies 2
  • 3. WHAT IS BIG DATA ABOUT 3
  • 4. What is Big Data about: The 2009 US Flu Epidemic  H1N1: big scare, no vaccine. Need for a map to contain the spread  Center for Diseas Control and Prevention method: good but 2 weeks late  Google came with a predictive algorithm based on what people searched for. 50m terms and 450m models brought down to 45 “marker” terms. No assumptions were taken  When the flu stroke, those 45 terms painted the same map as CDC, but in real time  http://www.nature.com/nature/journal/v457/n7232/full/nature07634.html 4
  • 5. What is Big Data about: A Definition THE ABILITY OF SOCIETY TO HARNESS INFORMATION IN NOVEL WAYS TO PRODUCE USEFUL INSIGHT OR GOODS OR SERVICES OF SIGNIFICANT VALUE 5
  • 6. What is Big Data about: Buying Airplane Tickets  Airplane fares do not behave linearly with time. A Computer Scientist (Oren Etzioli) got mad at it  He collected all historical data on a number of routes. The data, not the rules behind them. 200Billion flight-price records  Can predict if a price will go up or down with a 75% hit rate, saving 50$ in average  Sold to Microsoft for 110M$ 6
  • 7. What is Big Data about: What is New Here?  Ability to process huge quantities of data, and not necessary tidy  Hadoop vs relational DataBases  A Mindset shift:  DATA are no longer static – they have more value than their original use  DATA can be reused  DATA can reveal secrets  Look for correlation versus causality  Quantity shift leads to a Quality shift 7
  • 8. What is Big Data about: How Big?  Non Linear growth of data– New telescopes collect today 50 times the info they collected 5 years ago  Google process 24 petabytes per day = US Library of Congress time 1000  Facebook uploads 10M photos per hour and 3Billion “like” per day  YouTube adds one hour of video every second ….  In 2000 – 25% of data were digital  In 2007 – 300 exabytes of data were stored. As in 300 Billions compressed digital films. And it represented 93% of data  In 2012 - 1200 exabytes – representing 98% of all data. Like 5 piles of CD reaching the moon  Every person on Earth now has 320 times the information that were (estimate) stored in the Library of Alexandria   In Gutenberg time, it took 50 years to double the amount of info, now it takes 3 years 8
  • 9. What is Big Data about: 3 main shifts  MORE Data  We can now process almost ALL data we want  Using ALL data les us see details we could not see when we were limited  MESSY Data  Having ALL data available we can forgive some imperfections in them  Removing the sampling error allows for some measurement error  The loss in accuracy at the micro level is compensated by the insight at the macro level  From causality to CORRELATION  Big Data tells us the “WHAT”, not the “WHY”  From validation of our hypotheses to observing connections we never thought about 9
  • 10. What is Big Data about: Datafication  Taking informations on everything and making it analyzable opens the door to new usage for the data  Like a gold hunt, there is lot of value to be discovered  Data is the OIL of the “Information Economy” and will soon move to the Balance Sheets of companies  Subject Matter Expert will become less relevant, Statisticians will become more (! )  There will be value for Data, for people being able to manage them and for people with ideas on HOW to use them 10
  • 11. What is Big Data about: Risks  Moving from human-driven decision (based on small dataset) to machine-based decision (based on huge dataset containing OUR data) have implications  Who regulates the algorithms  How we preserve individual volition “sanctity”?  Examples:  Data predict you will have a hearth attack soon. Insurance asks you to pay more  Data predict you will default on a mortgage. Mortgage is denied  Data predict you will commit a crime. Should you be arrested? 11
  • 13. More Data: We were all biased by scarcity  Statistic in the past: confirm the richest finding using the smallest amount of data  The CENSUS history (by the way, CENSUS comes from “to estimate”)  Caesar Augustus (1 BC)  Domesday Book (1086). King William I did not live enough to see its end  London during the plague (1390) – First attempt to make inference.  US, 19th century. Constitution mandates one every 10 years, in 1890 the estimation was for 13 years  So Herman Hollerith invented punch cards and tabulation machines – Data Processing (and IBM!) is born  Still too complex and expensive to be run more frequently than each decade  Sampling gets invented  First, it looked like building a “representative sample” was the best approach  1934: Jerry Neyzman proves that random sampling provides a better result 13
  • 14. More Data: 23andMe  For 100$ they (used to) analyze your DNA to reveal traits making you more likely to get some heart and cancer problems  But they only sequence a small portion of your DNA – relative to the markers they know  So, if a new marker is discovered – they would need to sequence you again  So, working with a subset only answers the questions you considered in advance 14
  • 15. More Data: Steve Jobs  He got his entire DNA sequenced (3B pairs)  In choosing medications, doctors normally hope for similarities between what they know of their patient DNA and the one of who participated to the drug’s trial  In Job’s case, they could precisely select drugs according to their efficacy given his genetic make-up  They kept changing treatment, as the cancer mutated  This did not save Steve’s life, but extended it by many years 15
  • 16. More Data: Sampling make no more sense  In many cases today, we can get close to N=ALL  Google Flu Trends used billion of search queries  Farecast used all US routes price data for an entire year  In many cases, the interesting data points are the “outliers” – and you only see them when you get N=ALL  Detection of credit cards fraud - based on anomalies, need to be real time  International money transfer: Xoom. Discovered a large scam when they observed a pattern where there should not have been a pattern 16
  • 17. More Data: “Big” does not to be BIG  The real power is not from the sheer size of the data , is from N=ALL  The SUMO example  Steve Levitt from the Chicago University proved (after many unsuccessful attempts) what everybody knew: there was corruption in Sumo!  Analyzed 11 years of matches, all of them (64K)  Crossed the results with the ranking  Corruption was not in the matches for the top position but in the matches with mid-ranking players (you need to win 8 of 15 matches to retain your salary and ranking)  When a 8-6 player met a 7-7 player at the end of the season, he lost 25% more often than normal  And in their first match of the next season, the former 8-6 won much more frequently than normal….as a gift back 17
  • 18. More Data: Summing up  Big Data (or N=ALL) allows us to reuse information and not to resample  It allows us to look a details and test new hypothesis at each level of granularity  The Albert-Laszlo Barabasi example  The chart on the right comes from ALL calls done over one mobile operator network in a 4 months period  The study (Barabasi et al) is the first network analysis at a societal level  It shows that people with many links are less important that people with links outside their immediate community. It indicates a premium on diversity within societies  Using random sample in the era of big data is like using dial phones in the era of cell phones. Go for ALL, whenever you can! 18
  • 20. Messy Data: We were all obsessed with precision  Focused on sampling, we were trying to get exactitude – since errors got hugely amplified  And with few data, the quest for exactitude was reasonable and aligned with our inner belief since the 19th century  Quantum mechanics in 1920s should have changed that mindset, but did not  But if we relax the precision standard, we can get many more data, and “more trumps better”  Messiness (likelihood of errors) grows linearly with more data  Messines grows when combining different types of data: think an anagrafic where the company IBM can be represented as IBM, I.B.M., International Business machine, T.J.W Labs…….  Messiness kicks in when we transform data as when we use twitter messages to predict the success of a movie 20
  • 21. Messy Data: The vineyard example  If we have one temperature sensor for a whole vineyard, we must and can ensure it works perfectly, having a high-cost sensor and high-cost maintenance  If we have one per vine, we can use cheaper sensors since the aggregate data will provide a better picture even with few imprecise measurements  If each sensor sends a reading every minute, we have no sync issues, but if each sends every millisecond, we can have data “out-of-sequence” but we still collect a much better representation  Maintaining exactitude in the word of Big Data can be done (look at Wall Street 30,000 trades per second) but is expensive  Very often less precision is “good enough” and allow to scale data 21
  • 22. Messy Data: More Trumps Less  As Moore Law says processor speed keeps improving, also performance of algorithms has kept increasing  But most of the gains do not come from faster chips or better algorithms but from more data.  In Chess, the system has been fed with ALL data for a match with <= 6 pieces left an now the computer always win  In Natural Language, given 4 existing algorithms for grammar-checking, Microsoft discovered that feeding them more words changed the performance dramatically, and also altered the ranking of the algorithms  So Microsoft invested in developing a corpus of words versus developing new algorithms 22
  • 23. Messy Data: The case of machine-translation  Started with very small data: 250 word pairs were used to translate 60 Russian phrases into English in the cold war. It worked, but it was useless  But it did not improve fast: the issues were fuzzy words: is “bonjour” good morning, or good day, or hello, or hi ?  IBM in 1990 launched Candide: ten years of Canadian parliament transcripts in French and English. 3 millions sentence pairs, very well translated. It worked better, but not good enough to become commercial. And could not improve further  Enters Google:  Takes every translation it can find on the web. A trillion of words, 95 billions English sentences, unevenly translated.  It works way better than anything else before  Not because a better algorithm, not because better quality of the dataset. Just because its size  And it got the size because it accepted messiness 23
  • 24. Messy Data: The Billion Prices Project  Calculating the Consumer Price Index (or inflation rate) is a complex project that costs 250M$ a year. And gives the output after few weeks, often too late to predict crises  The MIT launched a project to get the prices of products over the Web. 500,000 prices a day. Messy and not neatly comparable  The project produces an accurate prediction of CPI in real time  It spawn off a commercial venture, PriceStats, that sells analysis real-time to banks and Governments over the world . At a much cheaper price 24
  • 25. Messy Data: Tags –imprecise but powerful  Traditional hierarchical taxonomies were painful, but good-enough in small data  But how to categorize the 6 Billions photo Flickr has, from 75M users?  Use TAGS. Created by people in ad-hoc way, simply typed in  They may be misspelled, so they introduce inaccuracy, but they give us natural access to our universe of photos, thoughts, expressions…. 25
  • 26. Messy Data: How to handle them?  Traditional databases “Structured Query Language” requires structured and precise data. If a field is defined as numeric, it must be a number. And so on  They are designed for a world when data are few, and hence are curated carefully and precise  Also indexes are predefined, so you need to know in advance what you will be searching for  Now we have large amounts of data with different types and different qualities, and we need to mix them. This required a new database design, “noSQL”.  Hadoop is an example of this.  It accepts data of different type and size, it accept messy data, and it allows to search for everything  But it requires more processing and storage, typically distributed across physical locations.  It has redundancy built in, and it perform processing in place.  Its output is less precise than a SLQ output. So don’t use it for your bank account  Segmenting a list of customer for a marketing campaign: Visa reduced processing time from one month to 13 minutes 26
  • 27. Messy Data: Past the tradeoffs  Only 5% of data are structured – we need to accept that and the inevitable messiness it brings if we want to tap into the universe of web pages, pictures, videos,….  We were used to be limited to small sets and focused on exactitude  We can now embrace the reality: data sets ARE large and they ARE messy. We have the tools to handle those characteristics and better understand the world 27
  • 29. Correlation: The Amazon story  In 1997, the top selling tool for Amazon were critics’ reviews. They had 12 full time book critics  Then they realized they had huge quantities of data: every purchase, every book looked at but not bought, the time spent on each book…  First attempt to use those data: taking a sample to find similarities across customers. Outcome: dumb.  Second attempt: use all data and just look at correlation between products (“item-to-item” collaborative filtering)  It worked, and it was book-independent  They marked-tested the 2 approach: books suggested by the algorithm beat books suggested by the critics 100:1  The 12 critics got fired, and Amazon sales soared 29
  • 30. Correlation: Machine-gen recommendations work  Nobody knows WHY a customer who bough book A also want to but book B  But one third of Amazon’s sales result from this system  75% of orders for Netflix come from this system  It is like the merchandise placed close to the cashiers – but it analyses your cart real time and real time it puts the right merchandise in the basket  Professional skills, subject-matter expertise, have no impact on those sales processes  Knowing what, not why, is good enough  Correlation cannot foretell the future, but through identifying a really good proxy for a phenomenon, it can predict it with a certain likelihood 30
  • 31. Correlation: Don’t make hypothesis, be data-driven  Walmart – the largest retailer in the world, crossed its historical sales data with the weather reports. Discovered that before every hurricane, people rushed to buy…. Pop-Tart, a sugary snack. Now they know and they stock it next to the hurricane supplies  Nobody could have made that hypothesis  The traditional approach was to make hypothesis and validate them through test. Slow and cumbersome and influenced by our bias  Let sophisticated computational analysis identify the optimal proxy  No need to know which are the search items correlated to flu  No need to know the rules the airlines use to compute prices  No need to know the taste of Walmart buyers 31
  • 32. Correlation: More examples of the use of correlation  FICO Medication Adherence Score:  To know if somebody will take his medicines, FICO analyzed apparently irrelevant variables as how often they changed job, if they were married, if they had a car  Historical data gave them correlations helping creating an index that helps health providers to better target the money they spend reminding the patients to take their medicines  Experia estimates people’s income based on their credit history. It cost 1$ to get itm while it would cost 10$ to get the tax return form  Aviva uses credit reports and lifestyle data as proxies for the blood and urine tests. The data driven prediction costs 5$ while the tests would cost 125$  Target used its shopping history to predict if a woman was pregnant. Found 20 products that were good predictors and used them to target those women. Even targeting the different phases of pregnancy 32
  • 33. Correlation: Predictive Analysis  Place sensors on motors, equipment or infrastructure like bridges to monitor the data patterns around temperature, vibration, sound, etc  Failures typically observe a pattern in those data so once the pattern is spotted, predicting it becomes easy  UPS use it for its 60,000 cars. Before it, it replaced each part every 2 years, to be on the safe side. Now it has saved millions of dollars  University of Ontario used it to help making better diagnostic decision while caring for premature babies  Data showed that very constant vital signs are a precursor of a serious infection – against any apparent logic  This stability is likely the calm before the storm, but the causality is not important, the correlation is  Big data saves lives 33
  • 34. Correlation: Not only linear  We already said that in small data every analysis started with an hypothesis  Today with big data , the hypothesis is no longer important  Also, in small data the analysis was limited to linear correlation. Today, no longer  Are happiness and income directly correlated?  They are linearly correlated for low income, than it plateau  How measles immunity depends on healthcare spend?  Again, it is linear at the beginning but then it drops (likely since more affluent people shy away from vaccines) 34
  • 35. Correlation: A philosophical problem  Those analysis help us understand the world by primarily asking WHAT and not WHY  As humans, we desire to make sense of the world through causal explanations  Causality normally is a very superficial (quick, illusory) mechanism. When two events happen one after the other, we are urged to see a causal relation.  Got a flu. It happened since I did not wear a hat yesterday  Got stomach sick. It happened since I ate at the restaurant yesterday  Big data correlation will routinely disprove our causal intuitions  Sometimes causality is a deep scientific experimental process  In this case, correlation is a fast and cheap way to accelerate it, providing proxies instead of hypothesis  Be careful with correlation:  In a “quality of used cars” study, it was proven that cars painted in orange were 50% less prone to have defects.  But painting you car orange will not make the trick! 35
  • 36. Correlation: The Manhattan Manhole  In Manhattan there are 51,000 manholes, each weighs 150K  They tend to explode in the air and crash on the ground   A typical Big Data problem for MIT : identify the ones at risk so to be able to service them preventively  94,000 miles of cables, some laid before 1930  Records kept since 1880, formats immensely different. Same object ( a “service box”) is identified with 38 different names  After a huge work to format the data to make them machine readable, the MIT team identified 106 predictors and mapped them against the historical data up to 2008, then used the result to predict 2009  It turned out that there were 2 important ones: age of cables and having had previous problems. The top 10% of the manholes in the list prioritized by those two factors contained 44% of the manholes that had incidents  Using those predictors in the future allows to reduce the number of incidents dramatically 36
  • 37. Correlation: Is it the end of theory?  Chris Anderson in 2008 Wired asked his readers if correlation and statistical analysis mark the end of theory  Likely NO, Big Data is founded on theories itself and requires them through its process  But it marks a shift in the way we make sense of the world, and this change will require time to get us used to  And this change is, in the end, due to the fact that we have far more data than ever 37
  • 38. Data: their essence, their value 38
  • 39. Data: their essence, their value Navigating the Oceans  In 1840, ocean navigation was still a mystery. Captains were afraid of the uncertain, they always repeated their own preferred routes, with no rationale  Enters M. Maury, head of the “Depot of Charts and Instruments” bureau of the US Navy  In his office, he discovers hundreds of thousands of “logs” of previous trips. They contain info on winds, tides, streams, weather….  He hires 10 “computers” to transform those logs in data to be able to tabulate them, he divides the oceans in 5x5 degrees squares … and those data indicates amazingly clearly the most efficient routes. On average, it saved one third of the navigation time  To improve further and get more data, he then created standards for logging (to save $ on the “computers”), he gave his charts only to whom agreed to return the data, he gave flags for the ships supporting the initiative to show  In the end he tabulated 1.2M data points, and changed the world. His maps are still in use 39
  • 40. Data: their essence, their value Datafication  Commander Maury was one of the first to understand the special value of huge corpus of data. He took data nobody cared about and transformed them into objects of value, This is called: Datafication  Similarly, Farecast had taken old price points for airplane tickets, and Google had taken old search queries and they had transformed those in something of value  Another example: a research in Japan Institute of Industrial Technology  They took data nobody thought to use: the way people sit in the car. 360 sensors on the car seat  They obtain a digital map that can be used as antitheft signature, for insurance purpose, as a safety tool  This is another example of taking some data with apparently little use and transform them in useful data. Datafication 40
  • 41. Data: their essence, their value Datafication ≠ Digitalization  To datafy a phenomenon is to put it in a quantified format so it can be tabulated and analyzed  It requires us to know how to measure and how to record what we measure  This idea pre-date the “IT Revolution” age by far.  Roman numeration was extremely hard to use for calculating large (or very small) amounts. Counting board helped with calculating but were np use for recording  Arabic numerals were introduced in Europe in 1200 but they only took off at in the 1500 thanks to Luca Pacioli and the double-entry bookkeeping: a clear tool for datafication  Double-entry bookkeeping standardized the recording of information, allowed quick queries to the data set and provided and audit trail to allow data to be retraced (a build-in “error-correction” mechanism)  Computers made datafying much more efficient. And improved immensely the ability to analyze data. But the act of digitalization, by itself does not datafy 41
  • 42. Data: their essence, their value Google vs Amazon  Both Google and Amazon has datafied a huge number of books  Google as part of his huge project “Google Book Search”  First digitized the text, then, using custom-build OCR, datafied it  Now 20M titles are fully searchable. Look at http://books.google.com/ngrams for a quick idea  15% of all published books  Google uses is for his machine translation service  Amazon, with Kindle, has datafied books too, for millions of new books  but it has decided not to use that for any relevant project/analyses (with the exception of the service of statistically relevant word)  Possibly since books are its core business 42
  • 43. Data: their essence, their value Location is also Data  Introduction of GPS in 1978 it allowed the simple datafication of location data  Price going down from >100$ to <1$ makes it possible to get location data for many different things  Insurances now price (also) based on location logs  UPS used geo-location to build, similarly to Maury’s navigation map of the oceans, an optimized navigation map for its 60,000 vehicles, saving 30 million miles  AirSage buys cellphone data to create real-time traffic reports  Jana uses cellphone data to understand consumer behaviors…. A powerful tool  The important point is that those data are used for different purposes versus what they were created for 43
  • 44. Data: their essence, their value Interaction is also Data  Facebook social graph (in 2012) covered >10% of the world population, all datafied and available to a single company  This could be used for credit scores: bad payers tend to stick with their similes: Facebook could be the next credit scoring agency  Twitter (who sells access to its data) is already used to read the “sentiment” about politics, movies, songs….  Now sentiment analysis starts being used also to drive investments in the stock market. MarketPsich sells reports on that, covering 18,864 indices across 119 countries  Social Media networks sit on a immense treasury of data, the exploitation of which has just started 44
  • 45. Data: their essence, their value Everything is also Data  The “Internet of things” is about sensors on everything, incessantly transmitting data in a format suitable for datafication  It is is starting with fitness, medical, manufacturing  Zeo has created a database of sleep activity uncovering differences between men and women  Heapsylon has created a sock that tells you phone if you are running well or not  Georgia Tech has created an app that allows a phone to monitor a person body tremor to diagnose and control Parkinson disease. It is just less effective than the expensive tools used in the hospitals  GreenGoose sells tiny sensors that everyone can put on objects to measure how much they are used. Allows anyone to create his own data environment 45
  • 46. Data: their essence, their value Datafication is a fundamental project  It is an infrastructure project rivaling the ones in the past, the Roman aqueducts or the Encyclopediè of the Enlightenment age  We may not notice, because we are in the middle of it  In time, datafication will give us the means to map the world in quantifiable, analyzable way  Today, it is mostly used in business to create new forms of value 46
  • 47. Data: their essence, their value The Value of reusing  You all (annoyingly) digit the captcha Luis von Ahn invented in 2000  When von Ahn realized he was wasting 10 seconds of your time 100M times a day, he thought harder  He invented ReCaptcha. The second word is a digitized word a computer cannot read  5 consistent user inputs disambiguate that word  Data has a primary use (to prove you are human) and a secondary use (to decipher unclear words)  And it saves 750M$/yr in digitalization manual work Captcha = Compeletely Automated Public Turing test to tell Computers and Humans Apart 47
  • 48. Data: their essence, their value A new Value for Data  Data has always been used and traded  Prices, Contents, Financial informations, Personal data…  But they used to be either ancillary to the business, or narrowly used like in contents or personal informations  Now, all data can become valuable  Fuel levels from a delivery vehicle  Readings from heat sensors  Billions of old search queries  Old price records for airline tickets  ….  And the cost of gathering and keeping them keeps falling. In 50 years storage density has increased by a 50-million fold factor…. 48
  • 49. Data: their essence, their value Data can be reused and multiused  The primary used for data is typically evident to who collects them: Stores for proper accounting Factories for quality control Websites for content optimization Social sites for ads optimization  But data do not get consumed by usage and can be reused for multiple purposes.  So data full value is greater than the one extracted from their first use  This is called the “option value” of data. They have a “potential energy” 49
  • 50. Data: their essence, their value Reuse  Search terms are a classic for reuse  Hitwise use search terms to learn about consumer preferences. Will “pink” or “black” be next season fashion color?  Bank of England use search terms to get a sense on the housing market  Logistic companies use their records to create business forecast they sell (under a different company name)  SWIFT offers GDP forecast based on the money transfers it handles  Mobile operators start reselling their infos (enriched with geo-loc info) for local advertisement and promotions  They can also sell the signal strength information (with geo-loc) to handset manufacturers to improve the reception quality  Large companies start spinning off dedicated companies to take $ advantage of their data option value 50
  • 51. Data: their essence, their value Data combination  At times the dormant value can only be unleashed thru combining different datasets – often vey different  Cancer and Cell Phones A question that has always been hanging around The Danes took a N=ALL approach, combining all consumer mobile operator data from 1987 to 1995, all cancer patient registers from 1990 to 2007 and all income and education information for each inhabitant The result was that there was NO correlation  With Big data, the sum is more valuable than the parts 51
  • 52. Data: their essence, their value Data Extensibility  To enable reuse – design extensibility from the ground up  Google Street view was originally used to allow the “street view” in Google maps. But data had been collected with extensibility in mind, so they will be reused to allow functioning of Google self-driving car  In-shops camera (and software) are designed to prevent shop-lifting but they can be extended to provide marketing-relevant data on customer behaviors and preference  The extra cost of collecting multiple data streams is low, and can drive massive benefit when a dataset can be used for multiple instances 52
  • 53. Data: their essence, their value Data Exhaust  Bad, Incorrect or Defective Data can bring a value  Google spell-checker is built using the end-user input when correcting misspelled queries  Data exhaust, in general, means data the users leave behind them  Also voice recognition, spam filters system improves in a similar way  Social networks are obviously looking at this  But other sectors are starting:  E-Book readers – gather an amazing amount of information that could help authors and publishers make better books  Online education programs can predict student behavior  This will constitute a huge barrier to entry for new-entrants 53
  • 54. Data: their essence, their value What is the value of data  Data are an intangible asset, as brand, talent and strategy  But it can explain some strange things that happened recently like WhatsApp evaluation (or Facebook IPO itself)  There are emerging marketplaces for Data, like Import.io, or Factual  But there is no clear answer yet, also since most of the value of data is in their (re)use, not in the data possession 54
  • 56. Implications Decide.com  Decide.com had an ambition: to be a price-prediction engine for almost every consumer product  They scrapped the web to obtain 25Billion price observations. Lot of data, and lot of text to be transformed in data  Identified un-natural behaviors, like prices increasing for old model at the introduction of a new one  Spotted any un-natural price spike  Provided 77% of accuracy, and saved on average 100$ per purchase  If the prediction was wrong, they reimbursed the difference  They got bought by eBay…  What makes them special? Data were available on the Internet, they did not use any special algorithm…. 56
  • 57. Implications Ideas matter  Decide.com had an IDEA. And that idea came from a big data mindset: they saw the opportunity and realized it could have been realized with existing data and tools  Moving from the data itself to the companies who use data, how does the value-chain work?  There are three types of big-data companies, differentiated by the value they offer:  The Data  The Skills  The Ideas (and of course some companies have a mix….) 57
  • 58. Implications Who has Data  Some companies have lot of data, but data is not what they are in business for  Twitter – as an example – turned to two independent companies to license its data to other users  Telecom companies could do the same – and in some cases they start doing it  ITA provided data to Farecast – they did not do the job themselves since they would have been in competition with the airlines  Master Card created a division (MC Advisors) to extract value from its data and resell 58
  • 59. Implications Who has Skills  Consultants, technology vendors analytics providers who have competencies to do the work but do not have access to data and do not have a “big-data” mindset  Accenture is a good example  Microsoft (Consulting) is another:  Worked with an Hospital in Seattle to analyze years of anonymized medical record to find a way to minimize readmissions  Found that the mental state of the patient is a key predictor  Addressed that and reduced the overall healthcare spend 59
  • 60. Implications Who has a big-data mindset (1)  They see opportunities before the others, and they see what is possible without thinking too early to its feasibility  FlightCaster.com – predicts if a flight will be delayed  Analyze every flight over ten years, matches against weather data, and apply the correlation to current flights and current weather  Data where all available openly (government owned) but the government had no interest in using them  Airlines had no interest (they want to hide the delays)  It worked perfectly… even airlines’ pilots used them...  They were a first mover – it was not difficult to copy them 60
  • 61. Implications Who has a big-data mindset (2)  Very often it takes an outsider to get a brilliant idea  The incumbent are often too “encumbered” by their present to think well to the future  Amazon was not funded by a bookstore but by an hedge fund…  Ebay was not launched by an auction company but by a software developer….  Entrepreneurs with big-data mindset do not normally have the data but they also miss the vested interest/fear preventing to use the data 61
  • 62. Implications Data Intermediaries  Today, both skills and ideas seem to dominate the value-chain, but long term most of the value will be in the data themselves  Data intermediaries will emerge  Inrix – a traffic-analysis firm  They get geo-loc data from car manufacturers, taxis, delivery vans  They aggregate, combine with historical data, weather data and local events information and predict traffic  They collect data from rival companies, who could do nothing with their data alone and who have no competencies in predictive methods  What Inrix does benefits their customers so they have a return themselves (even if not a competitive advantage)  This “collaboration” is not new (banks need to send their data to central bank etc) but now it is about a secondary use of data. And maybe tertiary.. Inrix stated using traffic data to provide informations about health of commercial centers and health of the economy in general…. 62
  • 63. Implications What are the experts for?  In the movie Moneyball the old “scouts” confront the geek statistician and offer their arguments against him  “He’s got a baseball body… a good face”  “He has an ugly girlfriend, it means no confidence”  This shows the shortcoming of human judgment  Data driven decisions are poised to augment and overrule the human judgment  The subject matter expert loses appeal versus the data analyst  The online training company Coursera uses machine-recorded data to advise teachers on what to improve in their lessons  Skills in the workplace are changing. Experience is a bit like exactitude. Very useful in a small data word where you need to make many inferences, less useful in a big data world where data talk 64
  • 64. Implications Who will be the winners  Large companies will continue to soar. Their advantage will rest on data scale and not on physical scale. And ownership of large set of data will be a competitive barrier.  But large companies need to get the big-data mindset. Rolls-Royce is a good example – using sensors and big data they transformed from a manufacturer to a services companies (charging on usage time and support)  Small companies will also do well since they can have “scale without mass” and big-data does not require large initial investments, they can license data vs owning them, they can rely on cheap cloud computing and storage  Mid-sized companies will be squeezed in between  Individuals will likely be able to take advantage of this revolution. Personal data ownership may empower individual consumers. But it will need new technologies, albeit companies as Mydex are already working on it 66
  • 66. Risks 3 categories of risk  Internet already threatened PRIVACY, with big data the change of scale created a change of state. Google knows what we search, Amazon knows what we buy (or would like to buy), Twitter and Facebook know how we feel and who we like  PROPENSITY now can become something affecting our life. We can see insurance and mortgages denied, even if we have never been sick or never been a bad payer  We can fall victim of a DATA DICTATORSHIP where we fetish our analysis and end-up misusing them 68
  • 67. Risks Privacy  Big Data is not all about personal informations (think to UPS or the manhole examples) but much of the data being generated now contain personal informations (or can be traced back to them  “Smart meters” collect info on electric usage very 6 minutes. It can tell whichever appliance you use, and of course when  The traditional approach to privacy is “notice and consent” that limits to the primary usage  How to use it in a big data world where secondary usages have not being imagined yet?  Opt-out leaves a trace  Anonymization does not work either since big data creates too many references to ensure we can not be identified 69
  • 68. Risks Probability and free will  Parole boards in the US use data analysis – based predictions to decide whether release somebody from prison  US Homeland has a project to identify terrorists by monitoring body language and other physiological patterns  In Los Angeles police use big data to select streets, groups, individuals need to be subject to more surveillance  It at looks like a great idea (preventing crime) but it is dangerous. We may want to punish the probable criminal  And while “small data” techniques were based on profiling based on a model of the issue at hand (causal), “big data” only look at correlations – that makes things even more dangerous 71
  • 69. Risks A potential bad outcome  Going back to the Google Flu example  What if the government decides to impose a quarantine on people in the more risky areas  The Google algorithm allows to identify them individually  So they can be quarantined only since they made the queries…  But remember: Correlation is NOT Causation…. 72
  • 71. Remedies Every revolution bring new rules  Gutenberg invention brought censorship, licensing, copyright, freedom of speech, defamation rules  First the focus was on limiting the information flow , than it edged in the opposite direction  With the Big Data transformation, we will also need a new set of rules. Simply adapting the existing ones will not be sufficient. But we need to move fast 74
  • 72. Remedies Few suggestions  Privacy should move from end-user consent to data-user accountability  Big data users should provide use-assessements on the dangers of the intended use  They should also provide a time-frame for the usage (and retention) of data to avoid a “permanent memory” scenario (as we have today)  Decisions based on big data predictions must be documented and the algorithm certified, and they need to be disprovable  Decisions mast be framed in a language of risks and avoidance not in a language of “personal responsibility”  Judgment must stick to personal responsibility and actual behavior 75
  • 73. Remedies A new profession  As the complexity of Finance paved the way for the creation of auditing firms, we will need a new set of experts: the “Algorithmists”  Companies will have internal algorithmists , as they have controllers now, and external ones, as they have auditors  Those people will be the expert ensuring that big data system do not remain “black-boxes” offering no accountability, traceability or confidence 76
  • 74. Remedies Data Antitrust  As for any other raw material or key service, access to data must be regulated  Competition must be ensured and data transactions enabled through licensing and interoperability  Government (and others willing to do so) should publicly release its own data (this is already happening under the name of “Open Data”) 77
  • 76. Big Data today  The effects are large on a practical level, finding solutions to real problems  Big Data is when the “Information Society” becomes true. Data (information) takes the center stage, and it speaks  Data will keep increasing  Messines will be acceptable in return for capturing far more data  Correlation is faster and cheaper than causality so it is often preferable  Much of the value will come from secondary use of data  We will need to establish new principles to govern the change  Big Data is a resource and a tool. It informs, it does not explain. It points us towards understanding, but is it not the truth 80
  • 77. Big Data tomorrow What’s past is prologue (William Shakespeare) 81