SlideShare una empresa de Scribd logo
1 de 36
Bigger Than Any One
      Solving Large Scale Data Problems with People and Machines




Tyler Bell: @twbell
"You just want to survive long
enough to have a long run, to
prove that your product has
value. And in the short term,
human solutions require
much less work. Worry about
scaling when you need to"

            Get 'Data Jujitsu' free at: http://oreilly.
            com/data/radarreports/data-jujitsu.csp
And now for something completely Factual
US Places: 25m entities, one pixel per place
474,691 inserts
4,971,825 updates


     Updates and Inserts Overlay: Jan 15 to Jan 30 2013
Detail: Santa Clara, CA
inserts
updates


          Updates and Inserts Overlay: Jan 15-30 2013
Discussing the Human Role as:


        ●   Turkers (Sloggers)

        ●   Experts (Thinkers)

        ●   Contributors (Players)
Turkers (Sloggers)
A Good Body of Literature Exists:

  ●   CrowdDB "harnessing Human Computation" to
      address the Closed World Assumptions of SQL

  ●   Crowdsourced Databases: "Query Processing with
      People"

  ●   Deco: Declarative Crowdsourcing "Human
      Computation"

  ●   Mob Data Sourcing: a good state-of-the-art survey
Image Source: http://bit.ly/XiCbBW
http://turkrequesters.blogspot.com/2013/01/the-reasons-why-amazon-mechanical-turk.html
"We’ve built a real-time human
computation engine to help us
identify search queries as soon as
they're trending, send these
queries to real humans to be
judged, and then incorporate the
human annotations into our back-
end models."

                                     "Clockwork Raven steps in to do
"we use a small custom pool of       what algorithms cannot: it sends
Mechanical Turk judges to ensure     your data analysis tasks to real
high quality."                       people and gets fast, cheap and
                                     accurate results."



                                      Edwin Chen & Alpa Jai, Twitter Blog, 8 Jan 2013: http://bit.ly/SixU2q
                                                                   Clockwork Raven: http://bit.ly/RkhxkW
Gold Data             Search for Candidate Matches              Resolve Candidates




                                         Compare Attributes




Likely Matches                                Unsure                               Likely Mismatches




                                   Send to MTurk for Evaluation



                                                                                             Factual:
                        Likely Matches                  Likely Mismatches
                                                                                             MT for QA



                                     Compute Accuracy Score




                                             Accuracy
                                              Score
Factual Moderation Queue


  ~60% automatically accepted or rejected based on preset
  rules/filters

  ~30% can be parsed, processed by attribute, and validated by
  MTurkers

  ~10% handled by in-house moderators or 'Trusted Turkers'
Experts (Thinkers)
The Data Science Debate 2012
Peter Skomoroch (LinkedIn), Michael Driscoll (Metamarkets), DJ Patil (Greylock Partners),Toby Segaran (Google), Pete
Warden (Jetpac), Amy Heineike (Quid)




"In data science, domain expertise is more important than
machine learning skill"

                                                                                                 2012 Strata Debate: http://bit.ly/A78ASK
                                                    Mike Loukides, 'The Unreasonable Necessity of Subject Experts' : http://oreil.ly/GAJHD9
"We've discovered
that creative-data
scientists can solve
problems in every
field better than
experts in those
fields can."



"The expertise you
need is strategy
expertise in
answering these
questions."


                       Interview with Jeremy Howard, Chief Scientist, Kaggle: http://slate.me/UM0wQ7
"Data struggles with context.
Human decisions are not discrete
events. They are embedded in
sequences and contexts. The
human brain has evolved to
account for this reality [...] Data
analysis is pretty bad at narrative
and emergent thinking, and it
cannot match the explanatory
suppleness of even a mediocre
novel."


"Data inherently has all of the
foibles of being human, [it] is not a
magic force in society; it’s an
extension of us."

                                                      NYT Article 24 Feb 2013: http://nyti.ms/15cMV80
                                        David Brooks NYT Op-Ed 18 Feb 2013: http://nyti.ms/12EQmpQ
                                             Google Flu Trends: http://www.google.org/flutrends/us/#US
"People from different places
and different backgrounds tend
to produce different sorts of
information. And so we risk
ignoring a lot of important
nuance if relying on big data as
a social/economic/political
mirror."



"Big data is undoubtedly useful for addressing and overcoming many
important issues face by society. But we need to ensure that we
aren't seduced by the promises of big data to render theory
unnecessary."


                           Mark Graham, 'Big Data and the End of Theory', The Guardian, 9 Mar 2012: http://bit.ly/yq87yW
                                                    Chris Anderson, 'The End of Theory', Wired (2008): http://bit.ly/LOU8
"Without
[cleansing and
harmonisation ]
they’re not data
markets; they’re
data jumble
sales."
                                     Edd Dumbill, 'Data Markets Compared', 7 March 2012: http://bit.ly/Ogsh30
                                                                               InfoChimps: http://bit.ly/yIzY8x
                   Paul Miller, 'Discussing Data Markets in New York City', 13 March 2013: http://bit.ly/TFjNnj
Contributors (Players)
"It’s just dumb that a
 "Humans don’t tend to                                           100mil+ people carry GPS
 think in floating point                                         device in their pockets and
 pairs."                                                         we have to buy expensive
                                                                 proprietary data to find out
                                                                 about the shape of where
                                                                 we live."


"It is equal parts sad-
making and hate-making
that we're all still stuck                                        "Seriously, if you’re
suffering the lack of a                                           presenting people
comprehensive and open                                            geographic data, ask them
dataset for places."                                              sometime if you’re getting
                                                                  it right."

                             Aaron Cope: http://www.aaronland.info/weblog/2013/02/03/reality/#youarehere
                             Kellan Elliott-McCrea: http://laughingmeme.org/2013/02/03/are-you-here-a-feature-on-the-side/
Verizon Wireless Precision Market Insights: http://vz.to/13l6Hje
"So far InfoArmy has paid
$146K to researchers
(averaging $23/report), and
some researchers have made
over $5,000 individually. We
are very happy that many hard     ●   Single report owners
working researchers have          ●   High prices
made money. However,              ●   Low quality
without a clear path to revenue   ●   Unpopularity of updates
this model is unsustainable."     ●   Report Abandonment
                                  ●   Sold 44 reports total


                                                  Info Army: https://www.infoarmy.com/
                                               Tech Crunch report: http://tcrn.ch/VsXnCv
Apple Maps'
Report a Problem




                   Image: http://www.macrumors.com/2012/09/24/how-to-report-a-problem-with-ios-6-maps-data/
40 million users

65,000 Waze users made 500 million edits.

Users fixed 70% of system-detected map problems and 100% of all user-
reported map problems.

Almost all user-reported map problems were taken care of within a week.

Users added more than 50,000 gas stations on the map in the first month.

Users spend 7 hours a month on the platform.

                    'Waze Now Lets Users Instantly Map Closed Roads, Acts Like Google Maps On Steroids': http://bit.ly/15iHyEo
Factual
provides both 'Flag' and 'Write' interfaces




                 (this is good)
                              Factual Write API: http://developer.factual.com/display/docs/Core+API+-+Writes
                                       Flag API: http://developer.factual.com/display/docs/Core+API+-+Flag
With No Means to Follow-Up




     (this is bad. But we're on it)
Now: extend this thinking to organizations
Ryan Kim, GigaOm, 6 Nov 2012 http://bit.ly/SLb1C7
Play ball with the competitors: "Think instead about collaborating with all
the other players and getting the network effects of a larger audience and a
bigger pool of offers. [...] The individual companies behind the consortium
still compete to acquire subscribers, but they now collaborate with a
common platform for monetizing those subscribers through marketing."




                                     Alistair Goodman, CEO Placecast, All Things D, 7 Feb 2012 http://bit.ly/YA8Nq4
"Seeds [...] are one of the original information storage devices, it’s almost
hard to understand why libraries haven’t always included seeds"




                                                Long Now Foundation: 'Seeds are the new books": http://blog.
                                                        longnow.org/02013/02/26/seeds-are-the-new-books/
Data In game theory and economic theory, a zero–sum game is a
mathematical representation of a situation in which a participant's
gain (or loss) of utility is exactly balanced by the losses (or gains)
of the utility of the other participant(s). If the total gains of the
participants are added up, and the total losses are subtracted, they
will sum to zero. Thus cutting a cake, where taking a larger piece
reduces the amount of cake available for others, is a zero–sum
game if all participants value each unit of cake equally (see
marginal utility). In contrast, non-zero–sum describes a situation in
which the interacting parties' aggregate gains and losses are either
less than or more than zero. A zero–sum game is also called a
strictly competitive game while non-zero–sum games can be either
competitive or non-competitive. Zero–sum games are most often
solved with the minimax theorem which is closely related to linear
programming duality or with Nash equilibrium.
                                                        http://en.wikipedia.org/wiki/Zero_sum
Tyler Bell
  Factual

    @twbell

tyler@factual.com

Más contenido relacionado

Similar a Bigger than Any One: Solving Large Scale Data Problems with People and Machines

Strata Conference NYC 2013 Full Version
Strata Conference NYC 2013 Full VersionStrata Conference NYC 2013 Full Version
Strata Conference NYC 2013 Full Version
Taewook Eom
 

Similar a Bigger than Any One: Solving Large Scale Data Problems with People and Machines (20)

Fontys Eric van Tol
Fontys Eric van TolFontys Eric van Tol
Fontys Eric van Tol
 
Big Data and the Future of Journalism (Futurist Keynote Speaker Gerd Leonhard...
Big Data and the Future of Journalism (Futurist Keynote Speaker Gerd Leonhard...Big Data and the Future of Journalism (Futurist Keynote Speaker Gerd Leonhard...
Big Data and the Future of Journalism (Futurist Keynote Speaker Gerd Leonhard...
 
Machine Learning: Understanding the Invisible Force Changing Our World
Machine Learning: Understanding the Invisible Force Changing Our WorldMachine Learning: Understanding the Invisible Force Changing Our World
Machine Learning: Understanding the Invisible Force Changing Our World
 
Présentation de Bruno Schroder au 20e #mforum (07/12/2016)
Présentation de Bruno Schroder au 20e #mforum (07/12/2016)Présentation de Bruno Schroder au 20e #mforum (07/12/2016)
Présentation de Bruno Schroder au 20e #mforum (07/12/2016)
 
Opportunities with data science
Opportunities with data scienceOpportunities with data science
Opportunities with data science
 
Origins of the Marketing Intelligence Engine (SXSW 2015)
Origins of the Marketing Intelligence Engine (SXSW 2015)Origins of the Marketing Intelligence Engine (SXSW 2015)
Origins of the Marketing Intelligence Engine (SXSW 2015)
 
Strata Conference NYC 2013 Full Version
Strata Conference NYC 2013 Full VersionStrata Conference NYC 2013 Full Version
Strata Conference NYC 2013 Full Version
 
Big Data Re-Told
Big Data Re-ToldBig Data Re-Told
Big Data Re-Told
 
Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference
 
Big Data v. Small data - Rules to thumb for 2015
Big Data v. Small data - Rules to thumb for 2015Big Data v. Small data - Rules to thumb for 2015
Big Data v. Small data - Rules to thumb for 2015
 
Jobs Complexity
Jobs ComplexityJobs Complexity
Jobs Complexity
 
The Need for Deep Learning Transparency
The Need for Deep Learning TransparencyThe Need for Deep Learning Transparency
The Need for Deep Learning Transparency
 
Sweden future of ai 20180921 v7
Sweden future of ai 20180921 v7Sweden future of ai 20180921 v7
Sweden future of ai 20180921 v7
 
Enabling the data driven enterprise
Enabling the data driven enterpriseEnabling the data driven enterprise
Enabling the data driven enterprise
 
Future of technology
Future of technologyFuture of technology
Future of technology
 
Big data tokyo (extended version)
Big data tokyo  (extended version)Big data tokyo  (extended version)
Big data tokyo (extended version)
 
Designing AI for Humanity at dmi:Design Leadership Conference in Boston
Designing AI for Humanity at dmi:Design Leadership Conference in BostonDesigning AI for Humanity at dmi:Design Leadership Conference in Boston
Designing AI for Humanity at dmi:Design Leadership Conference in Boston
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course ppt
 
Inside Out and Upside Down - FOO Camp 2016 - Peter Coffee
Inside Out and Upside Down - FOO Camp 2016 - Peter CoffeeInside Out and Upside Down - FOO Camp 2016 - Peter Coffee
Inside Out and Upside Down - FOO Camp 2016 - Peter Coffee
 
Why CxOs care about Data Governance; the roadblock to digital mastery
Why CxOs care about Data Governance; the roadblock to digital masteryWhy CxOs care about Data Governance; the roadblock to digital mastery
Why CxOs care about Data Governance; the roadblock to digital mastery
 

Bigger than Any One: Solving Large Scale Data Problems with People and Machines

  • 1. Bigger Than Any One Solving Large Scale Data Problems with People and Machines Tyler Bell: @twbell
  • 2. "You just want to survive long enough to have a long run, to prove that your product has value. And in the short term, human solutions require much less work. Worry about scaling when you need to" Get 'Data Jujitsu' free at: http://oreilly. com/data/radarreports/data-jujitsu.csp
  • 3. And now for something completely Factual
  • 4. US Places: 25m entities, one pixel per place
  • 5. 474,691 inserts 4,971,825 updates Updates and Inserts Overlay: Jan 15 to Jan 30 2013
  • 7. inserts updates Updates and Inserts Overlay: Jan 15-30 2013
  • 8. Discussing the Human Role as: ● Turkers (Sloggers) ● Experts (Thinkers) ● Contributors (Players)
  • 10. A Good Body of Literature Exists: ● CrowdDB "harnessing Human Computation" to address the Closed World Assumptions of SQL ● Crowdsourced Databases: "Query Processing with People" ● Deco: Declarative Crowdsourcing "Human Computation" ● Mob Data Sourcing: a good state-of-the-art survey
  • 12.
  • 14. "We’ve built a real-time human computation engine to help us identify search queries as soon as they're trending, send these queries to real humans to be judged, and then incorporate the human annotations into our back- end models." "Clockwork Raven steps in to do "we use a small custom pool of what algorithms cannot: it sends Mechanical Turk judges to ensure your data analysis tasks to real high quality." people and gets fast, cheap and accurate results." Edwin Chen & Alpa Jai, Twitter Blog, 8 Jan 2013: http://bit.ly/SixU2q Clockwork Raven: http://bit.ly/RkhxkW
  • 15. Gold Data Search for Candidate Matches Resolve Candidates Compare Attributes Likely Matches Unsure Likely Mismatches Send to MTurk for Evaluation Factual: Likely Matches Likely Mismatches MT for QA Compute Accuracy Score Accuracy Score
  • 16. Factual Moderation Queue ~60% automatically accepted or rejected based on preset rules/filters ~30% can be parsed, processed by attribute, and validated by MTurkers ~10% handled by in-house moderators or 'Trusted Turkers'
  • 18. The Data Science Debate 2012 Peter Skomoroch (LinkedIn), Michael Driscoll (Metamarkets), DJ Patil (Greylock Partners),Toby Segaran (Google), Pete Warden (Jetpac), Amy Heineike (Quid) "In data science, domain expertise is more important than machine learning skill" 2012 Strata Debate: http://bit.ly/A78ASK Mike Loukides, 'The Unreasonable Necessity of Subject Experts' : http://oreil.ly/GAJHD9
  • 19. "We've discovered that creative-data scientists can solve problems in every field better than experts in those fields can." "The expertise you need is strategy expertise in answering these questions." Interview with Jeremy Howard, Chief Scientist, Kaggle: http://slate.me/UM0wQ7
  • 20. "Data struggles with context. Human decisions are not discrete events. They are embedded in sequences and contexts. The human brain has evolved to account for this reality [...] Data analysis is pretty bad at narrative and emergent thinking, and it cannot match the explanatory suppleness of even a mediocre novel." "Data inherently has all of the foibles of being human, [it] is not a magic force in society; it’s an extension of us." NYT Article 24 Feb 2013: http://nyti.ms/15cMV80 David Brooks NYT Op-Ed 18 Feb 2013: http://nyti.ms/12EQmpQ Google Flu Trends: http://www.google.org/flutrends/us/#US
  • 21. "People from different places and different backgrounds tend to produce different sorts of information. And so we risk ignoring a lot of important nuance if relying on big data as a social/economic/political mirror." "Big data is undoubtedly useful for addressing and overcoming many important issues face by society. But we need to ensure that we aren't seduced by the promises of big data to render theory unnecessary." Mark Graham, 'Big Data and the End of Theory', The Guardian, 9 Mar 2012: http://bit.ly/yq87yW Chris Anderson, 'The End of Theory', Wired (2008): http://bit.ly/LOU8
  • 22. "Without [cleansing and harmonisation ] they’re not data markets; they’re data jumble sales." Edd Dumbill, 'Data Markets Compared', 7 March 2012: http://bit.ly/Ogsh30 InfoChimps: http://bit.ly/yIzY8x Paul Miller, 'Discussing Data Markets in New York City', 13 March 2013: http://bit.ly/TFjNnj
  • 24. "It’s just dumb that a "Humans don’t tend to 100mil+ people carry GPS think in floating point device in their pockets and pairs." we have to buy expensive proprietary data to find out about the shape of where we live." "It is equal parts sad- making and hate-making that we're all still stuck "Seriously, if you’re suffering the lack of a presenting people comprehensive and open geographic data, ask them dataset for places." sometime if you’re getting it right." Aaron Cope: http://www.aaronland.info/weblog/2013/02/03/reality/#youarehere Kellan Elliott-McCrea: http://laughingmeme.org/2013/02/03/are-you-here-a-feature-on-the-side/
  • 25. Verizon Wireless Precision Market Insights: http://vz.to/13l6Hje
  • 26. "So far InfoArmy has paid $146K to researchers (averaging $23/report), and some researchers have made over $5,000 individually. We are very happy that many hard ● Single report owners working researchers have ● High prices made money. However, ● Low quality without a clear path to revenue ● Unpopularity of updates this model is unsustainable." ● Report Abandonment ● Sold 44 reports total Info Army: https://www.infoarmy.com/ Tech Crunch report: http://tcrn.ch/VsXnCv
  • 27. Apple Maps' Report a Problem Image: http://www.macrumors.com/2012/09/24/how-to-report-a-problem-with-ios-6-maps-data/
  • 28. 40 million users 65,000 Waze users made 500 million edits. Users fixed 70% of system-detected map problems and 100% of all user- reported map problems. Almost all user-reported map problems were taken care of within a week. Users added more than 50,000 gas stations on the map in the first month. Users spend 7 hours a month on the platform. 'Waze Now Lets Users Instantly Map Closed Roads, Acts Like Google Maps On Steroids': http://bit.ly/15iHyEo
  • 29. Factual provides both 'Flag' and 'Write' interfaces (this is good) Factual Write API: http://developer.factual.com/display/docs/Core+API+-+Writes Flag API: http://developer.factual.com/display/docs/Core+API+-+Flag
  • 30. With No Means to Follow-Up (this is bad. But we're on it)
  • 31. Now: extend this thinking to organizations
  • 32. Ryan Kim, GigaOm, 6 Nov 2012 http://bit.ly/SLb1C7
  • 33. Play ball with the competitors: "Think instead about collaborating with all the other players and getting the network effects of a larger audience and a bigger pool of offers. [...] The individual companies behind the consortium still compete to acquire subscribers, but they now collaborate with a common platform for monetizing those subscribers through marketing." Alistair Goodman, CEO Placecast, All Things D, 7 Feb 2012 http://bit.ly/YA8Nq4
  • 34. "Seeds [...] are one of the original information storage devices, it’s almost hard to understand why libraries haven’t always included seeds" Long Now Foundation: 'Seeds are the new books": http://blog. longnow.org/02013/02/26/seeds-are-the-new-books/
  • 35. Data In game theory and economic theory, a zero–sum game is a mathematical representation of a situation in which a participant's gain (or loss) of utility is exactly balanced by the losses (or gains) of the utility of the other participant(s). If the total gains of the participants are added up, and the total losses are subtracted, they will sum to zero. Thus cutting a cake, where taking a larger piece reduces the amount of cake available for others, is a zero–sum game if all participants value each unit of cake equally (see marginal utility). In contrast, non-zero–sum describes a situation in which the interacting parties' aggregate gains and losses are either less than or more than zero. A zero–sum game is also called a strictly competitive game while non-zero–sum games can be either competitive or non-competitive. Zero–sum games are most often solved with the minimax theorem which is closely related to linear programming duality or with Nash equilibrium. http://en.wikipedia.org/wiki/Zero_sum
  • 36. Tyler Bell Factual @twbell tyler@factual.com