Knowledge graphs are used in various applications and have
been widely analyzed. A question that is not very well researched is: what is the price of their production? In this paper, we propose ways to estimate the cost of those knowledge graphs. We show that the cost of manually curating a triple is between $2 and $6, and that the cost for automatically created knowledge graphs is by a factor of 15 to 150 cheaper (i.e., 1c to 15c per statement). Furthermore, we advocate for taking cost into account as an evaluation metric, showing the correspondence between cost per triple and semantic validity as an example.
3. 10/15/18 Heiko Paulheim 3
...and Today’s Calculations
• So what would have been a good price for Freebase?
• Some back of the envelope calculations...
4. 10/15/18 Heiko Paulheim 4
What Do We Know?
• No. of facts in Freebase: 3B
• Cost of creating a single fact: unknown
• Freebase was edited similar to a Wiki
– assumption: adding a fact is as expensive
like adding a sentence in Wikipedia
5. 10/15/18 Heiko Paulheim 5
Cost of Manual Triple Creation
• Assumption: adding a fact is as expensive
like adding a sentence in Wikipedia
– English Wikipedia up to April 2011: 41M working hours
(Geiger and Halfaker, 2013)
●
size in April 2011: 3.6M pages, avg. 36.4 sentences each
→ 18.7 minutes per sentence
●
using US minimum wage: $2.25 per sentence
→ $2.25 per statement
• Result: total cost of creating Freebase would be $6.75B
• Cyc
– Total development cost: $120M (according to a presentation by Lenat in
2017)
– Total #statements: 21M
→ $5.71 per statement
6. 10/15/18 Heiko Paulheim 6
Cost of Automatic/Heuristic Creation
• DBpedia
– 4.9M LOC, 2.2M LOC for mappings
software project development: ~37 LOC per hour
(Devanbu et al., 1996)
we use German PhD salaries as a cost estimate
→ 1.85c per statement
– YAGO: made from 1.6M LOC
uses WordNet: 117k synsets, we treat each synset like a Wiki page
→ 0.83c per statement
– NELL: 103k LOC
→ 14.25c per statement
• Compared to manual curation: saving factor 16-250
7. 10/15/18 Heiko Paulheim 7
Cost vs. Quality
• Graph error rate against cost
– we can pay for accuracy
– NELL is a bit of an outlier
• Error rates according to Färber et al. (2018), Mitchell et al. (2015)
8. 10/15/18 Heiko Paulheim 8
Summary
• We can estimate the cost of KG creation
• A manually curated triple costs about $2 to $6
• An automatically/heuristically created triple costs about 1c-15c
– saving factor: around 100
• We can observe a relation
between cost and quality
9. 10/15/18 Heiko Paulheim 9
Open Questions
• Debatable approximations
– can we do better?
• Rate KG refinement approaches by their cost
• What about Wikidata?
• What about the provision
and maintenance Cost?
• ...and...
10. 10/15/18 Heiko Paulheim 10
...back to the Initial Question
Was that
a good deal?
acquisition by Google
estimated as $60-$300M!
estimated value of
Freebase: ~$6.75B