XML Amsterdam 2012 Keynote

BigData and Modern XML
Jim Fuller
email: jim.fuller@marklogic.com twitter: @xquery
Senior Engineer, Europe
19/09/12

Senior engineer

http://jim.fuller.name
http://exslt.org @xquery XSLT UK 2001
http://www.xmlprague.cz

@perl6
Perlmonks
Pilgrim

Kickoff

XML current status Modern XML & BigData

‘ontogeny recapitulates phylogeny’
or
A (very)Brief History of ML
• Late 1950s: Noam Chomsky ‘generative
grammars’
• 1969: Charles Goldfarb (w/ Ed Mosher and
Ray Lorie) created GML
• 1986: SGML formalized
• 1998: XML 1.0 W3C recommendation
• 1998 – 2012: A lot of stuff happened
• Future: XML 2.0 … microXML ?

RDBMS Goliath vs XML David
• Back then, XML was the proto ‘nosql’
• X in AJAX

• Now many ‘davids’
• AJAJ

Documents
• Back then, it wasn’t unusual for vendor to say
‘tough luck’ with your data (pay up)
• Now, most office documents are in XML

The ‘long tail’ of XML Vocabularies

• Back then, vocabularies built with
proprietary approaches

• Today, 1000’s of vocabularies based on
XML
– ‘2012 U.S. GAAP Taxonomy Adopted by SEC;
FASB Webcast April 3’

Back then, XML/Markup Conferences
• Software Development 99 East, November 8-13, 1999, Washington D.C.
• XML One Fall 99, November 8-11, 1999, Santa Clara, CA
• XML '99 December 6-9, 1999, Philadelphia PA
• Markup Technologies '99 Conference December 5-9, 1999, Philadelphia
• Web Design 2000, February 7-9, 2000, Atlanta
• XTech '2000, February 27-March 2, San Jose
• Software Development 2000 West, March 20-24, 2000, San Jose
• Sixteenth International Unicode Conference, Boston, March 27-30, 2000
• The Ninth International World Wide Web Conference, May 15-19, 2000,
Amsterdam
• DL 2000: Fifth ACM Conference on Digital Libraries, June 3-6 2000, Texas
• XML Europe 2000, June 12-16, Paris
• Web Design World 2000, July 17-21, 2000, Seattle, Washington
• MetaStructures, August 14-16, 2000, Montreal, Quebec, Canada
• XML Developers Conference, August 17-18, 2000, Montreal, Quebec
• Internet World Expo, October 25-27, 2000, New York City
• XML 2000/Markup Technologies 2000, December 3-7, Washington
• ….. Even a Geek Cruises XML Excursion - January 2001

Today - XML/Markup Conferences
• The XML ‘parallelogram’
– Balisage
– XML Summer School
– XML Prague
– XML Amsterdam
• Xtech*
• markupForum
• XATA
• MarkLogic World (600 ppl)
• databaseX (London November 2013 ?)

Other important good stuff
• Evolution of the Operating System
– Unix is the operating system for text
– Windows tried to be the operating system for
binaries, then adopted xml .. Mixed bag
– Java (vm) has a strong xml stack
• The web changed everything to text based
markup.
• cheap RAM/Disk/CPU
• Virtualization = scale out

Other important good stuff

http://googleblog.blogspot.cz/2012/02/unicode-over-60-percent-of-web.html

Unfair to point out failure ?
• Namespaces
• XLINK
• WS* astronautics
• Draconian error checking
• XML SCHEMA
• XFORMS
• XSLT 1.0 (or any xml) in the browser
• XHTML vs HTML5
• Too many specs (modularity good, complexity
bad)

“Winning isn’t everything. There
should be no conceit in victory and no
despair in defeat.” - Matt Busby

• 2001 I was the RDBMS serial killer
– ‘kill RDBMS’
• Define successful ?
– Adoption ?
– Cheaper ?
– Faster ?
– Better ?

Drill Down distraction - Why is Xquery
successful productive ?
• Choose my most successful (adhoc stories,
visible success)
• Functional, dynamic … work with structure,
text and values … stored proc + query lang
• XPATH^
• Is it possible to qualify/quantify Xquery
productivity?

Programming Language Productivity
Data compiled from studies by Prechelt and Garret of a particular string
processing problem - public domain 2006.

* 28msec – 2011
http://www.28msec.com/html/home

Java XQuery
SimpleDB 2905 572
S3 8589 1469
SNS 2309 455

13803 2496

Developing an Enterprise Web
Application in XQuery - 2009 Martin
Kaufmann, Donald Kossmann

Java/J2EE XQuery
Model 3100 240
View 4100 1500
Controller 900 1180

8100 (?) 2920 (3490)

Nooooo! The problem with loc

correlation of failure
with very high loc is
the only certain fact
with loc

That’s about it

An empirical comparison of C, C++,
Java, Perl, Python, Rexx, and Tcl for a
search/string-processing program
Lutz Prechelt (prechelt@ira.uka.de) Fakulta ̈t fu ̈r Informatik
Universita ̈t Karlsruhe

Language #loc per Function Point
C 91
C++ 53
Java 54
Perl 21
* Designing and writing programs using dynamic languages tended to
take half as long as well as resulting in half the code.

Function Point Method
Nooooo!

#loc per FP
=
Lines of code
Per
Function Point

Project Uncertainty Principle

* Dilbert Comic 2003 United Features Syndicate Inc

Reviewed 11 projects

FP Analysis
Calc FP inputs/outputs
Calc VAF (0.65 + [ (Ci) / 100])
AVP = VAF * sum(FP)

#loc
using cloc

= #loc per FP
* FP overview - http://www.softwaremetrics.com/fpafund.htm

Language #loc per Function Point
Perl 21
Eiffel 21
SQL 13-30
XQuery 27-33
Haskell 38
Erlang 40
Python 42-47
Java 50-80
Javascript 50-55
Scheme 53
C++ 59-80
C 128-140
http://www.qsm.com/resources/function-point-languages-table

Preferred Programming Language
73%
55%
45%
32%
22%

Which data formats do you use the
most ?

95%

40%

39%

32%

27%

18%

15%

Do you think XQuery makes you a
more productive programmer ?

67%

14%

10%

8%

Is XQuery more productive then (with???) Java
in developing web based data applications ?

58%

22%

12%

8%

Time to bust one myth
• xml is too slow and bloated

• http://www.navioo.com/ajax/ajax_json_x
ml_Benchmarking.php

• In data orientated AJAJ scenarios with
JSON … best most benchmarks today is
30% faster with less load (so more with
less resources)

mongodb

* http://www.linkedin.com/skills/skill

Javascript


XQuery


XSLT


hadoop


Java


JSON


XML


Channel effect of Aging inTechnology
• “Average age of @guardian Facebook
audience is 29. Website is 37, print paper 44.
Amazing channel effect, really. #newsrw”
• Babyboomers, Gen X, Y and Z
• I feel a bit uneasy framing generational
arguments …

Death of the XML Child
…Overachieving Child Prodigies
grow up

XML Hard Core - XML Hype cycle

2002

2006

2012
1998
2007

XML’s reported death-> 2009

REST of the World - XML Hype cycle

2006

2002

2009
1998

2012
XML’s reported death->

hype cycle

*2012 Gartner Hype Cycle http://www.gartner.com

2001 Edd Dumbill – xml.com
‘Stop the XML hype, I want to get off

As editor of XML.com, I welcome the massive
success XML has had. But things prized by the XML
community — openness and interoperability — are
getting swallowed up in a blaze of marketing hype. Is this
the price of success, or something we can avoid? ‘

Source: Edd Dumbill (March 2001)

2012 Edd Dumbill g+ post
‘For many years I was the editor of XML.com,
and the chair of the XML Europe conference.
Today, it seems that XML's mission to be a web
language is mostly dead. I'm not saying XML is
useless: it has proved itself as a more easily-used
SGML, but I'm not sure it's expanded too far
outside of that.’
Source: Edd Dumbill (March 2012)

Current Status: XML is dead
• XML fought too many battles (RDBMS, NoSQL,
web developers, HTML5)
• Age channeling and Hype curve in effect
• But XML technology stack is embracing JSON
etc …
• No room for sentimentality in technology

Is XML Applicable to Big Data ?
• We know it is, that’s why I am here
• Some of you already know
• Need to dig into the detail
• But we first need to simplify things

http://kensall.com/big-picture/bigpix22.html

* http://gigaom.com/cloud/big-data-equals-big-opportunities-for-businesses-infographic/

BigData Opportunity

managing data variability, volume & velocity is hard

You need to be a (data) scientist to build this rocket ship.

So whats the problem again ?

#1 – How to Apply Modern XML to your BigData
problems ?

#1a: XML Milieu too complicated, need to identify
what is successful as Modern XML

#1b – BigData is a huge opportunity

#1c – BigData has a huge learning curve and high risks

Solving #1 – Defining Modern XML
• Identify the technologies
• Identify and classify the Scenarios

Modern XML Technology analysis
• Internal survey of ML Customer projects &
External survey of projects (w/ pref towards
Big/Complex projects)
• Informal Survey (polldaddy)
• Qualitative and quantitative

Eisenhower - "What is important is seldom
urgent and what is urgent is seldom important,"
URGENT NOT URGENT

IMPORTANT

Critical Goals

NOT IMPORTANT

interruptions Distractions

Survey Interpretations
• XML 1.0, Namespaces is important now
• XProc, XHTML important now
• XSLT 2 and XQuery 1 very important now
• XSLT 2 and XQuery 2 in the browser near future
• XQuery 3.0 important near future
• SAX/DOM now, XOM possible future
• XML Schema 1.0 now, 1.1 for the near future
• Schematron surprising
• Semweb is for the future
• SVG and MathML due to web browser support
• XML vocabulary has a very ‘long tail’

Modern XML
Technology Candidates
Core XML 1.0 These technologies trended
Namespaces highly across all analysis

Other
Bold – could be trending due
to browser impl/historical
Transform XSLT 2.0 dep
XQuery 1.0
Processing SAX, DOM

Schema Schematron
XML Schema 1.0
Semantics RDF
OWL

Vocabularies Office Doc ML
SVG

Modern XML
Tier 1
Core XML 1.0 These technologies trended
Namespaces highly across all analysis

Other XProc
Bold – could be trending due
to browser impl/historical
Transform XSLT 2.0 / 3.0 / browser dep
XQuery 1.0 / 3.0
Processing SAX, DOM
Italic – strong signal, early
Schema Schematron usage, interest of unproven
XML Schema 1.0 / 1.1 spec/tech

Semantics RDF
OWL

Vocabularies Office Doc ML
SVG

Modern XML Modern XML
Tier 1 Tier 2
Core XML 1.0 XML Canonicalization
Namespaces xml:id

Other XProc XHTML*

Transform XSLT 2.0 / 3.0 / browser XSLT 1.0
XQuery 1.0 / 3.0
Processing SAX, DOM XOM, STAX

RELAX-NG
Schema Schematron
XML Schema 1.0 / 1.1
SPARQL
Semantics RDF
OWL

Vocabularies Office Doc ML MathML
SVG Docbook
SOAP* , DITA, EPUB

Modern XML Modern XML
Tier 1 Tier 2
Core XML 1.0 XML Canonicalization
Namespaces xml:id
XML infoset
Other XProc XHTML*

Transform XSLT 2.0 / 3.0 / browser XSLT 1.0
XQuery 1.0 / 3.0
Processing SAX, DOM XOM, STAX

RELAX-NG
Schema Schematron
XML Schema 1.0 / 1.1
SPARQL
Semantics RDF
OWL

Vocabularies Office Doc ML MathML
SVG Docbook
SOAP , DITA, EPUB,
Data Formats XML, text, binary, JSON

The technology triggers
• XML Database – reduce the complexity/risk of
BigData
– MarkLogic
– eXist
– Zorba
– Sedna
– Basex
– Others (Oracle!)
• Xquery - Rapid prototyping
• Avoid purist architectures, embrace
heterogeneity

Modern XML / BigData Scenarios
• Classic Scenarios
– Document (xml) Database
– Aggregation
– Enterprise Search
– Heterogeneous Content store
– Publishing

• BigData Scenarios
– BigData ‘classic’
– Extreme personalisation
– Predictive analytics
– Financial analysis
– Realtime analysis (management/financial)
– Actionable intelligence

• Semantic Web – too early to categorize but its for real

Solving Problem #2 – Focus on the
Practicalities
• What type of Big Data problem do you have ?
– The urgent, important ones you know about
– The urgent, important ones you don’t know about
• Create a dedicated team (analytics, problem
domain experts) to identify the later
• Assess data maturity (Data Audit)
• With power comes responsibility … Ethical
Analytics

BigData Tech Advice
• Start using an XML database asap!
• Don’t get distracted by the zoo … start
hadooping right away
• ‘Data outlives code’, spend more time on the
data, clean abstractions, cogent, opening it up

Size appropriately
Volume – will be relative to your current capability,
if the requirement is a magnitude greater past
current infrastructure scaling

Velocity – Updates versus reads ? High volatility
with realtime queries ?

Variety – managing versioning ?

Complexity – multiples, complex processes

Size Appropriately: Are you a
‘Facebook’ (Google, Yahoo…) ?
• 2.5 billion content items shared per day (status updates + wall posts +
photos + videos + comments)
• 2.7 billion Likes per day
• 300 million photos uploaded per day
• 100+ petabytes of disk space in one of FB’s largest Hadoop (HDFS)
clusters
• 105 terabytes of data scanned via Hive, Facebook’s Hadoop query
language, every 30 minutes
• 70,000 queries executed on these databases per day
• 500+terabytes of new data ingested into the databases every day

• Are you planning to scale out too ~180,900 servers ?
• ~18000 database servers ingesting 500+ terabytes of data through a
guestimated 50+ billion calls …. A day!

http://www.datacenterknowledge.com/the-facebook-data-center-faq/

Solving Problem #3 – Understanding
the risks
• Biggest mistakes seen with BigData adoption
• ‘data scientists themselves don't have much of
intuition either…and that is a problem. I saw an
estimate recently that said 70 to 80 percent of the
results that are found in the machine learning
literature, which is a key Big Data scientific field, are
probably wrong because the researchers didn't
understand that they were overfitting the data’. –
Alex Pentland MIT's “Big Data guy”

Summary
• We reviewed some aspects of XML current
status in the dataverse
• Identified a subset of the XML Milieu – calling
it Modern XML
• Identified the scenarios where Modern XML
are being brought to bear with Bigdata
• Reviewed common mistakes and Risks with
BigData

Final Thesis
• Modern XML provides great foundation today
– Great for ‘classic’ scenarios
– Great technical positioning for addressing
challenges of BigData
– Great technical positioning for semweb
• Adopting an XML database mitigates risk
• Knowing Bigdata/Modern XML scenarios helps
us mitigate risks
• There is a big prize if you get BigData right

Avoid stereotypes

I’m a RDBMS

I’m a Protocol Buffer
I’m a Json
I’m an XML

Jeni Tennison XML Prague 2012 talk

JSON

XML
RDF

HTML

Be wary of Paradigm Shifts
• RedMonks - Language divergence
• Andresson - Software is eating the world
• 128bit and beyond current von
neuman/harvard arch ?
• Power Wall (at server farms/mobile devices)
• The web revolution is not done yet
(http://www.firebase.com/index.html)

‘Form is temporary.
Class is permanent’
• XML is emerging from its ‘Trough of
disillusionment’, because its useful, productive
and reacting to new requirements.
• Modern XML is successful on many different
measure, mature and dead boring
• Modern XML can help solve your BigData
problems

Pull the Technology Trigger –
Try an XML Database Today!
• MarkLogic 6
– Web dev ‘surface area’, work with JSON
– REST API
– Java API
– Work across different data
• Zorba
• eXist
• BaseX
• Sedna

XML Amsterdam 2012 Keynote

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (11)

Similar a XML Amsterdam 2012 Keynote

Similar a XML Amsterdam 2012 Keynote (20)

Último

Último (20)

XML Amsterdam 2012 Keynote

Notas del editor