4. ‘ontogeny recapitulates phylogeny’
or
A (very)Brief History of ML
• Late 1950s: Noam Chomsky ‘generative
grammars’
• 1969: Charles Goldfarb (w/ Ed Mosher and
Ray Lorie) created GML
• 1986: SGML formalized
• 1998: XML 1.0 W3C recommendation
• 1998 – 2012: A lot of stuff happened
• Future: XML 2.0 … microXML ?
5. RDBMS Goliath vs XML David
• Back then, XML was the proto ‘nosql’
• X in AJAX
• Now many ‘davids’
• AJAJ
6. Documents
• Back then, it wasn’t unusual for vendor to say
‘tough luck’ with your data (pay up)
• Now, most office documents are in XML
7. The ‘long tail’ of XML Vocabularies
• Back then, vocabularies built with
proprietary approaches
• Today, 1000’s of vocabularies based on
XML
– ‘2012 U.S. GAAP Taxonomy Adopted by SEC;
FASB Webcast April 3’
9. Back then, XML/Markup Conferences
• Software Development 99 East, November 8-13, 1999, Washington D.C.
• XML One Fall 99, November 8-11, 1999, Santa Clara, CA
• XML '99 December 6-9, 1999, Philadelphia PA
• Markup Technologies '99 Conference December 5-9, 1999, Philadelphia
• Web Design 2000, February 7-9, 2000, Atlanta
• XTech '2000, February 27-March 2, San Jose
• Software Development 2000 West, March 20-24, 2000, San Jose
• Sixteenth International Unicode Conference, Boston, March 27-30, 2000
• The Ninth International World Wide Web Conference, May 15-19, 2000,
Amsterdam
• DL 2000: Fifth ACM Conference on Digital Libraries, June 3-6 2000, Texas
• XML Europe 2000, June 12-16, Paris
• Web Design World 2000, July 17-21, 2000, Seattle, Washington
• MetaStructures, August 14-16, 2000, Montreal, Quebec, Canada
• XML Developers Conference, August 17-18, 2000, Montreal, Quebec
• Internet World Expo, October 25-27, 2000, New York City
• XML 2000/Markup Technologies 2000, December 3-7, Washington
• ….. Even a Geek Cruises XML Excursion - January 2001
10. Today - XML/Markup Conferences
• The XML ‘parallelogram’
– Balisage
– XML Summer School
– XML Prague
– XML Amsterdam
• Xtech*
• markupForum
• XATA
• MarkLogic World (600 ppl)
• databaseX (London November 2013 ?)
11. Other important good stuff
• Evolution of the Operating System
– Unix is the operating system for text
– Windows tried to be the operating system for
binaries, then adopted xml .. Mixed bag
– Java (vm) has a strong xml stack
• The web changed everything to text based
markup.
• cheap RAM/Disk/CPU
• Virtualization = scale out
12. Other important good stuff
http://googleblog.blogspot.cz/2012/02/unicode-over-60-percent-of-web.html
13. Unfair to point out failure ?
• Namespaces
• XLINK
• WS* astronautics
• Draconian error checking
• XML SCHEMA
• XFORMS
• XSLT 1.0 (or any xml) in the browser
• XHTML vs HTML5
• Too many specs (modularity good, complexity
bad)
14. “Winning isn’t everything. There
should be no conceit in victory and no
despair in defeat.” - Matt Busby
• 2001 I was the RDBMS serial killer
– ‘kill RDBMS’
• Define successful ?
– Adoption ?
– Cheaper ?
– Faster ?
– Better ?
15. Drill Down distraction - Why is Xquery
successful productive ?
• Choose my most successful (adhoc stories,
visible success)
• Functional, dynamic … work with structure,
text and values … stored proc + query lang
• XPATH^
• Is it possible to qualify/quantify Xquery
productivity?
19. Developing an Enterprise Web
Application in XQuery - 2009 Martin
Kaufmann, Donald Kossmann
Java/J2EE XQuery
Model 3100 240
View 4100 1500
Controller 900 1180
8100 (?) 2920 (3490)
20. Nooooo! The problem with loc
correlation of failure
with very high loc is
the only certain fact
with loc
That’s about it
21. An empirical comparison of C, C++,
Java, Perl, Python, Rexx, and Tcl for a
search/string-processing program
Lutz Prechelt (prechelt@ira.uka.de) Fakulta ̈t fu ̈r Informatik
Universita ̈t Karlsruhe
Language #loc per Function Point
C 91
C++ 53
Java 54
Perl 21
* Designing and writing programs using dynamic languages tended to
take half as long as well as resulting in half the code.
29. Do you think XQuery makes you a
more productive programmer ?
67%
14%
10%
8%
30. Is XQuery more productive then (with???) Java
in developing web based data applications ?
58%
22%
12%
8%
31. Time to bust one myth
• xml is too slow and bloated
• http://www.navioo.com/ajax/ajax_json_x
ml_Benchmarking.php
• In data orientated AJAJ scenarios with
JSON … best most benchmarks today is
30% faster with less load (so more with
less resources)
44. Channel effect of Aging inTechnology
• “Average age of @guardian Facebook
audience is 29. Website is 37, print paper 44.
Amazing channel effect, really. #newsrw”
• Babyboomers, Gen X, Y and Z
• I feel a bit uneasy framing generational
arguments …
45. Death of the XML Child
…Overachieving Child Prodigies
grow up
51. 2001 Edd Dumbill – xml.com
‘Stop the XML hype, I want to get off
As editor of XML.com, I welcome the massive
success XML has had. But things prized by the XML
community — openness and interoperability — are
getting swallowed up in a blaze of marketing hype. Is this
the price of success, or something we can avoid? ‘
Source: Edd Dumbill (March 2001)
52. 2012 Edd Dumbill g+ post
‘For many years I was the editor of XML.com,
and the chair of the XML Europe conference.
Today, it seems that XML's mission to be a web
language is mostly dead. I'm not saying XML is
useless: it has proved itself as a more easily-used
SGML, but I'm not sure it's expanded too far
outside of that.’
Source: Edd Dumbill (March 2012)
53. Current Status: XML is dead
• XML fought too many battles (RDBMS, NoSQL,
web developers, HTML5)
• Age channeling and Hype curve in effect
• But XML technology stack is embracing JSON
etc …
• No room for sentimentality in technology
58. Is XML Applicable to Big Data ?
• We know it is, that’s why I am here
• Some of you already know
• Need to dig into the detail
• But we first need to simplify things
62. managing data variability, volume & velocity is hard
You need to be a (data) scientist to build this rocket ship.
63. So whats the problem again ?
#1 – How to Apply Modern XML to your BigData
problems ?
#1a: XML Milieu too complicated, need to identify
what is successful as Modern XML
#1b – BigData is a huge opportunity
#1c – BigData has a huge learning curve and high risks
64. Solving #1 – Defining Modern XML
• Identify the technologies
• Identify and classify the Scenarios
65. Modern XML Technology analysis
• Internal survey of ML Customer projects &
External survey of projects (w/ pref towards
Big/Complex projects)
• Informal Survey (polldaddy)
• Qualitative and quantitative
66. Eisenhower - "What is important is seldom
urgent and what is urgent is seldom important,"
URGENT NOT URGENT
IMPORTANT
Critical Goals
NOT IMPORTANT
interruptions Distractions
67. Survey Interpretations
• XML 1.0, Namespaces is important now
• XProc, XHTML important now
• XSLT 2 and XQuery 1 very important now
• XSLT 2 and XQuery 2 in the browser near future
• XQuery 3.0 important near future
• SAX/DOM now, XOM possible future
• XML Schema 1.0 now, 1.1 for the near future
• Schematron surprising
• Semweb is for the future
• SVG and MathML due to web browser support
• XML vocabulary has a very ‘long tail’
68. Modern XML
Technology Candidates
Core XML 1.0 These technologies trended
Namespaces highly across all analysis
Other
Bold – could be trending due
to browser impl/historical
Transform XSLT 2.0 dep
XQuery 1.0
Processing SAX, DOM
Schema Schematron
XML Schema 1.0
Semantics RDF
OWL
Vocabularies Office Doc ML
SVG
69. Modern XML
Tier 1
Core XML 1.0 These technologies trended
Namespaces highly across all analysis
Other XProc
Bold – could be trending due
to browser impl/historical
Transform XSLT 2.0 / 3.0 / browser dep
XQuery 1.0 / 3.0
Processing SAX, DOM
Italic – strong signal, early
Schema Schematron usage, interest of unproven
XML Schema 1.0 / 1.1 spec/tech
Semantics RDF
OWL
Vocabularies Office Doc ML
SVG
70. Modern XML Modern XML
Tier 1 Tier 2
Core XML 1.0 XML Canonicalization
Namespaces xml:id
Other XProc XHTML*
Transform XSLT 2.0 / 3.0 / browser XSLT 1.0
XQuery 1.0 / 3.0
Processing SAX, DOM XOM, STAX
RELAX-NG
Schema Schematron
XML Schema 1.0 / 1.1
SPARQL
Semantics RDF
OWL
Vocabularies Office Doc ML MathML
SVG Docbook
SOAP* , DITA, EPUB
71.
72. Modern XML Modern XML
Tier 1 Tier 2
Core XML 1.0 XML Canonicalization
Namespaces xml:id
XML infoset
Other XProc XHTML*
Transform XSLT 2.0 / 3.0 / browser XSLT 1.0
XQuery 1.0 / 3.0
Processing SAX, DOM XOM, STAX
RELAX-NG
Schema Schematron
XML Schema 1.0 / 1.1
SPARQL
Semantics RDF
OWL
Vocabularies Office Doc ML MathML
SVG Docbook
SOAP , DITA, EPUB,
Data Formats XML, text, binary, JSON
73. The technology triggers
• XML Database – reduce the complexity/risk of
BigData
– MarkLogic
– eXist
– Zorba
– Sedna
– Basex
– Others (Oracle!)
• Xquery - Rapid prototyping
• Avoid purist architectures, embrace
heterogeneity
74. Modern XML / BigData Scenarios
• Classic Scenarios
– Document (xml) Database
– Aggregation
– Enterprise Search
– Heterogeneous Content store
– Publishing
• BigData Scenarios
– BigData ‘classic’
– Extreme personalisation
– Predictive analytics
– Financial analysis
– Realtime analysis (management/financial)
– Actionable intelligence
• Semantic Web – too early to categorize but its for real
75. Solving Problem #2 – Focus on the
Practicalities
• What type of Big Data problem do you have ?
– The urgent, important ones you know about
– The urgent, important ones you don’t know about
• Create a dedicated team (analytics, problem
domain experts) to identify the later
• Assess data maturity (Data Audit)
• With power comes responsibility … Ethical
Analytics
76. BigData Tech Advice
• Start using an XML database asap!
• Don’t get distracted by the zoo … start
hadooping right away
• ‘Data outlives code’, spend more time on the
data, clean abstractions, cogent, opening it up
77. Size appropriately
Volume – will be relative to your current capability,
if the requirement is a magnitude greater past
current infrastructure scaling
Velocity – Updates versus reads ? High volatility
with realtime queries ?
Variety – managing versioning ?
Complexity – multiples, complex processes
78. Size Appropriately: Are you a
‘Facebook’ (Google, Yahoo…) ?
• 2.5 billion content items shared per day (status updates + wall posts +
photos + videos + comments)
• 2.7 billion Likes per day
• 300 million photos uploaded per day
• 100+ petabytes of disk space in one of FB’s largest Hadoop (HDFS)
clusters
• 105 terabytes of data scanned via Hive, Facebook’s Hadoop query
language, every 30 minutes
• 70,000 queries executed on these databases per day
• 500+terabytes of new data ingested into the databases every day
• Are you planning to scale out too ~180,900 servers ?
• ~18000 database servers ingesting 500+ terabytes of data through a
guestimated 50+ billion calls …. A day!
http://www.datacenterknowledge.com/the-facebook-data-center-faq/
79. Solving Problem #3 – Understanding
the risks
• Biggest mistakes seen with BigData adoption
• ‘data scientists themselves don't have much of
intuition either…and that is a problem. I saw an
estimate recently that said 70 to 80 percent of the
results that are found in the machine learning
literature, which is a key Big Data scientific field, are
probably wrong because the researchers didn't
understand that they were overfitting the data’. –
Alex Pentland MIT's “Big Data guy”
80. Summary
• We reviewed some aspects of XML current
status in the dataverse
• Identified a subset of the XML Milieu – calling
it Modern XML
• Identified the scenarios where Modern XML
are being brought to bear with Bigdata
• Reviewed common mistakes and Risks with
BigData
81. Final Thesis
• Modern XML provides great foundation today
– Great for ‘classic’ scenarios
– Great technical positioning for addressing
challenges of BigData
– Great technical positioning for semweb
• Adopting an XML database mitigates risk
• Knowing Bigdata/Modern XML scenarios helps
us mitigate risks
• There is a big prize if you get BigData right
82. Avoid stereotypes
I’m a RDBMS
I’m a Protocol Buffer
I’m a Json
I’m an XML
84. Be wary of Paradigm Shifts
• RedMonks - Language divergence
• Andresson - Software is eating the world
• 128bit and beyond current von
neuman/harvard arch ?
• Power Wall (at server farms/mobile devices)
• The web revolution is not done yet
(http://www.firebase.com/index.html)
86. ‘Form is temporary.
Class is permanent’
• XML is emerging from its ‘Trough of
disillusionment’, because its useful, productive
and reacting to new requirements.
• Modern XML is successful on many different
measure, mature and dead boring
• Modern XML can help solve your BigData
problems
87. Pull the Technology Trigger –
Try an XML Database Today!
• MarkLogic 6
– Web dev ‘surface area’, work with JSON
– REST API
– Java API
– Work across different data
• Zorba
• eXist
• BaseX
• Sedna
Notas del editor
First encounter with BigData – mapmaking (Gravity map of Rhode Island) late 1980’s – geophysics generates a lot of data points
Apologies for the gratitious football analogies … it was either that or Jaws
Chomsky proposed the notion of grammar to capture the structural constraints of a particular language. A grammar is described as a set of production rules. Depending on the kind of rules one is allowed to write, Chomsky distinguished four types of grammars of decreasing complexity, from type 0 (unconstrained) to type 3 (regular grammar). While type 0 and type 1 grammars need a full-fledged Turing machine to be checked, type 2 or context free grammars (CFG) only need a stack machine, and type 3 or regular grammars only need a finite state automaton. The last two are interesting from a computer science perspective, as they require less complex algorithms.
Binaries replaced in most office programsAvg 200 word docs on each pc (comScore tech matrix study 2008), 1 billion * 100 billion xml files latently living on pc users hard drivesGartner study 2010 – as little as a few hundred billion xml based MS Word docs on the webWhats in email, sharehpoint, websites ?These are all lowball figures … not including open source file formats, or ebooks80% of all companies use some form of Office (a few years ago MS quote that there were billion instances of office worldwide) with nearly half of these being versions that default generate XML … that’s a lot of xmlAustralia Australia's Department of Finance has released a desktop policy that required all agencies to adopt Office Open XML as the standard document format.[37]Belgium Belgium's Federal Public Service for Information and Communication Technology in 2006 was evaluating the adoption of the Office Open XML format. It already then confirmed that it would consider all ISO standards to be open standards, mentioning Office Open XML as such a possible future ISO standard.[38]Denmark In June 2007, the DanishMinistry of Science, Technology and Innovation recommended that beginning with January 1, 2008 public authorities must support at least one of the two word processing document formats Office Open XML or Open Document Format in all new IT solutions, where appropriate.[39]Germany In Germany the Office Open XML standard is currently under observation by the Federal Commissioner for Information Technology ("Die Beauftragte der BundesregierungfürInformationstechnik"). The latest release of "SAGA" (Standards and Architectures for E-Government-Applications) includes Office Open XML file formats in both its strict and transitional variant. The ISO/IEC 29500 standard may be used to exchange complex documents when further processing is required.[40]Japan On June 29, 2007, the government of Japan published a new interoperability framework which gives preference to the procurement of products that follow open standards.[41][42] On July 2 the government declared that they hold the view that formats like Office Open XML which organizations such as Ecma International and ISO had also approved was, according to them, an open standard.[43] Also, they said that it was one of the preferences, whether the format is open, to choose which software the government shall deploy. Lithuania The Lithuanian Standards Board has adopted the ISO/IEC 29500:2008 Office Open XML format standard as the Lithuanian national standard. The decision was made by Technical Committee 4 Information Technology on March 5, 2009. The proposal to adopt the Office Open XML format standard was submitted by the Lithuanian Archives Department of the Government of the Republic of Lithuania.[44]Norway Norway's Ministry of Government Administration and Reform is evaluating the adoption of the Office Open XML format. The ministry put the document standard under observation in December 2007.[45]Sweden The Kingdom of Sweden has adopted Office Open XML as a 4 part Swedish National Standard SS-ISO/IEC 29500:2009.[46][47][48][49] Switzerland In July 2007, the Swiss Federal Council announced adherence SAGA.che-Government standards mandatory for its departments as well as for cantons, cities and municipalities. The latest version of SAGA.ch includes Office Open XML file formats.[50]United Kingdom The UK has put out an action plan for use of open standards, which includes ISO/IEC 29500 as one of several formats to be supported.[51][52]United States of America On April 15, 2009, the ANSI-accredited INCITSorganisation voted to adopt ISO/IEC 29500:2008 as an American National Standard.[53] The state of Massachusetts has been examining its options for implementing XML-based document processing. In early 2005, Eric Kriss, Secretary of Administration and Finance in Massachusetts, was the first government official in the United States to publicly connect open formats to a public policy purpose: "It is an overriding imperative of the American democratic system that we cannot have our public documents locked up in some kind of proprietary format, perhaps unreadable in the future, or subject to a proprietary system license that restricts access".[54] Since 2007 Massachusetts has classified Office Open XML as "Open Format" and has amended its approved technical standards list — the Enterprise Technical Reference Model (ETRM) — to include Office Open XML. Massachusetts, under heavy pressure from some vendors, now formally endorses Office Open XML formats for its public records.[55]
The The Ninth International World Wide Web Conference, May 15-19, 2000, Amsterdam had an XML Trackhttp://www9.org/http://www9.org/w9-devxml.html
SmallerMore focusedThere are also conferences on vocabularies, but they are less about XML and more about the problem domain itself
C/C++ are the language for binariesJava heavily adopted XML good at text/binariesWith html being the single preferred markup language
HTML5 +javascript kills flashIt remains to be seen what will kill PDF’sVirtualisationCheaper hardware/software
Instead of focusing on the negatives we know about, I thought I would spend some time being more precise on the positives
Its ML special sauce
Searched around in the literature of how to measure a programming language’s productivity
Amazon client libraries written in XQuery have 80% less code than their equivalent written in Java.
Useful study on implementing an entire enterprise web applicationDave Thomas mentionedThat the bigger a program gets is the single worst thingA long paper trail of software engineering studies has shown that many internal code metrics (such as methods per class, depth of inheritance tree, coupling among classes etc.) are correlated with external attributes, the most important of which is bugs. What the authors of this paper show is that when they introduce a second variable, namely, the total size of the program, into the statistical analysis and control for it, the correlation between all these code metrics and bugs disappears.Furthermore, this relates to larger development teams who by dint of their size generate large LOC e.g. the failure rate of projects with over 300-400 developers working on them skyrockets.
Probably not to do with loc itself, but with the fact that larger programs usually have more features to fail!
The following study (related to previous study) discovered the avg number of lines of code to implement a single function pointDesigning and writing programs using dynamic languages tend to take half as long, resulting in half the codeMore code = more bugs, studies have shown a direct relationship to failure with high loc
Settled on #loc per function pointLOCLine of code (LOC)Function pointsA method of decomposing a projects requirements in hope of being able to estimate effort to do the project
Before you start throwing stuff at me for mentioning LOC and FPI do not subscribe to using LOCC and FP for project estimation … though clearly there is a lot of historical analysis which I will leverage
Projects have been anonymised to protect the innocent (my colleagues, clients, etc) … disclaimer I did 4 of these Xquery projectsTried to reduce mixed language affect … e.g. but because of Xquery ‘dsl’ness for things like data apps no problemsFP range between ~250-1200Toke me 4 daysVAF in actuality remained close to 1.0Methodologyanalyzed 11 reasonably sized projects (4 were done my me)cloc defined lines of code based on user point of view I defined FP and summed themdefined VAF for each projectVAF = (TDI*0.01) + 0.65AVP = VAF * sum of FP
Close to SQLXquery is a query language and ‘good enough’ stored proc language for working with XMLSeems to matchup that its twice as productive as Java on paperVAF modifier ranges between .6 - 1.3 … in actuality for most of the projects was very close to 1.0 (confirms its usage across the industry)Largest: ~15000 loccQuite surprised by the results … they seem to confirm what people are feeling that xquery does the job with less codeWould need to do analyze a lot more projects … probably not enough xquery projects in existence to match other function point historical data tables for other languages.Threats to validityLow sample sizeInaccurate FP analysisSelection biasMixed language effectJob survey demonstrates that xquery jobs are in demand … an indirect measure that shows there is something cooking with XQueryAdhoc survey shows that a significant % of xquery developers think they are more productive when using XQuery … specifically when programming with XQuery and Java, C++, and JS working with XML, text, RDBMS and JSON.#Loc/FP Analysis confirms that XQuery is about as productive to SQL but has a much larger applicability … theadhoc survey seems to indicate that xquery is used in conjunction with an xml database is significantly leveraged when XQuery is used in conjunction with XML datastoreFindingsXquery is a DSL, though expansive not yet a GPL and its unclear if it should beNeeds better docs, tooling, librariesIs good because of fpIs bad because of fpVery good with XMLXQuery's most suitable purpose is in making semi-structured (i.e. XML) information repositories accessible, scrutable, and tractableXslt is complimentary by generating the viewXrx is productiveProductive used in conjunction with Java
Ran from Sept 20 – Oct 1st102 people responded15,000,000 programmers worldwide (wikipedia)~100 people95% certainconfidence interval: +- 9.8% errorUnited States 43% United Kingdom 15% Germany 10% France 8% Czech Republic 3% Netherlands 2% Switzerland 2%50% people put their name to the poll
This survey targeted developers who used Xquery.Strong correlation between usage of xquery and java and xsltMultiple choiceXquery 73 22%Java 55 17%XSLT 45 14%Javascript 32 10%C++ 22 7%python 18 5%C++ 14 4%Perl 12 4%C# 12 4%ruby 10 3%php 10 3%Haskell 9 3%Scala 9 3%Lisp 7 2%Erlang 1 >1%
Strong correlation between XML and usage of text, rdbms and jsonMultiple choiceXML 95 36%Text 40 15%RDBMS 39 15%JSON 32 12%Binaries (images, video, etc) 27 10%Office documents 18 7%Semantic web stuff (RDF, owl, etc..) 15 6%
Single OptionYes 67 67%Maybe 14 14%No 10 10%Don’tKnow 8 8%
Ok, we’ve drilled down into Xquery … we don’t have time to drill down into every technology we deem productive … but clearly there is something to this xml stack that is real
http://www.navioo.com/ajax/ajax_json_xml_Benchmarking.phpClaiming 2 to 10 times fasterNow little differencehttp://www.navioo.com/ajax/examples/json/test.phpOptimisations in the browser have helped bothIn programming languageEvidenceWith IE8 css2 started getting its act together (nightmares of IE6 fading in the distance) … earlier XSLT 1.0 looked promising, CSS3 even more promisingSafari/chrome/opera -- Data vs document orientated … clearly only some scenariosXml is too slow or bloated XML is not html … and the whole XHTML Forced xml processing with XSLT 1.0 in the browser Dynamic dispatch and fp = big learning curve for most web developersTooling and browsers misinterpreted draconian well formedness
Hstore for postgresql is key value store with ACIDDropping acid
Hstore for postgresql is key value store with ACIDDropping acid
If we told people that Goldfarb’s GML was born in the 60’s … which begot SGML hence XML it
JeniTennison evoked wonderful imagery at her XML Prague 2012 keynote
Though sometimes its hard to not fight a war, when encountering people with well meaning sentiments
We fought many warsRDBMSWeb browsers (browser ppl won html5 is markup)Interchange (JSON won)We are in a ‘don’t mention the war’ period.Not necc isolationist … modern xml technology stack (as we will identify later) is very active in embracing jsonWeb people think textual markup is dead whilst using it ? Strange irony to that, but they are just emerging from the trough of disillusionmentXML folks are embracing how to integrate with JSON … webdevppl don’t want to know about it.Lost the war with RDBMSLost the war for the browserLost the war for interchange
2002- lots of books, lots of adoption, lots of hype2006- December 2005, Yahoo! began offering some of its web services in JSON and google starts providing JSON to GDATAXML’s perception tainted by financial crisis (lots of content providers going out of business)Yet XML Prague attendance doubled (and sold out between 2009-2011)Bigdata and semantic web showing that we need more
WS* astronautics were shooting XML into orbitHeavy on the Enterprise Investment by browsers, sun, microsoftetcXML Hype cycle was several years in the making ( we are now on the slope of enlightment = modern xml)Switching from relational to hierarchical (text, structure (mixed content), values = semistructured data)Though I find it a bit unfair … bigdata is mentioned on this list as if it was a ‘thing’ but it’s an underpinning The map/reduce hype cycle ?Disagree with some thingsHtml5 is probably just about starting down the trough of disillusionment … Ian Hickson / Anne … html is looking like PDF these days (html5+js+css3) … its great progress but not on things I consider importanthttp://www.itworld.com/it-managementstrategy/293397/gartner-dead-wrong-about-big-data-hype-cycleArgues that the hype cycle is wrong because BigData has ‘real’ benefitis … he is missing the point
Don’t get upset if your pet technology goes in and out of fashion … expect this to happen a few times in your career.Sentimentality – that’s like saying you should start using goto statements because you miss them ,… XML needs to have a meaning a use, a valid domain to be applied too
We’ve talked about where XML has been and where is is today, as well as update some of the older perma topicsBut I mainly wanted to talk to you about XML’s place in the dataverse … as it relates to BigData
Is the problem thatXML is dead or XML time is up ?Not really … because XML is everywhere … its not going anywhere soon.Its everywhere in a way that JSON will never be … which is one of the reasons for JSON success/uptake.The problem is not XML vs JSON, we’ve been over that debate and I think everyone here can see the benefits of each data format.
http://kensall.com/big-picture/bigpix22.html
I said the NoSQL word, now I will say the other word e.g. BigData. … Curt Monash, well known db analyst, calls this polystructured … many call it unstructured but even text data will have some structure, probably all heard about the 85% of data goes unused.Just 10% increase in using a companies existing data can result in giant gains.
Show how this relates to specific industry sectors …
The three v’s of data is hard to manage.When I first saw this graphic I thought it was a pair of programmers (mostly because the guys look kind of like Larry Wall), but I think these guys are business guys and it occurred to me that we are in a strange place now where business folk are making commercial decisions based on algorithms … algorithms are absolutely crucial to our craft but it trivilizes the solution … like saying we will use hammers to build a house; of course we will use hammers.Developers need to balance off their desire to learn algorithms with the reality of getting stuff done
http://jimfuller2011.polldaddy.com/surveys/1906925/report/locationsCaveat – we are talking about solutions with databases !
If things are in urgent/important cell, that’s what you work on first, try to push everything into the Important, not urgent categoryNever read ‘The 7 Habits of Highly Effective People’
Items in bold are almost certainly skewed by either large historical dependency and/or browsers now implementItems italic/underlined are either in recc stage or was just ‘on the line’ in terms of ranking data
Items in bold are almost certainly skewed by either large historical dependency and/or browsers now implementItems italic/underlined are either in recc stage or was just ‘on the line’ in terms of ranking data
This is much better subsetItems in bold are skewed by either large historical dependency and/or browsers support
http://kensall.com/big-picture/bigpix22.html
This is much better subsetItems in bold are skewed by either large historical dependency and/or browsers support
Data maturity******* Stage one – ‘no usable data’******* Stage two, ‘too much data’, isn’t much better though. When you are swamped with data it will take up too much of your time to sort through it and the chances are that you will end up with many, if not most, of your insights being unrelated to your core business strategy. Before you know it you’re running around in woods that are heavily dense with trees and inhabited by wild geese.******* Stage three, ‘the right data’, is better, as you may well assume. With the ‘right’ data you can get the insights that support your primary business focus, ensuring that you have as much information to facilitate success in your chosen field as possible.******* The ‘predictive’, stage four, is one that many consider to be the optimum stage. This is where you make the transition from reactive to proactive. When you reach the predictive stage you can start to understand how certain influences in the future will affect your business and plan accordingly. A slightly banal yet illustrative example is to calculate what the expected peaks in website visitors will be following an advertising campaign so that enough bandwidth can be employed to cope. Something more complex might involve a simulation of market patterns and supply chain effects should a large scale natural disaster occur.******* The final stage, ‘strategic’, is the most data intensive
So far most of the scenarios I showed are BigData … or at a minimum represent maximums for their industry sectorNo feasibility study – initial sanity check if what you want to do is possible No organized selection process – self selection means no support/buy in at the various levels needed …FOSS selects itself !No proof of conceptPremature project initiation, before data is readyOverfittingIn statistics and machine learning, overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model which has been overfit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data.Seems like common sense, but our most successful clients avoided most of these mistakes which reduced risk immeasurablyShoot for the stars, but you don’t really want to build a rocketshipFOSS can be an important onramp to BigData, but eventually you will want to be able to create commercial partnershipsPoC is a scaled down version, the goal is to identify gaps in your skillset … you should help build the PoCStarting a project early is common in enterprise, it’s a mistakeVendors want to sell their software … resist the urge to let them do the work and drill down into the detail with your own problem domain experts
Marc Andreeson 'software is eating the world' http://online.wsj.com/article/SB10001424053111903480904576512250915629460.htmlRedmonk reports that programming languages have never been as diverse as todayRedMonk Tier 1 Languages (02/12)1C2 C#3 C++4 Java5 JavaScript6 Objective-C7 PERL8 PHP9 Python10 Ruby11 ShellscriptSource:RedMonk Tier 2 Languages (02/12)1 ASP2 ActionScript3 Assembly4 Clojure5 CoffeeScript6 ColdFusion7 CommonLisp8D9 Delphi10 EmacsLispwe've been living in a fairly stable hardware bubble for 30 years e.g. same techniques yet smaller, fasterThe power wall About five years ago, however, the top speed for most microprocessors peaked when their clocks hit about 3 gigahertz. The problem is not that the individual transistors themselves can't be pushed to run faster; they can. But doing so for the many millions of them found on a typical microprocessor would require that chip to dissipate impractical amounts of heat. Computer engineers call this the power wall. Given that obstacle, it's clear that all kinds of computers, including supercomputers, are not going to advance at nearly the rates they have in the past.Advances*********** tissue engineering*********** Terascaleneuromorphic chips (memristorssynapes, nanostore memory (logic and memory together)*********** Many billions and probably trillions of electronic tattoos (less than a penny each in most cases) with processing, sensors, memory, wireless*********** 2000 qubit adiabatic quantum computers*********** The human brain project (if funded would be done and if not there are other DARPA and asian projects of comparable scale)*********** Memristors at exascale (supercomputer class), petascale for very affordable systems*********** Sensors even more capable*********** Electronic tattoos even cheaper and more capable.*********** Deep robotics commercialization adoption.*********** Beamed power and persistent UAVs*********** Megascale or gigascale adiabatic quantum computersHardware*********** Optical computing - trapping, storing and manipulating light is difficult.*********** Quantum computing*********** Neuronal computing*********** DNA computing*********** Reversible computing - Normally every computational operation that involves losing a bit of information also discards the energy used to represent it. Reversible computing aims to recover and reuse this energy.*********** Billiard Ball computing - involves chain reactions of electrons passing from molecule to molecule inside a circuit.*********** Magnetic (NMR) computing Every glass of water contains a computer, if you just know how to operate it.*********** Glooper Computer One of the weirdest computers ever built forsakes traditional hardware in favour of "gloopware". Andrew Adamatzky at the University of the West of England, UK, can make interfering waves of propagating ions in a chemical goo behave like logic gates, the building blocks of computers.*********** Mouldy computers*********** Water wave computing Perhaps the most unlikely place to see computing power is in the ripples in a tank of water. Using a ripple tank and an overhead camera, Chrisantha Fernando andSampsaSojakka at the University of Sussex, used wave patterns to make a type of logic gate called an "exclusive OR gate", or XOR gate.
Remember write once, run everywhereProgramming for the browserWell, things changeJava was originally designed for interactive television"Write Once, Run Anywhere" (WORA)Java AppletsBe skeptical of purityXML is for data
When extensibility is not required, XML will always loose against DSLs:Diversity makes strong ecosystemsAs I’ve shown you, Modern XML is being applied to BigData problems today, How it provides;A stable, fast and mature toolset of technologies to work with textual markup, text and in many cases lots of different kind of dataFoundation for semwebCan be applied to a wide range of BigData scenarios