2. Career is a mash-up of:
start-ups + enterprise
customer + vendor
data + application
technical + business
3. The View from 30,000 Feet … ok - from low earth orbit
The time has come to manage information across the enterprise for strategic benefit.
Be the “Googler” of your enterprise
4. Simply put : manage a company’s information as an asset - at least as well as
Google tries to manage the world’s information as an asset
Assume your information assets are as diverse as the modern web - but not the
same - data matters more than documents.
What does this mean?
However…
MOST OF US ARE NOT GOOGLE in
the level of quality and quantity of
engineering resources
Google makes it look easy sometimes
because they have much of the best
talent in the world
Data Silos are a primary bottleneck
5. Viz tools are democratizing analysis - D3.org, Tableau, Spotfire, etc
“Big Data Mania” represents an opportunity to re-architect for flexibility + agility
Monolithic, hard-coded warehouses & ETL constrain experimentation, collaboration and agility
Entities do not have perfect definitions - don’t try to force it...
Static schemas/data structures are great for collection but have “drag coefficient” for analytics
Embrace data variety as a reality - leave the monolithic vendors pretending they can lock us in
Semantic approaches allow access to diverse data and agile integration to solve specific questions
Data marts should be available “on demand” using tech “@ target” that suits the analytic
“You can’t get there from here” : NOT Enterprise Data “Business as Usual”
6. Part of the Answer:
The 3 V’s?
important but...
…not enough...
7. Try this …
Start with the Questions, not the Answer - “Analytic context will set you free”
● Ask aspirational/transformational analytic questions
● Use them as context for defining all the work you do
● Build your infrastructure to answer the analytical questions
In the process….
● Get a broad and dynamic inventory of all your data
● Match workload to appropriate engine/tech
● Use Distributed Systems - radically lower cost vs. traditional
● Expect modern and dynamic visualization - iterative vs. reporting
● Treat Cloud as a first-order resource - not just ancillary
● Modern DevOps - core capability
● JSON sources will proliferate...embrace it
● Bottom-up data/metadata management
● Internal and external data - both valuable but not same
8. Start with the Questions, not the Answer….
….but sometimes it’s not simple….
...embrace the ambiguity...
Same but Different - Identity depends on the question:
● Gleevec, Glivec and Imatinib
● Same INCHI Key
● Formulation vs. Substance
● Product versus compound
● Regional naming difference
● Canonicalization depends on context
InChI=1S/C29H31N7O/c1-21-5-10-25(18-27(21)34-29-
31-13-11-26(33-29)24-4-3-12-30-19-24)32-28(37)23-8-
6-22(7-9-23)20-36-16-14-35(2)15-17-36/h3-13,18-19H,
14-17,20H2,1-2H3,(H,32,37)(H,31,33,34)
9. Pick a problem that is:
- greenfield
- well-defined
- valuable
DO NOT BOIL THE OCEAN
11. Distributed Systems
For data science at scale, we can’t afford to pay the
“enterprise IT tax”
Need to build an enterprise infrastructure as
inexpensive, scalable and persistent as that of modern
web companies
Mindset: Put tight spending limits on storage and
systems infrastructure … and it will take you toward a
place similar to the modern internet consumer
companies - this is a good place :)
Facebook CIO talking about Vertica
12. The Cloud
A first-level citizen in the enterprise infrastructure
Fact...not opinion: The world’s largest high-
performance computing and persistence infrastructure
is available for you to rent on-demand
Let’s drop the hubris of on-prem enterprise data
centers much like we don’t generate our own electricity
anymore….
13. DevOps
DevOps matters as much for data as for software
DevOps is to the Cloud as Systems Management was
to Client-Server computing
● Couldn’t live without Systems Management then
● Can’t live without DevOps now
Getting to scale (managing hundreds/thousands of
machines) on demand requires automated tools and a
modern DevOps infrastructure.
14. JSON
JSON is now a primary tool to access data
Ultimate evolution of relational and object-oriented
technologies coming together
Provides a loose, flexible coupling between data access
and applications
Definition of flexibility: As long as it’s JSON, we don’t
need to care what’s behind it
15. Variety - how to tackle the enterprise data silo problem
Standardization and Aggregation are necessary but not
sufficient to solve the challenges of Enterprise
Analytics 3.0
16. Bottom-Up + Top Down Data Modeling & “Collaborative Curation”
Time to embrace the reality of extreme data variety across
the entire enterprise - “Unified Data”
Requires a bottom-up, probabilistic approach to data
curation and integration (compliment deterministic)
● mix of 80% probabilistic & 20% deterministic
● Tamr’s primary design pattern
Back to the future:
● 1990’s web: probabilistic search and website connection
● 2020’s enterprise: probabilistic data source connection &
curation
17. Internal and External Data
Internally and externally generated data are
now BOTH important
If our orgs are going to become truly data-
driven, we have to embrace external data
We need to get to the point that, a la Google,
we don’t care where it comes from
Google Maps, for example
● Seamless integration of internal Google
and external data
● And Google just doesn’t care
18. In Summary
● Manage your information as an asset
● Start with a broad inventory of all your data
● Embrace ambiguity/variety of enterprise data
● Throw the “one schema to rule them all” into the
fires of Mordor…
● Embrace modern viz & iterative analytics
● Don’t ignore the Cloud - it’s inevitable
● DevOps is cool - and fun :)
● JSON is the future of data access - it’s ok
● True shared nothing distributed systems are the
only way out of the “Enterprise IT Tax”