The document discusses a company that finances, develops, and operates renewable energy and efficiency installations. They collect large amounts of time series data from these installations, including temperature readings and flow rates taken at regular intervals. The author is considering using MongoDB to build a flexible data pipeline to store, search, and analyze this time series data. Key requirements are that the system needs to scale to potentially large amounts of data from many installations, and that it is designed with analytics and flexibility in mind to support a variety of use cases and evolving business needs.
2. ajackson
@
skylineinnovations.com
Sunday, July 24, 2011
3. a tale of rapid
prototyping, data
warehousing, solar
power, an architecture
designed for data
analysis at āscaleā
...and arduinos!
Sunday, July 24, 2011
So hereās what iād like to talk about: Who we are, how we got started, and most importantly,
how weāve been able to use MongoDB to help us. Weāre not a traditional startup -- and while
i know that this is not a āstartupsā talk, but a Mongo one, iād like to show how Mongoās
ļ¬exible nature really helped us as a business, and how Mongo speciļ¬cally has been a good
choice for us as we build some of our tools. Here are some themes:
4. Scaling
Sunday, July 24, 2011
Mongo has come to have a pretty strong association with the word āscaling.ā
Scaling is a word we throw around a lot, and it almost always means āsoftware performance,
as inputs grow by orders of magnitude.ā
But scaling also means performance as the variety of inputs increases. Iād argue that itās
scaling to go from 10 users to 10,000, and itās also scaling to go from ten ākindsā of input to
a hundred.
Thereās another word for this.
5. Scaling
Flexibility
Sunday, July 24, 2011
Particularly when you scale in the real world, you start to ļ¬nd that itās complicated and messy
and entropic in ways that software isnāt always equipped to handle. So for us, when we say
āmongo helps us scaleā, we donāt necessarily mean scaling to petabytes of data. Weāll come
back to them as well.
6. Business-ļ¬rst
development
Sunday, July 24, 2011
This generally means ļ¬exibile, lightweight processes. Things that become ļ¬xed &
unchangable quickly become obsolete and sad :ā(
7. When Does
āContextā
become āYak
Shavingā?
Sunday, July 24, 2011
When i read new things or hear about new stuff, Iām always trying to put it in context. So,
sometimes i put too much context in my talks :( To avoid it, I sometimes go a little too fast
over the context that *is* important. So please stop me to ask questions! Also, the problem
domain here is a little different than what we might be used to, so bear with me as we go into
plumbing & construction.
12. ļ¬nance, develop, and operate
renewable energy and efļ¬ciency
installations, for measurable,
guaranteed savings.
Sunday, July 24, 2011
13. ļ¬nance, develop, and
operate renewable energy
and efļ¬ciency installations, for
measurable, guaranteed savings.
Sunday, July 24, 2011
Weāll pay to put stuff on your roof, and weāll keep it at its maximally awesome.
14. ļ¬nance, develop, and operate
renewable energy and
efļ¬ciency installations, for
measurable, guaranteed savings.
Sunday, July 24, 2011
Right now, this means solar thermal, more efficient lighting retroļ¬ts, and maybe HVAC.
15. ļ¬nance, develop, and operate
renewable energy and efļ¬ciency
installations, for measurable,
guaranteed savings.
Sunday, July 24, 2011
So, hereās the interesting part. Since we put stuff on your roof for free, we need to get that
money back. What we do is, weāll charge you for the energy that it saved you, but, hereās the
twist. Other companies have done similar things, where they say āweāll pay for a system/
retroļ¬t/whatever, and youāll agree to pay us an arbitrary number, and we say youāll get
savings, but you wonāt actually be able to tell, really.ā That always seemed sketchy to us. So,
we actually measure the performance of this stuff, collect the data, and guarantee that you
save money.
18. ā¢ Why solar thermal?
ā¢ Why hasnāt anyone else done this before?
ā¢ Pivots? Iterations?
ā¢ Whatās the market size?
ā¢ Funding? Capital structures?
ā¢ Wait, how do you guys make money?
Sunday, July 24, 2011
Oh, right, this isnāt a startup talk. But feel free to ask me these later!
19. Solar Thermal in Five
Minutes
( mongo next, i promise! )
Sunday, July 24, 2011
20. Municipal
=>
Roof
=>
Tank
=>
Customer
Sunday, July 24, 2011
22. Temperatures
(about a dozen)
Sunday, July 24, 2011
23. Flow Rates
(at least two)
Sunday, July 24, 2011
24. Parallel data streams
(hopefully many)
Sunday, July 24, 2011
e.g., weather data, insolation data. Itād be nice if we didnāt have to collect it all ourselves.
25. how much data?
20 data points @ 4 bytes
1 minute intervals
at 1000 projects (I wish!)
for 10 years
80 * 60 * 24 * 365 * 10 * 1000 = 400 GB?
...not much, really, āin the rawā
Sunday, July 24, 2011
unfortunately, we canāt really store it with maximal efficiency, because of things like
timestamps, metadata, etc., but still.
26. Sunday, July 24, 2011
I hope this provides enough context on the business problems weāre trying to solve. It looks
like weāll need a data pipeline, and weāll need one fast.
Weāve got data that weāll need to use to build, monitor, and monetize these energy
technologies. Having worked at other smart grid companies before, Iāve seen some good
data pipelines and some bad data pipelines. Iād like to build a good one. The less stuff i
have to build, the better.
27. Sunday, July 24, 2011
As i do some research, i ļ¬nd that a lot of these data pipelines have a few well-deļ¬ned areas
of responsibility.
28. Acquisition,
Storage,
Search,
Retrieval,
Analytics.
Sunday, July 24, 2011
These should be self explanatory. Whatās interesting is that not only are most of the end-
users of the system analysts, interested in analyzing, but that most systems seem to be
designed for the other functionality. More importantly, theyāre not very well decoupled: by
the time the analysts get to start building tools, the design decisions from the beginning are
inextricable from the systems that came before.
29. Acquisition,
Storage,
Search,
Retrieval,
} Designed for these
Analytics. <= Users are here
Sunday, July 24, 2011
These should be self explanatory. Whatās interesting is that not only are most of the end-
users of the system analysts, interested in analyzing, but that most systems seem to be
designed for the other functionality. More importantly, theyāre not very well decoupled: by
the time the analysts get to start building tools, the design decisions from the beginning are
inextricable from the systems that came before.
30. Acquisition,
Storage,
Search,
Retrieval,
Analytics.
Sunday, July 24, 2011
These should be self explanatory. Whatās interesting is that not only are most of the end-
users of the system analysts, interested in analyzing, but that most systems seem to be
designed for the other functionality. More importantly, theyāre not very well decoupled: by
the time the analysts get to start building tools, the design decisions from the beginning are
inextricable from the systems that came before.
Itās important to remember that, while you canāt get good analytics without the other stuff,
the analytics is where almost all of the value is! Search & retrieval are approaching āsolvedā
31. Acquisition,
Storage,
Search,
Retrieval,
} Designed for these
Analytics. <= Users are here
Business value is here!
Sunday, July 24, 2011
These should be self explanatory. Whatās interesting is that not only are most of the end-
users of the system analysts, interested in analyzing, but that most systems seem to be
designed for the other functionality. More importantly, theyāre not very well decoupled: by
the time the analysts get to start building tools, the design decisions from the beginning are
inextricable from the systems that came before.
Itās important to remember that, while you canāt get good analytics without the other stuff,
the analytics is where almost all of the value is! Search & retrieval are approaching āsolvedā
32. Sunday, July 24, 2011
so, hereās how i started thinking about things. This is a design diagram from the early days
of the company.
33. Sunday, July 24, 2011
easy, python, no problem. There are some interesting topics here, but theyāre not mongoDB
related. I was pretty sure i knew how to build this part, and i was pretty sure i knew what the
data would look like.
34. Sunday, July 24, 2011
This part was also easy -- e-mail reports, csvs, maybe some fancy graphs, possibly some
light webapps for internal use. These would be dictated by business goals ļ¬rst, but the
technological questions were straightforward.
35. Sunday, July 24, 2011
Here was the real question.
What would be some use cases of an analyst having a good experience look like? What would
they expect the tools to do?
36. Now we can think
about what the data
looks like
Sunday, July 24, 2011
So, letās think about what this data looks like, how itās structured and what it is. Then, after
that, we can look at what the best ways to organize it for future usefulness.
37. Time series?
Time,municipal water in T,solar heated water out T,solar tank bottom taped to side,solar tank top taped to side,array in/out,array in/out,tank room ambient t,array supply temperature,array return
temperature,solar energy sensor,customer ļ¬ow meter,customer OIML btu meter,solar collector array ļ¬ow meter,solar collector array OIML btu meter,Cycle Count
Tue Mar 9 23:01:44 2010,14.7627064834,53.7822899383,12.1642527206,51.1436001456,6.40476190476,8.9582972583,22.6857033228,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333458
Tue Mar 9 23:02:44 2010,14.958038343,53.764889193,12.1642527206,51.0925345058,6.40476190476,8.85184138407,22.5716100982,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462
Tue Mar 9 23:03:45 2010,15.1145934976,53.6986641192,12.1642527206,50.8692901812,6.40476190476,8.78519002979,22.5673674246,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462
Tue Mar 9 23:04:45 2010,15.2512207824,53.5955190752,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333468
Tue Mar 9 23:05:45 2010,15.3690229715,53.5534492867,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333471
Tue Mar 9 23:06:46 2010,15.5253261193,53.5534492867,12.1642527206,50.8658228816,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333472
Tue Mar 9 23:07:46 2010,15.6676270005,53.5534492867,12.1642527206,50.9177829276,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.293277114,0.0,0.0,0.0,0.0,0.0,333472
Tue Mar 9 23:08:47 2010,15.7915083121,53.4761516976,12.1642527206,50.8398031014,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.1826467404,0.0,0.0,0.0,0.0,0.0,333477
Tue Mar 9 23:09:47 2010,15.9763741003,53.693428918,12.1642527206,50.7859446809,6.40476190476,8.78519002979,22.5461357574,24.0728390462,22.1782915595,0.0,1.0,0.0,0.0,0.0,333581
Tue Mar 9 23:10:47 2010,16.1650984572,54.0547534088,12.1642527206,50.725,6.40476190476,8.78519002979,22.4544906773,24.0728390462,22.1782915595,0.0,0.0,0.0,0.0,0.0,333614
Sunday, July 24, 2011
38. TIME SERIES
DATA
Sunday, July 24, 2011
So what is time series data?
39. Features, Over Time
Sunday, July 24, 2011
multi-dimensional features. Whatās fun in a business like this is that weāre not really sure
what the features we study will be. -- Flexibility callout
40. Features, Over Time
Thing
(Feature vector, v)
Time
(t)
Sunday, July 24, 2011
multi-dimensional features. Whatās fun in a business like this is that weāre not really sure
what the features we study will be. -- Flexibility callout
41. Features, Over Time
Thing
(Feature vector, v)
Time
(t)
Sunday, July 24, 2011
multi-dimensional features. Whatās fun in a business like this is that weāre not really sure
what the features we study will be. -- Flexibility callout
42. Sunday, July 24, 2011
A couple of ideas:
sampling rates. āregularityā. ācompletenessā
analog vs. digital
instantaneous vs. cumulative (tradeoffs)
43. tn tn+1
Sunday, July 24, 2011
Finding known interesting ranges (deļ¬nitely the most common)
44. tn tn+1
Sunday, July 24, 2011
Finding known interesting ranges (deļ¬nitely the most common)
45. t tā etc.
Sunday, July 24, 2011
Using features to ļ¬nd interesting ranges.
These two ways to look for things should inform our design decisions.
46. y
t tā etc.
Sunday, July 24, 2011
Using features to ļ¬nd interesting ranges.
These two ways to look for things should inform our design decisions.
47. y
Thresholds
yā
t tā etc.
Sunday, July 24, 2011
Using features to ļ¬nd interesting ranges.
These two ways to look for things should inform our design decisions.
48. y
Thresholds
yā
t tā etc.
Sunday, July 24, 2011
Using features to ļ¬nd interesting ranges.
These two ways to look for things should inform our design decisions.
49. (more complicated stuff
can be thought of as
transformations...)
Sunday, July 24, 2011
e.g., frequency analysis, wavelets, whatever.
50. Sunday, July 24, 2011
At this point, I go off and do a bunch of research on existing technologies. I really hate
reinventing the wheel, and we really donāt have the manpower.
51. Time series speciļ¬c tools
Scientiļ¬c tools & libraries
Traditional data-warehousing approaches
Sunday, July 24, 2011
So, these were some of the options i looked at. I want to quickly point out why i eliminated
the ļ¬rst two classes of tools.
52. Time series speciļ¬c tools
RRDtool -- Round Robin Database
Sunday, July 24, 2011
Thereās really surprisingly few of these. One of the best is the RRDtool. Itās pretty sweet, and
i highly recommend it. Unfortunately, itās really designed for applications that are highly
regular, and that are already pretty digital, for instance, sampling latencies, or temperatures
in a datacenter. Itās not really good for unreliable sensors, nor is it really designed for long
term persistance. It also has a really high lock-in, with legacy data formats, etc. Donāt get
me wrong, itās totally rad, but i didnāt think it was for us.
53. Scientiļ¬c tools & libraries
e.g., PyTables
Sunday, July 24, 2011
Pretty cool, but not many of these were mature & ready for primetime. Some that were, like
PyTables, didnāt really match our business use-case.
54. Traditional data-warehousing approaches
Sunday, July 24, 2011
So, these were some of the options i looked at. I want to quickly point out why i eliminated
the ļ¬rst two classes of tools. [...]. That leaves us with the traditional approaches. This
represents a pretty well established ļ¬eld, but very few of the tools are free, lightweight, and
mature.
55. Enterprise buzzwords
(Just google for OLAP)
Sunday, July 24, 2011
But the biggest idea i learned is that most data warehousing revolves around the idea of a
āfact tableā. They call it a āmultidimensional OLAP cubeā, but basically it exists as a totally
denormalized SQL table.
56. āMeasuresā
and their
āDimensionsā
Sunday, July 24, 2011
(or facts)
63. Time series are not relational!
Sunday, July 24, 2011
even extracted features are not inherently relational!
Also: you donāt know what youāre looking for, you donāt know when youāll ļ¬nd it, you wonāt
know when youāll have to start looking for something different.
Why would you lock yourself into a schema?
64. We donāt know what
weāll want to know.
Sunday, July 24, 2011
We wonāt know what we want to know. Not only are we warehousing time-series of
multidimensional feature vectors, we donāt even know the dimensions weāll be interested in
yet!
65. natural ļ¬t for
documents
Sunday, July 24, 2011
This makes a schema-less database a natural ļ¬t for these sorts of things. Think about all the
alter-table calls iāve avoided...
68. ...actually, not a good
model.
Sunday, July 24, 2011
The line gets pretty blurry, in practice. Multi-dimensional vectors mean every measure
provides another dimension.
Anyway!
70. the goal: good numbers!
Sunday, July 24, 2011
Remember, the goal here is to make it easy for analysts to get comparable numbers, so when
i ask for the delivered energy for one system, compared to the delivered energy from
another, i can just get the time-series data, without having to worry about if sensors
changed, when the network was out, when a logger was replaced with another one, etc.
71. Sunday, July 24, 2011
So, the OLTP layer serving as our inputs essentially serves up timestamped data as CSV
series. It doesnāt really provide a lot of intelligence, and is basically the raw numbers
72. from rows
to columns
Sunday, July 24, 2011
So, most of what our pipeline does is turn things from rows to columns, in a ļ¬exible, useful
way. Iām gonna walk through that process, quickly.
74. row-major data
Time,municipal water in T,solar heated water out T,solar tank bottom taped to side,solar tank top taped to side,array in/out,array in/out,tank room ambient t,array supply temperature,array return
temperature,solar energy sensor,customer ļ¬ow meter,customer OIML btu meter,solar collector array ļ¬ow meter,solar collector array OIML btu meter,Cycle Count
Tue Mar 9 23:01:44 2010,14.7627064834,53.7822899383,12.1642527206,51.1436001456,6.40476190476,8.9582972583,22.6857033228,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333458
Tue Mar 9 23:02:44 2010,14.958038343,53.764889193,12.1642527206,51.0925345058,6.40476190476,8.85184138407,22.5716100982,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462
Tue Mar 9 23:03:45 2010,15.1145934976,53.6986641192,12.1642527206,50.8692901812,6.40476190476,8.78519002979,22.5673674246,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462
Tue Mar 9 23:04:45 2010,15.2512207824,53.5955190752,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333468
Tue Mar 9 23:05:45 2010,15.3690229715,53.5534492867,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333471
Tue Mar 9 23:06:46 2010,15.5253261193,53.5534492867,12.1642527206,50.8658228816,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333472
Tue Mar 9 23:07:46 2010,15.6676270005,53.5534492867,12.1642527206,50.9177829276,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.293277114,0.0,0.0,0.0,0.0,0.0,333472
Tue Mar 9 23:08:47 2010,15.7915083121,53.4761516976,12.1642527206,50.8398031014,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.1826467404,0.0,0.0,0.0,0.0,0.0,333477
Tue Mar 9 23:09:47 2010,15.9763741003,53.693428918,12.1642527206,50.7859446809,6.40476190476,8.78519002979,22.5461357574,24.0728390462,22.1782915595,0.0,1.0,0.0,0.0,0.0,333581
Tue Mar 9 23:10:47 2010,16.1650984572,54.0547534088,12.1642527206,50.725,6.40476190476,8.78519002979,22.4544906773,24.0728390462,22.1782915595,0.0,0.0,0.0,0.0,0.0,333614
Sunday, July 24, 2011
75. āFunctionalā
class Mass(BasicMeasure):
def __init__(self, density, volume):
...
self._result_func = functools.partial(
lambda data, density, volume: density * volume(data)
density=density, volume=volume)
def __call__(self, data):
return self._result_func(data)
Sunday, July 24, 2011
quasi-functional classes that describe how to calculate a value from data.
77. Creating a Cube
For each install, for each chunk of data:
apply all known formulas to get values
make some convenience keys (e.g., day_of_year)
stuff it in mongo
Then, map/reduce to whatever dimensionalities youāre
interested in: e.g., downsampling.
Sunday, July 24, 2011
Hereās some pseudocode for how to make a cube of multidimensional data.
So, whatās the payoff?
78. How much water did
[x] use, monthly?
> db.facts_monthly.find({"install.name": [foo]}, {"measures.Gallons Sold":
1}).sort({ā_idā: 1})
Sunday, July 24, 2011
Complicated analytical queries can be boiled down to nearly single line mongo-queries.
Hereās some examples:
79. What were our highest
production days?
> db.facts_daily.find({}, {āmeasures.Energy Soldā: 1}).sort({_measures.Energy
Soldā: -1})
Sunday, July 24, 2011
Complicated analytical queries can be boiled down to nearly single line mongo-queries.
Hereās some examples:
80. How does the distribution of [x]
on the weekend compare to its
distribution on the weekdays?
> weekends = db.facts_daily.find({"day_of_week": {$in: [5,6]}})
> weekdays = db.facts_daily.find({"day_of_week": {$nin: [5,6]}})
> do stuff
Sunday, July 24, 2011
Complicated analytical queries can be boiled down to nearly single line mongo-queries.
Hereās some examples:
81. Whatās the production of installs north of a certain
latitude, with a certain class of panel, on Tuesdays?
For hours where the average delivered temperature
delta was above [x], what was our generation
efļ¬ciency?
Normalize by number of panels? (map/reduce)
Normalize by distance from equinox? (map/reduce)
...etc.
Sunday, July 24, 2011
82. ā¢ Building a cube can be done in parallel
ā¢ Map/reduce is an easy way to think about
transforms.
ā¢ Not maximally efļ¬cient, but parallelizes on
commodity hardware.
Sunday, July 24, 2011
Some advantages.
re #3 -- so what? Itās not a webapp.
83. mongoDB:
The future of enterprise
business intelligence.
(they just donāt know it yet)
Sunday, July 24, 2011
So, hereās my thesis:
document-databases are far superior to relational databases for business intelligence cases.
Not only that, but mongoDB and some common sense lets you replace multimillion dollar
IBM-level enterprise solutions with open-source awesomeness. All this in a rapid, agile way.
85. Mongo expands in an
organization.
Sunday, July 24, 2011
itās cool, donāt ļ¬ght it. Once we started using it for our analytics, we realized there was a lot
of other schema-loose data that we could use it for -- like the deļ¬nitions of the measures
themselves, or the details about an install, etc., etc.
88. Flexible tools means
business responsiveness
should be easy
Sunday, July 24, 2011
89. āScalingā doesnāt just
mean depth-ļ¬rst.
Sunday, July 24, 2011
businesses grow deep, in the sense of adding more users, but they also grow broad.
91. Epilogue
Quest for Logging Hardware
Sunday, July 24, 2011
92. Thisāll be easy!
This is such an obvious and well
explored problem space, iām
sure weāll be able to ļ¬nd a
solution that matches our needs
without breaking the bank!
Sunday, July 24, 2011
93. Shopping List!
16 temperature sensors
4 ļ¬ow sensors
maybe some miscellaneous ones
internet backhaul
no software/data lock in.
Sunday, July 24, 2011
94. Conventions
FTW!
And since weāve walked a couple
convention ļ¬oors and product
catalogs from major industrial
supply vendors, iām sure itās in
here somewhere!
Sunday, July 24, 2011
95. derp derp
āinternetā?
Iām sure thereās a reason why all
of these loggers have to connect
via USB...
Pace Scientiļ¬c XR5:
8 analog
3 pulse
ONE MB
no internet?
$500?!?
Sunday, July 24, 2011
96. yay windows?
...and require proprietary
(windows!) software or
subscription plans that route my
data through their servers
(basically all of them!)
Sunday, July 24, 2011
97. Maybe the govāt
can help!
Perhaps thereās some kind of
standard that the governments
require for solar thermal
monitoring systems to be
eligible for incentives or tax
credits.
Sunday, July 24, 2011
99. A āCertiļ¬edā
Logger
two temperature sensors
one pulse
no increase in accuracy
no data backhaul -- at all
...
whatās the price?
Sunday, July 24, 2011
105. TL; DR:
Existing loggers
are terrible.
Sunday, July 24, 2011
Also, existing industries arenāt really ready for rapid prototyping and its destructive effects.